我需要从数据框中删除所有非英语单词,如下所示:
ID text
1 they all went to the store bonkobuns and bought chicken
2 if we believe no exomunch standards are in order then we're ok
3 living among the calipodians seems reasonable
4 given the state of all relimited editions we should be fine
我想这样结束一个数据框:
ID text
1 they all went to the store and bought chicken
2 if we believe no standards are in order then we're ok
3 living among the seems reasonable
4 given the state of all editions we should be fine
我有一个包含所有英语单词的向量:word_vec
我可以使用tm包从数据框中删除向量中的所有单词
for(k in 1:nrow(frame){
for(i in 1:length(word_vec)){
frame[k,] <- removeWords(frame[i,],word_vec[i])
}
}
但我想相反。我只想“保留”向量中找到的单词。
这是一种简单的方法:
txt <- "Hi this is an example"
words <- c("this", "is", "an", "example")
paste(intersect(strsplit(txt, "\\s")[[1]], words), collapse=" ")
[1] "this is an example"
当然,细节在于魔鬼,因此您可能需要稍微调整一下内容,以考虑撇号和其他标点符号。
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句