我正在处理Twitter数据集,但还没有弄清楚根据标签列表对数据进行分组的情况。
df:
rowID Hashtags
1 ogretmenemayistamujdehazirandaatama,onlarkonusurakpartiyapar
2 onlarkonusurakpartiyapar,halkinbasbakanitokatta
3 kurdish,mahabad,justiceforfarinaz,kurdistan
4 onlarkonusurakpartiyapar
5 anfal,halabja,kurdistan,kobani
6 onlarkonusurakpartiyapar
7 kurdistan
标签是一个字符列表
hashtag_list:
"onlarkonusurakpartiyapar" "kurdistan"
我尝试了这段代码,但是对我来说不起作用。
new_df=df[df$Hashtags %in% hashtag_list,]
它只能给出“ onlarkonusurakpartiyapar”主题标签的子集。我知道它看起来很简单,但是即使我已经查看了网站上的所有帖子,也无法弄清楚。谢谢你的帮助。
这是一种通过将以“,”分隔的字符区分为不同的#标签,并说如果该列表中有这些#标签,则该行是匹配的,从而修改您的方法。
df <- data.frame(
rowID=1:8,
Hashtags=c(
"ogretmenemayistamujdehazirandaatama,onlarkonusurakpartiyapar",
"onlarkonusurakpartiyapar,halkinbasbakanitokatta",
"kurdish,mahabad,justiceforfarinaz,kurdistan",
"onlarkonusurakpartiyapar",
"anfal,halabja,kurdistan,kobani",
"onlarkonusurakpartiyapar",
"kurdistan",
"this,willnot,befound"
),
stringsAsFactors=F
)
hashtag_list <- c("onlarkonusurakpartiyapar", "kurdistan")
find_ht <- function(hashtags, hashtag_list){
sapply(strsplit(hashtags, split=","), function(x)any(x%in%hashtag_list))
}
find_ht(hashtags=df$Hashtags, hashtag_list=hashtag_list)
返回...
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
要执行子集,您只需要...
sub.index <- find_ht(hashtags=df$Hashtags, hashtag_list=hashtag_list)
df[sub.index,]
哪个返回
rowID Hashtags
1 1 ogretmenemayistamujdehazirandaatama,onlarkonusurakpartiyapar
2 2 onlarkonusurakpartiyapar,halkinbasbakanitokatta
3 3 kurdish,mahabad,justiceforfarinaz,kurdistan
4 4 onlarkonusurakpartiyapar
5 5 anfal,halabja,kurdistan,kobani
6 6 onlarkonusurakpartiyapar
7 7 kurdistan
或者,如果您希望索引这样做which(sub.index)
。要rowID
仅将的子集具体化,请执行df[sub.index,"rowID"]
。在这种情况下,两个都返回[1] 1 2 3 4 5 6 7
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句