我想要做的是创建允许我对推文进行分类的代码。因此,在下面的示例中,我想讨论有关信用卡的推文,并确定它们是否与旅行问题有关。
这是初始数据集:
id<- c(123,124,125,126,127)
text<- c("Since I love to travel, this is what I rely on every time.",
"I got this card for the no international transaction fee",
"I got this card mainly for the flight perks",
"Very good card, easy application process",
"The customer service is outstanding!")
travel_cat<- c(1,0,1,0,0)
df_all<- data.frame(id,text,travel)
输出1:
id text travel_cat
123 Since I love to travel, this is what I rely on every time. 1
124 I got this card for the no international transaction fee 0
125 I got this card mainly for the flight perks 1
126 Very good card, easy application process 0
127 The customer service is outstanding! 0
然后,我仅使用文本字段创建一个数据框,然后进行文本分析:
myvars<- c("text")
df<- df_all[myvars]
library(tm)
corpus<- Corpus(DataframeSource(df))
corpus<- tm_map(corpus, content_transformer(tolower))
corpus<- tm_map(corpus, removePunctuation)
corpus<- tm_map(corpus, removeWords, stopwords("english"))
corpus<- tm_map(corpus, stripWhitespace)
dtm<- as.matrix(DocumentTermMatrix(corpus))
输出2(dtm):
Docs application card customer easy every ... etc.
1 0 0 0 1 0
2 0 1 0 0 1
3 0 1 0 0 0
4 1 1 0 0 0
5 0 0 1 0 0
然后如何将其绑定到原始数据,以便包含原始数据集和矩阵中的字段(输出1 +输出2):id,text,travel_cat + application,card,customer,easy,every ...
只是尝试一个 cbind()
allcombined <- cbind(dtm,df_all)
这是你想要的?
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句