我有一个像这样的数据框:
-------------------------------------------------------------------
| | Keywords | Paragraph | Date | Decision |
|===+==================+==================+============+==========|
| 1 | a; b | A lot. of words. | 12/15/2015 | TRUE |
|---+------------------+------------------+------------+----------|
| 2 | c; d | more. words. many| 01/23/2015 | FALSE |
|---+------------------+------------------+------------+----------|
| 3 | a; d; c; foo; bar| words, words, etc| 12/13/2015 | FALSE |
-------------------------------------------------------------------
但是有大约1500条记录。
我正在尝试查找决策的最常见特征。例如:
Group 1: Keywords: "a", Paragraph words: ["trouble", "abhorrent"], Date: "12/12/2015",
Outcome: FALSE, odds of FALSE Decision: 60%
Group 2: Keywords: "b", Paragraph words: ["good", "maximum"], Date: "02/02/2015",
Outcome: TRUE, odds of TRUE Decision: 30%
另外,如果我可以在这样的图形上绘制几率,那就太好了:
| -----
60% | |///|
| |///| -----
30% | |///| |\\\|
| |///| |\\\|
0% +---|---|------|---|---
Group 1 Group 2
我想我正在寻找回归建模,但是所有示例似乎都处理纯数字数据。如何使用非数字数据完成此操作?
编辑:这是指向Google云端硬盘上dput文件的链接:https ://drive.google.com/open?id=0BwrbzZiF0KGtVVZ4Tk1kdDdBZXM
使用您在此处上传的数据是一个简单的示例:
mod <- glm(Decision ~ Keywords, data = df1, family = "binomial")
predictions <- predict(mod, df1, "response")
predictions
1 2 3 4 5 6 0.6 0.6 0.6 0.6 0.6 1.0
这是您想要的图,其中的组由定义Keywords
:
res <- aggregate(predictions, by=list(df1$Keywords), mean)
barplot(res$x, names.arg = c("Group 1", "Group 2"))
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句