我有一个数据框 dftime,有很多变量,但数据的快照如下所示:
| gene | country | case_month | case_year |
| ----- | ------- | ---------- | --------- |
| gene1 | Senegal | February | 2020 |
| gene2 | Botswana| January | 2021 |
| gene3 | Congo | March | 2021 |
| gene4 | Guinea | September | 2020 |
这是可重现的东西:
structure(list(gene = c("gene1", "gene2",
"gene3", "gene4", "gene5",
"gene6"), date = structure(c(18319, 18328,
18320, 18323, 18325, 18324), class = "Date"), country = c("Nigeria",
"South Africa", "Senegal", "Senegal", "Senegal", "Senegal"),
case_month = c("February", "March", "February", "March",
"March", "March"), case_year = c("2020", "2020", "2020",
"2020", "2020", "2020")), row.names = c(1L, 3L, 22L, 23L,
24L, 25L), class = "data.frame")
我留在日期变量中以防万一它有帮助!我从日期中取出 case_month 和 case_year。
总共有 38 个国家,所有 12 个月都有代表,只有两年是 2020 年和 2021 年。我正在尝试对这些数据进行排序,以便我可以得到 2020 年 1 月塞内加尔,2020 年 2 月塞内加尔的基因数量,等等,这样我就可以得到每个国家在两年中每个月的所有基因的计数 n。我希望有这样的输出:
| country | case_month | case_year | n |
| ------- | ---------- | --------- |---|
| Senegal | January | 2020 | 4 |
| Senegal | February | 2020 | 6 |
| Senegal | March | 2020 | 5 |
| Botswana| January | 2021 | 1 |
| Congo | March | 2021 | 2 |
等等...
目标是我可以使用这个计数来生成这样的堆叠条形图,n 是计数的新变量:
dftime_stacked <- ggplot(dftime_ord, aes(fill=country, y= n, x=case_month)) +
geom_bar(position="stack", stat="identity")
dftime_stacked + facet_wrap(~ case_year)
我尝试使用 dplyr 对数据进行排序,使用 mutate:
dftime_ord <- mutate(dftime, country = reorder(country, -n, sum),
case_month = reorder(case_month, -n, sum))
然而,这会引发两个错误——第一个错误是 -n,它说:
Error in -n : invalid argument to unary operator
第二个当我把它拿出来时,因为在这种情况下按最大到最小排序并不是最重要的,因为我的国家无论如何都是按字母顺序排列的:
Error in tapply(X = X, INDEX = x, FUN = FUN, ...) :
arguments must have same length
我所有的变量都是字符。是否有原因无法在 dplyr 中以这种方式对它们进行排序?任何想法为什么会像这样抛出错误?非常感谢所有的帮助!
您可以通过data.table
解决方案操纵订单;
df <- read.table(textConnection(' gene | country | case_month | case_year
gene1 | Senegal | February | 2020
gene2 | Botswana| January | 2021
gene3 | Congo | March | 2021
gene4 | Guinea | September | 2020 '),sep='|',header=T)
library(data.table)
setDT(df)
df <- df[,.(n=.N),by=c('country','case_year','case_month')]
setorderv(df,c('country','case_month'),c(-1,-1))
输出;
country case_year case_month n
<fct> <dbl> <fct> <int>
1 " Senegal " 2020 " February " 1
2 " Guinea " 2020 " September " 1
3 " Congo " 2021 " March " 1
4 " Botswana" 2021 " January " 1
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句