我是R的新手。我正在尝试根据多种条件从data.frame(df)过滤行:
我的data.frame的示例:df的图像
df:
SNPA SNPB value block1 block2 score_T
A1 A22 0.379927 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 12
A2 A23 0.449074 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 25
A3 A24 0.464135 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 584
A4 A22 0.328866 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 51
A5 A22 0.326026 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 64
A22 A27 0.57169 A22|A23|A24|A25 A27|A28|A29|A30|A31 77
A23 A28 0.416178 A22|A23|A24|A25 A27|A28|A29|A30|A31 90
A24 A29 0.456144 A22|A23|A24|A25 A27|A28|A29|A30|A31 103
A34 A39 0.379927 A31|A32|A33|A34 A39|A40|A41|A42 116
A34 A40 0.759074 A31|A32|A33|A34 A39|A40|A41|A42 129
A34 A41 0.562303 A31|A32|A33|A34 A39|A40|A41|A42 142
A39 A57 0.322303 A39|A40|A41|A42 A52|A53|A54|A55|A56|A57|A58|A59|A60|A61 25
A40 A57 0.372303 A39|A40|A41|A42 A52|A53|A54|A55|A56|A57|A58|A59|A60|A61 198
A41 A57 0.562303 A39|A40|A41|A42 A52|A53|A54|A55|A56|A57|A58|A59|A60|A61 356
我要使用的dplyr
是仅保留块(block1和block2)至少具有两个SNP的行(来自block的SNPA列和block2的SNPB列),并删除包含1个SNP的成对的块(示例:第9至14行)。
想要的结果:结果
SNPA SNPB value block1 block2 score_T
A1 A22 0.379927 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 12
A2 A23 0.449074 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 25
A3 A24 0.464135 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 584
A4 A22 0.328866 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 51
A5 A22 0.326026 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 64
A22 A27 0.57169 A22|A23|A24|A25 A27|A28|A29|A30|A31 77
A23 A28 0.416178 A22|A23|A24|A25 A27|A28|A29|A30|A31 90
A24 A29 0.456144 A22|A23|A24|A25 A27|A28|A29|A30|A31 103
你知道我该怎么做吗?
result <- df %>% group_by(block1, block2) %>% filter(...) %>% summarise(mean_s = mean(score_T), number = n())
谢谢。
有点慢的base-dplyr
解决方案。该解决方案的一些问题包括需要在我们的过滤器功能中手动设置“块”和“快照”。一个人可能可以使这一过程自动化。
my_filter <- function(df,block, snp){
res<-strsplit(df[[block]],"|", fixed= TRUE)
lengths(lapply(res, function(x) which(x %in% df[[snp]]))) > 1
}
df %>%
filter(my_filter(., "block1", "SNPA"), my_filter(., "block2","SNPB"))
SNPA SNPB value block1 block2 score_T
1 A1 A22 0.379927 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 12
2 A2 A23 0.449074 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 25
3 A3 A24 0.464135 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 584
4 A4 A22 0.328866 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 51
5 A5 A22 0.326026 A1|A2|A3|A4|A5|A6 A22|A23|A24|A25 64
6 A22 A27 0.571690 A22|A23|A24|A25 A27|A28|A29|A30|A31 77
7 A23 A28 0.416178 A22|A23|A24|A25 A27|A28|A29|A30|A31 90
8 A24 A29 0.456144 A22|A23|A24|A25 A27|A28|A29|A30|A31 103
数据:
df <-structure(list(SNPA = c("A1", "A2", "A3", "A4", "A5", "A22",
"A23", "A24", "A34", "A34", "A34", "A39", "A40", "A41"), SNPB = c("A22",
"A23", "A24", "A22", "A22", "A27", "A28", "A29", "A39", "A40",
"A41", "A57", "A57", "A57"), value = c(0.379927, 0.449074, 0.464135,
0.328866, 0.326026, 0.57169, 0.416178, 0.456144, 0.379927, 0.759074,
0.562303, 0.322303, 0.372303, 0.562303), block1 = c("A1|A2|A3|A4|A5|A6",
"A1|A2|A3|A4|A5|A6", "A1|A2|A3|A4|A5|A6", "A1|A2|A3|A4|A5|A6",
"A1|A2|A3|A4|A5|A6", "A22|A23|A24|A25", "A22|A23|A24|A25", "A22|A23|A24|A25",
"A31|A32|A33|A34", "A31|A32|A33|A34", "A31|A32|A33|A34", "A39|A40|A41|A42",
"A39|A40|A41|A42", "A39|A40|A41|A42"), block2 = c("A22|A23|A24|A25",
"A22|A23|A24|A25", "A22|A23|A24|A25", "A22|A23|A24|A25", "A22|A23|A24|A25",
"A27|A28|A29|A30|A31", "A27|A28|A29|A30|A31", "A27|A28|A29|A30|A31",
"A39|A40|A41|A42", "A39|A40|A41|A42", "A39|A40|A41|A42", "A52|A53|A54|A55|A56|A57|A58|A59|A60|A61",
"A52|A53|A54|A55|A56|A57|A58|A59|A60|A61", "A52|A53|A54|A55|A56|A57|A58|A59|A60|A61"
), score_T = c(12L, 25L, 584L, 51L, 64L, 77L, 90L, 103L, 116L,
129L, 142L, 25L, 198L, 356L)), class = "data.frame", row.names = c(NA,
-14L))
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句