我有以下数据集(示例)
idnumber=c(12,12,13,14,14,15,16,17,18,18)
reg = c('FR','FR','DE','US','US','TZ','MK','GR','ES','ES')
code1=c('F56','G76','G56','T78','G78','G76','G64','T65','G79','G56')
code2=c('G56','I89','J83','S46','D78','G56','H89','G56','W34','T89')
df = data.frame(idnumber,reg,code1,code2)
这使:
idnumber reg code1 code2
1 12 FR F56 G56
2 12 FR G76 I89
3 13 DE G56 J83
4 14 US T78 S46
5 14 US G78 D78
6 15 TZ G76 G56
7 16 MK G64 H89
8 17 GR T65 G56
9 18 ES G79 W34
10 18 ES G56 T89
我想子集df
只保留值G56
出现在列code1
or 中code 2
的原始数据,但idnumber
如果 id 值是与该值匹配的相同 id 值,则保留原始数据,G56
例如:
idnumber reg code1 code2
1 12 FR F56 G56
2 12 FR G76 I89
3 13 DE G56 J83
6 15 TZ G76 G56
8 17 GR T65 G56
9 18 ES G79 W34
10 18 ES G56 T89
我有数百万个观察值和大约 30code
列。希望问题足够清楚,欢迎提出任何建议!
干杯
library(dplyr)
df %>%
semi_join(df %>%
filter(code1=="G56" | code2=="G56"),by="idnumber")
idnumber reg code1 code2
1 12 FR F56 G56
2 12 FR G76 I89
3 13 DE G56 J83
4 15 TZ G76 G56
5 17 GR T65 G56
6 18 ES G79 W34
7 18 ES G56 T89
编辑:这可能更简单,有 30 个代码列:
df %>%
semi_join(df %>%
pivot_longer(cols=-c(idnumber, reg)) %>%
filter(value=="G56") %>%
pivot_wider(id_cols=c(idnumber, reg)),
by="idnumber")
第三选择:
df %>%
semi_join(df %>%
filter_at(vars(starts_with("code")), any_vars(. == "G56")),
by="idnumber")
编辑:如果“G56”在“代码”列中至少出现两次,OP 现在想要过滤记录(见下面的评论)
df %>%
semi_join(df %>%
mutate(n=rowSums(.[grep("code", names(.))] =="G56")) %>%
group_by(idnumber) %>%
filter(sum(n)>1),
by="idnumber")
idnumber reg code1 code2 code3
1 12 FR F56 G56 M56
2 12 FR G76 I89 G56
3 18 ES G79 W34 W33
4 18 ES G56 G56 T89
idnumber=c(12,12,13,14,14,15,16,17,18,18)
reg = c('FR','FR','DE','US','US','TZ','MK','GR','ES','ES')
code1=c('F56','G76','G56','T78','G78','G76','G64','T65','G79','G56')
code2=c('G56','I89','J83','S46','D78','G56','H89','G56','W34','G56')
code3=c('M56','G56','J83','S46','D78','G46','H89','J56','W33','T89')
df = data.frame(idnumber,reg,code1,code2,code3)
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句