我有一个数据集进来,它有一个奇怪的报告格式,我需要把它变成一个可行的数据框。我正在处理的数据如下所示:
ids<-(c("A101","","","","B101","","","C101","","",""))
dx<-c("Lung","","","","Kidney","","","Prostate","","","")
alt<-c("","A766","G283","F933","","B293","T432","","U920","D289","S203")
val<-c(NA,3.2,4.3,7.2,NA,2.1,3.8,NA,8.1,5.3,7.1)
df.in<-data.frame(ids,dx,alt,val)
它生成一种格式,该格式具有一系列与样本 ID 未对齐的数据。我希望它们以最终数据框如下所示的方式对齐:
ids<-(c("A101","A101","A101","B101","B101","C101","C101","C101"))
dx<-c("Lung","Lung","Lung","Kidney","Kidney","Prostate","Prostate","Prostate")
alt<-c("A766","G283","F933","B293","T432","U920","D289","S203")
val<-c(3.2,4.3,7.2,2.1,3.8,8.1,5.3,7.1)
df.out<-data.frame(ids,dx,alt,val)
我使用 plyr、lapply 探索了不同的方法,但似乎无法让这些看起来像上面的 'df.out' 数据格式。请注意,样本可能具有的值的数量没有对称性(即,有些可能只有 1 个值,而其他可能有多达 10 个)。关于如何解决这个问题的任何想法?
一种方式tidyr
和dplyr
:
library(dplyr)
library(tidyr)
# Replace blank cells "" with NA
df.in[df.in == ""] <- NA
# Fill NA values with value of row above it
df.in %>%
fill(c(ids, dx), .direction = "down") %>%
drop_na() %>%
mutate_if(is.factor, as.character) # optional
# A tibble: 8 x 4
ids dx alt val
<chr> <chr> <chr> <dbl>
1 A101 Lung A766 3.20
2 A101 Lung G283 4.30
3 A101 Lung F933 7.20
4 B101 Kidney B293 2.10
5 B101 Kidney T432 3.80
6 C101 Prostate U920 8.10
7 C101 Prostate D289 5.30
8 C101 Prostate S203 7.10
链中的最后一行mutate_if(is.factor, as.character)
是可选的,将因子转换为字符。我们可以通过stringsAsFactors = FALSE
在创建数据集时使用来避免这一步。
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句