如何自动查找列表的值是否存在于 R 的另一列数据框中

生物技术爱好者

在现有的 csv 中，有一列包含以下代码之一，即每一行的外膜 [GO:0019867]。我想向 csv 添加一列，该列将为每一行提供一个类别，即 OuterMembrane。所以我添加了一个空列，我想制作这个列表，以便在代码被引用到 csv 时自动添加通用类别。（并非所有编码都包括在内）

categ <- list(OuterMembrane = c("outer membrane [GO:0019867]","cell outer membrane [GO:0009279]", "integral component of membrane [GO:0016021]", "membrane [GO:0016020]"),
              Cytoplasmic =c("ribosome [GO:0005840]", "cytoplasm [GO:0005737]"),
              Extracellular=c(),
              InnerMembrane=c("plasma membrane [GO:0005886]", "membrane [GO:0016020]"),
              Periplasmic=c("periplasmic space [GO:0042597]"),
              CellWall=c(),
              Vacuole=c(),
              Lipoproteins=c())


csv1 <- csv1%>%
  add_column("Subcellular Localization" = NA)

for (row in (categ)){ 
   if row(categ) %in% csv1{

………………？？？？？？

卡兰帕利基

以下内容for loop可能对您的问题有所帮助。

csv1['subcellular_localization'] <- NA      #add a new column 

for (i in 1:nrow(csv1)) {                   #fill in the new column
  for (j in 1:length(categ)) {
    if (csv1$cell_comp[i] %in% categ[[j]]) {
       csv1$subcellular_localization[i] <- names(categ[j])
    }
  }
}

csv1

输入：

> csv1
  name                      cell_comp
1   p1    outer membrane [GO:0019867]
2   p2         cytoplasm [GO:0005737]
3   p3 periplasmic space [GO:0042597]

输出：

> csv1
  name                      cell_comp subcellular_localization
1   p1    outer membrane [GO:0019867]            OuterMembrane
2   p2         cytoplasm [GO:0005737]              Cytoplasmic
3   p3 periplasmic space [GO:0042597]              Periplasmic

编辑

如果每个蛋白质有多个细胞成分，可以使用以下形式的 for 循环（使用stringr库）：

library(stringr)

for (i in 1:nrow(csv1)) {
  components <- unlist(strsplit(csv1$cell_comp[i], ';'))
  for (component in components) {
    component <- str_trim(component, side='left')
    for (j in 1:length(categ)) {
      if (component %in% categ[[j]]) {
        if (is.na(csv1$subcellular_localization[i])) {
          csv1$subcellular_localization[i] <- names(categ[j])
        } else {
          if (csv1$subcellular_localization[i] != names(categ[j])) {
            csv1$subcellular_localization[i] <- paste(csv1$subcellular_localization[i],names(categ[j]), sep="; ")
          } else {
            csv1$subcellular_localization[i] <- names(categ[j])
          }
        }
      }
    }
  }
}

输入*：

> csv1
  name                                                                cell_comp
1   p1 outer membrane [GO:0019867]; integral component of membrane [GO:0016021]
2   p2                   cytoplasm [GO:0005737]; periplasmic space [GO:0042597]
3   p3                                           periplasmic space [GO:0042597]

输出*：

> csv1
  name                                                                cell_comp subcellular_localization
1   p1 outer membrane [GO:0019867]; integral component of membrane [GO:0016021]            OuterMembrane
2   p2                   cytoplasm [GO:0005737]; periplasmic space [GO:0042597] Cytoplasmic; Periplasmic
3   p3                                           periplasmic space [GO:0042597]              Periplasmic

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。