假设我有一个包含以下各列的数据框,
df.head()
ref_loc ref_chr REF ALT coverage base
9532728 21 G [A] 1 A
9540473 21 C [G] 2 G
9540473 21 CTATT [C] 2 G
9540794 21 C [T] 1 A
9542965 21 C [A] 1 T
我想将列ALT
与列进行比较,base
并查看匹配项和区别。根据匹配和差异,我想生成一个名为的新列cate
。
为此,我尝试使用以下功能,
def grouping(row):
if row['ALT'] == row['base']:
val = "same_variants"
elif row['ALT'] != row['base']:
val = "diff_variants"
return val
df["cate"] = df.apply(grouping,axis=0)
但是,该函数在尝试应用于数据框时抛出此错误,
KeyError Traceback (most recent call last)
<ipython-input-13-a265dee72ec1> in <module>
----> 1 df["group"] =df.apply(grouping,axis=0)
~/software/anaconda/lib/python3.7/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
6911 kwds=kwds,
6912 )
-> 6913 return op.get_result()
6914
6915 def applymap(self, func):
~/software/anaconda/lib/python3.7/site-packages/pandas/core/apply.py in get_result(self)
184 return self.apply_raw()
185
--> 186 return self.apply_standard()
187
188 def apply_empty_result(self):
~/software/anaconda/lib/python3.7/site-packages/pandas/core/apply.py in apply_standard(self)
290
291 # compute the result using the series generator
--> 292 self.apply_series_generator()
293
294 # wrap results
~/software/anaconda/lib/python3.7/site-packages/pandas/core/apply.py in apply_series_generator(self)
319 try:
320 for i, v in enumerate(series_gen):
--> 321 results[i] = self.f(v)
322 keys.append(v.name)
323 except Exception as e:
<ipython-input-11-098066170c2f> in grouping(row)
1 def grouping(row):
----> 2 if row['ALT'] == row['base']:
3 val = "same_variants"
4 elif row['ALT'] != row['base']:
5 val= "diff_variants"
~/software/anaconda/lib/python3.7/site-packages/pandas/core/series.py in __getitem__(self, key)
1066 key = com.apply_if_callable(key, self)
1067 try:
-> 1068 result = self.index.get_value(self, key)
1069
1070 if not is_scalar(result):
~/software/anaconda/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
4728 k = self._convert_scalar_indexer(k, kind="getitem")
4729 try:
-> 4730 return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
4731 except KeyError as e1:
4732 if len(self) > 0 and (self.holds_integer() or self.is_boolean()):
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()
KeyError: ('ALT', 'occurred at index ref_loc')
我想提出一些建议,我可以继续前进。
最后,输出应如下所示:
ref_loc ref_chr REF ALT coverage base cate
9532728 21 G [A] 1 A same_variants
9540473 21 C [G] 2 G same_variants
9540473 21 CTATT [C] 2 G diff_variants
9540794 21 C [T] 1 A diff_variants
9542965 21 C [A] 1 T diff_variants
请注意,由于ALT
列周围有方括号,因此它始终是不同的。您可以首先提取括号内的内容:
df["ALT"] = df.ALT.apply(lambda l: l[0])
您需要使用axis=1
来遍历行。axis=0
遍历列。
df["cate"] = df.apply(grouping,axis=1)
print(df)
ref_loc ref_chr REF ALT coverage base cate
0 9532728 21 G A 1 A same_variants
1 9540473 21 C G 2 G same_variants
2 9540473 21 CTATT C 2 G diff_variants
3 9540794 21 C T 1 A diff_variants
4 9542965 21 C A 1 T diff_variants
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句