我有一个MultiIndex Pandas DataFrame,看起来像下面的样子:
import pandas as pd
import numpy as np
genotype_data = [
['0/1', '120,60', 180, 5, 0.5, '0/1', '200,2', 202, 99, 0.01],
['0/1', '200,20', 60, 99, 0.1, '0/1', '200,50', 250, 99, 0.4],
['0/1', '200,2', 202, 99, 0.01, '0/1', '200,2', 202, 99, 0.01]
]
genotype_columns = [['Sample1', 'Sample2'], ['GT', 'AD', 'DP', 'GQ', 'AB']]
cols = pd.MultiIndex.from_product(genotype_columns)
genotype = pd.DataFrame(data=genotype_data, columns=cols)
info_columns = [['INFO'], ['AC', 'DEPTH']]
cols = pd.MultiIndex.from_product(info_columns)
info = pd.DataFrame(data=[[12, 100], [23, 200], [40, 40]], columns=cols)
df = pd.concat([info, genotype], axis=1)
我要为至少其中一个样本(在这种情况下为Sample1或Sample2)具有DP> = 50&GQ <4的任何行过滤df。在这些情况下,除第一行外,所有行均应过滤掉。
我不知道从哪里开始,不胜感激。
编辑:
感谢jezrael的帖子,我找到了解决方案。代码如下:
genotype = df.ix[:,3:]
DP = genotype.xs('DP', axis=1, level=1)
GQ = genotype.xs('GQ', axis=1, level=1)
conditions = (DP.ge(50) & GQ.le(4)).T.any()
df = df[conditions]
return df
我认为您可以使用:
#data in sample change for matching (first 99 in HQ in Sample1 was changed to 3)
genotype_data = [
['0/1', '120,60', 180, 5, 0.5, '0/1', '200,2', 202, 99, 0.01],
['0/1', '200,20', 60, 3, 0.1, '0/1', '200,50', 250, 99, 0.4],
['0/1', '200,2', 202, 99, 0.01, '0/1', '200,2', 202, 99, 0.01]
]
genotype_columns = [['Sample1', 'Sample2'], ['GT', 'AD', 'DP', 'GQ', 'AB']]
cols = pd.MultiIndex.from_product(genotype_columns)
genotype = pd.DataFrame(data=genotype_data, columns=cols)
info_columns = [['INFO'], ['AC', 'DEPTH']]
cols = pd.MultiIndex.from_product(info_columns)
info = pd.DataFrame(data=[[12, 100], [23, 200], [40, 40]], columns=cols)
df = pd.concat([info, genotype], axis=1)
print (df)
INFO Sample1 Sample2
AC DEPTH GT AD DP GQ AB GT AD DP GQ AB
0 12 100 0/1 120,60 180 5 0.50 0/1 200,2 202 99 0.01
1 23 200 0/1 200,20 60 3 0.10 0/1 200,50 250 99 0.40
2 40 40 0/1 200,2 202 99 0.01 0/1 200,2 202 99 0.01
df1 = df.filter(like='Sample')
df = df.loc[[(df1.xs('DP', axis=1, level=1).ge(50) &
df1.xs('GQ', axis=1, level=1).lt(4)).any(1).idxmax()]]
print (df)
INFO Sample1 Sample2
AC DEPTH GT AD DP GQ AB GT AD DP GQ AB
1 23 200 0/1 200,20 60 3 0.1 0/1 200,50 250 99 0.4
编辑:
如果需要按条件返回所有行,请删除loc
和idmax
:
df1 = df.filter(like='Sample')
#changed condition to lt(10) (<10)
df = df[(df1.xs('DP', axis=1, level=1).ge(50) & df1.xs('GQ', axis=1, level=1).lt(10)).any(1)]
print (df)
INFO Sample1 Sample2
AC DEPTH GT AD DP GQ AB GT AD DP GQ AB
0 12 100 0/1 120,60 180 5 0.5 0/1 200,2 202 99 0.01
1 23 200 0/1 200,20 60 3 0.1 0/1 200,50 250 99 0.40
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句