我有以下数据框作为我的python脚本的输出。我想添加每pmid计数的另一列,并将计数器添加到第一行,并保留其他行。
数据框如下所示:
df
PMID gene_symbol gene_label gene_mentions
0 33377242 MTHFR Matched Gene 2
1 33414971 CSF3R Matched Gene 13
2 33414971 BCR Other Gene 2
3 33414971 ABL1 Matched Gene 1
4 33414971 ESR1 Matched Gene 1
5 33414971 NDUFB3 Other Gene 1
6 33414971 CSF3 Other Gene 1
7 33414971 TP53 Matched Gene 2
8 33414971 SRC Matched Gene 1
9 33414971 JAK1 Matched Gene 1
预期的结果是:
PMID gene_symbol gene_label gene_mentions count
0 33377242 MTHFR Matched Gene 2 1
1 33414971 CSF3R Matched Gene 13 9
2 33414971 BCR Other Gene 2 9
3 33414971 ABL1 Matched Gene 1 9
4 33414971 ESR1 Matched Gene 1 9
5 33414971 NDUFB3 Other Gene 1 9
6 33414971 CSF3 Other Gene 1 9
7 33414971 TP53 Matched Gene 2 9
8 33414971 SRC Matched Gene 1 9
9 33414971 JAK1 Matched Gene 1 9
10 33414972 MAK2 Matched Gene 1 1
如何获得此输出?
谢谢
您可以使用groupby().transform
以下命令为每行添加计数:
df['count'] = df.groupby('PMID')['PMID'].transform('size')
输出:
PMID gene_symbol gene_label gene_mentions count
0 33377242 MTHFR Matched Gene 2 1
1 33414971 CSF3R Matched Gene 13 9
2 33414971 BCR Other Gene 2 9
3 33414971 ABL1 Matched Gene 1 9
4 33414971 ESR1 Matched Gene 1 9
5 33414971 NDUFB3 Other Gene 1 9
6 33414971 CSF3 Other Gene 1 9
7 33414971 TP53 Matched Gene 2 9
8 33414971 SRC Matched Gene 1 9
9 33414971 JAK1 Matched Gene 1 9
现在,如果您真的只想对每一行进行计数,则PMID
可以使用mask
:
df['count'] = df['count'].mask(df['PMID'].duplicated())
然后您将拥有:
PMID gene_symbol gene_label gene_mentions count
0 33377242 MTHFR Matched Gene 2 1.0
1 33414971 CSF3R Matched Gene 13 9.0
2 33414971 BCR Other Gene 2 NaN
3 33414971 ABL1 Matched Gene 1 NaN
4 33414971 ESR1 Matched Gene 1 NaN
5 33414971 NDUFB3 Other Gene 1 NaN
6 33414971 CSF3 Other Gene 1 NaN
7 33414971 TP53 Matched Gene 2 NaN
8 33414971 SRC Matched Gene 1 NaN
9 33414971 JAK1 Matched Gene 1 NaN
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句