我有一张这样的表:
colour number letter
0 red one a
1 red two b
2 red two c
3 blue two a
4 blue two b
5 green one a
6 green two b
7 green three c
我做的:
df = pd.DataFrame([
('red', 'one', 'a'),
('red', 'two', 'b'),
('red', 'two', 'c'),
('blue', 'two', 'a'),
('blue', 'two', 'b'),
('green', 'one', 'a'),
('green', 'two', 'b'),
('green', 'three', 'c')
], columns=['colour', 'number', 'letter'])
我想按颜色对表格进行分组,然后为每个剩余的列获取三个最常见的值。如果一列没有三个唯一值,那么最后一个可以重复,也可以是NaN
,两者都可以。输出将如下所示:
colour red blue green
number 1 two two one
2 one two two
3 one two three
letter 1 a a a
2 b b b
3 c b c
或者:
colour red blue green
number 1 two two one
2 one NaN two
3 NaN NaN three
letter 1 a a a
2 b b b
3 c NaN c
我已经为单列做了这个:
df.groupby('colour').number
.value_counts()
.groupby(level=0)
.head(3)
输出:
colour number
blue two 2
green one 1
two 1
three 1
red two 2
one 1
但是,我想对数据框中的所有列执行此操作,并获得与示例类似的输出。我完全被困住了。
尝试:
def fn(x):
return pd.Series(
(x.value_counts().index[:3].tolist() + [np.nan, np.nan])[:3],
index=range(1, 4),
)
out = pd.concat(
[
df.groupby("colour")[col].apply(fn).unstack(level=0).ffill()
for col in df.loc[:, "number":]
],
keys=df.loc[:, "number":],
)
print(out)
打印:
colour blue green red
number 1 two three two
2 two two one
3 two one one
letter 1 b b b
2 a a a
3 a c c
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句