有条件地替换分组值

Uni 13 发表于 Dev

大学 13

假设我有以下 df：

df = pd.DataFrame({'id':[1,1,1,1,2,2,2,2,3,3,3,3],'category' : ['A','A','A','A','B','B','B','B','C','C','C','C'],'price':[1,2,3,10,2,3,4,20,1,10,1,4]})
print(df)
     id    category value
0    1        A      1
1    1        A      2
2    1        A      3
3    1        A     10
4    2        B      2
5    2        B      3
6    2        B      4
7    2        B     20
8    3        C      1
9    3        C     10
10   3        C      1
11   3        C      4

对于相同 id 和类别的 values('price')，当它们不满足条件时，我想用其余值的平均值替换它们。例如，对于 id 1 和类别 A，我想用其他三个值（1,2,3）的平均值替换 10。我尝试了很多东西，但似乎没有任何效果。关于如何解决这个问题的任何建议？谢谢

莫兹韦

这是一个解决方案，它根据组的平均值检测异常值，然后用非异常值的平均值替换它们：

means = df.groupby(['id', 'category'])['price'].transform('mean')
df['new_price'] = df['price'].where(~(df['price'].gt(2*means)|df['price'].lt(0.5*means)), float('nan'))
df['new_price'] = df['new_price'].where(~df['new_price'].isna(), df.groupby(['id', 'category'])['new_price'].transform('mean'))

# for debugging only
df['outlier'] = df['price'].where(~(df['price'].gt(2*means)|df['price'].lt(0.5*means)), float('nan')).isna()

输出：

    id category  price  new_price  outlier
0    1        A      1        2.5     True
1    1        A      2        2.0    False
2    1        A      3        3.0    False
3    1        A     10        2.5     True
4    2        B      2        4.0     True
5    2        B      3        4.0     True
6    2        B      4        4.0    False
7    2        B     20        4.0     True
8    3        C      1        4.0     True
9    3        C     10        4.0     True
10   3        C      1        4.0     True
11   3        C      4        4.0    False

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。