Python Pandas - 在刪除異常值的同時更快地遍歷數據中的類別（無 For 循環）

edutt 发表于 Dev

埃達特

假設我有一個像這樣的數據框：

import pandas as pd
import numpy as np

data = [[5123, '2021-01-01 00:00:00', 'cash','sales$', 105],
        [5123, '2021-01-01 00:00:00', 'cash','items', 20],
        [5123, '2021-01-01 00:00:00', 'card','sales$', 190],
        [5123, '2021-01-01 00:00:00', 'card','items', 40],
        [5123, '2021-01-02 00:00:00', 'cash','sales$', 75],
        [5123, '2021-01-02 00:00:00', 'cash','items', 10],
        [5123, '2021-01-02 00:00:00', 'card','sales$', 170],
        [5123, '2021-01-02 00:00:00', 'card','items', 35],
        [5123, '2021-01-03 00:00:00', 'cash','sales$', 1000],
        [5123, '2021-01-03 00:00:00', 'cash','items', 500],
        [5123, '2021-01-03 00:00:00', 'card','sales$', 150],
        [5123, '2021-01-03 00:00:00', 'card','items', 20]]

columns = ['Store', 'Date', 'Payment Method', 'Attribute', 'Value']

df = pd.DataFrame(data = data, columns = columns)

店鋪	日期	付款方法	屬性	價值
5123	2021-01-01 00:00:00	現金	銷售額$	105
5123	2021-01-01 00:00:00	現金	項目	20
5123	2021-01-01 00:00:00	卡片	銷售額$	190
5123	2021-01-01 00:00:00	卡片	項目	40
5123	2021-01-02 00:00:00	現金	銷售額$	75
5123	2021-01-02 00:00:00	現金	項目	10
5123	2021-01-02 00:00:00	卡片	銷售額$	170
5123	2021-01-02 00:00:00	卡片	項目	35
5123	2021-01-03 00:00:00	現金	銷售額$	1000
5123	2021-01-03 00:00:00	現金	項目	500
5123	2021-01-03 00:00:00	卡片	銷售額$	150
5123	2021-01-03 00:00:00	卡片	項目	20

我想過濾異常值並用前兩天的平均值替換它們。我的“異常值規則”是這樣的：如果屬性/支付方式的值是前兩天該屬性/支付方式的平均值的兩倍以上，或者小於一半，那麼替換那個離群值與前兩天的平均值。否則，保留該值。在這種情況下，除了 5123/'2021-01-03'/'cash' 的 1000 美元銷售額和 500 件商品外，所有值都應保留。這些值應替換為 90 美元的銷售額和 15 美元的商品。

這是我的嘗試（使用 for 循環，但不起作用）。每當我同時使用循環和 Pandas 時，我的腦海中就會閃過一面紅旗。這樣做的正確方法是什麼？

stores = df['Store'].unique()
payment_methods = df['Payment Method'].unique()
attributes = df['Attribute'].unique()

df_no_outliers = pd.DataFrame()

for store in stores:
    for payment_method in payment_methods:
        for attribute in attributes:

            df_temp = df.loc[df['Store'] == store]
            df_temp = df_temp.loc[df_temp['Payment Method'] == payment_method]
            df_temp = df_temp.loc[df_temp['Attribute'] == attribute]

            df_temp['Value'] = np.where(df_temp['Value'] <= (df_temp['Value'].shift(-1)
                                                                +df_temp['Value'].shift(-2))*2/2,
                                         df_temp['Value'],
                                        (df_temp['Value'].shift(-1)+df_temp['Value'].shift(-2))/2)

            df_temp['Value'] = np.where(df_temp['Value'] >= (df_temp['Value'].shift(-1)
                                                                +df_temp['Value'].shift(-2))*0.5/2,
                                         df_temp['Value'],
                                        (df_temp['Value'].shift(-1)+df_temp['Value'].shift(-2))/2)


            df_no_outliers = df_no_outliers.append(df_temp)

如果有人好奇我為什麼使用這種滾動平均方法，而不是像 Tukey 的方法那樣從 1Q 和 3Q 中截斷大於或小於 1.5*IQR 的數據，我的數據是 COVID 期間的時間序列，這意味著IQR 非常大（在 COVID 之前銷售量很高，然後是銷售量不足的深坑），因此 IQR 最終沒有過濾任何內容。我不想刪除 COVID 下降，而是刪除一些錯誤的數據輸入失敗（有些商店對此很不好，並且可能會在某些日子輸入一些額外的零......）。我可能最終會使用 5 或 7 天（一周），而不是將最後兩天用作滾動過濾器。我也願意接受其他清理/異常值刪除的方法。

not_speshal

嘗試：

#groupby the required columns and compute the rolling 2-day average
average = (df.groupby(["Store","Payment Method","Attribute"], as_index=False)
           .apply(lambda x: x["Value"].rolling(2).mean().shift())
           .droplevel(0).sort_index()
           )

#divide values by the average and keep only those ratios that fall between 0.5 and 2
df['Value'] = df["Value"].where(df["Value"].div(average).fillna(1).between(0.5,2),average)
>>> df
    Store                 Date Payment Method Attribute  Value
0    5123  2021-01-01 00:00:00           cash    sales$  105.0
1    5123  2021-01-01 00:00:00           cash     items   20.0
2    5123  2021-01-01 00:00:00           card    sales$  190.0
3    5123  2021-01-01 00:00:00           card     items   40.0
4    5123  2021-01-02 00:00:00           cash    sales$   75.0
5    5123  2021-01-02 00:00:00           cash     items   10.0
6    5123  2021-01-02 00:00:00           card    sales$  170.0
7    5123  2021-01-02 00:00:00           card     items   35.0
8    5123  2021-01-03 00:00:00           cash    sales$   90.0
9    5123  2021-01-03 00:00:00           cash     items   15.0
10   5123  2021-01-03 00:00:00           card    sales$  150.0
11   5123  2021-01-03 00:00:00           card     items   20.0

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-12-7

我来说两句

0 条评论

登录后参与评论

上一篇：如何將 TemplatedParent Content 屬性分配給變量？

TOP 榜单

文章

Python Pandas - 在刪除異常值的同時更快地遍歷數據中的類別（無 For 循環）

Python Pandas - 在刪除異常值的同時更快地遍歷數據中的類別（無 For 循環）

蓝屏死机没有修复解决方案

计算数据帧中每行的NA

UITableView的项目向下滚动后更改颜色，然后快速备份

Node.js中未捕获的异常错误，发生调用

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

Linux的官方Adobe Flash存储库是否已过时？

验证REST API参数

ggplot：对齐多个分面图-所有大小不同的分面

Mac OS X更新后的GRUB 2问题

通过 Git 在运行 Jenkins 作业时获取 ClassNotFoundException

带有错误“ where”条件的查询如何返回结果？

用日期数据透视表和日期顺序查询

VB.net将2条特定行导出到DataGridView

如何从视图一次更新多行（ASP.NET - Core）

Java Eclipse中的错误13，如何解决？

尝试反复更改屏幕上按钮的位置 - kotlin android studio

离子动态工具栏背景色

应用发明者仅从列表中选择一个随机项一次

当我尝试下载 StanfordNLP en 模型时，出现错误

python中的boto3文件上传

在同一Pushwoosh应用程序上Pushwoosh多个捆绑ID