熊猫：通过Groupby滚动时间加权移动平均线

Thor 发表于 Dev

雷神

我有以下客户销售历史记录的数据框（这只是其中一部分，实际数据框超过7万行）：

import pandas as pd
import datetime as DT

df_test = pd.DataFrame({
    'Cus_ID': ["T313","T348","T313","T348","T313","T348","T329","T329","T348","T313","T329","T348"],
    'Value': [3,2,3,4,5,3,7.25,10.25,4.5,11.75,6.25,6],
    'Date' : [
        DT.datetime(2015,10,18),
        DT.datetime(2015,11,14),
        DT.datetime(2015,11,18),
        DT.datetime(2015,12,13),
        DT.datetime(2015,12,19),
        DT.datetime(2016,1,24),
        DT.datetime(2016,1,31),
        DT.datetime(2016,2,17),
        DT.datetime(2016,3,28),
        DT.datetime(2016,3,31),
        DT.datetime(2016,4,3),            
        DT.datetime(2016,4,16),            
    ]})

我想在数据框中添加新列，以显示该客户过去90天的时间加权平均值的结果。

预期结果（第列Value_Result）：

     Cus_ID   Date    Value    Value_Result   
0    T313 2015-10-18   3.00          NaN      (No 90days history)
1    T348 2015-11-14   2.00          NaN      (No 90days history)
2    T313 2015-11-18   3.00            3      (3*31)/31
3    T348 2015-12-13   4.00            2      (2*29)/29
4    T313 2015-12-19   5.00            3      (3*62+3*31)/(62+31)
5    T348 2016-01-24   3.00        2.743      (4*42+2*71)/(42+71)
6    T329 2016-01-31   7.25          NaN      (No 90days history)
7    T329 2016-02-17  10.25         7.25      (7.25*17)/17
8    T348 2016-03-28   4.50            3      (3*64)/64
9    T313 2016-03-31  11.75          NaN      (No 90days history)
10   T329 2016-04-03   6.25        8.516      (10.25*46+7.25*63)/(46+63)
11   T348 2016-04-16   6.00        3.279      (4.5*19+3*83)/(19+83)

我尝试使用groupby('Cus_ID')滚动应用，但是我很难编写仅考虑倒退90天的函数。

任何输入高度赞赏。

乔希

我不确定滚动功能是否将与加权平均值一起使用，尽管也许其他人知道如何使用它，但我不能保证这将是最优化的方法，但会产生您想要的结果，您可以根据需要进行构建。

非常感谢这篇pbpython文章。我建议通读它。

我的方法是创建一个将应用于组（按Cus_ID分组）的函数。该函数将遍历该组中的行，并按照上述方法进行加权平均，然后将其应用于该组并返回该组。为了清楚说明起见，此代码段很冗长，您可以通过删除所有创建的变量（如果需要）来缩小代码段。

apply函数如下所示

def tw_avg(group, value_col, time_col, new_col_name="time_weighted_average", days_back='-90 days', fill_value=np.nan):
"""
Will calculate the weighted (by day) time average of the group passed.
It will not operate on the day it is evaulating but the previous days_back.
Should be used with the apply() function in Pandas with groupby function


Args:
    group (pandas.DataFrame): Will be passed by pandas
    value_col (str): Name of column with value to be averaged by weight
    time_col (str): Name of column of with times in them
    new_col_name (str): Name of new column to place time weighted average into, default: time_weighted_average
    days_back (str): Time delta description as described in panda time deltas documentation, default: -90 days
    fill_value (any): The value to fill rows which do not have data in days_back period, default: np.nan

Returns:
    (pandas.DataFrame): The modified DataFrame with time weighted average added to columns, np.nan if no
    time weight average exist
"""
for idx, row in group.iterrows():
    # Filter for only values that are days_back for averaging.
    days_back_fil = (group[time_col] < row[time_col]) & (group[time_col] >= row[time_col] + pd.Timedelta(days_back))
    df = group[days_back_fil] 

    df['days-back'] = (row[time_col] - df[time_col]) / np.timedelta64(1, 'D') # need to divide by np.timedelta day to get number back
    df['weight'] = df[value_col] * df['days-back']

    try:
        df['tw_avg'] = df['weight'].sum() / df['days-back'].sum()
        time_avg = df['tw_avg'].iloc[0] # Get single value of the tw_avg
        group.loc[idx, new_col_name] = time_avg
    except ZeroDivisionError:
        group.loc[idx, new_col_name] = fill_value     

return group

然后，您可以在此行中返回要查找的DataFrame

df_test.groupby(by=['Cus_ID']).apply(tw_avg, 'Value', 'Date')

这将产生，

    Cus_ID  Date        Value  time_weighted_average
0   T313    2015-10-18  3.0    NaN
1   T348    2015-11-14  2.0    NaN
2   T313    2015-11-18  3.0    3.0
3   T348    2015-12-13  4.0    2.0
4   T313    2015-12-19  5.0    3.0
5   T348    2016-01-24  3.0    2.743362831858407
6   T329    2016-01-31  7.25   NaN  
7   T329    2016-02-17  10.25  7.25
8   T348    2016-03-28  4.5    3.0
9   T313    2016-03-31  11.75  NaN  
10  T329    2016-04-03  6.25   8.51605504587156
11  T348    2016-04-16  6.0    3.2794117647058822

现在，您可以使用该函数将带value_col参数的加权平均值应用于其他值列，或带参数更改时间窗口长度days_back。有关如何描述时间增量的信息，请参见熊猫时间增量页面。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2020-11-4

我来说两句

0 条评论

登录后参与评论

上一篇：诊断2013与MySQL的连接断开

熊猫：通过Groupby滚动时间加权移动平均线

熊猫：通过Groupby滚动时间加权移动平均线

UITableView的项目向下滚动后更改颜色，然后快速备份

Linux的官方Adobe Flash存储库是否已过时？

用日期数据透视表和日期顺序查询

应用发明者仅从列表中选择一个随机项一次

Mac OS X更新后的GRUB 2问题

验证REST API参数

Java Eclipse中的错误13，如何解决？

带有错误“ where”条件的查询如何返回结果？

ggplot：对齐多个分面图-所有大小不同的分面

尝试反复更改屏幕上按钮的位置 - kotlin android studio

如何从视图一次更新多行（ASP.NET - Core）

计算数据帧中每行的NA

蓝屏死机没有修复解决方案

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

离子动态工具栏背景色

VB.net将2条特定行导出到DataGridView

通过 Git 在运行 Jenkins 作业时获取 ClassNotFoundException

在Windows 7中无法删除文件（2）

python中的boto3文件上传

当我尝试下载 StanfordNLP en 模型时，出现错误

Node.js中未捕获的异常错误，发生调用