根据条件对数据帧行进行分组和平均

米洛什德拉戈

我有以下数据框：

Company_ID  Year   Metric_1  Metric_2  Bankrupt
1           2010   10        20        0.0
1           2011   NaN       30        0.0
1           2012   30        40        0.0
1           2013   50        NaN       1.0
2           2012   50        60        0.0
2           2013   60        NaN       0.0
2           2014   10        10        0.0
3           2011   100       100       1.0

我想为每个公司做除去年以外的所有年份的所有指标的平均值。它仅应取当前值的平均值，而忽略缺失值（NaN）。同样，它不应该平均破产列。

因此输出应如下所示：

Company_ID  Year        Metric_1  Metric_2  Bankrupt
1           2010-2012   20        30        0.0
1           2013        50        Nan       1.0
2           2012-2013   55        60        0.0
2           2014        10        10        0.0
3           2011        100       100       1.0

感谢您的帮助。

我想要一片T骨牛排

这种方式类似于@Stef的方法，但是我保留了这一方式，因为它适用于任意数量的Metric列（只要它们的名称以Metric开头）。如果最终使用此解决方案，请改为接受他们的解决方案。

你可以这样

#mask for catching last year per Company
m = df.groupby(['Company_ID'])['Year'].transform('max').eq(df['Year'])
# create groups per company without the last year
gr = df[~m].groupby(df['Company_ID'], as_index=False)

df_ = (pd.concat([gr.agg(Company_ID=('Company_ID', 'first'), #perform agg depending on needs
                         Bankrupt=('Bankrupt', 'first'), #here I'm not sure with value you want
                         Year=('Year', lambda x: f'{x.min()}-{x.max()}')), 
                  gr[df.filter(like='Metric').columns].mean()], 
                 axis=1)
         .append(df[m]) # append last year
         .sort_values(['Company_ID'])
         .reset_index(drop=True)
      )
print (df_)   
   Company_ID  Bankrupt       Year  Metric_1  Metric_2
0           1       0.0  2010-2012      20.0      30.0
1           1       1.0       2013      50.0       NaN
2           2       0.0  2012-2013      55.0      60.0
3           2       0.0       2014      10.0      10.0
4           3       1.0       2011     100.0     100.0

避免使用append和sort_values的另一个版本，可以对Year列使用不同的lambda函数来实现

#mask for catching last year per Company
m = df.groupby(['Company_ID'])['Year'].transform('max').eq(df['Year']) #same
# create groups per company without the last year
gr = df.groupby([df['Company_ID'], m]) #m is in the groupby and not as mask

df_ = (pd.concat([gr.agg(Company_ID=('Company_ID', 'first'), 
                        Bankrupt=('Bankrupt', 'first'),
                        Year=('Year', lambda x: f'{x.min()}-{x.max()}' if x.min()!=x.max()
                                                else x.max())), #different lambda function
                  gr[df.filter(like='Metric').columns].mean()], 
                 axis=1)
         #no more append/sort_values
         .reset_index(drop=True)
      )

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-01-24

我来说两句

0 条评论

登录后参与评论

上一篇：如何使用另一个集合中的信息过滤MongoDB集合？

根据条件对数据帧行进行分组和平均

根据条件对数据帧行进行分组和平均

构建类似于Jarvis的本地语言应用程序

在 Avalonia 中是否有带有柱子的 TreeView 或类似的东西？

Qt Creator Windows 10 - “使用 jom 而不是 nmake”不起作用

SQL Server中的非确定性数据类型

使用next.js时出现服务器错误，错误：找不到react-redux上下文值；请确保组件包装在<Provider>中

Swift 2.1-对单个单元格使用UITableView

Hashchange事件侦听器在将事件处理程序附加到事件之前进行侦听

HttpClient中的角度变化检测

如何了解DFT结果

错误：找不到存根。请确保已调用spring-cloud-contract：convert

Embers js中的更改侦听器上的组合框

在Wagtail管理员中，如何禁用图像和文档的摘要项？

如何避免每次重新编译所有文件？

Java中的循环开关案例

ng升级性能注意事项

Swift中的指针替代品？

如何使用geoChoroplethChart和dc.js在Mapchart的路径上添加标签或自定义值？

使用分隔符将成对相邻的数组元素相互连接

在同一Pushwoosh应用程序上Pushwoosh多个捆绑ID

ggplot：对齐多个分面图-所有大小不同的分面

完全禁用暂停（在内核级别？-必须与使用的DE和登录状态无关！）