带有多级列的Pandas Groupby

席德

我想知道如何转换表并获得期望的结果:

我的样本数据集:

df=pd.DataFrame({
    "ID":[111,111,111,111,222,222,222,333,333],
    "Section":["CS01","CS01","IT01","IT01","CS02","CS02","CS02","HS01","HS01"],
    "Subject":["Hist","Pol","Pol","Arts","Pol","Hist","Arts","Pol","Hist"],
    "Activity":["Quiz 1","Quiz 2","Quiz 3","Quiz 1","Quiz 2","Quiz 3","Quiz 1","Quiz 2","Quiz 3"],
    "Pass":[1,0,0,1,1,1,0,1,0],
    })

看起来像什么:

    ID      Section     Subject     Activity    Pass
0   111     CS01        Hist        Quiz 1      1
1   111     CS01        Pol         Quiz 2      0
2   111     IT01        Pol         Quiz 3      0
3   111     IT01        Arts        Quiz 1      1
4   222     CS02        Pol         Quiz 2      1
5   222     CS02        Hist        Quiz 3      1
6   222     CS02        Arts        Quiz 1      0
7   333     HS01        Pol         Quiz 2      1
8   333     HS01        Hist        Quiz 3      0

我正在尝试做的是:

ID  Section Subject Quiz 1      Quiz 2      Quiz 3      
                    0   1   NA  0   1   NA  0   1   NA
111 CS01    Hist    0   1   0   0   0   1   0   0   1
111 CS01    Pol     0   0   1   1   0   0   0   0   1
111 IT01    Arts    0   1   0   0   0   1   0   0   1
111 IT01    Pol     0   0   1   0   0   1   1   0   0
222 CS02    Arts    1   0   0   0   0   0   0   0   0
222 CS02    Hist    0   0   1   0   0   1   0   1   0
222 CS02    Pol     0   0   1   0   1   0   0   0   1
333 HS01    Hist    0   0   1   0   0   1   1   0   0
333 HS01    Pol     0   0   1   0   1   0   0   0   1

我想要的是将“主题”列设置为2级,将其“通过”列设置为1级,并使用“ NA”列。

到目前为止,我仅有的是:

df.groupby(["ID","Section", "Subject","Activity"])["Pass"].value_counts().unstack().fillna(0)

但这没有级别2的“ NA”列和“活动”

耶斯列尔

想法是首先Series.reindex使用MultiIndex.from_product创建所有可能的组合,然后在value_countsMultiIndexdropna=False应用您的解决方案

s = df.set_index(["ID","Section", "Subject","Activity"])["Pass"] 
df = (s.reindex(pd.MultiIndex.from_product(s.index.levels))
       .groupby(level=[0,1,2,3])
       .value_counts(dropna=False)
       .unstack([3,4], fill_value=0)
       .sort_index(axis=1))

print (df)
Activity            Quiz 1         Quiz 2         Quiz 3        
Pass                   0.0 1.0 NaN    0.0 1.0 NaN    0.0 1.0 NaN
ID  Section Subject                                             
111 CS01    Arts         0   0   1      0   0   1      0   0   1
            Hist         0   1   0      0   0   1      0   0   1
            Pol          0   0   1      1   0   0      0   0   1
    CS02    Arts         0   0   1      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      0   0   1
            Pol          0   0   1      0   0   1      0   0   1
    HS01    Arts         0   0   1      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      0   0   1
            Pol          0   0   1      0   0   1      0   0   1
    IT01    Arts         0   1   0      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      0   0   1
            Pol          0   0   1      0   0   1      1   0   0
222 CS01    Arts         0   0   1      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      0   0   1
            Pol          0   0   1      0   0   1      0   0   1
    CS02    Arts         1   0   0      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      0   1   0
            Pol          0   0   1      0   1   0      0   0   1
    HS01    Arts         0   0   1      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      0   0   1
            Pol          0   0   1      0   0   1      0   0   1
    IT01    Arts         0   0   1      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      0   0   1
            Pol          0   0   1      0   0   1      0   0   1
333 CS01    Arts         0   0   1      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      0   0   1
            Pol          0   0   1      0   0   1      0   0   1
    CS02    Arts         0   0   1      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      0   0   1
            Pol          0   0   1      0   0   1      0   0   1
    HS01    Arts         0   0   1      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      1   0   0
            Pol          0   0   1      0   1   0      0   0   1
    IT01    Arts         0   0   1      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      0   0   1
            Pol          0   0   1      0   0   1      0   0   1
            

编辑:使用重复的解决方案:

df=pd.DataFrame({
    "ID":[111,111,111,111,222,222,222,333,333],
    "Section":["CS01","CS01","IT01","IT01","CS02","CS02","CS02","HS01","HS01"],
    "Subject":["Hist","Pol","Pol","Arts","Pol","Hist","Arts","Pol","Hist"],
    "Activity":["Quiz 1","Quiz 2","Quiz 3","Quiz 1","Quiz 2","Quiz 3","Quiz 1","Quiz 2","Quiz 3"],
    "Pass":[1,0,0,1,1,1,0,1,0],
    })

df = pd.concat([df, df.head()])
print (df)
    ID Section Subject Activity  Pass
0  111    CS01    Hist   Quiz 1     1
1  111    CS01     Pol   Quiz 2     0
2  111    IT01     Pol   Quiz 3     0
3  111    IT01    Arts   Quiz 1     1
4  222    CS02     Pol   Quiz 2     1
5  222    CS02    Hist   Quiz 3     1
6  222    CS02    Arts   Quiz 1     0
7  333    HS01     Pol   Quiz 2     1
8  333    HS01    Hist   Quiz 3     0
0  111    CS01    Hist   Quiz 1     1 <- duplicates
1  111    CS01     Pol   Quiz 2     0 <- duplicates
2  111    IT01     Pol   Quiz 3     0 <- duplicates
3  111    IT01    Arts   Quiz 1     1 <- duplicates
4  222    CS02     Pol   Quiz 2     1 <- duplicates

首先使用SeriesGroupBy.value_counts由和重塑ONY去年的水平Series.unstack,加上所有可能的组合的levels通过DataFrame.reindex,并添加列NaN由实心1如果所有值都0在通过测试的两列DataFrame.eqDataFrame.all,最后unstackMultiIndex列,层次和形式的变化顺序MultiIndex

df1 = (df.groupby(["ID","Section", "Subject","Activity"])["Pass"]
         .value_counts()
         .unstack(fill_value=0))

df1 = df1.reindex(pd.MultiIndex.from_product(df1.index.levels), fill_value=0)
df1[np.nan] = df1.eq(0).all(axis=1).view('i1')
df1 = df1.unstack().swaplevel(1,0, axis=1).sort_index(axis=1)

print (df1)
Activity            Quiz 1         Quiz 2         Quiz 3        
Pass                   0.0 1.0 NaN    0.0 1.0 NaN    0.0 1.0 NaN
ID  Section Subject                                             
111 CS01    Arts         0   0   1      0   0   1      0   0   1
            Hist         0   2   0      0   0   1      0   0   1
            Pol          0   0   1      2   0   0      0   0   1
    CS02    Arts         0   0   1      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      0   0   1
            Pol          0   0   1      0   0   1      0   0   1
    HS01    Arts         0   0   1      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      0   0   1
            Pol          0   0   1      0   0   1      0   0   1
    IT01    Arts         0   2   0      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      0   0   1
            Pol          0   0   1      0   0   1      2   0   0
222 CS01    Arts         0   0   1      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      0   0   1
            Pol          0   0   1      0   0   1      0   0   1
    CS02    Arts         1   0   0      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      0   1   0
            Pol          0   0   1      0   2   0      0   0   1
    HS01    Arts         0   0   1      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      0   0   1
            Pol          0   0   1      0   0   1      0   0   1
    IT01    Arts         0   0   1      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      0   0   1
            Pol          0   0   1      0   0   1      0   0   1
333 CS01    Arts         0   0   1      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      0   0   1
            Pol          0   0   1      0   0   1      0   0   1
    CS02    Arts         0   0   1      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      0   0   1
            Pol          0   0   1      0   0   1      0   0   1
    HS01    Arts         0   0   1      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      1   0   0
            Pol          0   0   1      0   1   0      0   0   1
    IT01    Arts         0   0   1      0   0   1      0   0   1
            Hist         0   0   1      0   0   1      0   0   1
            Pol          0   0   1      0   0   1      0   0   1

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章

数据分组具有多级列索引的Pandas groupby

带有q列表的多级列MultiIndex groupby对象上的熊猫分位数

带有布尔OR的Pandas groupby

将groupby中的某些列转换为pandas中的多级

带有分类列的Pandas Groupby返回NaN

带有2个列的Groupby-“ pandas.core.groupby.generic.DataFrameGroupBy”终端响应

带有lambda参数的pandas groupby

带有 lambda 和条件的 Pandas groupby

带有 bin sum 聚合的 Pandas Groupby

带有 bin 的 Pandas groupby 值

将Pandas groupby转换为具有带有布尔值列的数据框

Groupby的多级索引

Pandas Groupby比较Excel中带有子行的2列中的计数相等值

带有自定义功能的Pandas groupby以数组形式返回列值

使用groupby和mean()在Pandas中保留带有分类变量的列

带有多个列的Pandas v 0.25 groupby提供了内存错误

带有自定义聚合函数的pandas groupby()并将结果放入新列中

从Pandas groupBy到PySpark groupBy

带有冗余Nan的类别的Pandas groupby

带有过滤器的Pandas groupby()transform()max()

带有 agg() nth() 和/或 iloc() 的 Python Pandas groupby

带有groupby的pandas数据框滚动窗口

Pandas GroupBy:应用带有两个参数的函数

带有Agg最小/最大日期的Pandas Groupby

带有dict.update()的Pandas groupby函数

列表理解中带有lambda的Pandas groupby()。agg()

带有cumsum的pandas groupby不会建立索引

带有误差范围的 Pandas Groupby 整数

带有自定义聚合函数的pandas groupby()可以使用pandas连接列和行