熊猫:将任何列索引级别的列添加到multiindex

莫里兹

我想将缺少级别(index = 1)的列添加到数据帧的每个父级别(index = 0)。对于一个简单的数据框,效果很好

index = [['A', 'B', 'C', 'D'], ['a', 'b', 'a', 'b']]
cols = [['AC', 'AC', 'BC', 'DC', 'CC'], ['ac', 'aac', 'bc', 'ac', 'bc']]
data = np.random.random((4, 5))
df = pd.DataFrame(data=data, index=index, columns=cols)
df.columns.names = ['col_name_0', 'col_name_1']

数据框:

col_name_0        AC                  BC        DC        CC
col_name_1        ac       aac        bc        ac        bc
A a         0.169402  0.899434  0.644941  0.330402  0.805702
B b         0.933743  0.994497  0.060507  0.609129  0.545999
C a         0.064937  0.686350  0.740594  0.985218  0.717699
D b         0.151031  0.932294  0.948751  0.538251  0.085700    

处理步骤:

feature_index = [index for index, item in enumerate(df.columns.names) if item == 'col_name_1'][0]
all_features = df.columns.levels[feature_index].to_list()

for idx, item in df.groupby(level=0, axis=1):
    features = item.columns.get_level_values(1).to_list()
    missing = list(set(all_features) - set(features))
    for m_item in missing:
        df[idx, m_item] = np.nan * np.ones(df.shape[0])

处理后的df:

col_name_0        AC                BC      ...  CC            DC              
col_name_1       aac        ac  bc aac  ac  ...  ac        bc aac        ac  bc
A a         0.561247  0.353270 NaN NaN NaN  ... NaN  0.733714 NaN  0.343174   NaN
B b         0.699053  0.696892 NaN NaN NaN  ... NaN  0.144768 NaN  0.267141 NaN
C a         0.624581  0.064629 NaN NaN NaN  ... NaN  0.856559 NaN  0.772735 NaN
D b         0.563903  0.192823 NaN NaN NaN  ... NaN  0.071497 NaN  0.000361 NaN

但是,对于具有多个列级别的数据框(如以下所示),该方法将失败:

index = [['A', 'B', 'C', 'D'], ['a', 'b', 'a', 'b']]
cols = [['AC', 'AC', 'BC', 'DC', 'CC'], ['ac', 'aac', 'bc', 'ac', 'bc'], ['Xc', 'Xc', 'Xc', 'Xc', 'Xc']]
data = np.random.random((4, 5))
df = pd.DataFrame(data=data, index=index, columns=cols)
df.columns.names = ['col_name_0', 'col_name_1', 'col_name_2']

原始数据框:

col_name_0        AC                  BC        DC        CC
col_name_1        ac       aac        bc        ac        bc
col_name_2        Xc        Xc        Xc        Xc        Xc
A a         0.317022  0.700635  0.305712  0.934382  0.315501
B b         0.601277  0.726890  0.737907  0.571935  0.716260
C a         0.679046  0.314987  0.846560  0.962516  0.770071
D b         0.124029  0.626421  0.967531  0.193875  0.395897

处理步骤:

feature_index = [index for index, item in enumerate(df.columns.names) if item == 'col_name_1'][0]
all_features = df.columns.levels[feature_index].to_list()

for idx, item in df.groupby(level=0, axis=1):
    features = item.columns.get_level_values(1).to_list()
    missing = list(set(all_features) - set(features))
    for m_item in missing:
        df[idx, m_item] = np.nan * np.ones(df.shape[0])

错误信息:

ValueError: Item must have length equal to number of levels.

有什么想法可以使我的方法更通用以接受任何列级别?

贝尼

所以,你可以只使用stackunstack

out = df.stack(level = 1).unstack().swaplevel(1, 2, axis = 1)

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章