我有一个像这样的数据框(样本),
A B C D E
0 V1 B1 Clearing C1 1538884.46
1 V1 B1 CustomerPayment_Difference C1 13537679.70
2 V1 B1 Invoice C1 -15771005.81
3 V1 B1 PaymentDifference C1 0.00
4 V2 B2 Clearing C2 104457.22
5 V2 B2 Invoice C2 -400073.56
6 V2 B2 Payment C2 297856.45
7 V3 B3 Clearing C3 1989462.95
8 V3 B3 CreditMemo C3 538.95
9 V3 B3 CustomerPayment_Difference C3 2112329.00
10 V3 B3 Invoice C3 -4066485.69
11 V4 B4 Clearing C4 -123946.13
12 V4 B4 CreditMemo C4 127624.66
13 V4 B4 Accounting C4 424774.52
14 V4 B4 Invoice C4 -40446521.41
15 V4 B4 Payment C4 44441419.95
我想像下面这样重塑这个数据框:
A B D Accounting Clearing CreditMemo CustomerPayment_Difference \
V1 B1 C1 NaN 1538884.46 NaN 13537679.7
V2 B2 C2 NaN 104457.22 NaN NaN
V3 B3 C3 NaN 1989462.95 538.95 2112329.0
V4 B4 C4 424774.52 -123946.13 127624.66 NaN
C Invoice Payment PaymentDifference
0 -15771005.81 NaN 0.0
1 -400073.56 297856.45 NaN
2 -4066485.69 NaN NaN
3 -40446521.41 44441419.95 NaN
到目前为止,我试图从数据透视表中获得帮助, df.pivot(index='A',columns='C', values='E').reset_index()
它给出如下结果:
C A Accounting Clearing CreditMemo CustomerPayment_Difference \
0 V1 NaN 1538884.46 NaN 13537679.7
1 V2 NaN 104457.22 NaN NaN
2 V3 NaN 1989462.95 538.95 2112329.0
3 V4 424774.52 -123946.13 127624.66 NaN
C Invoice Payment PaymentDifference
0 -15771005.81 NaN 0.0
1 -400073.56 297856.45 NaN
2 -4066485.69 NaN NaN
3 -40446521.41 44441419.95 NaN
在上表中它留下了 B&C 列,我也需要这些列。
为简单起见,这提供了此示例数据。但未来的数据也会是这样,
A B C D E
0 V1 B1 Clearing C1 1538884.46
1 V1 B1 CustomerPayment_Difference C1 13537679.70
2 V1 B1 Invoice C1 -15771005.81
3 V1 B1 PaymentDifference C1 0.00
**4 V1 B2 Clearing C1 88.9
5 V1 B2 Clearing C2 79.9**
在这种情况下,我的代码将抛出重复索引错误。
为了解决这两个问题,我需要指定 A、B、D 作为索引。我需要一个类似的代码,
df.pivot(index=['A','B','D'],columns='C', values='E').reset_index()
这段代码给我一个错误。
如何解决这个问题?如何在 Pandas 数据透视表中提供多列作为索引?
我认为需要:
df = df.set_index(['A','B','D', 'C'])['E'].unstack().reset_index()
print (df)
C A B D Accounting Clearing CreditMemo CustomerPayment_Difference \
0 V1 B1 C1 NaN 1538884.46 NaN 13537679.7
1 V2 B2 C2 NaN 104457.22 NaN NaN
2 V3 B3 C3 NaN 1989462.95 538.95 2112329.0
3 V4 B4 C4 424774.52 -123946.13 127624.66 NaN
C Invoice Payment PaymentDifference
0 -15771005.81 NaN 0.0
1 -400073.56 297856.45 NaN
2 -4066485.69 NaN NaN
3 -40446521.41 44441419.95 NaN
另一种解决方案是使用pivot_table
:
df = df.pivot_table(index=['A','B','D'], columns='C', values='E')
但如果在A
, B
, C
,D
列中重复,它会聚合。在第一个解决方案中,如果重复出现错误:
print (df)
A B C D E
0 V1 B1 Clearing C1 3000.00 <-V1,B1,Clearing,C1
1 V1 B1 CustomerPayment_Difference C1 13537679.70
2 V1 B1 Invoice C1 -15771005.81
3 V1 B1 PaymentDifference C1 0.00
4 V1 B1 Cleari7ng C1 1000.00 <-V1,B1,Clearing,C1
df = df.set_index(['A','B','D', 'C'])['E'].unstack().reset_index()
print (df)
ValueError:索引包含重复条目,无法重塑
但 pivot_table 聚合:
df = df.pivot_table(index=['A','B','D'], columns='C', values='E')
print (df)
C Clearing CustomerPayment_Difference Invoice PaymentDifference
A B D
V1 B1 C1 2000.0 13537679.7 -15771005.81 0.0
所以问题是:好主意总是使用pivot_table
吗?
在我看来,这取决于是否需要关注重复项 - 如果使用pivot
或set_index
+unstack
得到错误 - 您知道欺骗,但pivot_table
总是聚合,所以不知道欺骗。
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句