I would like to create an additional column with averages based on column A, using dynamic ranges.
import numpy as np
import pandas as pd
test = {'A' : [100, 120, 70, 300, 190, 70, 300, 190, 70],
'B' : [80, 50, 64, 288, 172, 64, 288, 172, 64],
'C' : ['NO', 'NO', 'YES', 'NO', 'YES', 'YES', 'NO', 'YES', 'YES'],
'D' : [0, 1, 0, 3, 2, 2, 3, 1, 4] }
df = pd.DataFrame(data=test)
A B C D
0 100 80 NO 0
1 120 50 NO 1
2 70 64 YES 0
3 300 288 NO 3
4 190 172 YES 2
5 70 64 YES 2
6 300 288 NO 3
7 190 172 YES 1
8 70 64 YES 4
When item in column C
is YES
get average value from dynamic range in column A
by using value in column D
as starting row index and row index of current row -1
as highest row index.
Below is the outcome I am seeking to achieve.
A B C D Dyn_Ave
0 100 80 NO 0 NaN
1 120 50 NO 1 NaN
2 70 64 YES 0 110
3 300 288 NO 3 NaN
4 190 172 YES 2 185
5 70 64 YES 2 187
6 300 288 NO 3 NaN
7 190 172 YES 1 175
8 70 64 YES 4 188
My attempt at creating the column has lead me to a np.where approach, although I am coming across the following error - TypeError: Cannot index by location index with a non-integer key
df['Dyn_Ave'] = np.where(df['C'] == 'YES', df['A'].iloc[df['D']:df.loc['C'][-1]].mean(), np.nan)
Let's try:
s = df['A'].cumsum().shift(fill_value=0)
df['Dyn_Ave'] = np.where(df['C'] == 'YES',
(s - s.reindex(df['D']).values) / (np.arange(len(df)) - df['D']),
np.nan)
Output:
A B C D Dyn_Ave
0 100 80 NO 0 NaN
1 120 50 NO 1 NaN
2 70 64 YES 0 110.000000
3 300 288 NO 3 NaN
4 190 172 YES 2 185.000000
5 70 64 YES 2 186.666667
6 300 288 NO 3 NaN
7 190 172 YES 1 175.000000
8 70 64 YES 4 187.500000
Explanation: Let's first forget about C=='YES'
for a moment and paying attention to dynamic average. The average from row df['D']
to row j-1
can be seen as
(cumsum[j-1] - cumsum[df['D']-1])/(j-df['D'])
or:
(cumsum.shift()[j] - cumsum.shift()[df['D']) / (j-df['D'])
That's why we first calculate the cumsum, then shift it:
s = df['A'].cumsum().shift(fill_value=0)
To get the cumsum at df['D']
, we use reindex and pass the underlying numpy array for subtraction:
(s - s.reindex(df['D']).values)
The number of rows can be easily seen as:
(np.arange(len(df)) - df['D'])
The last part is just filling in where C=='YES'
, just as you were trying to accomplish.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments