Create dynamic ranges and calculate mean

windwalker

I would like to create an additional column with averages based on column A, using dynamic ranges.

import numpy as np
import pandas as pd
    
test = {'A' : [100, 120, 70, 300, 190, 70, 300, 190, 70],
        'B' : [80, 50, 64, 288, 172, 64, 288, 172, 64],
        'C' : ['NO', 'NO', 'YES', 'NO', 'YES', 'YES', 'NO', 'YES', 'YES'],
        'D' : [0, 1, 0, 3, 2, 2, 3, 1, 4] }

df = pd.DataFrame(data=test)

     A      B      C      D    
0   100    80     NO      0
1   120    50     NO      1
2    70    64    YES      0
3   300   288     NO      3
4   190   172    YES      2
5    70    64    YES      2
6   300   288     NO      3
7   190   172    YES      1
8    70    64    YES      4

When item in column C is YES get average value from dynamic range in column A by using value in column D as starting row index and row index of current row -1 as highest row index.

Below is the outcome I am seeking to achieve.

     A      B      C      D    Dyn_Ave    
0   100    80     NO      0     NaN
1   120    50     NO      1     NaN
2    70    64    YES      0     110
3   300   288     NO      3     NaN
4   190   172    YES      2     185
5    70    64    YES      2     187
6   300   288     NO      3     NaN
7   190   172    YES      1     175
8    70    64    YES      4     188

My attempt at creating the column has lead me to a np.where approach, although I am coming across the following error - TypeError: Cannot index by location index with a non-integer key

df['Dyn_Ave'] = np.where(df['C'] == 'YES', df['A'].iloc[df['D']:df.loc['C'][-1]].mean(), np.nan)
Quang Hoang

Let's try:

s = df['A'].cumsum().shift(fill_value=0)

df['Dyn_Ave'] = np.where(df['C'] == 'YES', 
                         (s - s.reindex(df['D']).values) / (np.arange(len(df)) - df['D']),           
                         np.nan)

Output:

     A    B    C  D     Dyn_Ave
0  100   80   NO  0         NaN
1  120   50   NO  1         NaN
2   70   64  YES  0  110.000000
3  300  288   NO  3         NaN
4  190  172  YES  2  185.000000
5   70   64  YES  2  186.666667
6  300  288   NO  3         NaN
7  190  172  YES  1  175.000000
8   70   64  YES  4  187.500000

Explanation: Let's first forget about C=='YES' for a moment and paying attention to dynamic average. The average from row df['D'] to row j-1 can be seen as

(cumsum[j-1] - cumsum[df['D']-1])/(j-df['D'])

or:

(cumsum.shift()[j] - cumsum.shift()[df['D']) / (j-df['D'])

That's why we first calculate the cumsum, then shift it:

s = df['A'].cumsum().shift(fill_value=0)

To get the cumsum at df['D'], we use reindex and pass the underlying numpy array for subtraction:

(s - s.reindex(df['D']).values)

The number of rows can be easily seen as:

(np.arange(len(df)) - df['D'])

The last part is just filling in where C=='YES', just as you were trying to accomplish.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related