Pandas DataFrame mean of data in columns occurring before certain date time

Matthi9000

I have a dataframe with ID's of clients and their expenses for 2014-2018. What I want is to have the mean of the expenses per ID but only the years before a certain date can be taken into account when calculating the mean value (so column 'Date' dictates which columns can be taken into account for the mean).

Example: for index 0 (ID: 12), the date states '2016-03-08', then the mean should be taken from the columns 'y_2014' and 'y_2015', so then for this index, the mean is 111.0. If the date is too early (e.g. somewhere in 2014 or earlier in this case), then NaN should be returned (see index 6 and 9).

Desired output:

   y_2014  y_2015  y_2016  y_2017  y_2018        Date  ID    mean
0   100.0   122.0     324     632     NaN  2016-03-08  12   111.0
1   120.0   159.0      54     452   541.0  2015-04-09  96   120.0
2     NaN   164.0     687     165   245.0  2016-02-15  20   164.0
3   180.0   421.0     512     184   953.0  2018-05-01  73  324.25
4   110.0   654.0     913     173   103.0  2017-08-04  84   559.0
5   130.0     NaN     754     124   207.0  2016-07-03  26   130.0
6   170.0   256.0     843      97   806.0  2013-02-04  87     NaN
7   140.0   754.0      95     101   541.0  2016-06-08  64     447
8    80.0   985.0     184      84    90.0  2019-03-05  11   284.6
9    96.0    65.0     127     130   421.0  2014-05-14  34     NaN

The code below is what I tried.

Tried code:

import pandas as pd

import numpy as np




df = pd.DataFrame({"ID":   [12,96,20,73,84,26,87,64,11,34],
  
               "y_2014": [100,120,np.nan,180,110,130,170,140,80,96],
  
               "y_2015": [122,159,164,421,654,np.nan,256,754,985,65],
  
               "y_2016": [324,54,687,512,913,754,843,95,184,127],
    
               "y_2017": [632,452,165,184,173,124,97,101,84,130],
  
               "y_2018": [np.nan,541,245,953,103,207,806,541,90,421],
   
                 "Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',
                          '2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})



print(df)



# the years from columns

data = df.filter(like='y_')

data_years = data.columns.str.extract('(\d+)')[0].astype(int)



# the years from Date

years = pd.to_datetime(df.Date).dt.year.values



df['mean'] = data.where(data_years<years[:,None]).mean(1)

print(df)

-> ValueError: Lengths must match to compare


Matthi9000

Solved: one possible answer to my own question

import pandas as pd

import numpy as np



df = pd.DataFrame({"ID":   [12,96,20,73,84,26,87,64,11,34],
                 
               "y_2014": [100,120,np.nan,180,110,130,170,140,80,96],
   
               "y_2015": [122,159,164,421,654,np.nan,256,754,985,65],
                 
               "y_2016": [324,54,687,512,913,754,843,95,184,127],
  
               "y_2017": [632,452,165,184,173,124,97,101,84,130],
                 
               "y_2018": [np.nan,541,245,953,103,207,806,541,90,421],
  
                 "Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',
               
                          '2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})

#Subset from original df to calculate mean
subset = df.loc[:,['y_2014', 'y_2015', 'y_2016', 'y_2017', 'y_2018']]


#an expense value is only available for the calculation of the mean when that year has passed, therefore 2015-01-01 is chosen for the 'y_2014' column in the subset etc. to check with the 'Date'-column
subset.columns = ['2015-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01']


s = subset.columns[0:].values < df.Date.values[:,None]

t = s.astype(float)
t[t == 0] = np.nan


df['mean'] = (subset.iloc[:,0:]*t).mean(1)


print(df)

#Additionally: (gives the sum of expenses before a certain date in the 'Date'-column
df['sum'] = (subset.iloc[:,0:]*t).sum(1)


print(df)


Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Format certain floating dataframe columns into percentage in pandas

Format certain floating dataframe columns into percentage in pandas

Convert date/time columns in Pandas dataframe

How to exclude certain columns of a pandas dataframe?

Pandas Combine date and time columns

Pandas: datetime indexed series to time indexed date columns dataframe

ReArrange Pandas DataFrame date columns in date order

Reorder certain columns in pandas dataframe

Create group mean columns in pandas dataframe

Sum of only certain columns in a pandas Dataframe

Pandas not converting certain columns of dataframe to datetimeindex

Reading data from csv into pandas when date and time are in separate columns

R - extracting data from certain date or time period from a dataframe

Python Pandas Dataframe - remove data from dataframe if indexed date is older than a certain date

Pandas mean and mode particular columns in DataFrame

dropping columns in panda if they are NaN before a certain date

Pandas DataFrame Combine Certain Columns in Nested JSON

pandas dataframe take rows before certain indexes

Calculate mean per few columns in Pandas Dataframe

How to reshape a pandas dataframe from time vs data to time vs date vs data

Apply function to certain groups of columns of a pandas dataframe

Pandas - Add mean, max, min as columns in dataframe

Get all columns in a pandas dataframe that is a date-column when different time-zones are present in different columns

Color only certain rows and columns of a Pandas DataFrame

pandas - combine time and date from two dataframe columns to a datetime column

Adding a list of different length under a certain condition to a date time index pandas dataframe

Filtering Pandas Dataframe on Time (not Date)

Merge certain columns of a pandas dataframe with data from another dataframe by condition

Calculate mean of columns from multiple dataframe in pandas