I have this dataframe to begin with:
ID PRODUCT_ID NAME STOCK SELL_COUNT DELIVERED_BY PRICE_A PRICE_B
1 P1 PRODUCT_P1 12 15 UPS 32,00 40,00
2 P2 PRODUCT_P2 4 3 DHL 8,00 NaN
3 P3 PRODUCT_P3 120 22 DHL NaN 144,00
4 P1 PRODUCT_P1 423 18 UPS 98,00 NaN
5 P2 PRODUCT_P2 0 5 GLS 12,00 18,00
6 P3 PRODUCT_P3 53 10 DHL 84,00 NaN
7 P4 PRODUCT_P4 22 0 UPS 2,00 NaN
8 P1 PRODUCT_P1 94 56 GLS NaN 49,00
9 P1 PRODUCT_P1 9 24 GLS NaN 1,00
What I'm trying to achieve is - after aggregating by PRODUCT_ID, to sum PRICE_A or PRICE_B depending on whether they have a value or not (prioritizing PRICE_A if both are set).
Based on @WeNYoBen 's helping answer, I now know how to conditionally apply aggregation functions depending on different columns:
def custom_aggregate(grouped):
data = {
'STOCK': grouped.loc[grouped['DELIVERED_BY'] == 'UPS', 'STOCK'].min(),
'TOTAL_SELL_COUNT': grouped.loc[grouped['ID'] > 6, 'SELL_COUNT'].sum(min_count=1),
'COND_SELL_COUNT': grouped.loc[grouped['SELL_COUNT'] > 10, 'SELL_COUNT'].sum(min_count=1)
# THIS IS WHERE THINGS GET FOGGY...
# I somehow need to add a second condition here, that says
# if PRICE_B is set - use the PRICE_B value for the sum()
'COND_PRICE': grouped.loc[grouped['PRICE_A'].notna(), 'PRICE_A'].sum()
}
d_series = pd.Series(data)
return d_series
result = df_products.groupby('PRODUCT_ID').apply(custom_aggregate)
I really don't know if this is possible by using the .loc function. One way to solve this could be to create an additional column before calling .groupby that already contains the correct price values. But I thought there might be a more flexible way of doing this. I'd be happy to somehow apply a custom function for the 'COND_PRICE' value calculation that gets executed before passing the results to sum(). In SQL I could nest x levels of CASE WHEN END statements in order to implement this kind of logic. Just curious about how to implement this flexibility in pandas.
Thanks a lot.
So here is the solution we need fillna
def custom_aggregate(grouped):
data = {
'STOCK': grouped.loc[grouped['DELIVERED_BY'] == 'UPS', 'STOCK'].min(),
'TOTAL_SELL_COUNT': grouped.loc[grouped['ID'] > 6, 'SELL_COUNT'].sum(min_count=1),
'COND_SELL_COUNT': grouped.loc[grouped['SELL_COUNT'] > 10, 'SELL_COUNT'].sum(min_count=1),
# Fillna if A have the value A return , if not check with B , both nan will keep the value as nan
'COND_PRICE': grouped['PRICE_A'].fillna(grouped['PRICE_B']).sum()
}
d_series = pd.Series(data)
return d_series
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments