Group by Month and Perform Cumulative Sum of Unique Values

LeoGER

Below is a simplified version of the df in question:

df = pd.DataFrame({'date':['2021-01-01','2021-01-02','2021-01-03','2021-01-04','2021-02-01','2021-02-02','2021-02-03','2021-02-04'],
                   'month':  ['Jan','Jan','Jan','Jan','Feb','Feb','Feb','Feb'],
                   'label':  ['A','A','B','A','A','B', 'C', 'A']})

df

     date       month  label
0   2021-01-01   Jan     A
1   2021-01-02   Jan     A
2   2021-01-03   Jan     B
3   2021-01-04   Jan     A
4   2021-02-01   Feb     A
5   2021-02-02   Feb     B
6   2021-02-03   Feb     C
7   2021-02-04   Feb     A

I would like to have a new column showing the cumulative sum of unique labels on a monthly basis.

Intended df:

    date       month    label   count
0   2021-01-01  Jan       A       1
1   2021-01-02  Jan       A       1
2   2021-01-03  Jan       B       2
3   2021-01-04  Jan       A       2
4   2021-02-01  Feb       A       1
5   2021-02-02  Feb       B       2
6   2021-02-03  Feb       C       3
7   2021-02-04  Feb       A       3
Erfan

We can use sort to check by month and label to check for differences in rows with shift. Join this boolean array to our dataframe and use groupby.cumsum to get the counter:

d = df.sort_values(["month", "label"])
s = d["label"].ne(d["label"].shift()).rename("count")
df = df.join(s)

df["count"] = df.groupby("month")["count"].cumsum()

        date month label  count
0 2021-01-01   Jan     A      1
1 2021-01-02   Jan     A      1
2 2021-01-03   Jan     B      2
3 2021-01-04   Jan     A      2
4 2021-02-01   Feb     A      1
5 2021-02-02   Feb     B      2
6 2021-02-03   Feb     C      3
7 2021-02-04   Feb     A      3

OLD ANSWER

We can make use of a cumulative sum of booleans:, by checking if the previous label is equal to the current. Then groupby and cumsum

s = df["label"].ne(df["label"].shift())
df["count"] = s.groupby(df["month"]).cumsum()
        date month label  count
0 2021-01-01   Jan     A      1
1 2021-01-01   Jan     A      1
2 2021-01-03   Jan     B      2
3 2021-02-01   Feb     A      1
4 2021-02-02   Feb     B      2
5 2021-02-03   Feb     C      3

Or more safe and make use of your dates by doing a groupby on year-month:

df["date"] = pd.to_datetime(df["date"])

s = df["label"].ne(df["label"].shift())
df["count"] = s.groupby(df["date"].dt.strftime("%Y-%m")).cumsum()
        date month label  count
0 2021-01-01   Jan     A      1
1 2021-01-01   Jan     A      1
2 2021-01-03   Jan     B      2
3 2021-02-01   Feb     A      1
4 2021-02-02   Feb     B      2
5 2021-02-03   Feb     C      3

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

How to perform a cumulative sum with unique IDs only?

Mysql cumulative sum and group by month year

Cumulative sum of values by month, filling in for missing months

How to get cumulative sum of unique IDs with group by?

Counting the cumulative sum of unique values in a vector

Python: cumulative sum of column based on unique values

Cumulative sum of unique values based on multiple criteria

Python Pandas - Cumulative sum by date (Month-Year) and unique user

Same month cumulative sum

Sum unique values group by id

Sum unique values by group with pandas

How to perform a cumulative sum of distinct values in pandas dataframe

How to calculate cumulative sum of a column based on Month column values

Get sum of unique customer ids each month, then group by month and age

Group Daily Date Columns by Month and Sum Values

Ruby: Group by Month, Year , Category and sum values

How to perform aggregation(sum) group by month field using HiveQL?

Cumulative sum with group by and join

Group Rows By Cumulative Sum

reverse cumulative sum by group

lag and cumulative sum by group

How to group the cumulative sum of rain values into a new column for given timestamps

Cumulative sum of first occurence of consecutive True values in a group in Pandas

Cumulative sum by id and by month in Presto

Cumulative sum by month with missing months

Pandas groupby cumulative sum and month

Running cumulative count group by month

Cumulative sum in R by group and start over when sum of values in group larger than maximum value

Get cumulative sum with using group by