Group by Month and Perform Cumulative Sum of Unique Values

LeoGER Published at Dev

LeoGER

Below is a simplified version of the df in question:

df = pd.DataFrame({'date':['2021-01-01','2021-01-02','2021-01-03','2021-01-04','2021-02-01','2021-02-02','2021-02-03','2021-02-04'],
                   'month':  ['Jan','Jan','Jan','Jan','Feb','Feb','Feb','Feb'],
                   'label':  ['A','A','B','A','A','B', 'C', 'A']})

df

     date       month  label
0   2021-01-01   Jan     A
1   2021-01-02   Jan     A
2   2021-01-03   Jan     B
3   2021-01-04   Jan     A
4   2021-02-01   Feb     A
5   2021-02-02   Feb     B
6   2021-02-03   Feb     C
7   2021-02-04   Feb     A

I would like to have a new column showing the cumulative sum of unique labels on a monthly basis.

Intended df:

    date       month    label   count
0   2021-01-01  Jan       A       1
1   2021-01-02  Jan       A       1
2   2021-01-03  Jan       B       2
3   2021-01-04  Jan       A       2
4   2021-02-01  Feb       A       1
5   2021-02-02  Feb       B       2
6   2021-02-03  Feb       C       3
7   2021-02-04  Feb       A       3

Erfan

We can use sort to check by month and label to check for differences in rows with shift. Join this boolean array to our dataframe and use groupby.cumsum to get the counter:

d = df.sort_values(["month", "label"])
s = d["label"].ne(d["label"].shift()).rename("count")
df = df.join(s)

df["count"] = df.groupby("month")["count"].cumsum()


        date month label  count
0 2021-01-01   Jan     A      1
1 2021-01-02   Jan     A      1
2 2021-01-03   Jan     B      2
3 2021-01-04   Jan     A      2
4 2021-02-01   Feb     A      1
5 2021-02-02   Feb     B      2
6 2021-02-03   Feb     C      3
7 2021-02-04   Feb     A      3

OLD ANSWER

We can make use of a cumulative sum of booleans:, by checking if the previous label is equal to the current. Then groupby and cumsum

s = df["label"].ne(df["label"].shift())
df["count"] = s.groupby(df["month"]).cumsum()

        date month label  count
0 2021-01-01   Jan     A      1
1 2021-01-01   Jan     A      1
2 2021-01-03   Jan     B      2
3 2021-02-01   Feb     A      1
4 2021-02-02   Feb     B      2
5 2021-02-03   Feb     C      3

Or more safe and make use of your dates by doing a groupby on year-month:

df["date"] = pd.to_datetime(df["date"])

s = df["label"].ne(df["label"].shift())
df["count"] = s.groupby(df["date"].dt.strftime("%Y-%m")).cumsum()

        date month label  count
0 2021-01-01   Jan     A      1
1 2021-01-01   Jan     A      1
2 2021-01-03   Jan     B      2
3 2021-02-01   Feb     A      1
4 2021-02-02   Feb     B      2
5 2021-02-03   Feb     C      3

Collected from the Internet

Please contact [email protected] to delete if infringement.