I am trying to figure out a code where all rows of the same 'SCU_KEY' are deleted if the 'STATUS' == 0. So you will see that SCU_KEY -> 5 has a 0 in the status, so I want to delete all of the SCU_KEY's containing a 5. Here is a sample dataframe and the desired output.
Dataframe:
df = pd.DataFrame({'SCU_KEY': [3, 3, 3, 5, 5, 5, 5, 5, 16, 16, 16],
'STATUS' : [1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1]})
Desired output:
df_2 = pd.DataFrame({'SCU_KEY': [3, 3, 3, 16, 16, 16],
'STATUS' : [1, 1, 1, 1, 1, 1]})
Use groupby
+ filter
# filter out all 'SCU_KEY' groups
# that have at least one 'STATUS' == 0
df2 = df.groupby('SCU_KEY').filter(lambda g: ~g['STATUS'].eq(0).any())
Although I find this solution somehow more idiomatic, Corralien's solution is away faster if your DataFrame is large.
Setup
n = 500_000
max_groups = 20
df1 = pd.DataFrame({
'SCU_KEY': rng.integers(max_groups, size=n),
'STATUS': rng.integers(2, size=n)
})
Results
Here are the results for comparison
# Corralien's
>>> %timeit df1[~df1['SCU_KEY'].isin(df1.loc[df1['STATUS'] == 0, 'SCU_KEY'])]
15.2 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
# My solution
>>> %timeit df1.groupby('SCU_KEY').filter(lambda g: ~g['STATUS'].eq(0).any())
59.4 ms ± 9.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Solution suggested by wwnde (see comments)
>>> %timeit df1[df1.groupby('SCU_KEY')['STATUS'].transform(lambda x: (x!=0).all())]
210 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments