What is the fastest way to find the group by max in a column in a Python Pandas dataframe AND mark it?

Climbs_lika_Spyder

UPDATE2: I actually have 2000 draws not 3.

UPDATE: My df column A was wrong. I fixed it.

I have a really large version of df below.

data = {'A':[11111, 11111, 33333,11111], 'B':[101, 101, 102, 101],'C':[1,2,3,4],
    'draw0':[5, 6, 2, 1], 'draw1':[4,3,2,1], 'draw2':[2,3,4,6]}
df = pd.DataFrame(data)

     A     B   C  draw0   draw1   draw2
0  11111  101  1      5      4      2
1  11111  101  2      6      3      3
2  33333  102  3      2      2      4
3  11111  101  4      1      1      6

I am trying to find which of the draw columns wins for each draw. Below is my current attempt, but its slow, but works. I feel like there should be a way with apply or something to make it faster.

draw_cols = [col for col in df if col.startswith('draw')]

for col in draw_cols:
    max_idx = df.groupby(['A', 'B'])[col].idxmax().values
    df.loc[max_idx, col] = 1
    df.loc[~df.index.isin(max_idx), col] = 0

Desired Output:

     A     B   C  draw0  draw1  draw2
0  11111  101  1      0      1      0
1  11111  101  2      1      0      0
2  33333  102  3      1      1      1
3  11111  101  4      0      0      1

I generate the 2000 columns like so:

def simulateDraw(df, n=2000):
    
    #simulate n drawings from the alpha and beta values and create columns 
    return pd.concat([df,
           df.apply(lambda row: pd.Series(np.random.beta(row.C, row.C, size=n)), axis = 1).add_prefix('draw')],
          axis = 1)
It_is_Chris
# groupby and transform the idxmax
max_idx = df.groupby(['A', 'B'])[df.columns[3:]].transform('idxmax')
# create a new column that is just your index
# this is done just in case your real data does not have a range index
max_idx['index'] = max_idx.index.values
# where the max_idx is in the index to return bool values and then update the original df
df.update(max_idx.isin(max_idx['index']).astype(int))

       A    B  C  draw0  draw1  draw2
0  11111  101  1      0      1      0
1  11111  101  2      1      0      0
2  33333  102  3      1      1      1
3  11111  101  4      0      0      1

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Why does 11010100 << 1 equal 110101000, not 10101000?

Odd Java ternary behavior when assigning value. What is Java doing behind the scenes for this to happen?

Java 8 - How to get single values out of Map of Lists?

How to solve x - a tan(x) = 0 with Sympy

What is the meaning of the exclamation mark in indexing a Julia DataFrame?

Pandas Dataframe remove all rows that begin with a double quotation mark

Drop rows with a 'question mark' value in any column in a pandas dataframe

Pandas - groupby all columns and mark in original dataframe

Why does a question mark in a dataframe column title show as a period when outputting from r to an excel sheet (using xlsx package)?

Mark first ocurrence in a dataframe column

Search for specific strings in rows of dataframe and if strings exist then mark in another column in python

Replace question mark in pandas dataframe

How to mark DataFrame rows with nan in any column

Mark each row in a large dataframe via two variables

Mark Empty values in Pandas DataFrame Multi-Row Header

How to mark all instances of -1 in a dataframe as NA in r

plot time series dataframe and mark certain points using pandas and matplotlib

Remove row in Dataframe if Contains Question Mark Python 3

How to find same values and mark them in a separate column in a dataframe in Python?

How to compose each word in the dataframe into a sentence, and generate the next sentence after the period or question mark?

How to mark or select rows in one dataframe, where value lies between any of the ranges in another dataframe featuring additional identifier

Pandas: Compare rows within groups in a dataframe and create summary rows to mark / highlight different entries in group

Mark repeated id with a-b relationship in dataframe

Mark groups of rows in dataframe based on boolean sequence using pandas approach

How to mark start/end of a series of non-null and non-0 values in a column of a Pandas DataFrame?

How to mark overlapping time range in PySpark dataframe?

How to mark the duplicated items in dataframe?

Mark rows of one dataframe based on values from another dataframe

How to mark all the outliers in a dataframe when there are around 1/3 missing values?