I have the following dataframe:
| col1 | col2 | col3 | col4 |
|------|------|------|------|
| a | 1 | 2 | abc |
| b | 1 | 2 | abc |
| c | 3 | 2 | def |
I want the rows which have duplicates based on col2, col3, col4 for unique values of col1.
In this case the output would be:
| col1 | col2 | col3 | col4 |
|------|------|------|------|
| a | 1 | 2 | abc |
| b | 1 | 2 | abc |
df.duplicated excluding col1 wont work since I need the col1 information to be contained in the result. I have millions of rows and further analysis would be difficult without this direct information. I can't set col1 as index as some other value needs to be set as index.
Is there a pythonic/pandaic way to achieve this?
We can using filter
df.groupby(['col2','col3','col4']).filter(lambda x : (x['col1'].nunique()==x['col1'].count())&(x['col1'].nunique()>1))
Out[65]:
col1 col2 col3 col4
0 a 1 2 abc
1 b 1 2 abc
Also duplicated
, first duplicate make sure you have duplicate value rows , second make sure you do not have only one row
df[df.duplicated(['col2','col3','col4'],keep=False)&~df.duplicated(['col1','col2','col3','col4'],keep=False)]
Out[70]:
col1 col2 col3 col4
0 a 1 2 abc
1 b 1 2 abc
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments