How to select values with a condition on multiple columns and multiple rows in pandas (best practice)

guitarokh

I want to select (unique) values from one column in a pandas data frame based on conditions on multiple columns and multiple rows. Consider the following example data frame:

df = pd.DataFrame({'Developer': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
                   'Language': ['Java', 'Python', 'Python', 'Java', 'Python', 'Python', 'Java', 'Python', 'C++'],
                   'Skill_Level': [1, 3, 3, 3, 2, 3, 3, 1, 3],
                   'Version': ["x.x", "2.x", "3.x", "x.x", "2.x", "3.x", "x.x", "3.x", "x.x"]
                   })
    Developer    Language    Skill_Level    Version
0           A        Java              1        x.x
1           A      Python              3        2.x
2           A      Python              3        3.x
3           B        Java              3        x.x
4           B      Python              2        2.x
5           B      Python              3        3.x
6           C        Java              3        x.x
7           C      Python              1        3.x
8           C         C++              3        x.x

Now I want to find all developers who know Java with a skill level of at least 3 and also know Python (no matter the version) with a skill level of at least 2.

The way I solved it for now was by selecting one set based on the Java condition, another set based on the Python condition and then doing an inner merge to get the set of developers matching all conditions:

result_java_df = df[(df["Language"] == "Java") & (df["Skill_Level"] >= 3)][["Developer"]]
result_python_df = df[(df["Language"] == "Python") & (df["Skill_Level"] >= 2)][["Developer"]]
result_df = result_java_df.merge(result_python_df, on="Developer")
result_df = result_df.drop_duplicates()
    Developer
0   B

Is there a more "elegant" way to do this? I feel like I am overlooking smth. Especially if I want to select based on more row-based conditions (e.g. selecting developers who know 4 languages at certain skill levels) this will become quite convoluted, and of course justify writing a function to handle such selections. Hence I am wondering if pandas supports this somehow and I just didn't find that feature.

Acccumulation

When I ran

    qualified=    df.groupby("Developer").apply(
        lambda x: 
            any(
                    (x.Language == "Java") & 
                    (x.Skill_Level >=3)
                ) & 
            any(
                    (x.Language == "Python") & 
                    (x.Skill_Level >= 2))
        )

I got

Developer
A    False
B     True
C    False
dtype: bool

You can then subset with various methods, such as

[developer for developer,status in qualified.items() if status]

(returns a list)

or

qualified[qualified]

(returns a Series)

If you want to make it more general, you could do something like:

minimum_skill_levels = {"Java":3,
                    "Python":2}

qualified=    df.groupby("Developer").apply(
        lambda x: 
            all([any(
                    (x.Language == Language)&
                    (x.Skill_Level >= Skill_Level)
                    )
                 for Language, Skill_Level in minimum_skill_levels.items()
                 ])
        )

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Select rows that match values in multiple columns in pandas

Python Pandas, select and drop rows grouped by multiple columns based on condition

select values based on condition on multiple columns for pandas dataframe in python

How to select (slice) in Pandas multiple rows and multiple non continuous columns?

Select rows from a DataFrame based on values in a MULTIPLE columns in pandas

How to select multiple columns and rows from dataframe under condition?

How to select multiple columns and rows

how to select multiple rows as a group based on a column condition in pandas

How to assign values on multiple columns of a pandas data frame based on condition

How to select rows where multiple columns have same values

How to select rows in Pandas dataframe based on string matching in multiple columns

How do I convert multiple columns to individual rows/values in pandas?

How to count values in rows across multiple columns with Pandas?

pandas select rows based on multiple datetime columns

Pandas select rows by multiple conditions on columns

How to filter multiple rows with explicit values and select a column using pandas

Remove multiple rows based on condition in multiple columns in pandas

Pandas is condition on multiple columns

How to select multiple rows with a condition of only value

Select rows by searching multiple columns and multiple values in MySQL

Replace the values of multiple rows with the values of another row based on a condition in Pandas

How to combine multiple rows into a single row with python pandas based on the values of multiple columns?

Pandas DataFrame select rows based on values of multiple columns whose names are specified in a list

Split columns into multiple rows by condition

Condition for Selecting multiple columns and rows

Pandas: How to combine rows based on multiple columns

How to group Pandas rows by a function of multiple columns

How to filter multiple rows based on rows and columns condition in pyspark

Set value based on condition on multiple rows and columns Pandas