I want to select (unique) values from one column in a pandas data frame based on conditions on multiple columns and multiple rows. Consider the following example data frame:
df = pd.DataFrame({'Developer': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'Language': ['Java', 'Python', 'Python', 'Java', 'Python', 'Python', 'Java', 'Python', 'C++'],
'Skill_Level': [1, 3, 3, 3, 2, 3, 3, 1, 3],
'Version': ["x.x", "2.x", "3.x", "x.x", "2.x", "3.x", "x.x", "3.x", "x.x"]
})
Developer Language Skill_Level Version
0 A Java 1 x.x
1 A Python 3 2.x
2 A Python 3 3.x
3 B Java 3 x.x
4 B Python 2 2.x
5 B Python 3 3.x
6 C Java 3 x.x
7 C Python 1 3.x
8 C C++ 3 x.x
Now I want to find all developers who know Java with a skill level of at least 3 and also know Python (no matter the version) with a skill level of at least 2.
The way I solved it for now was by selecting one set based on the Java condition, another set based on the Python condition and then doing an inner merge to get the set of developers matching all conditions:
result_java_df = df[(df["Language"] == "Java") & (df["Skill_Level"] >= 3)][["Developer"]]
result_python_df = df[(df["Language"] == "Python") & (df["Skill_Level"] >= 2)][["Developer"]]
result_df = result_java_df.merge(result_python_df, on="Developer")
result_df = result_df.drop_duplicates()
Developer
0 B
Is there a more "elegant" way to do this? I feel like I am overlooking smth. Especially if I want to select based on more row-based conditions (e.g. selecting developers who know 4 languages at certain skill levels) this will become quite convoluted, and of course justify writing a function to handle such selections. Hence I am wondering if pandas supports this somehow and I just didn't find that feature.
When I ran
qualified= df.groupby("Developer").apply(
lambda x:
any(
(x.Language == "Java") &
(x.Skill_Level >=3)
) &
any(
(x.Language == "Python") &
(x.Skill_Level >= 2))
)
I got
Developer
A False
B True
C False
dtype: bool
You can then subset with various methods, such as
[developer for developer,status in qualified.items() if status]
(returns a list)
or
qualified[qualified]
(returns a Series)
If you want to make it more general, you could do something like:
minimum_skill_levels = {"Java":3,
"Python":2}
qualified= df.groupby("Developer").apply(
lambda x:
all([any(
(x.Language == Language)&
(x.Skill_Level >= Skill_Level)
)
for Language, Skill_Level in minimum_skill_levels.items()
])
)
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments