Say that I have a list of strings, such as
listStrings = [ 'cat', 'bat', 'hat', 'dad', 'look', 'ball', 'hero', 'up']
Is there a way would return all rows if a particular column contains 3 or more of the strings from the list?
For example
If the column contained 'My dad is a hero for saving the cat'
Then the row would be returned.
But if the column only contained 'the cat and bat teamed up to find some food'
That row wouldn't be returned.
The only way I can think of is to get every combination of 3 from the list of strings, and use AND statements. e.g. 'cat' AND 'bat' AND 'hat'.
But this doesn't seem computationally efficient nor pythonic.
Is there a more efficient, compact way to do this?
Edit
Here is a pandas example
import pandas as pd
listStrings = [ 'cat', 'bat', 'hat', 'dad', 'look', 'ball', 'hero', 'up']
df = pd.DataFrame(['test1', 'test2', 'test3'], ['My dad is a hero for saving the cat', 'the cat and bat teamed up to find some food', 'The dog found a bowl'])
df.head()
0
My dad is a hero for saving the cat test1
the cat and bat teamed up to find some food test2
The dog found a bowl test3
So using the listStrings
, I would like row 1 returned, but not row 2 or row 3.
You can use set itersection:
import pandas as pd
listStrings = {'A', 'B'}
df = pd.DataFrame({'text': ['A B', 'B C', 'C D']})
df = df.loc[df.text.apply(lambda x: len(listStrings.intersection(x.split())) >= 2)]
print(df)
Output:
text
0 A B
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments