What is the fastest way to check a pandas dataframe for elements?

EB2127

I'm a bit confused regarding the best way to check a pandas dataframe column for items.

I am writing a program whereby if the dataframe has elements in a certain column which are not allowed, an error is raised.

Here's an example:

import pandas as pd

raw_data = {'first_name': ['Jay', 'Jason', 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Jones', 'Miller', 'Ali', 'Milner', 'Cooze'], 
        'age': [47, 42, 36, 24, 73], 
        'preTestScore': [4, 4, 31, 2, 3],
        'postTestScore': [27, 25, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
print(df)

which outputs

  first_name last_name  age  preTestScore  postTestScore
0      Jay       Jones   47             4             27
1      Jason    Miller   42             4             25
2       Tina       Ali   36            31             57
3       Jake    Milner   24             2             62
4        Amy     Cooze   73             3             70

If column last_name contains anything besides Jones, Miller, Ali, Milner, or Cooze, raise a warning.

One could possibly use pandas.DataFrame.isin, but it's not clear to me this is the most efficient approach.

Something like:

if df.isin('last_name':{'Jones', 'Miller', 'Ali', 'Milner', 'Cooze'}).any() == False:
    raise:
        ValueError("Column `last_name` includes ill-formed elements.")
jezrael

I think you can use all for check if match all values:

if not df['last_name'].isin(['Jones', 'Miller', 'Ali', 'Milner', 'Cooze']).all():
    raise ValueError("Column `last_name` includes ill-formed elements.")

Another solution with issubset:

if not set(['Jones', 'Miller', 'Ali', 'Milner', 'Cooze']).issubset(df['last_name']):
    raise ValueError("Column `last_name` includes ill-formed elements.")

Timings:

np.random.seed(123)
N = 10000
L = list('abcdefghijklmno') 

df = pd.DataFrame({'last_name': np.random.choice(L, N)})
print (df)

In [245]: %timeit df['last_name'].isin(L).all()
The slowest run took 4.73 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 421 µs per loop

In [247]: %timeit set(L).issubset(df['last_name'])
The slowest run took 4.50 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 273 µs per loop

In [248]: %timeit df.loc[~df['last_name'].isin(L), 'last_name'].any()
1000 loops, best of 3: 562 µs per loop

Caveat:

Performance really depend on the data - number of rows and number of non matched values.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

What's the fastest way to acces a Pandas DataFrame?

Pandas: What is the fastest way to search a large dataframe

Fastest way to check which dates exist in another pandas dataframe

What is the fastest way to select rows that contain a value in a Pandas dataframe?

What is the fastest way to save a large pandas DataFrame to S3?

Fastest way to "unpack' a pandas dataframe

Fastest way to sample Pandas Dataframe?

What is the fastest way to populate one pandas dataframe based on values from another pandas dataframe?

fastest way to check if all elements in an array are equal

Node.js - What is the fastest way to check if a string represents a valid datetime for a big number of elements?

What is the fastest way to change the order of elements in an ArrayList?

What is the fastest way to count elements in an array?

What's the fastest way to check thousands of urls?

Fastest way to set elements of Pandas Dataframe based on a function with index and column value as input

What is the fastest way to make a DataFrame from a list?

Fastest way to sort each row in a pandas dataframe

Fastest Way To Filter A Pandas Dataframe Using A List

Fastest way to split a pandas dataframe into a list of subdataframes

Fastest Way to Drop Duplicated Index in a Pandas DataFrame

fastest way to apply an async function to pandas dataframe

Fastest way to filter a pandas dataframe on multiple columns

Fastest way to add rows to existing pandas dataframe

Fastest way to iterate function over pandas dataframe

Fastest way to join coulmn values in pandas dataframe?

fastest way to locate specific cell in pandas dataframe?

What is the fastest way to filter a pandas time series?

What is the fastest way to perform a replace on a column of a Pandas DataFrame based on the index of a separate Series?

What is the fastest way to find the group by max in a column in a Python Pandas dataframe AND mark it?

What is the best way to check correct dtypes in a pandas dataframe as part of testing?