Fastest way to filter a pandas dataframe on multiple columns

RoachLord

I have a pandas dataframe with several columns that labels data in a final column, for example,

df = pd.DataFrame( {'1_label' : ['a1','b1','c1','d1'],
                    '2_label' : ['a2','b2','c2','d2'],
                    '3_label' : ['a3','b3','c3','d3'],
                    'data'    : [1,2,3,4]})

df =      1_label 2_label 3_label  data
     0      a1      a2      a3     1
     1      b1      b2      b3     2
     2      c1      c2      c3     3
     3      d1      d2      d3     4

and a list of tuples,

list_t = [('a1','a2','a3'), ('d1','d2','d3')]

I want to filter this dataframe and return a new dataframe containing only the rows that correspond to the tuples in my list.

result =        1_label 2_label 3_label  data
            0      a1      a2      a3     1
            1      d1      d2      d3     4

My naive (and C++ inspired) solution was to use append (like vector::push_back)

for l1, l2, l3 in list_t:
    if df[(df['1_label'] == l1) & 
          (df['2_label'] == l2) & 
          (df['3_label'] == l3)].empty is False:
        result = result.append(df[(df['1_label'] == l1) & 
                              (df['2_label'] == l2) &
                              (df['3_label'] == l3)]

While my solution works I suspect it is horrendously slow for large dataframes and large list of tuples as I think pandas creates a new dataframe upon each call to append. Could anyone suggest a faster/cleaner way to do this? Thanks!

Ilja Everilä

Assuming no duplicates, you could create index out of the columns you want to "filter" on:

In [10]: df
Out[10]: 
  1_label 2_label 3_label  data
0      a1      a2      a3     1
1      b1      b2      b3     2
2      c1      c2      c3     3
3      d1      d2      d3     4

In [11]: df.set_index(['1_label', '2_label', '3_label'])\
    .loc[[('a1','a2','a3'), ('d1','d2','d3')]]\
    .reset_index()
Out[11]: 
  1_label 2_label 3_label  data
0      a1      a2      a3     1
1      d1      d2      d3     4

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Fastest Way To Filter A Pandas Dataframe Using A List

Fastest way to sort each row in a pandas dataframe

Fastest way to find which of two lists of columns of each row is true in a pandas dataframe

Fastest way to cast all dataframe columns to float - pandas astype slow

What is the fastest way to check a pandas dataframe for elements?

Fastest way to filter out pandas dataframe rows containing special characters

Is there a simple way to filter a dataframe in Python by the contents of multiple columns?

Filter Pandas Dataframe Columns by header containing multiple strings

Fastest way to sample Pandas Dataframe?

Filter a Pandas dataframe with a single condition on multiple columns, programmatically

Fastest/most computationally efficient way to create a pandas dataframe where columns are filled with random strings, for several million rows?

Fastest way to replace multiple values of a pandas dataframe with values from another dataframe

Is there a way to multiple filter Dataframe in Python-Pandas?

Pandas Dataframe: fastest way of updating multiple rows based on a list of dictionaries

Pandas: Filter correctly Dataframe columns considering multiple conditions

Fastest way to update pandas columns based on matching column from other pandas dataframe

Fastest way to "unpack' a pandas dataframe

Fastest ways to filter for values in pandas dataframe

Fastest way to copy columns from one DataFrame to another using pandas?

Filter Pandas DataFrame using value_counts and multiple columns?

Fastest way to find nearest nonzero value in array from columns in pandas dataframe

Fastest way to map a dict on a df (multiple columns)

Query or filter pandas dataframe on multiple columns and cell values

Fastest way to multiply multiple columns in Dataframe based on conditions

Is there a way to filter all columns of a pandas dataframe against a list?

Fastest way to filter results from one dataframe into another dataframe based on multiple conditions (including date range)

Filter pandas dataframe by multiple columns, using tuple from list of tuples

Fastest way of filter index values based on list values from multiple columns in Pandas Dataframe?

What is the fastest way to filter a pandas time series?