Improve speed of pandas boolean indexing

jim basquiat

Using the boolean indexing with a sample data worked fine, but as I increased the size of the data, the computing time is getting exponentially long (example below). Anyone knows a way to increase the speed of that particular boolean indexer ?

import pandas as pd
import numpy as np
a = pd.date_range('2019-01-01', '2019-12-31',freq = '1T')
b = np.random.normal(size = len(a), loc = 50)
c = pd.DataFrame(index = a, data = b, columns = ['price'])

1500 rows:

z = c.head(1500)
z[z.index.map(lambda x : 8 <= x.hour <= 16 ) & z.index.map(lambda x : x.weekday() < 5 )]

CPU times: user 149 ms, sys: 8.71 ms, total: 158 ms Wall time: 157 ms

5000 rows:

z = c.head(5000)
z[z.index.map(lambda x : 8 <= x.hour <= 16 ) & z.index.map(lambda x : x.weekday() < 5 )]

CPU times: user 14.1 s, sys: 9.07 s, total: 23.2 s Wall time: 23.2 s

I tried with z = c.head(10000) but it's taking more than 15 minutes to comput so i stopped... The size of the data I want to use that indexer on is about 30000 rows.

Willem Van Onsem

The reason this does not work fast is because you perform a mapping with a lambda expression, so that means that for each item, a function call will be made. This is typically not a good idea if you want to process data in "bulk". You can speed this up with:

hour = z.index.hour
z[(8 <= hour) & (hour <= 16) & (z.index.weekday < 5)]

With z = c (so a total of 524'161 rows), we get the following timings:

>>> z = c
>>> timeit(lambda: z[(8 <= z.index.hour) & (z.index.hour <= 16) & (z.index.weekday < 5)], number=100)
11.825318349001464

So this runs in a total of ~118 milliseconds per run.

When we use the first 5'000 rows, we get:

>>> z = c.head(5000)
>>> timeit(lambda: z[(8 <= z.index.hour) & (z.index.hour <= 16) & (z.index.weekday < 5)], number=100)
0.1542488380218856

So this runs in 1.5 milliseconds per run.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

How to speed up pandas boolean indexing with multiple string conditions

Pandas Boolean Indexing Issue

Boolean indexing in pandas dataframes

Pandas .iloc indexing coupled with boolean indexing in a Dataframe

Pandas boolean indexing: matching a set

pandas: Boolean indexing with multi index

Logical operators for boolean indexing in Pandas

Pandas boolean indexing error with .drop()

Star (*) within Pandas boolean indexing

Simplify Boolean Indexing Conditions in Pandas

Pandas boolean indexing w/ Column boolean array

pandas: boolean indexing using a list of boolean series

Does pandas categorical data speed up indexing?

Pandas, loc vs non loc for boolean indexing

Shorter way of boolean indexing in pandas (Series and DataFrame)?

Pandas: boolean indexing with 'item in list' syntax

python - stumped by pandas conditionals and/or boolean indexing

Python Pandas: Boolean indexing on multiple columns

Boolean indexing in Pandas DataFrame with MultiIndex columns

Boolean indexing in pandas combining a variable number of columns

Pandas indexing by both boolean `loc` and subsequent `iloc`

Python, Pandas: Boolean Indexing Comparing DateTimeIndex to Period

How to improve the computation speed of subsetting a pandas dataframe?

improve speed of extracting information from pandas columns

How to vectorize pandas operation to improve speed?

Improve speed parsing XML with elements and namespace, into Pandas

How to improve the speed of an aggregate operation on a pandas dataframe?

drastically improve speed to subset and summarize pandas dataframe

Proper way to use "opposite boolean" in Pandas data frame boolean indexing

TOP Ranking

HotTag

Archive