Using the boolean indexing with a sample data worked fine, but as I increased the size of the data, the computing time is getting exponentially long (example below). Anyone knows a way to increase the speed of that particular boolean indexer ?
import pandas as pd
import numpy as np
a = pd.date_range('2019-01-01', '2019-12-31',freq = '1T')
b = np.random.normal(size = len(a), loc = 50)
c = pd.DataFrame(index = a, data = b, columns = ['price'])
1500 rows:
z = c.head(1500)
z[z.index.map(lambda x : 8 <= x.hour <= 16 ) & z.index.map(lambda x : x.weekday() < 5 )]
CPU times: user 149 ms, sys: 8.71 ms, total: 158 ms Wall time: 157 ms
5000 rows:
z = c.head(5000)
z[z.index.map(lambda x : 8 <= x.hour <= 16 ) & z.index.map(lambda x : x.weekday() < 5 )]
CPU times: user 14.1 s, sys: 9.07 s, total: 23.2 s Wall time: 23.2 s
I tried with z = c.head(10000)
but it's taking more than 15 minutes to comput so i stopped... The size of the data I want to use that indexer on is about 30000 rows.
The reason this does not work fast is because you perform a mapping with a lambda
expression, so that means that for each item, a function call will be made. This is typically not a good idea if you want to process data in "bulk". You can speed this up with:
hour = z.index.hour
z[(8 <= hour) & (hour <= 16) & (z.index.weekday < 5)]
With z = c
(so a total of 524'161 rows), we get the following timings:
>>> z = c
>>> timeit(lambda: z[(8 <= z.index.hour) & (z.index.hour <= 16) & (z.index.weekday < 5)], number=100)
11.825318349001464
So this runs in a total of ~118 milliseconds per run.
When we use the first 5'000 rows, we get:
>>> z = c.head(5000)
>>> timeit(lambda: z[(8 <= z.index.hour) & (z.index.hour <= 16) & (z.index.weekday < 5)], number=100)
0.1542488380218856
So this runs in 1.5 milliseconds per run.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments