关于找到日期或时间范围的重叠存在一些问题(例如)。我用这些来解决我的问题,但是最终我得到了一个非常慢的解决方案(一点也不优雅)。如果有人对如何更快(更优雅)有所了解,我将不胜感激:
问题:
我有2个数据框,df1
和df2
,每个都有2列,分别代表开始时间和结束时间:
>>> df1
datetime_start datetime_end
0 2016-09-11 06:00:00 2016-09-11 06:30:00
1 2016-09-11 07:00:00 2016-09-11 07:30:00
2 2016-09-11 07:30:00 2016-09-11 08:00:00
3 2016-09-11 08:00:00 2016-09-11 08:30:00
4 2016-09-11 08:30:00 2016-09-11 09:00:00
5 2016-09-11 09:00:00 2016-09-11 09:30:00
6 2016-09-11 09:30:00 2016-09-11 10:00:00
7 2016-09-11 10:30:00 2016-09-11 11:00:00
13 2016-09-11 14:00:00 2016-09-11 14:30:00
14 2016-09-11 14:30:00 2016-09-11 15:00:00
15 2016-09-11 15:00:00 2016-09-11 15:30:00
16 2016-09-11 15:30:00 2016-09-11 16:00:00
17 2016-09-11 16:00:00 2016-09-11 16:30:00
18 2016-09-11 16:30:00 2016-09-11 17:00:00
19 2016-09-11 17:00:00 2016-09-11 17:30:00
>>> df2
datetime_start datetime_end catg
4 2016-09-11 08:48:33 2016-09-11 09:41:53 a
6 2016-09-11 09:54:25 2016-09-11 10:00:50 a
8 2016-09-11 10:01:47 2016-09-11 10:04:55 b
10 2016-09-11 10:08:00 2016-09-11 10:08:11 b
12 2016-09-11 10:30:28 2016-09-11 10:30:28 b
14 2016-09-11 10:38:18 2016-09-11 10:38:18 a
18 2016-09-11 13:44:05 2016-09-11 13:44:05 a
20 2016-09-11 13:46:52 2016-09-11 14:11:41 d
23 2016-09-11 14:22:17 2016-09-11 14:33:40 b
25 2016-09-11 15:00:12 2016-09-11 15:02:55 b
27 2016-09-11 15:04:19 2016-09-11 15:06:36 b
29 2016-09-11 15:08:43 2016-09-11 15:31:29 d
31 2016-09-11 15:38:04 2016-09-11 16:09:24 a
33 2016-09-11 16:18:40 2016-09-11 16:44:32 b
35 2016-09-11 16:45:59 2016-09-11 16:59:01 b
37 2016-09-11 17:08:31 2016-09-11 17:12:23 b
39 2016-09-11 17:16:13 2016-09-11 17:16:33 c
41 2016-09-11 17:17:23 2016-09-11 17:20:00 b
45 2016-09-13 12:27:59 2016-09-13 12:34:21 a
47 2016-09-13 12:38:39 2016-09-13 12:38:45 a
我想要的是找到其中的范围df2
与中的范围df1
重叠的位置,重叠的时间(以秒为单位)以及该值df2.catg
是多少。我想要插入到列中的重叠部分的长度df1
(该列将catg
为其表示名称)。
所需输出:
>>> df1
datetime_start datetime_end a b d c
0 2016-09-11 06:00:00 2016-09-11 06:30:00 0.0 0.0 0.0 0.0
1 2016-09-11 07:00:00 2016-09-11 07:30:00 0.0 0.0 0.0 0.0
2 2016-09-11 07:30:00 2016-09-11 08:00:00 0.0 0.0 0.0 0.0
3 2016-09-11 08:00:00 2016-09-11 08:30:00 0.0 0.0 0.0 0.0
4 2016-09-11 08:30:00 2016-09-11 09:00:00 687.0 0.0 0.0 0.0
5 2016-09-11 09:00:00 2016-09-11 09:30:00 1800.0 0.0 0.0 0.0
6 2016-09-11 09:30:00 2016-09-11 10:00:00 1048.0 0.0 0.0 0.0
7 2016-09-11 10:30:00 2016-09-11 11:00:00 0.0 0.0 0.0 0.0
13 2016-09-11 14:00:00 2016-09-11 14:30:00 0.0 463.0 701.0 0.0
14 2016-09-11 14:30:00 2016-09-11 15:00:00 0.0 220.0 0.0 0.0
15 2016-09-11 15:00:00 2016-09-11 15:30:00 0.0 300.0 1277.0 0.0
16 2016-09-11 15:30:00 2016-09-11 16:00:00 1316.0 0.0 89.0 0.0
17 2016-09-11 16:00:00 2016-09-11 16:30:00 564.0 680.0 0.0 0.0
18 2016-09-11 16:30:00 2016-09-11 17:00:00 0.0 1654.0 0.0 0.0
19 2016-09-11 17:00:00 2016-09-11 17:30:00 0.0 389.0 0.0 20.0
做到这一点的荒谬方式:
基于这个漂亮的答案,我使用以下难以理解的代码实现了我想要的目标:
from collections import namedtuple
Range = namedtuple('Range', ['start', 'end'])
def overlap(row1, row2):
r1 = Range(start=row1.datetime_start, end=row1.datetime_end)
r2 = Range(start=row2.datetime_start, end=row2.datetime_end)
latest_start = max(r1.start, r2.start)
earliest_end = min(r1.end, r2.end)
delta = (earliest_end - latest_start).total_seconds()
overlap = max(0, delta)
return overlap
for cat in df2.catg.unique().tolist():
df1[cat] = 0
for idx1, row1 in df1.iterrows():
for idx2, row2 in df2.iterrows():
if overlap(row1, row2) > 0:
df1.loc[idx1, row2.catg] += overlap(row1, row2)
这是可行的,但是在较大的数据帧上它太慢了,因为它基本上是无法使用的。如果有人有什么想法可以加快速度,我将很高兴您的投入。
在此先感谢您,如果有不清楚的地方请通知我!
数据框设置:
import pandas as pd
from pandas import Timestamp
d1 = {'datetime_start': {0: Timestamp('2016-09-11 06:00:00'), 1: Timestamp('2016-09-11 07:00:00'), 2: Timestamp('2016-09-11 07:30:00'), 3: Timestamp('2016-09-11 08:00:00'), 4: Timestamp('2016-09-11 08:30:00'), 5: Timestamp('2016-09-11 09:00:00'), 6: Timestamp('2016-09-11 09:30:00'), 7: Timestamp('2016-09-11 10:30:00'), 13: Timestamp('2016-09-11 14:00:00'), 14: Timestamp('2016-09-11 14:30:00'), 15: Timestamp('2016-09-11 15:00:00'), 16: Timestamp('2016-09-11 15:30:00'), 17: Timestamp('2016-09-11 16:00:00'), 18: Timestamp('2016-09-11 16:30:00'), 19: Timestamp('2016-09-11 17:00:00')}, 'datetime_end': {0: Timestamp('2016-09-11 06:30:00'), 1: Timestamp('2016-09-11 07:30:00'), 2: Timestamp('2016-09-11 08:00:00'), 3: Timestamp('2016-09-11 08:30:00'), 4: Timestamp('2016-09-11 09:00:00'), 5: Timestamp('2016-09-11 09:30:00'), 6: Timestamp('2016-09-11 10:00:00'), 7: Timestamp('2016-09-11 11:00:00'), 13: Timestamp('2016-09-11 14:30:00'), 14: Timestamp('2016-09-11 15:00:00'), 15: Timestamp('2016-09-11 15:30:00'), 16: Timestamp('2016-09-11 16:00:00'), 17: Timestamp('2016-09-11 16:30:00'), 18: Timestamp('2016-09-11 17:00:00'), 19: Timestamp('2016-09-11 17:30:00')}}
d2 = {'datetime_start': {4: Timestamp('2016-09-11 08:48:33'), 6: Timestamp('2016-09-11 09:54:25'), 8: Timestamp('2016-09-11 10:01:47'), 10: Timestamp('2016-09-11 10:08:00'), 12: Timestamp('2016-09-11 10:30:28'), 14: Timestamp('2016-09-11 10:38:18'), 18: Timestamp('2016-09-11 13:44:05'), 20: Timestamp('2016-09-11 13:46:52'), 23: Timestamp('2016-09-11 14:22:17'), 25: Timestamp('2016-09-11 15:00:12'), 27: Timestamp('2016-09-11 15:04:19'), 29: Timestamp('2016-09-11 15:08:43'), 31: Timestamp('2016-09-11 15:38:04'), 33: Timestamp('2016-09-11 16:18:40'), 35: Timestamp('2016-09-11 16:45:59'), 37: Timestamp('2016-09-11 17:08:31'), 39: Timestamp('2016-09-11 17:16:13'), 41: Timestamp('2016-09-11 17:17:23'), 45: Timestamp('2016-09-13 12:27:59'), 47: Timestamp('2016-09-13 12:38:39')}, 'datetime_end': {4: Timestamp('2016-09-11 09:41:53'), 6: Timestamp('2016-09-11 10:00:50'), 8: Timestamp('2016-09-11 10:04:55'), 10: Timestamp('2016-09-11 10:08:11'), 12: Timestamp('2016-09-11 10:30:28'), 14: Timestamp('2016-09-11 10:38:18'), 18: Timestamp('2016-09-11 13:44:05'), 20: Timestamp('2016-09-11 14:11:41'), 23: Timestamp('2016-09-11 14:33:40'), 25: Timestamp('2016-09-11 15:02:55'), 27: Timestamp('2016-09-11 15:06:36'), 29: Timestamp('2016-09-11 15:31:29'), 31: Timestamp('2016-09-11 16:09:24'), 33: Timestamp('2016-09-11 16:44:32'), 35: Timestamp('2016-09-11 16:59:01'), 37: Timestamp('2016-09-11 17:12:23'), 39: Timestamp('2016-09-11 17:16:33'), 41: Timestamp('2016-09-11 17:20:00'), 45: Timestamp('2016-09-13 12:34:21'), 47: Timestamp('2016-09-13 12:38:45')}, 'catg': {4: 'a', 6: 'a', 8: 'b', 10: 'b', 12: 'b', 14: 'a', 18: 'a', 20: 'd', 23: 'b', 25: 'b', 27: 'b', 29: 'd', 31: 'a', 33: 'b', 35: 'b', 37: 'b', 39: 'c', 41: 'b', 45: 'a', 47: 'a'}}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
根据timeit
测试(每个测试有100次执行),问题中的namedtuple
方法在15.7314
我的计算机上平均数秒,而在这种情况下,平均1.4794
数秒:
# determine the duration of the events in df2, in seconds
duration = (df2.datetime_end - df2.datetime_start).dt.seconds.values
# create a numpy array with one timestamp for each second
# in which an event occurred
seconds_range = np.repeat(df2.datetime_start.values, duration) + \
np.concatenate(map(np.arange, duration)) * pd.Timedelta('1S')
df1.merge(pd.DataFrame({'datetime_start':seconds_range,
'catg':np.repeat(df2.catg, duration)}). \
groupby(['catg', pd.Grouper(key='datetime_start', freq='30min')]). \
size(). \
unstack(level=0). \
reset_index(),
how="left")
# datetime_end datetime_start a b c d
# 0 2016-09-11 06:30:00 2016-09-11 06:00:00 NaN NaN NaN NaN
# 1 2016-09-11 07:30:00 2016-09-11 07:00:00 NaN NaN NaN NaN
# 2 2016-09-11 08:00:00 2016-09-11 07:30:00 NaN NaN NaN NaN
# 3 2016-09-11 08:30:00 2016-09-11 08:00:00 NaN NaN NaN NaN
# 4 2016-09-11 09:00:00 2016-09-11 08:30:00 687.0 NaN NaN NaN
# 5 2016-09-11 09:30:00 2016-09-11 09:00:00 1800.0 NaN NaN NaN
# 6 2016-09-11 10:00:00 2016-09-11 09:30:00 1048.0 NaN NaN NaN
# 7 2016-09-11 11:00:00 2016-09-11 10:30:00 NaN NaN NaN NaN
# 8 2016-09-11 14:30:00 2016-09-11 14:00:00 NaN 463.0 NaN 701.0
# 9 2016-09-11 15:00:00 2016-09-11 14:30:00 NaN 220.0 NaN NaN
# 10 2016-09-11 15:30:00 2016-09-11 15:00:00 NaN 300.0 NaN 1277.0
# 11 2016-09-11 16:00:00 2016-09-11 15:30:00 1316.0 NaN NaN 89.0
# 12 2016-09-11 16:30:00 2016-09-11 16:00:00 564.0 680.0 NaN NaN
# 13 2016-09-11 17:00:00 2016-09-11 16:30:00 NaN 1654.0 NaN NaN
# 14 2016-09-11 17:30:00 2016-09-11 17:00:00 NaN 389.0 20.0 NaN
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句