从2个数据帧中有效地找到日期时间范围的重叠

安全:

关于找到日期或时间范围的重叠存在一些问题(例如)。我用这些来解决我的问题,但是最终我得到了一个非常慢的解决方案(一点也不优雅)。如果有人对如何更快(更优雅)有所了解,我将不胜感激

问题:

我有2个数据框,df1df2,每个都有2列,分别代表开始时间和结束时间:

>>> df1

        datetime_start        datetime_end
0  2016-09-11 06:00:00 2016-09-11 06:30:00
1  2016-09-11 07:00:00 2016-09-11 07:30:00
2  2016-09-11 07:30:00 2016-09-11 08:00:00
3  2016-09-11 08:00:00 2016-09-11 08:30:00
4  2016-09-11 08:30:00 2016-09-11 09:00:00
5  2016-09-11 09:00:00 2016-09-11 09:30:00
6  2016-09-11 09:30:00 2016-09-11 10:00:00
7  2016-09-11 10:30:00 2016-09-11 11:00:00
13 2016-09-11 14:00:00 2016-09-11 14:30:00
14 2016-09-11 14:30:00 2016-09-11 15:00:00
15 2016-09-11 15:00:00 2016-09-11 15:30:00
16 2016-09-11 15:30:00 2016-09-11 16:00:00
17 2016-09-11 16:00:00 2016-09-11 16:30:00
18 2016-09-11 16:30:00 2016-09-11 17:00:00
19 2016-09-11 17:00:00 2016-09-11 17:30:00

>>> df2

        datetime_start        datetime_end catg
4  2016-09-11 08:48:33 2016-09-11 09:41:53    a
6  2016-09-11 09:54:25 2016-09-11 10:00:50    a
8  2016-09-11 10:01:47 2016-09-11 10:04:55    b
10 2016-09-11 10:08:00 2016-09-11 10:08:11    b
12 2016-09-11 10:30:28 2016-09-11 10:30:28    b
14 2016-09-11 10:38:18 2016-09-11 10:38:18    a
18 2016-09-11 13:44:05 2016-09-11 13:44:05    a
20 2016-09-11 13:46:52 2016-09-11 14:11:41    d
23 2016-09-11 14:22:17 2016-09-11 14:33:40    b
25 2016-09-11 15:00:12 2016-09-11 15:02:55    b
27 2016-09-11 15:04:19 2016-09-11 15:06:36    b
29 2016-09-11 15:08:43 2016-09-11 15:31:29    d
31 2016-09-11 15:38:04 2016-09-11 16:09:24    a
33 2016-09-11 16:18:40 2016-09-11 16:44:32    b
35 2016-09-11 16:45:59 2016-09-11 16:59:01    b
37 2016-09-11 17:08:31 2016-09-11 17:12:23    b
39 2016-09-11 17:16:13 2016-09-11 17:16:33    c
41 2016-09-11 17:17:23 2016-09-11 17:20:00    b
45 2016-09-13 12:27:59 2016-09-13 12:34:21    a
47 2016-09-13 12:38:39 2016-09-13 12:38:45    a

我想要的是找到其中的范围df2与中的范围df1重叠的位置,重叠的时间(以秒为单位)以及该值df2.catg多少我想要插入到列中的重叠部分的长度df1(该列将catg为其表示名称)。

所需输出

>>> df1
        datetime_start        datetime_end       a       b       d     c
0  2016-09-11 06:00:00 2016-09-11 06:30:00     0.0     0.0     0.0   0.0
1  2016-09-11 07:00:00 2016-09-11 07:30:00     0.0     0.0     0.0   0.0
2  2016-09-11 07:30:00 2016-09-11 08:00:00     0.0     0.0     0.0   0.0
3  2016-09-11 08:00:00 2016-09-11 08:30:00     0.0     0.0     0.0   0.0
4  2016-09-11 08:30:00 2016-09-11 09:00:00   687.0     0.0     0.0   0.0
5  2016-09-11 09:00:00 2016-09-11 09:30:00  1800.0     0.0     0.0   0.0
6  2016-09-11 09:30:00 2016-09-11 10:00:00  1048.0     0.0     0.0   0.0
7  2016-09-11 10:30:00 2016-09-11 11:00:00     0.0     0.0     0.0   0.0
13 2016-09-11 14:00:00 2016-09-11 14:30:00     0.0   463.0   701.0   0.0
14 2016-09-11 14:30:00 2016-09-11 15:00:00     0.0   220.0     0.0   0.0
15 2016-09-11 15:00:00 2016-09-11 15:30:00     0.0   300.0  1277.0   0.0
16 2016-09-11 15:30:00 2016-09-11 16:00:00  1316.0     0.0    89.0   0.0
17 2016-09-11 16:00:00 2016-09-11 16:30:00   564.0   680.0     0.0   0.0
18 2016-09-11 16:30:00 2016-09-11 17:00:00     0.0  1654.0     0.0   0.0
19 2016-09-11 17:00:00 2016-09-11 17:30:00     0.0   389.0     0.0  20.0

做到这一点的荒谬方式:

基于这个漂亮的答案,我使用以下难以理解的代码实现了我想要的目标:

from collections import namedtuple
Range = namedtuple('Range', ['start', 'end'])

def overlap(row1, row2):
    r1 = Range(start=row1.datetime_start, end=row1.datetime_end)
    r2 = Range(start=row2.datetime_start, end=row2.datetime_end)
    latest_start = max(r1.start, r2.start)
    earliest_end = min(r1.end, r2.end)
    delta = (earliest_end - latest_start).total_seconds()
    overlap = max(0, delta)
    return overlap

for cat in df2.catg.unique().tolist():
    df1[cat] = 0

for idx1, row1 in df1.iterrows():
    for idx2, row2 in df2.iterrows():
        if overlap(row1, row2) > 0:
            df1.loc[idx1, row2.catg] += overlap(row1, row2)

这是可行的,但是在较大的数据帧上它太慢了,因为它基本上是无法使用的。如果有人有什么想法可以加快速度,我将很高兴您的投入。

在此先感谢您,如果有不清楚的地方请通知我!

数据框设置:

import pandas as pd
from pandas import Timestamp

d1 = {'datetime_start': {0: Timestamp('2016-09-11 06:00:00'), 1: Timestamp('2016-09-11 07:00:00'), 2: Timestamp('2016-09-11 07:30:00'), 3: Timestamp('2016-09-11 08:00:00'), 4: Timestamp('2016-09-11 08:30:00'), 5: Timestamp('2016-09-11 09:00:00'), 6: Timestamp('2016-09-11 09:30:00'), 7: Timestamp('2016-09-11 10:30:00'), 13: Timestamp('2016-09-11 14:00:00'), 14: Timestamp('2016-09-11 14:30:00'), 15: Timestamp('2016-09-11 15:00:00'), 16: Timestamp('2016-09-11 15:30:00'), 17: Timestamp('2016-09-11 16:00:00'), 18: Timestamp('2016-09-11 16:30:00'), 19: Timestamp('2016-09-11 17:00:00')}, 'datetime_end': {0: Timestamp('2016-09-11 06:30:00'), 1: Timestamp('2016-09-11 07:30:00'), 2: Timestamp('2016-09-11 08:00:00'), 3: Timestamp('2016-09-11 08:30:00'), 4: Timestamp('2016-09-11 09:00:00'), 5: Timestamp('2016-09-11 09:30:00'), 6: Timestamp('2016-09-11 10:00:00'), 7: Timestamp('2016-09-11 11:00:00'), 13: Timestamp('2016-09-11 14:30:00'), 14: Timestamp('2016-09-11 15:00:00'), 15: Timestamp('2016-09-11 15:30:00'), 16: Timestamp('2016-09-11 16:00:00'), 17: Timestamp('2016-09-11 16:30:00'), 18: Timestamp('2016-09-11 17:00:00'), 19: Timestamp('2016-09-11 17:30:00')}}

d2 = {'datetime_start': {4: Timestamp('2016-09-11 08:48:33'), 6: Timestamp('2016-09-11 09:54:25'), 8: Timestamp('2016-09-11 10:01:47'), 10: Timestamp('2016-09-11 10:08:00'), 12: Timestamp('2016-09-11 10:30:28'), 14: Timestamp('2016-09-11 10:38:18'), 18: Timestamp('2016-09-11 13:44:05'), 20: Timestamp('2016-09-11 13:46:52'), 23: Timestamp('2016-09-11 14:22:17'), 25: Timestamp('2016-09-11 15:00:12'), 27: Timestamp('2016-09-11 15:04:19'), 29: Timestamp('2016-09-11 15:08:43'), 31: Timestamp('2016-09-11 15:38:04'), 33: Timestamp('2016-09-11 16:18:40'), 35: Timestamp('2016-09-11 16:45:59'), 37: Timestamp('2016-09-11 17:08:31'), 39: Timestamp('2016-09-11 17:16:13'), 41: Timestamp('2016-09-11 17:17:23'), 45: Timestamp('2016-09-13 12:27:59'), 47: Timestamp('2016-09-13 12:38:39')}, 'datetime_end': {4: Timestamp('2016-09-11 09:41:53'), 6: Timestamp('2016-09-11 10:00:50'), 8: Timestamp('2016-09-11 10:04:55'), 10: Timestamp('2016-09-11 10:08:11'), 12: Timestamp('2016-09-11 10:30:28'), 14: Timestamp('2016-09-11 10:38:18'), 18: Timestamp('2016-09-11 13:44:05'), 20: Timestamp('2016-09-11 14:11:41'), 23: Timestamp('2016-09-11 14:33:40'), 25: Timestamp('2016-09-11 15:02:55'), 27: Timestamp('2016-09-11 15:06:36'), 29: Timestamp('2016-09-11 15:31:29'), 31: Timestamp('2016-09-11 16:09:24'), 33: Timestamp('2016-09-11 16:44:32'), 35: Timestamp('2016-09-11 16:59:01'), 37: Timestamp('2016-09-11 17:12:23'), 39: Timestamp('2016-09-11 17:16:33'), 41: Timestamp('2016-09-11 17:20:00'), 45: Timestamp('2016-09-13 12:34:21'), 47: Timestamp('2016-09-13 12:38:45')}, 'catg': {4: 'a', 6: 'a', 8: 'b', 10: 'b', 12: 'b', 14: 'a', 18: 'a', 20: 'd', 23: 'b', 25: 'b', 27: 'b', 29: 'd', 31: 'a', 33: 'b', 35: 'b', 37: 'b', 39: 'c', 41: 'b', 45: 'a', 47: 'a'}}

df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
cmaher:

根据timeit测试(每个测试有100次执行),问题中的namedtuple方法在15.7314我的计算机上平均数秒,而在这种情况下,平均1.4794数秒:

# determine the duration of the events in df2, in seconds
duration = (df2.datetime_end - df2.datetime_start).dt.seconds.values

# create a numpy array with one timestamp for each second 
# in which an event occurred
seconds_range = np.repeat(df2.datetime_start.values, duration) + \
                np.concatenate(map(np.arange, duration)) * pd.Timedelta('1S')

df1.merge(pd.DataFrame({'datetime_start':seconds_range,
                        'catg':np.repeat(df2.catg, duration)}). \
              groupby(['catg', pd.Grouper(key='datetime_start', freq='30min')]). \
              size(). \
              unstack(level=0). \
              reset_index(), 
          how="left")

#           datetime_end      datetime_start       a       b     c       d
# 0  2016-09-11 06:30:00 2016-09-11 06:00:00     NaN     NaN   NaN     NaN
# 1  2016-09-11 07:30:00 2016-09-11 07:00:00     NaN     NaN   NaN     NaN
# 2  2016-09-11 08:00:00 2016-09-11 07:30:00     NaN     NaN   NaN     NaN
# 3  2016-09-11 08:30:00 2016-09-11 08:00:00     NaN     NaN   NaN     NaN
# 4  2016-09-11 09:00:00 2016-09-11 08:30:00   687.0     NaN   NaN     NaN
# 5  2016-09-11 09:30:00 2016-09-11 09:00:00  1800.0     NaN   NaN     NaN
# 6  2016-09-11 10:00:00 2016-09-11 09:30:00  1048.0     NaN   NaN     NaN
# 7  2016-09-11 11:00:00 2016-09-11 10:30:00     NaN     NaN   NaN     NaN
# 8  2016-09-11 14:30:00 2016-09-11 14:00:00     NaN   463.0   NaN   701.0
# 9  2016-09-11 15:00:00 2016-09-11 14:30:00     NaN   220.0   NaN     NaN
# 10 2016-09-11 15:30:00 2016-09-11 15:00:00     NaN   300.0   NaN  1277.0
# 11 2016-09-11 16:00:00 2016-09-11 15:30:00  1316.0     NaN   NaN    89.0
# 12 2016-09-11 16:30:00 2016-09-11 16:00:00   564.0   680.0   NaN     NaN
# 13 2016-09-11 17:00:00 2016-09-11 16:30:00     NaN  1654.0   NaN     NaN
# 14 2016-09-11 17:30:00 2016-09-11 17:00:00     NaN   389.0  20.0     NaN

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章

在数据帧中有效地获取可变长度的时间片

从2个数据表中有效地选择与某列总和匹配的行

如何在python中有效地合并两个具有容差的数据帧

如何使用Python中的Pandas有效地将数据帧重组为日期时间项?

有效地找到重叠的JodaTime间隔

如何有效地找到重叠的间隔?

有效地计算一个数据帧与另一个数据帧的比例

是否有一个数据结构可以有效地找到彼此靠近的点?

如何使用python在pandas数据帧中有效地迭代行

在pyspark中有效地附加大量数据帧

如何使用熊猫有效地找到两个大数据帧之间的逆交集?

从2个数据帧中获取具有日期范围的非重叠周期

创建一个数据结构,可以有效地找到高分缺失的组合

在R中有效地找到人口数据的中位数

在OrderedDictionary中有效地找到上一个键

如何有效地匹配两个数据帧中的字符串

如何有效地将具有MultiIndex的数据帧合并到另一个数据帧中?

如何在MATLAB中有效地找到所有像素都具有相同颜色的视频帧?

如何有效地找到R中数据帧的插入索引?

有效地找到R中数据帧中不同行的列值计数

如何有效地找到二进制数据帧中的模式?

Python中有效的日期范围重叠计算?

根据作为另一个数据帧的行给出的条件有效地过滤数据帧

如何有效地根据与包含每个 ID 条件的第二个数据帧相对应的 ID 列过滤数据帧?

如何在时间戳数组中有效地找到每天/每周/每月的第一个时间戳?

STL或范围算法有效地找到满足谓词的n个连续元素

有效地找到一个值属于哪个范围

有效地找到最旧的日期

根据某些条件将熊猫数据帧有效地分为两个数据帧