我有一些事件数据,如下所示:
| time | id | status | value |
|-------------------------|----|----------|-------|
| 2020-08-26T21:29:01.000 | 2 | started | 8 |
| 2020-08-26T21:29:01.000 | 3 | started | 4 |
| 2020-08-26T21:29:02.000 | 2 | finished | 8 |
| 2020-08-26T21:29:03.000 | 4 | started | 12 |
| 2020-08-26T21:29:04.000 | 5 | started | 2 |
| 2020-08-26T21:29:05.000 | 6 | started | 24 |
| 2020-08-26T21:29:06.000 | 4 | finished | 12 |
| 2020-08-26T21:29:06.000 | 3 | finished | 4 |
| 2020-08-26T21:29:07.000 | 1 | finished | 1 |
| 2020-08-26T21:29:10.000 | 7 | started | 4 |
请注意,事件数据是在事情开始后才开始记录的,有些事件尚未完成。
然后,我试图获取行的运行计数和每秒的值运行总和。
我一想到运行计数就想到了窗口查询,但是我正在努力找出如何从这些数据获取期望的输出。
理想情况下,我希望得到以下结果:
| time | count | sum_values |
|-------------------------|-------|------------|
| 2020-08-26T21:29:00.000 | 1 | 1 |
| 2020-08-26T21:29:01.000 | 3 | 13 |
| 2020-08-26T21:29:02.000 | 2 | 5 |
| 2020-08-26T21:29:03.000 | 3 | 17 |
| 2020-08-26T21:29:04.000 | 4 | 19 |
| 2020-08-26T21:29:05.000 | 5 | 43 |
| 2020-08-26T21:29:06.000 | 3 | 29 |
| 2020-08-26T21:29:07.000 | 2 | 28 |
| 2020-08-26T21:29:08.000 | 2 | 28 |
| 2020-08-26T21:29:09.000 | 2 | 28 |
| 2020-08-26T21:29:10.000 | 3 | 32 |
| 2020-08-26T21:29:11.000 | 3 | 32 |
对于未考虑1
事件开始记录之前运行的id记录的答案,我也很满意,这样会得到以下结果:
| time | count | sum_values |
|-------------------------|-------|------------|
| 2020-08-26T21:29:00.000 | 0 | 0 |
| 2020-08-26T21:29:01.000 | 2 | 12 |
| 2020-08-26T21:29:02.000 | 1 | 4 |
| 2020-08-26T21:29:03.000 | 2 | 16 |
| 2020-08-26T21:29:04.000 | 3 | 18 |
| 2020-08-26T21:29:05.000 | 4 | 42 |
| 2020-08-26T21:29:06.000 | 2 | 28 |
| 2020-08-26T21:29:07.000 | 2 | 28 |
| 2020-08-26T21:29:08.000 | 2 | 28 |
| 2020-08-26T21:29:09.000 | 2 | 28 |
| 2020-08-26T21:29:10.000 | 3 | 32 |
| 2020-08-26T21:29:11.000 | 3 | 32 |
由于Athena / Presto不支持完全连接,因此我可以id
使用以下查询(也可以在SQL Fiddle上)获得每个连接的开始和停止时间:
WITH started AS (
SELECT *
FROM foo
WHERE status = 'started'
), finished AS (
SELECT *
FROM foo
WHERE status = 'finished'
)
SELECT started.time AS started_time, finished.time AS finished_time, started.id, started.value
FROM started LEFT JOIN finished ON started.id = finished.id
我认为您需要累积条件总和:
select time,
sum(sum(case when status = 'started' then 1
when status = 'finished' then -1
end)
) over (order by time) as running_count,
sum(sum(case when status = 'started' then value
when status = 'finished' then - value
end)
) over (order by time) as running_value
from foo
group by time
该sum()
的需要,因为一个是需要的窗口函数嵌套,另一种是为聚集。
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句