我的数据集如下所示。
ts c1 c2 c3
2019-01-04T01:50:00.000Z C 25.48801612854004 33.317527770996094
2019-01-04T01:51:00.000Z C 25.74610710144043 33.392295837402344
2019-01-04T01:52:00.000Z C 25.978872299194336 33.29177474975586
2019-01-04T01:53:00.000Z B 26.12158203125 33.2805061340332
2019-01-04T01:54:00.000Z B 26.28511619567871 33.26923751831055
2019-01-04T01:55:00.000Z C 26.470335006713867 33.25796890258789
2019-01-04T01:56:00.000Z C 26.63957977294922 33.24669647216797
2019-01-04T01:57:00.000Z C 26.954004287719727 33.23542785644531
2019-01-04T01:58:00.000Z C 27.08258056640625 33.224159240722656
2019-01-04T01:59:00.000Z A 27.25551986694336 33.212890625
2019-01-04T02:00:00.000Z A 27.514263153076172 33.201622009277344
2019-01-04T02:01:00.000Z A 27.588970184326172 33.17148971557617
2019-01-04T02:02:00.000Z B 27.727638244628906 33.13819122314453
2019-01-04T02:03:00.000Z B 27.956039428710938 33.104896545410156
2019-01-04T02:04:00.000Z B 28.152463912963867 33.10499954223633
我想为列“ c1”中的每个分区值取“ ts”的第一个和最后一个值。我已经尝试了以下查询,但是没有返回正确的结果。
SELECT ts, c1, c2, c3,
first_value(ts) OVER (partition by c1 order by ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as first,
last_value(ts) OVER (partition by c1 order by ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as last
FROM `default`.`a07_a15`
问题:第一个值仅返回三个不同的ts值,而最大值返回完全错误。
预期:对于每个重复的分区值,我需要第一个和最后一个值。
ts c1 c2 c3 first last
2019-01-04T01:50:00.000Z C 25.48801612854004 33.317527770996094 2019-01-04T01:50:00.000Z 2019-01-04T01:52:00.000Z
2019-01-04T01:51:00.000Z C 25.74610710144043 33.392295837402344 2019-01-04T01:50:00.000Z 2019-01-04T01:52:00.000Z
2019-01-04T01:52:00.000Z C 25.978872299194336 33.29177474975586 2019-01-04T01:50:00.000Z 2019-01-04T01:52:00.000Z
2019-01-04T01:53:00.000Z B 26.12158203125 33.2805061340332 2019-01-04T01:53:00.000Z 2019-01-04T01:54:00.000Z
2019-01-04T01:54:00.000Z B 26.28511619567871 33.26923751831055 2019-01-04T01:53:00.000Z 2019-01-04T01:54:00.000Z
2019-01-04T01:55:00.000Z C 26.470335006713867 33.25796890258789 2019-01-04T01:55:00.000Z 2019-01-04T01:58:00.000Z
2019-01-04T01:56:00.000Z C 26.63957977294922 33.24669647216797 2019-01-04T01:55:00.000Z 2019-01-04T01:58:00.000Z
2019-01-04T01:57:00.000Z C 26.954004287719727 33.23542785644531 2019-01-04T01:55:00.000Z 2019-01-04T01:58:00.000Z
2019-01-04T01:58:00.000Z C 27.08258056640625 33.224159240722656 2019-01-04T01:55:00.000Z 2019-01-04T01:58:00.000Z
2019-01-04T01:59:00.000Z A 27.25551986694336 33.212890625 2019-01-04T01:59:00.000Z 2019-01-04T02:01:00.000Z
2019-01-04T02:00:00.000Z A 27.514263153076172 33.201622009277344 2019-01-04T01:59:00.000Z 2019-01-04T02:01:00.000Z
2019-01-04T02:01:00.000Z A 27.588970184326172 33.17148971557617 2019-01-04T01:59:00.000Z 2019-01-04T02:01:00.000Z
2019-01-04T02:02:00.000Z B 27.727638244628906 33.13819122314453 2019-01-04T02:02:00.000Z 2019-01-04T02:04:00.000Z
2019-01-04T02:03:00.000Z B 27.956039428710938 33.104896545410156 2019-01-04T02:02:00.000Z 2019-01-04T02:04:00.000Z
2019-01-04T02:04:00.000Z B 28.152463912963867 33.10499954223633 2019-01-04T02:02:00.000Z 2019-01-04T02:04:00.000Z
使用lag()
和lead()
:
select t.*
from (select t.*,
lag(c1) over (order by ts) as prev_c1,
lead(c1) over (order by ts) as next_c1
from t
) t
where prev_c1 is null or next_c1 is null or
prev_c1 <> c1 or next_c1 <> c1;
这会将值放在不同的行中。如果您希望将它们放在同一行中,则可能最简单的解决方案是将其视为“隔岛问题”:
select c1, min(ts), max(ts)
from (select t.*,
row_number() over (order by ts) as seqnum,
row_number() over (partition by c1 order by ts) as seqnum_2
from t
) t
group by c1, (seqnum - seqnum_2);
编辑:
如果需要保留原始行,请使用窗口函数:
select t.*,
min(ts) over (partition by c1, (seqnum - seqnum2)) as min_ts,
max(ts) over (partition by c1, (seqnum - seqnum2)) as max_ts
from (select t.*,
row_number() over (order by ts) as seqnum,
row_number() over (partition by c1 order by ts) as seqnum_2
from t
) t
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句