我正在查看BigQuery中的文本序列,并尝试在多个行中标识单词的补全(共享ID)。数据如下:
ID, Text
1, t
1, th
1, the
1, the
1, the c
1, the ca
1, the cat
1, the cat
1, the cat s
...
1, the cat sat on the mat
2, r
...
对于每个给定的ID和序列,我正在尝试查找下一个单词边界。因此理想的输出为:
ID, Text, Boundary
1, t, the
1, th, the
1, the c, the cat
1, the ca, the cat
1, the cat s, the cat sat
在上面的下一个共享ID并以空格结尾的下一行给出了下一个(可以有多个)单词补全边界。
以下是BigQuery标准SQL
注意:这是蛮力方法,因此查询可能不会那么优雅-希望这会给您一个良好的开端。
#standardSQL
SELECT id, item, boundary
FROM (
SELECT id, grp,
STRING_AGG(IF(boundary, text, ''), '') boundary,
ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items
FROM (
SELECT id, text,
LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,
SUBSTR(text, -1) = ' ' boundary
FROM `project.dataset.table`
)
GROUP BY id, grp
), UNNEST(items) item WITH OFFSET pos
WHERE RTRIM(item) != RTRIM(boundary)
是否适用于您的问题中的伪数据,如下所示
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, 't' text UNION ALL
SELECT 1, 'th' UNION ALL
SELECT 1, 'the' UNION ALL
SELECT 1, 'the ' UNION ALL
SELECT 1, 'the c' UNION ALL
SELECT 1, 'the ca' UNION ALL
SELECT 1, 'the cat' UNION ALL
SELECT 1, 'the cat ' UNION ALL
SELECT 1, 'the cat s' UNION ALL
SELECT 1, 'the cat sat '
)
SELECT id, item, boundary
FROM (
SELECT id, grp,
STRING_AGG(IF(boundary, text, ''), '') boundary,
ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items
FROM (
SELECT id, text,
LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,
SUBSTR(text, -1) = ' ' boundary
FROM `project.dataset.table`
)
GROUP BY id, grp
), UNNEST(items) item WITH OFFSET pos
WHERE RTRIM(item) != RTRIM(boundary)
ORDER BY id, grp, pos
结果是
Row id item boundary
1 1 t the
2 1 th the
3 1 the c the cat
4 1 the ca the cat
5 1 the cat s the cat sat
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句