bigquery:查找以下行匹配条件

米什15

我正在查看BigQuery中的文本序列,并尝试在多个行中标识单词的补全(共享ID)。数据如下:

ID, Text
1, t
1, th
1, the
1, the
1, the c
1, the ca
1, the cat
1, the cat 
1, the cat s
...
1, the cat sat on the mat
2, r
...

对于每个给定的ID和序列,我正在尝试查找下一个单词边界。因此理想的输出为:

ID, Text, Boundary
1, t, the
1, th, the
1, the c, the cat
1, the ca, the cat
1, the cat s, the cat sat 

在上面的下一个共享ID并以空格结尾的下一行给出了下一个(可以有多个)单词补全边界。

米哈伊尔·伯利安(Mikhail Berlyant)

以下是BigQuery标准SQL

注意:这是蛮力方法,因此查询可能不会那么优雅-希望这会给您一个良好的开端。

#standardSQL
SELECT id, item, boundary
FROM (
  SELECT id, grp, 
    STRING_AGG(IF(boundary, text, ''), '') boundary,
    ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items
  FROM (
    SELECT id, text, 
      LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,
      SUBSTR(text, -1) = ' ' boundary
    FROM `project.dataset.table`
  )
  GROUP BY id, grp
), UNNEST(items) item WITH OFFSET pos
WHERE RTRIM(item) != RTRIM(boundary)

是否适用于您的问题中的伪数据,如下所示

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id, 't' text UNION ALL
  SELECT 1, 'th' UNION ALL
  SELECT 1, 'the' UNION ALL
  SELECT 1, 'the ' UNION ALL
  SELECT 1, 'the c' UNION ALL
  SELECT 1, 'the ca' UNION ALL
  SELECT 1, 'the cat' UNION ALL
  SELECT 1, 'the cat ' UNION ALL
  SELECT 1, 'the cat s' UNION ALL
  SELECT 1, 'the cat sat ' 
)
SELECT id, item, boundary
FROM (
  SELECT id, grp, 
    STRING_AGG(IF(boundary, text, ''), '') boundary,
    ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items
  FROM (
    SELECT id, text, 
      LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,
      SUBSTR(text, -1) = ' ' boundary
    FROM `project.dataset.table`
  )
  GROUP BY id, grp
), UNNEST(items) item WITH OFFSET pos
WHERE RTRIM(item) != RTRIM(boundary)
ORDER BY id, grp, pos   

结果是

Row     id      item        boundary     
1       1       t           the  
2       1       th          the  
3       1       the c       the cat  
4       1       the ca      the cat  
5       1       the cat s   the cat sat  

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章