改进 SQL 查询以查找开始日期和结束日期之间的范围

威拉西我

我与一个叫数据库工作international_education从world_bank_intl_education的数据集bigquery-public-data。

FIELDS            
country_name      
country_code      
indicator_name    
indicator_code    
value   
year

我的目标是用人口增长（年百分比）（其中一个indicator_name值）变化最大和最小的国家绘制折线图。

我在下面使用两个分区完成了这项工作，以查找每个国家/地区当年的第一个和最后一个值，但我对 SQL 很粗略，想知道是否有优化此公式的方法。

query = """
WITH differences AS
    (
    SELECT country_name, year, value, 
      FIRST_VALUE(value)
      OVER (
          PARTITION BY country_name
          ORDER BY year
          RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
      ) AS small_value,
      LAST_VALUE(value)
      OVER (
          PARTITION BY country_name
          ORDER BY year
          RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
      ) AS large_value
    FROM `bigquery-public-data.world_bank_intl_education.international_education`
    WHERE indicator_name = 'Population growth (annual %)'
    ORDER BY year
    )
SELECT country_name, year, (large_value-small_value) AS total_range, value
FROM differences
ORDER BY total_range
        """

转换为熊猫数据框。

df= wbed.query_to_pandas_safe(query)
df.head(10)

结果表。

    country_name            year    total_range value
0   United Arab Emirates    1970    -13.195183  14.446942
1   United Arab Emirates    1971    -13.195183  16.881671
2   United Arab Emirates    1972    -13.195183  17.689814
3   United Arab Emirates    1973    -13.195183  17.695296
4   United Arab Emirates    1974    -13.195183  17.125615
5   United Arab Emirates    1975    -13.195183  16.211873
6   United Arab Emirates    1976    -13.195183  15.450884
7   United Arab Emirates    1977    -13.195183  14.530119
8   United Arab Emirates    1978    -13.195183  13.033461
9   United Arab Emirates    1979    -13.195183  11.071306

然后我会用 python 绘制它如下。

all_countries = df.groupby('country_name', as_index=False).max().sort_values(by='total_range').country_name.values

countries = np.concatenate((all_countries[:3], all_countries[-4:]))

plt.figure(figsize=(16, 8))
sns.lineplot(x='year',y='value', data=df[df.country_name.isin(countries)], hue='country_name')

戈登·利诺夫

您不需要 CTE，也不需要窗口框架定义。所以这应该是等价的：

SELECT country_name, year, value,
       (first_value(value) OVER (PARTITION BY country_name ORDER BY YEAR DESC) -
        first_value(value) OVER (PARTITION BY country_name ORDER BY YEAR)
       ) as total_range
FROM `bigquery-public-data.world_bank_intl_education.international_education`
WHERE indicator_name = 'Population growth (annual %)';

请注意，这LAST_VALUE()对窗口框架定义很挑剔。所以我通常只是使用FIRST_VALUE()相反的顺序。

如果您只需要每个国家/地区一行，那么您需要聚合。BigQuery 没有“first”和“last”聚合函数，但它们很容易用数组实现：

SELECT country_name, 
       ((array_agg(value ORDER BY year DESC LIMIT 1))[ordinal(1)] -
        (array_agg(value ORDER BY year LIMIT 1))[ordinal(1)]
       ) as total_range
FROM `bigquery-public-data.world_bank_intl_education.international_education`
WHERE indicator_name = 'Population growth (annual %)'
GROUP BY country_name
ORDER BY total_range;

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。