我与一个叫数据库工作international_education
从world_bank_intl_education
的数据集bigquery-public-data
。
FIELDS
country_name
country_code
indicator_name
indicator_code
value
year
我的目标是用人口增长(年百分比)(其中一个indicator_name
值)变化最大和最小的国家绘制折线图。
我在下面使用两个分区完成了这项工作,以查找每个国家/地区当年的第一个和最后一个值,但我对 SQL 很粗略,想知道是否有优化此公式的方法。
query = """
WITH differences AS
(
SELECT country_name, year, value,
FIRST_VALUE(value)
OVER (
PARTITION BY country_name
ORDER BY year
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS small_value,
LAST_VALUE(value)
OVER (
PARTITION BY country_name
ORDER BY year
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS large_value
FROM `bigquery-public-data.world_bank_intl_education.international_education`
WHERE indicator_name = 'Population growth (annual %)'
ORDER BY year
)
SELECT country_name, year, (large_value-small_value) AS total_range, value
FROM differences
ORDER BY total_range
"""
转换为熊猫数据框。
df= wbed.query_to_pandas_safe(query)
df.head(10)
结果表。
country_name year total_range value
0 United Arab Emirates 1970 -13.195183 14.446942
1 United Arab Emirates 1971 -13.195183 16.881671
2 United Arab Emirates 1972 -13.195183 17.689814
3 United Arab Emirates 1973 -13.195183 17.695296
4 United Arab Emirates 1974 -13.195183 17.125615
5 United Arab Emirates 1975 -13.195183 16.211873
6 United Arab Emirates 1976 -13.195183 15.450884
7 United Arab Emirates 1977 -13.195183 14.530119
8 United Arab Emirates 1978 -13.195183 13.033461
9 United Arab Emirates 1979 -13.195183 11.071306
然后我会用 python 绘制它如下。
all_countries = df.groupby('country_name', as_index=False).max().sort_values(by='total_range').country_name.values
countries = np.concatenate((all_countries[:3], all_countries[-4:]))
plt.figure(figsize=(16, 8))
sns.lineplot(x='year',y='value', data=df[df.country_name.isin(countries)], hue='country_name')
您不需要 CTE,也不需要窗口框架定义。所以这应该是等价的:
SELECT country_name, year, value,
(first_value(value) OVER (PARTITION BY country_name ORDER BY YEAR DESC) -
first_value(value) OVER (PARTITION BY country_name ORDER BY YEAR)
) as total_range
FROM `bigquery-public-data.world_bank_intl_education.international_education`
WHERE indicator_name = 'Population growth (annual %)';
请注意,这LAST_VALUE()
对窗口框架定义很挑剔。所以我通常只是使用FIRST_VALUE()
相反的顺序。
如果您只需要每个国家/地区一行,那么您需要聚合。BigQuery 没有“first”和“last”聚合函数,但它们很容易用数组实现:
SELECT country_name,
((array_agg(value ORDER BY year DESC LIMIT 1))[ordinal(1)] -
(array_agg(value ORDER BY year LIMIT 1))[ordinal(1)]
) as total_range
FROM `bigquery-public-data.world_bank_intl_education.international_education`
WHERE indicator_name = 'Population growth (annual %)'
GROUP BY country_name
ORDER BY total_range;
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句