COVID-19跟踪项目(此处描述了api )提供了有关大流行的许多方面的数据。JSON的每一行都是一种状态的一天数据。正如许多人所知,大流行病对不同州的打击方式有所不同-纽约及其邻国首先受到的打击最大,其他州随后受到打击。这是数据的子集:
date,state,positive,negative
20200505,AK,371,22321
20200505,CA,56212,723690
20200505,NY,321192,707707
20200505,WY,596,10319
20200504,AK,370,21353
20200504,CA,54937,692937
20200504,NY,318953,688357
20200504,WY,586,9868
20200503,AK,368,21210
20200503,CA,53616,662135
20200503,NY,316415,669496
20200503,WY,579,9640
20200502,AK,365,21034
20200502,CA,52197,634606
20200502,NY,312977,646094
20200502,WY,566,9463
为了获得整个数据集,我正在这样做:
import pandas as pd
all_states = pd.read_json("https://covidtracking.com/api/v1/states/daily.json")
我希望能够通过将一列的值加起来来汇总数据,但仅限于某些状态。然后针对之前未包含的状态添加同一列。我能够做到这一点,例如:
not_NY = all_states[all_states['state'] != 'NY'].groupby(['date'], as_index = False).hospitalizedCurrently.sum()
这将从all_states创建一个新数据框,并按日期分组,并对所有非“ NY”的状态求和。不过,我想做的是排除多个状态,例如“ not in”功能(这不起作用):
not_tristate = all_states[all_states['state'] not in ['NY','NJ','CT']].groupby(['date'], as_index = False).hospitalizedCurrently.sum()
有没有办法做到这一点?我尝试过的另一种方法是创建一个新的数据框作为数据透视表,每个日期一行,每个状态一行,如下所示:
pivot_states = all_states.pivot_table(index = 'gooddate', columns = 'state', values = 'hospitalizedCurrently', aggfunc='sum')
但这仍然使我只能通过汇总一些列来创建新列。在SQL中,我将解决以下问题:
SELECT all_states.Date AS [Date], Sum(IIf([all_states]![state] In ("NY","NJ","CT"),[all_states]![hospitalizedCurrently],0)) AS tristate, Sum(IIf([all_states]![state] Not In ("NY","NJ","CT"),[all_states]![hospitalizedCurrently],0)) AS not_tristate
FROM all_states
GROUP BY all_states.Date
ORDER BY all_states.Date;
我正在寻找的最终结果是这样的(使用上面的示例数据并在“正”列上加总,“ NY”代表“ tristate”):
date,not_tristate,tristate
20200502,53128,312977,366105
20200503,54563,316415,370978
20200504,55893,318953,374846
20200505,57179,321192,378371
任何帮助都将受到欢迎。
要获得预期的输出,可以groupby
在日期上使用,并且np.where
状态是isin
想要的状态sum
(正数),unstack
并分配以获取总列数
df_f = all_states.groupby(['date',
np.where(all_states['state'].isin(["NY","NJ","CT"]),
'tristate', 'not_tristate')])\
['positive'].sum()\
.unstack()\
.assign(total=lambda x: x.sum(axis=1))
print (df_f)
not_tristate tristate total
date
20200502 53128 312977 366105
20200503 54563 316415 370978
20200504 55893 318953 374846
20200505 57179 321192 378371
或使用pivot_table
,您将获得类似的结果:
print ( all_states.assign(state= np.where(all_states['state'].isin(["NY","NJ","CT"]),
'tristate', 'not_tristate'))\
.pivot_table(index='date', columns='state', values='positive',
aggfunc='sum', margins=True))
state not_tristate tristate All
date
20200502 53128 312977 366105
20200503 54563 316415 370978
20200504 55893 318953 374846
20200505 57179 321192 378371
All 220763 1269537 1490300
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句