我有两个这样的数据框:
data_2019_dict = {'state': ['Ohio', 'Texas', 'Pennsylvania', 'Nevada', 'New York', 'Nevada', 'Ohio', 'Virginia', 'Louisiana', 'Florida', 'Nevada'],
'industry': ['Agriculture', 'Agriculture', 'Agriculture', 'Agriculture', 'Medicine', 'Medicine', 'Medicine', 'Medicine', 'Manufacture', 'Manufacture', 'Manufacture'],
'value': [3.6, 3.2, 2.9, 2.4, 3.1, 1.5, 1.4, 0.9, 4.4, 2.0, 1.9]}
data_2020_dict = {'state': ['Kansas', 'Texas', 'California', 'Idaho', 'Nevada', 'Ohio', 'Virginia', 'Louisiana', 'Texas', 'Nevada'],
'industry': ['Agriculture', 'Agriculture', 'Agriculture', 'Medicine', 'Medicine', 'Finance', 'Finance', 'Manufacture', 'Manufacture', 'Manufacture'],
'value': [2.3, 1.8, 1.6, 7.2, 5.9, 4.1, 0.2, 5.1, 2.3, 2.2]}
data_2019 = pd.DataFrame(data_2019_dict)
data_2020 = pd.DataFrame(data_2020_dict)
每个数据框都显示一年中哪些州在这些行业中表现良好。我想要产生但被困的是:在每个州,两年中哪些行业表现良好?结果数据帧将如下所示:
# Manually generated for illustration
data_both_dict = {'state': ['Ohio', 'Texas', 'Pennsylvania', 'Nevada', 'Nevada', 'New York', 'Virginia', 'Louisiana', 'Florida', 'Kansas', 'California', 'Idaho'],
'common_industry': ['', 'Agriculture', '', 'Medicine', 'Manufacture', '', '', 'Manufacture', '', '', '', ''],
'common_industry_count': [0, 1, 0, 2, 2, 0, 0, 1, 0, 0, 0, 0]
}
data_both = pd.DataFrame(data_both_dict)
首先DataFrame.merge
针对两列的常见行,对列重命名,并通过Series.value_counts
和添加计数Series.map
:
df = (data_2019.merge(data_2020, on=['state','industry'])
.rename(columns={'industry':'common_industry'}))
df['common_industry_count'] = df['state'].map(df['state'].value_counts())
df = df[['state','common_industry','common_industry_count']]
print (df)
state common_industry common_industry_count
0 Texas Agriculture 1
1 Nevada Medicine 2
2 Louisiana Manufacture 1
3 Nevada Manufacture 2
然后通过concat
删除所有重复项Series.drop_duplicates
并通过删除一列DataFrame获得所有状态Series.to_frame
:
both = pd.concat([data_2019['state'], data_2020['state']]).drop_duplicates().to_frame()
print (both)
state
0 Ohio
1 Texas
2 Pennsylvania
3 Nevada
4 New York
7 Virginia
8 Louisiana
9 Florida
0 Kansas
2 California
3 Idaho
最后与left
join合并,并用替换缺少的值Series.fillna
:
df = both.merge(df, how='left')
df['common_industry_count'] = df['common_industry_count'].fillna(0).astype(int)
df['common_industry'] = df['common_industry'].fillna('')
print (df)
state common_industry common_industry_count
0 Ohio 0
1 Texas Agriculture 1
2 Pennsylvania 0
3 Nevada Medicine 2
4 Nevada Manufacture 2
5 New York 0
6 Virginia 0
7 Louisiana Manufacture 1
8 Florida 0
9 Kansas 0
10 California 0
11 Idaho 0
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句