我有一个巨大的数据框,大约 450K 行。该数据帧包含一个房屋 ID,然后用作家庭的标识符。例如 :
House_ID Resident_name SSN_ID Occupation Offer
9211 Aaron 1122 Unemployed No_offer
9211 Emelda 9831 Unemployed No_offer
9211 Brandt 9744 Prosecutor Household_offer
9080 Elise 8903 Teacher No_offer
9531 Ryan 7856 Unemployed Household_offer
9531 Gillian 9002 Unemployed Household_offer
在示例数据框中,有 2 个家庭和 1 个个人。我的目标是根据家庭中的“职业”列修改/更改“报价”列。
逻辑如下:
所需的输出如下:
House_ID Resident_name SSN_ID Occupation Offer
9211 Aaron 1122 Unemployed Household_offer
9211 Emelda 9831 Unemployed Household_offer
9211 Brandt 9744 Prosecutor Household_offer
9080 Elise 8903 Teacher Household_offer
9531 Ryan 7856 Unemployed No_offer
9531 Gillian 9002 Unemployed No_offer
Aaron 和 Emelda 收到家庭邀请(即使他们都失业)的原因是他们家庭的另一个成员仍在工作,而 Ryan 的家庭成员没有收到任何邀请,因为他们都没有工作。
我目前的解决方案是使用grouby
和循环,我使用家庭 ID 对每个家庭进行分组,然后在分配要约之前通过每个成员的就业状态,如下所示:
df = HouseDF.groupby('house_ID')
mergedDF = pd.DataFrame()
for i in range(len(list(df.groups))) :
tempDF = df.get_group(list(df.groups)[i])
employment_list = tempDF['Occupation'].unique().tolist()
n_occupation = len(tempDF['Occupation'].unique().tolist())
if ((n_occupation == 1) & ('Unemployment' in employment_list )):
tempDF['Offer'] = 'No_offer'
mergedDF = pd.concat([mergedDF,tempDF])
虽然该解决方案几乎不起作用,但由于迭代需要很长时间才能完成,因为原始数据集大约有 200K house_ID(但有 450K SSN_ID),并且遍历这些家庭(并迭代其每个成员以检查他们的就业状况)将是太耗时且效率低下。
是否有任何解决方案可以更有效地根据多行和多列的条件设置值?
谢谢~
尝试GROUPBY变换检查每户都来检查,如果所有家庭成员失业,然后np.where分配的“No_offer”,'Household_offer的价值观:
import numpy as np
import pandas as pd
HouseDF = pd.DataFrame({
'House_ID': {0: 9211, 1: 9211, 2: 9211, 3: 9080, 4: 9531, 5: 9531},
'Resident_name': {0: 'Aaron', 1: 'Emelda', 2: 'Brandt', 3: 'Elise',
4: 'Ryan', 5: 'Gillian'},
'SSN_ID': {0: 1122, 1: 9831, 2: 9744, 3: 8903, 4: 7856, 5: 9002},
'Occupation': {0: 'Unemployed', 1: 'Unemployed', 2: 'Prosecutor',
3: 'Teacher', 4: 'Unemployed', 5: 'Unemployed'},
'Offer': {0: 'No_offer', 1: 'No_offer', 2: 'Household_offer',
3: 'No_offer', 4: 'Household_offer', 5: 'Household_offer'}
})
all_unemployed = HouseDF.groupby('House_ID')['Occupation'] \
.transform(lambda o: o.eq('Unemployed').all())
HouseDF['Offer'] = np.where(all_unemployed, 'No_offer', 'Household_offer')
print(HouseDF)
HouseDF
:
House_ID Resident_name SSN_ID Occupation Offer
0 9211 Aaron 1122 Unemployed Household_offer
1 9211 Emelda 9831 Unemployed Household_offer
2 9211 Brandt 9744 Prosecutor Household_offer
3 9080 Elise 8903 Teacher Household_offer
4 9531 Ryan 7856 Unemployed No_offer
5 9531 Gillian 9002 Unemployed No_offer
没有 np.where 选项的 Groupby 转换:
HouseDF['Offer'] = HouseDF.groupby('House_ID')['Occupation'].transform(
lambda o: 'No_offer' if o.eq('Unemployed').all() else 'Household_offer'
)
一些时间信息(数字=1000):
loop
3.6998247
transform + where
1.6054817000000003
transform
1.5982574999999999
import timeit
setup = '''
import numpy as np
import pandas as pd
HouseDF = pd.DataFrame(
{'House_ID': {0: 9211, 1: 9211, 2: 9211, 3: 9080, 4: 9531, 5: 9531},
'Resident_name': {0: 'Aaron', 1: 'Emelda', 2: 'Brandt', 3: 'Elise',
4: 'Ryan', 5: 'Gillian'},
'SSN_ID': {0: 1122, 1: 9831, 2: 9744, 3: 8903, 4: 7856, 5: 9002},
'Occupation': {0: 'Unemployed', 1: 'Unemployed', 2: 'Prosecutor',
3: 'Teacher', 4: 'Unemployed', 5: 'Unemployed'},
'Offer': {0: 'No_offer', 1: 'No_offer', 2: 'Household_offer',
3: 'No_offer', 4: 'Household_offer', 5: 'Household_offer'}})
'''
loop = '''
df = HouseDF.groupby('House_ID')
mergedDF = pd.DataFrame()
for i in range(len(list(df.groups))) :
tempDF = df.get_group(list(df.groups)[i])
employment_list = tempDF['Occupation'].unique().tolist()
n_occupation = len(tempDF['Occupation'].unique().tolist())
if ((n_occupation == 1) & ('Unemployment' in employment_list )):
tempDF['Offer'] = 'No_offer'
mergedDF = pd.concat([mergedDF,tempDF])
'''
transform1 = '''
all_unemployed = HouseDF.groupby('House_ID')['Occupation'] \
.transform(lambda o: o.eq('Unemployed').all())
HouseDF['Offer'] = np.where(all_unemployed, 'No_offer', 'Household_offer')
'''
transform2 = '''
HouseDF['Offer'] = HouseDF.groupby('House_ID')['Occupation'].transform(
lambda o: 'No_offer' if o.eq('Unemployed').all() else 'Household_offer'
)
'''
if __name__ == '__main__':
print('loop')
print(timeit.timeit(setup=setup, stmt=loop, number=1000))
print('transform + where')
print(timeit.timeit(setup=setup, stmt=transform1, number=1000))
print('transform')
print(timeit.timeit(setup=setup, stmt=transform2, number=1000))
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句