我有来自体育赛事的数据,而且我知道每个主场都存在偏见,我想对其进行调整。我已经创建了一个字典,其中 arena 是键,值是我想要进行的调整。
所以对于每一行,我想拿主队,得到调整,然后从距离列中减去。我有以下代码,但似乎无法正常工作。
#Making the dictionary, this is working properly
teams = df.home_team.unique().tolist()
adj_shot_dict = {}
for team in teams:
df_temp = df[df.home_team == team]
average = round(df_temp.event_distance.mean(),2)
adj_shot_dict[team] = average
def make_adjustment(df):
team = df.home_team
distance = df.event_distance
adj_dist = distance - adj_shot_dict[team]
return adj_dist
df['adj_dist'] = df['event_distance'].apply(make_adjustment)
IIUC,你已经有了字典,你想简单的减法adj_shot_dict
,以event_distance
柱:
df['adj_dist'] = df['event_distance'] - df['home_team'].map(adj_shot_dict)
旧答案
分组home_team
,计算平均值,event_distance
然后减去结果为event_distance
:
df['adj_dist'] = df['event_distance'] \
- df.groupby('home_team')['event_distance'] \
.transform('mean').round(2)
# OR
df['adj_dist'] = df.groupby('home_team')['event_distance'] \
.apply(lambda x: x - x.mean().round(2))
表现
>>> len(df)
60000
>>> df.sample(5)
home_team event_distance
5 team3 60
4 team2 50
1 team2 20
1 team2 20
0 team1 10
def loop():
teams = df.home_team.unique().tolist()
adj_shot_dict = {}
for team in teams:
df_temp = df[df.home_team == team]
average = round(df_temp.event_distance.mean(),2)
adj_shot_dict[team] = average
def loop2():
df.groupby('home_team')['event_distance'].transform('mean').round(2)
>>> %timeit loop()
13.5 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit loop2()
3.62 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Total process
>>> %timeit df['event_distance'] - df.groupby('home_team')['event_distance'].transform('mean').round(2)
3.7 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句