我想根据其他两列的条件在熊猫中创建一列。我在for循环中尝试if条件,但是在检查字符串值时出错。
我的数据框:
df=pd.DataFrame({"Area:['USA','India','China','UK','France','Germany','USA','USA','India','Germany'],
"Sales":[2,3,7,1,4,3,5,6,9,10]})
我想根据条件创建一列RATING:
如果国家/地区在ASIA中且销售额> 2,则1
如果国家/地区在NA中且销售额> 3,则1
如果国家/地区是欧元,且销售额> = 4,则1否则0
我正在使用一个功能:
ASIA=['India','China']
NA= ['USA']
EUR=['UK','France','Germany']
def label_race(row):
if row['Area'].isin(ASIA) & row['Sales'] >2 :
return 1
if row['Area'].isin(NA) & row['Sales'] >3 :
return 1
if row['Area'].isin(EUR) & row['Sales'] >=4 :
return 1
return 0
df['Rating']=df.apply(lambda row: label_race(row),axis=1)
这引发以下错误:
AttributeError: ("'str' object has no attribute 'isin'", 'occurred at index 0')
请告诉我我在函数中做错了什么,或者以其他任何更简单的方式执行此操作。
使用向量化解决方案numpy.select
:
m = [df['Area'].isin(ASIA) & (df['Sales'] > 2),
df['Area'].isin(NA) & (df['Sales'] > 3),
df['Area'].isin(EUR) & (df['Sales'] >= 4)]
df['Rating'] = np.select(m, [1,1,1], default=0)
print (df)
Area Sales Rating
0 USA 2 0
1 India 3 1
2 China 7 1
3 UK 1 0
4 France 4 1
5 Germany 3 0
6 USA 5 1
7 USA 6 1
8 India 9 1
9 Germany 10 1
您的解决方案应使用in
和and
代替,isin
并进行更改&
:
def label_race(row):
if row['Area'] in (ASIA) and row['Sales'] >2 :
return 1
if row['Area'] in (NA) and row['Sales'] >3 :
return 1
if row['Area'] in (EUR) and row['Sales'] >=4 :
return 1
return 0
df['Rating']=df.apply(lambda row: label_race(row),axis=1)
print (df)
Area Sales Rating
0 USA 2 0
1 India 3 1
2 China 7 1
3 UK 1 0
4 France 4 1
5 Germany 3 0
6 USA 5 1
7 USA 6 1
8 India 9 1
9 Germany 10 1
区别在于性能:
#[10000 rows x 3 columns]
df = pd.concat([df] * 1000, ignore_index=True)
In [216]: %timeit df['Rating1']=df.apply(lambda row: label_race(row),axis=1)
275 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [217]: %timeit df['Rating'] = np.select(m, [1,1,1], default=0)
215 µs ± 3.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
我尝试从评论中检查@Anton vBR想法:
def label_race(row):
if row['Area'] in (ASIA) and row['Sales'] >2 :
return 1
elif row['Area'] in (NA) and row['Sales'] >3 :
return 1
elif row['Area'] in (EUR) and row['Sales'] >=4 :
return 1
else:
return 0
df['Rating1']=df.apply(lambda row: label_race(row),axis=1)
In [223]: %timeit df['Rating1']=df.apply(lambda row: label_race(row),axis=1)
268 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句