使用2个参数在df中创建新列

亲爱的

我需要根据2个条件创建一个新列，即人口超过50,000的国家/地区和降序恢复率。


df1['Recovery Rate'] = df1.apply(lambda x: (x['Total Recovered']/x['Total Infected']), axis = 1)

df1['Populated Country'] = df1.apply(if lambda row: row.Country == Country and (row: row.Population 2020 (in thousands) >= 50000), axis = 1) 

df1.sort_values(['Recovery Rate'], ascending = [False])

print(df1[['Populated Country','Recovery Rate']].head(10))

但是我在新列代码中遇到以下错误。


File "<ipython-input-25-ab35558abd61>", line 4
df1['Populated Country'] = df1.apply(if lambda row: row.Country == Country and (row: row.Population 2020 (in thousands) >= 50000), axis = 1)
                                         ^
SyntaxError: invalid syntax

>Country    Daily Tests Daily Tests per 100000 people   Pop density per sq. km  Urban Population (%)    Start Date of Quarantine/Lockdown   Start Date of Schools Closure   Start Date of Public Place Restrictions Hospital Beds per 1000 people   M-to-F Gender Ratio at Birth    ... Death rate from lung diseases per 100k people for male  Median Age  GDP 2018    Crime Index Population 2020 (in thousands)  Smokers in Population (%)   % of Females in Population  Total Infected  Total Deaths    Total Recovered
>0  Albania NaN NaN 105 63  NaN NaN NaN 2.9 1.08    ... 17.04   32.9    1.510250e+10    40.02   2877.797    28.7    49.063095   949 31  742
>1  Algeria NaN NaN 18  73  NaN NaN NaN 1.9 1.05    ... 12.81   28.1    1.737580e+11    54.41   43851.044   15.6    49.484268   7377    561 3746
>2  Argentina   NaN NaN 17  93  3/20/2020   NaN NaN 5.0 1.05    ... 42.59   31.7    5.198720e+11    62.96   45195.774   21.8    51.237348   8809    393 2872
>3  Armenia 694.0   2.342029    104 63  NaN NaN NaN 4.2 1.13    ... 35.99   35.1    1.243309e+10    20.78   2963.243    24.1    52.956577   5041    64  2164
>4  Australia   31635.0 12.405939   3   86  NaN NaN 3/23/2020   3.8 1.06    ... 22.16   38.7    1.433900e+12    42.70   25499.884   14.7    50.199623   7072    100 6431

这是数据-https://raw.githubusercontent.com/ptw2/PRGA/main/covid19_by_country.csv

这是我应该得到的结果

>         Country  Recovery Rate
>17         China       0.943459
>87      Thailand       0.941972
>47   South Korea       0.906031
>32       Germany       0.875705
>95       Vietnam       0.811728

有人可以帮忙吗？

帕克佩

在这种情况下，定义一个函数进行计算然后在lambda语句中应用该函数会更干净：

def compute_rr(row):
    if row['Population 2020 (in thousands)'] >= 50000:
        return row['Total Recovered'] / row['Total Infected']

df1['Recovery Rate'] = df1.apply(lambda row: compute_rr(row), axis = 1)
df1 = df1.sort_values(['Recovery Rate'], ascending = [False])

print(df1[['Country','Total Recovered','Total Infected','Recovery Rate']].head())

#Output:
        Country  Total Recovered  Total Infected  Recovery Rate
17        China            79310           84063       0.943459
87     Thailand             2857            3033       0.941972
47  South Korea            10066           11110       0.906031
32      Germany           155681          177778       0.875705
95      Vietnam              263             324       0.811728

如果您确实想更改数据框以消除人口少于5万的国家/地区，只需将以下行添加到上一个代码的底部即可。它会删除“恢复率”列中所有包含NaN的行。