大熊猫列读取和添加列

Liam 发表于 Dev

利亚姆

目前使用 python 的 pandas 来加载一个大的 CSV 文件。我正在努力根据数据框中三列中的多个值有效地创建和添加一个新列。

有三列（时间、二氧化碳和成本），我想根据一些计算添加一个名为 gcost 的新列

下面的代码有效，但速度很慢。我相信它是row['time']减慢它的项目：

输入

Id,time,CO2eq,cost

0,10,10,10

1,5,5,5

2,2,3,6

预期结果

Id,time,CO2eq,cost,gcost

0,10,10,10,X

1,5,5,5,X

2,2,3,6,X

代码

#wftime, wfco2eq and wfcost are inputted from the front-end.
    hhinfo_input_df = pd.read_csv(input_file_path, header=0,
                              names=['Id','CO2eq', 'time', 'cost'])

    hhinfo_input_df['gcost'] = hhinfo_input_df.apply(cost_generate, axis=1)
    return hhinfo_input_df

#Normalized weighted values of each criterion (input by user)
def cost_generate(row):
    Norm_time = (row['time'] * (wftime / max_time)) * 100000
    Norm_co2eq = (row['CO2eq'] * (wfco2eq / max_co2eq)) * 100000
    Norm_cost = (row['cost'] * (wfcost / max_cost)) * 100000

    gcost = int(round(Norm_time)) + int(round(Norm_co2eq)) + int(round(Norm_cost))

    #gcost should never be 0.
    if gcost == 0:
        return 1
    return gcost

罗伯特

无需在行级别执行这些操作。如果你只使用这些操作的矢量化版本，Pandas 会更快地处理这个问题：

df = pd.read_csv(input_file_path, header=0,
                 names=['Id','CO2eq', 'time', 'cost'])

Norm_time = (df['time'] * (wftime / max_time)) * 100000
Norm_co2eq = (df['CO2eq'] * (wfco2eq / max_co2eq)) * 100000
Norm_cost = (df['cost'] * (wfcost / max_cost)) * 100000
df["gcost"] = Norm_time.round().astype(int) + Norm_co2eq.round().astype(int) + Norm_cost.round().astype(int)

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。