合并多个数据框

PEBKAC 发表于 Dev

佩巴卡

这个问题是指上一篇文章

所提出的解决方案对于较小的数据集非常有效，这里我要处理7个.txt文件，总内存为750 MB。不应太大，所以在此过程中我一定做错了。

df1  = pd.read_csv('Data1.txt', skiprows=0, delimiter=' ', usecols=[1,2, 5, 7, 8, 10, 12, 13, 14])
df2  = pd.read_csv('Data2.txt', skiprows=0, delimiter=' ', usecols=[1,2, 5, 7, 8, 10, 12, 13, 14])
df3  = ...
df4 = ...

这是我的一个数据框（df1）的样子-头：

  name_profile depth           VAR1  ...  year  month  day
0  profile_1   0.6           0.2044  ...  2012     11  26
1  profile_1   0.6           0.2044  ...  2012     11  26
2  profile_1   1.1           0.2044  ...  2012     11  26
3  profile_1   1.2           0.2044  ...  2012     11  26
4  profile_1   1.4           0.2044  ...  2012     11  26
...

和尾巴：

       name_profile     depth              VAR1  ...  year  month  day
955281  profile_1300   194.600006          0.01460  ...  2015      3  20
955282  profile_1300   195.800003          0.01095  ...  2015      3  20
955283  profile_1300   196.899994          0.01095  ...  2015      3  20
955284  profile_1300   198.100006          0.00730  ...  2015      3  20
955285  profile_1300   199.199997          0.01825  ...  2015      3  20

我遵循了一个建议并删除了重复项：

df1.drop_duplicates()
...

等等

类似地，df2具有VAR2，df3VAR3等。

根据前一篇文章的答案之一，对解决方案进行了修改。

的目标是创建一个新的合并所有的数据帧VARX（每DFX的）作为附加列的深度，轮廓等3级的，所以我想是这样的：

dfs = [df.set_index(['depth','name_profile', 'year', 'month', 'day']) for df in [df1, df2, df3, df4, df5, df6, df7]]

df_merged = (pd.concat(dfs, axis=1).reset_index())

当前错误是：

ValueError：无法处理非唯一的多索引！

我究竟做错了什么？

完善

再次考虑将水平串联与一起使用pandas.concat。因为您有多个行共享相同的profile，depth，year，month，day和day，所以将运行计数添加cumcount到mult-index中，计算方式为groupby().cumcount()：

grp_cols = ['depth', 'name_profile', 'year', 'month', 'day']

dfs = [(df.assign(grp_count = df.groupby(grp_cols).cumcount())
          .set_index(grp_cols + ['grp_count'])
       ) for df in [df1, df2, df3, df4, df5, df6, df7]]

df_merged = pd.concat(dfs, axis=1).reset_index()

print(df_merged)

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。