如何在Python中使用贪婪方法将两个数据框的最相似列配对

Shaykh_Python

我有两个尺寸为24x10的数据框（实际df尺寸很大）。要求所有列的配对都以贪婪的方式完成，方法是枚举df1中的列并找到df2中最相似（不一定完全相同）的列。结果最终将导致df1的每个列与df2中的未分配列配对。dfs如下。

df1 = pd.DataFrame([[1., 1., 1., 1., 1., 1., 0., 1., 1., 1.],
   [1., 1., 1., 1., 1., 1., 0., 1., 1., 1.],
   [1., 1., 1., 1., 1., 1., 0., 1., 1., 0.],
   [1., 1., 1., 1., 1., 1., 0., 1., 1., 1.],
   [1., 1., 2., 1., 1., 1., 0., 1., 1., 0.],
   [2., 1., 1., 3., 1., 1., 0., 1., 1., 1.],
   [2., 1., 1., 2., 1., 1., 1., 1., 1., 0.],
   [2., 1., 1., 3., 1., 1., 1., 1., 1., 0.],
   [2., 1., 1., 3., 1., 2., 1., 1., 1., 0.],
   [2., 1., 1., 4., 2., 2., 1., 1., 1., 1.],
   [2., 4., 1., 4., 3., 1., 1., 1., 1., 1.],
   [2., 4., 1., 4., 3., 1., 1., 1., 1., 1.],
   [2., 4., 1., 5., 2., 1., 0., 1., 1., 1.],
   [2., 4., 1., 6., 2., 1., 0., 1., 1., 1.],
   [2., 4., 1., 5., 2., 1., 1., 1., 1., 1.],
   [2., 4., 1., 5., 1., 1., 0., 1., 1., 1.],
   [2., 4., 1., 5., 3., 1., 1., 1., 1., 1.],
   [1., 4., 1., 4., 2., 1., 1., 1., 1., 1.],
   [1., 4., 2., 4., 2., 1., 1., 1., 1., 1.],
   [1., 1., 2., 3., 2., 1., 1., 1., 1., 1.],
   [1., 1., 2., 1., 2., 1., 1., 1., 1., 1.],
   [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
   [1., 1., 1., 1., 1., 1., 0., 1., 1., 1.],
   [1., 1., 1., 1., 1., 1., 0., 1., 1., 1.]])

df2 = pd.DataFrame([[0., 0., 1., 1., 1., 1., 1., 1., 1., 0.],
   [0., 1., 1., 1., 1., 1., 1., 1., 0., 0.],
   [0., 0., 1., 1., 1., 1., 1., 1., 1., 0.],
   [0., 0., 1., 1., 1., 1., 1., 1., 0., 1.],
   [0., 0., 1., 1., 1., 1., 1., 1., 0., 1.],
   [0., 0., 0., 1., 1., 1., 1., 1., 1., 1.],
   [0., 1., 1., 1., 1., 1., 0., 0., 1., 1.],
   [0., 0., 1., 1., 1., 1., 1., 0., 1., 1.],
   [0., 1., 1., 1., 1., 1., 0., 0., 1., 1.],
   [0., 0., 1., 1., 1., 1., 1., 1., 0., 1.],
   [0., 1., 1., 1., 1., 1., 1., 0., 0., 1.],
   [0., 1., 1., 1., 1., 1., 1., 0., 0., 1.],
   [1., 1., 1., 1., 1., 1., 0., 0., 0., 1.],
   [0., 1., 1., 1., 0., 0., 1., 1., 1., 1.],
   [0., 0., 1., 1., 1., 1., 0., 1., 1., 1.],
   [0., 1., 1., 1., 1., 1., 0., 0., 1., 1.],
   [1., 1., 1., 1., 0., 0., 0., 1., 1., 1.],
   [0., 1., 1., 1., 1., 1., 1., 0., 0., 1.],
   [1., 1., 1., 1., 1., 0., 0., 0., 1., 1.],
   [0., 1., 1., 1., 1., 1., 1., 0., 0., 1.],
   [1., 1., 1., 1., 1., 1., 0., 0., 0., 1.],
   [0., 0., 1., 1., 1., 1., 1., 1., 1., 0.],
   [0., 1., 1., 1., 1., 1., 1., 1., 0., 0.],
   [0., 0., 1., 1., 1., 1., 1., 1., 1., 0.]] )

可以将“最相似”定义为列之间公共元素的最大数量。任何帮助或提示都值得赞赏。我尝试了以下方法。

for key1, value1 in df1.iteritems():
#print(value)
    for key2, value2 in df2.iteritems():
        common_elements = [e for e in list(value1) if e in list(value2)]
    l = len(common_elements)

乔·费尔兹（Joe Ferndz）

这是进行比赛的一种方式。

假设：

＃1：如果df1中的一列与df2中的一列相匹配，则将这两个列都排除掉以进行进一步匹配。例如，如果df1中的第1列和第3列与df2中的第5列完全匹配，则只有df1中的第1列将与df2中的第5列配对。df1中的第3列将需要寻找新的匹配项。

＃2：df1和df2的行数相同。对于此示例，我还考虑了行和列的大小相同。对代码的微小调整可以解决列中的差异，但行数必须匹配。

＃3：列比较必须完全匹配。换句话说，如果df1的第1列第1行的值为1，则df2的第1列第1行应为1。如果是，则在row＆col上匹配。不会重新排列数据以检查df1或df2是否匹配。

基于以上假设，代码如下所示。

#create a list to store all the match counts
df_list = []

#iterate through df1 first
for cols1 in df1.columns:

    #convert df1 column value to a list
    x = df1[cols1].tolist()

    #iterate through df2 to match to df1 column data
    for cols2 in df2.columns:

        #convert df2 column value to a list
        y = df2[cols2].tolist()

        #iterate and compare each value in df1[col1] with df2[col2]
        #i==j will result in True or False
        #sum() will count all True values (i.e., all matched values)

        z = sum((i==j) for i,j in zip(x,y))

        #store match count, col 1, col 2 into the lsit
        df_list.append((z,cols1,cols2))

#once you have iterated through df2 for each df1
#sort the df_list by descending order of match count, ascending order of df1 column
#highest match will be first, then df1 column
df_list = sorted(df_list,key=lambda x:(-x[0],x[1]))

dfc1,dfc2,points = [],[],[]

#iterate thru df_list and pick only if df1 column and df2 column were not picked earlier
#dfc1, dfc2, points will store each matched pair

for p,c1,c2 in df_list:
    if (c1 not in dfc1) and (c2 not in dfc2):
        points.append(p)
        dfc1.append(c1)
        dfc2.append(c2)

#print the matched values

for i in range(len(dfc1)):
    print (f'{points[i]:2} rows of df1[{dfc1[i]}] matches with df2[{dfc2[i]}]')

输入数据帧df1和df2的输出为：

24 rows of df1[7] matches with df2[3]
23 rows of df1[8] matches with df2[2]
20 rows of df1[5] matches with df2[4]
18 rows of df1[2] matches with df2[5]
17 rows of df1[6] matches with df2[9]
15 rows of df1[9] matches with df2[1]
12 rows of df1[1] matches with df2[6]
 9 rows of df1[4] matches with df2[7]
 5 rows of df1[0] matches with df2[8]
 1 rows of df1[3] matches with df2[0]

您可以决定分界点（例如：考虑匹配项> 15或更多）。我们可以在将数据追加到列表之前添加过滤器。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-02-22

我来说两句

0 条评论

登录后参与评论

上一篇：如何检测Chrome中的JavaScript是否对HTML视频进行了DRM保护？

TOP 榜单

文章

如何在Python中使用贪婪方法将两个数据框的最相似列配对

如何在Python中使用贪婪方法将两个数据框的最相似列配对

UITableView的项目向下滚动后更改颜色，然后快速备份

Linux的官方Adobe Flash存储库是否已过时？

用日期数据透视表和日期顺序查询

应用发明者仅从列表中选择一个随机项一次

Mac OS X更新后的GRUB 2问题

验证REST API参数

Java Eclipse中的错误13，如何解决？

带有错误“ where”条件的查询如何返回结果？

ggplot：对齐多个分面图-所有大小不同的分面

尝试反复更改屏幕上按钮的位置 - kotlin android studio

如何从视图一次更新多行（ASP.NET - Core）

计算数据帧中每行的NA

蓝屏死机没有修复解决方案

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

离子动态工具栏背景色

VB.net将2条特定行导出到DataGridView

通过 Git 在运行 Jenkins 作业时获取 ClassNotFoundException

在Windows 7中无法删除文件（2）

python中的boto3文件上传

当我尝试下载 StanfordNLP en 模型时，出现错误

Node.js中未捕获的异常错误，发生调用