大熊猫：获取相关性高的列的组合

Peter 发表于 Dev

彼得

我有一个包含6列的数据集，让熊猫从中计算相关矩阵，结果如下：

               age  earnings    height     hours  siblings    weight
age       1.000000  0.026032  0.040002  0.024118  0.155894  0.048655
earnings  0.026032  1.000000  0.276373  0.224283  0.126651  0.092299
height    0.040002  0.276373  1.000000  0.235616  0.077551  0.572538
hours     0.024118  0.224283  0.235616  1.000000  0.067797  0.143160
siblings  0.155894  0.126651  0.077551  0.067797  1.000000  0.018367
weight    0.048655  0.092299  0.572538  0.143160  0.018367  1.000000

如何获得相关性例如高于0.5但列不相等的列的组合？因此，在这种情况下，输出必须类似于：

[('height', 'weight')]

我尝试使用for循环来做到这一点，但是我认为这不是正确/最有效的方法：

correlated = []
for column1 in columns:
    for column2 in columns:
        if column1 != column2:
            correlation = df[column1].corr(df[column2])
            if correlation > 0.5 and (column2, column1) not in correlated:
                correlated.append((column1, column2))

在其中df是我的原始数据帧。这将输出所需的结果：

[(u'height', u'weight')]

迈克尔·布伦南

接下来，使用numpy并假设您已经具有相关矩阵df：

import numpy as np

indices = np.where(df > 0.5)
indices = [(df.index[x], df.columns[y]) for x, y in zip(*indices)
                                        if x != y and x < y]

这将导致indices包含：