是否有任何类型的包包含通过删除样本来平滑或均匀分布数组的功能？

Shane Smiskol 发表于 Dev

沙恩·斯米斯科

我有一个范围从 0 到 1 的值数组，这些值与我正在构建的神经网络的输出真值有关。然而，分布非常广泛且不均匀，所以我很好奇是否有一个 Python 包可以删除样本，以便在整个数组中分布更均匀。

这是来自 seaborn 的分布图seaborn.distplot()。

我想要做的本质上是指定将数组分成多少个“部分”的值，并从最大的部分中删除值，以便分布更均匀。

此函数输出的绘图可能如下所示：

是否存在任何类型的 numpy 或 scipy 内置包来执行此操作？

沙恩·斯米斯科

如果这可以帮助将来的任何人，这就是我想出的：

def reject_outliers(x_t, y_t, m):
    mean = np.mean(y_t)
    std = np.std(y_t)
    x_t, y_t = zip(*[[x, y] for x, y in zip(x_t, y_t) if abs(y - mean) < (m * std)])
    return list(x_t), np.array(y_t)


def even_out_distribution(x_t, y_t, n_sections, reduction=0.5, reduce_min=.5, m=2):
    x_t, y_t = reject_outliers(x_t, y_t, m)
    linspace = np.linspace(np.min(y_t), np.max(y_t), n_sections + 1)
    sections = [[] for i in range(n_sections)]
    for x, y in zip(x_t, y_t):
        where = max(np.searchsorted(linspace, y) - 1, 0)
        sections[where].append([x, y])
    sections = [sec for sec in sections if sec != []]

    min_section = min([len(i) for i in sections])  # np.mean([len(i) for i in sections]) * reduce_min  # todo: in replace of min([len(i) for i in sections])
    print([len(i) for i in sections])
    new_sections = []
    for section in sections:
        this_section = list(section)
        if len(section) > min_section:
            to_remove = (len(section) - min_section) * reduction
            for i in range(int(to_remove)):
                this_section.pop(random.randrange(len(this_section)))

        new_sections.append(this_section)
    print([len(i) for i in new_sections])
    output = [inner for outer in new_sections for inner in outer]
    x_t, y_t = zip(*output)

    return list(x_t), np.array(y_t)