将多个数据框的列名与Pandas中的原始列名进行比较

ahbon 发表于 Dev

阿邦

假设我有一个df带标头的数据框a, b, c, d。

我想与其他dfs (df1, df2, df3, ...)列名称进行比较。我需要所有dfs的列名都应与完全相同df（请注意，列名的不同顺序不应视为不同的列名）。

例如：

原始数据框：

df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'])

col = ['a', 'b', 'c']

dfs：

df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'c', 'b'])

退货identical columns name;

df2 = pd.DataFrame(np.array([[1, 2, 3, 10], [4, 5, 6, 11], [7, 8, 9, 12]]),
                   columns=['a', 'c', 'e', 'b'])

退货extra columns in dataframe;

df3 = pd.DataFrame(np.array([[1, 2], [4, 5], [7, 8]]),
                   columns=['a', 'c'])

退货missing columns in dataframe;

df4 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', '*c', 'b'])

退货errors in dataframe's column names;

df5 = pd.DataFrame(np.array([[1, 2, 3, 9], [4, 5, 6, 9], [7, 8, 9, 10]]),
                   columns=['a', 'b', 'b', 'c'])

返回extra columns in dataframe。

如果过于复杂，则返回columns names are incorrect各种错误也可以。

我该如何在熊猫中做到这一点？谢谢。

耶斯列尔

我认为这里设置是个不错的选择，因为顺序并不重要：

def compare(df, df1):
    orig = set(df.columns)
    
    c = set(df1.columns)

    #testing if length of set is same like length of columns names
    if len(c) != len(df1.columns):
        return ('extra columns in dataframe')
    #if same sets
    elif (c == orig):
        return ('identical columns name')
    #compared subsets
    elif c.issubset(orig):
        return ('missing columns in dataframe')
    #compared subsets
    elif orig.issubset(c):
        return ('extra columns in dataframe')
    else:
        return ('columns names are incorrect')

print(compare(df, df1))                    
print(compare(df, df2))    
print(compare(df, df3))    
print(compare(df, df4))    
print(compare(df, df5))    

identical columns name
extra columns in dataframe
missing columns in dataframe
columns names are incorrect
extra columns in dataframe

对于返回值：

def compare(df, df1):
    orig = set(df.columns)
    
    c = set(df1.columns)

    #testing if length of set is same like length of columns names
    if len(c) != len(df1.columns):
        col = df1.columns.tolist()
        a = set([str(x) for x in col if col.count(x) > 1])
        return f'duplicated columns: {", ".join(a)}'
    #if same sets
    elif (c == orig):
        return ('identical columns name')
    #compared subsets
    elif c.issubset(orig):
        a = (str(x) for x in orig - c)
        return f'missing columns: {", ".join(a)}'
    #compared subsets
    elif orig.issubset(c):
        a = (str(x) for x in c - orig)
        return f'extra columns: {", ".join(a)}'
    else:
        a = (str(x) for x in c - orig)
        return f'incorrect: {", ".join(a)}'

print(compare(df, df1))                    
print(compare(df, df2))    
print(compare(df, df3))    
print(compare(df, df4))    
print(compare(df, df5)) 

identical columns name
extra columns: e
missing columns: b
incorrect: *c
duplicated columns: b

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。