假设我有一个df
带标头的数据框a, b, c, d
。
我想与其他dfs (df1, df2, df3, ...)
列名称进行比较。我需要所有dfs
的列名都应与完全相同df
(请注意,列名的不同顺序不应视为不同的列名)。
例如:
原始数据框:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
col = ['a', 'b', 'c']
dfs:
df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'c', 'b'])
退货identical columns name
;
df2 = pd.DataFrame(np.array([[1, 2, 3, 10], [4, 5, 6, 11], [7, 8, 9, 12]]),
columns=['a', 'c', 'e', 'b'])
退货extra columns in dataframe
;
df3 = pd.DataFrame(np.array([[1, 2], [4, 5], [7, 8]]),
columns=['a', 'c'])
退货missing columns in dataframe
;
df4 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', '*c', 'b'])
退货errors in dataframe's column names
;
df5 = pd.DataFrame(np.array([[1, 2, 3, 9], [4, 5, 6, 9], [7, 8, 9, 10]]),
columns=['a', 'b', 'b', 'c'])
返回extra columns in dataframe
。
如果过于复杂,则返回columns names are incorrect
各种错误也可以。
我该如何在熊猫中做到这一点?谢谢。
我认为这里设置是个不错的选择,因为顺序并不重要:
def compare(df, df1):
orig = set(df.columns)
c = set(df1.columns)
#testing if length of set is same like length of columns names
if len(c) != len(df1.columns):
return ('extra columns in dataframe')
#if same sets
elif (c == orig):
return ('identical columns name')
#compared subsets
elif c.issubset(orig):
return ('missing columns in dataframe')
#compared subsets
elif orig.issubset(c):
return ('extra columns in dataframe')
else:
return ('columns names are incorrect')
print(compare(df, df1))
print(compare(df, df2))
print(compare(df, df3))
print(compare(df, df4))
print(compare(df, df5))
identical columns name
extra columns in dataframe
missing columns in dataframe
columns names are incorrect
extra columns in dataframe
对于返回值:
def compare(df, df1):
orig = set(df.columns)
c = set(df1.columns)
#testing if length of set is same like length of columns names
if len(c) != len(df1.columns):
col = df1.columns.tolist()
a = set([str(x) for x in col if col.count(x) > 1])
return f'duplicated columns: {", ".join(a)}'
#if same sets
elif (c == orig):
return ('identical columns name')
#compared subsets
elif c.issubset(orig):
a = (str(x) for x in orig - c)
return f'missing columns: {", ".join(a)}'
#compared subsets
elif orig.issubset(c):
a = (str(x) for x in c - orig)
return f'extra columns: {", ".join(a)}'
else:
a = (str(x) for x in c - orig)
return f'incorrect: {", ".join(a)}'
print(compare(df, df1))
print(compare(df, df2))
print(compare(df, df3))
print(compare(df, df4))
print(compare(df, df5))
identical columns name
extra columns: e
missing columns: b
incorrect: *c
duplicated columns: b
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句