Python Pandas数据清理

罗杰·怀特

我正在尝试读取一个大的日志文件，该文件已使用不同的分隔符（旧版问题）进行了解析。

码

for root, dirs, files in os.walk('.', topdown=True):
    for file in files:
        df = pd.read_csv(file, sep='\n', header=None, skipinitialspace=True)
        df = df[0].str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})
        df.email = df.email.str.lower()
        print(df)

输入文件

[email protected]         address1
[email protected]    address2
 [email protected],address3
  [email protected];;addre'ss4
[email protected],,address"5
[email protected],,address;6
single.col1;
 single.col2                 [spaces at the beginning of the row]
    single.col3              [tabs at the beginning of the row]
nonascii.row;data.is.junk-Œœ
not.email;address11
not_email;address22

问题

包含任何非ASCII字符的行需要从DF中删除（我是说整个行都需要排除并清除）
开头带有制表符或空格的行需要修剪。我有'skipinitialspace = True'，但看来这不会删除选项卡
需要检查“ df.email”以查看这是否是有效的电子邮件正则表达式格式。如果不是，则需要清除整行

希望有帮助

斯蒂夫

df = pd.read_csv(file, sep='\n', header=None)    

#remove leading/trailing whitespace and split into columns
df = df[0].str.strip().str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})

#drop rows with non-ASCII (<32 or >255, you can adopt the second to your needs)
df = df[~df.data.fillna('').str.contains('[^ -ÿ]')]

#drop rows with invalid email addresses
email_re = "^\w+(?:[-+.']\w+)*@\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*$"
df = df[df.email.fillna('').str.contains(email_re)]

电子邮件正则表达式是从此处获取的（只是将括号更改为非分组）。如果您想变得全面，也可以使用此Monster-regex。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。