我正在尝试读取一个大的日志文件,该文件已使用不同的分隔符(旧版问题)进行了解析。
码
for root, dirs, files in os.walk('.', topdown=True):
for file in files:
df = pd.read_csv(file, sep='\n', header=None, skipinitialspace=True)
df = df[0].str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})
df.email = df.email.str.lower()
print(df)
输入文件
[email protected] address1
[email protected] address2
[email protected],address3
[email protected];;addre'ss4
[email protected],,address"5
[email protected],,address;6
single.col1;
single.col2 [spaces at the beginning of the row]
single.col3 [tabs at the beginning of the row]
nonascii.row;data.is.junk-Œœ
not.email;address11
not_email;address22
问题
希望有帮助
df = pd.read_csv(file, sep='\n', header=None)
#remove leading/trailing whitespace and split into columns
df = df[0].str.strip().str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})
#drop rows with non-ASCII (<32 or >255, you can adopt the second to your needs)
df = df[~df.data.fillna('').str.contains('[^ -ÿ]')]
#drop rows with invalid email addresses
email_re = "^\w+(?:[-+.']\w+)*@\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*$"
df = df[df.email.fillna('').str.contains(email_re)]
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句