我有一个包含 18000 多个 .txt 文件的目录。大多数文件都是电子邮件,因此大多采用以下格式:
(Some text)
Subject: Re: Relevant text
(More text)
从每个 .txt 文件中,我需要提取“相关文本”
到目前为止我最好的结果是
re.findall(r"(Subject:[^.]*\n\n\n?)",text)
3 个示例文件的输出如下:
['Subject: Re: DMORPH\n\nIn article <> (Armstrong Jay N) writes:\n>Can someone please tell me where I can ftp DTA or DMORPH?\n\n']
['Subject: Alias phone number wanted\n\n']
['Subject: Re: The 1994 Mustang\n\n']
尝试
import re, os
relevant_texts={}
textfilesdir=#enter you text file dir here
for file in os.listdir(textfilesdir):
if os.path.splitext(file.lower())== '.txt':
with open(os.path.join(textfilesdir, file) as f:
subject = re.findall('[sS]{1}ubject:.+\n+', f.read())
if len(subject):
relevant_texts[file] = re.sub('[sS]{1}ubject:[ ]*(Re:)*', '', subject[0].strip()).strip()
else:
relevant_texts[file] = 'SUBJECT NOT FOUND !!!'
print('relevant text not found in %s!!!'%file)
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句