删除以 CRLF 结尾的行中的 CRLF

binway 发表于 Dev

宾威

我一直在寻找一些 python 代码来计算记录中的分隔符，但找不到任何示例。

我有一个管道分隔的文本文件，带有双引号用于文本限定符，CRLF 定义行的结尾。与往常一样，有些列在文本中包含 CRLF，这会混淆输出格式。

"记录开始"|""|"SomeText"|"更多的内容与 CRLF 然后更多的文本"|"甚至可以包含"CRLF"|""CRLF

目前我在记事本++中打开了该文件，并且正在使用正则表达式手动(?<!")\r\n查找没有前面双引号的CRLF。由于我有几个大文件要修复，我想让 python 转到记录的开头，计算 5 个管道并删除该计数中的任何 CRLF，但只有非常基本的 python 知识。我有一些基本的 python 代码来查找和替换一些字符，但认为它不足以完成所需的操作。

replacement = {'","':'"|"'}
lines = [] with open('C:\OriginalRplPipe.txt') as infile:
for line in infile:
    for src, target in replacement.items():
        line = line.replace(src,target)
    lines.append(line)with open('C:\PipeDel.txt', 'w') as outfile:
for line in lines:
    outfile.write(line)
    print ("Finished")

用户9611000

同时，我设法消除了第一个答案中的缺陷。下面是应该做你想做的新代码。它应该独立于记录字段中 CRLF 的数量和位置。

from pathlib import Path
import re

regex_lin = rb'(".*?"\|".*?"\|".*?"\|".*?"\|".*?"\|".*?"\r\n)' # split file into lines
reo_lin = re.compile(regex_lin, re.DOTALL)
regex_rec = rb'".*?"'                                          # split line into records
reo_rec = re.compile(regex_rec, re.DOTALL)

in_file = Path('input.txt')
out_file = Path('output.txt')

old_content = in_file.read_bytes()                             # read as binary file although it is a text file!

lines = reo_lin.findall(old_content)
new_content = b''
for line in lines:
    old_records = reo_rec.findall(line)
    new_line= b''
    for record in old_records:
        record = record.replace(b'\r', b'')
        record = record.replace(b'\n', b'')
        new_line = new_line + record + b'|'
    new_content = new_content + new_line + b'\r\n'

out_file.write_bytes(new_content)                              # write as binary file although it is a text file!

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。