正则表达式匹配多行

护肤品

我目前正在尝试对 pdf 进行一些基本的清理，以便我可以将其转换为 ePub 以在我的电子阅读器上使用。我正在做的就是删除页码（简单）和脚注（到目前为止很难过）。基本上，我想要一个表达式，在每个脚注的开头找到标记模式（ <bar>后跟换行符、数字和字母或引号），选择模式及其后的所有内容，直到到达<hr/1>标记在下一页的开头。这是一些示例文本：

The phantoms, for so they then seemed, were flitting on the other side of <br>
the deck, and, with a noiseless celerity, were casting loose the tackles and bands <br>
of the boat which swung there. This boat had always been deemed one of the spare boats <br>
technically called the captain’s, on account of its hanging from the starboard quarter.<br>
The figure that now stood by its bows was tall and swart, with one white tooth <br>
evilly protruding from its steel-like lips. <br>
 <br>
1 "Hardly" had they pulled out from under the ship’s lee, when a <br>
fourth keel, coming from the windward side, pulled round under the stern, <br>
and showed the five strangers <br>
127 <br>
<br>
<hr/>

由于所有脚注都是以这种方式格式化的，我想选择以  （注意空格）开头并以<hr/>标签结尾的每一组行。这是我第一次真正尝试使用正则表达式，所以我尝试了一些解决方案的尝试：

\s \n\d+\s[a-zA-Z“].*：这正确选择  了脚注的第一行，但在中断处停止。\s \n\d+\s[a-zA-Z“].*\n.*\n.*\n.*\n.*\n.*选择正确的行数，但这显然只适用于恰好有三行文本的脚注。
\s \n\d+\s[a-zA-Z“]((.*\n)*)<hr\/>从第一个脚注的正确位置开始，但最终选择了整个文档的其余部分。我对这个表达式的解释是“以开头  ，一个数字，后跟一个空格，后跟一个字母或引号，然后选择所有内容，包括换行符，直到到达<hr/>。”
\s \n\d+\s[a-zA-Z“]((?:.*\r?\n?)*)<hr\/>\n 与 (2) 相同的想法，具有相同的结果，尽管我对正则表达式不够熟悉，无法完全理解这个是怎么回事。

基本上，我的问题是我的表达式要么排除换行符（并忽略结束模式），要么包含每个换行符并返回整个文本（显然仍然忽略结束模式。

我如何让它只返回模式之间的文本，包括换行符？

罗德米拉

你的尝试非常接近。在第一个中，您可能需要设置允许.匹配换行符的标志。通常不会。其次，您需要?在任何匹配项上设置非贪婪模式.*。否则.*尝试匹配整个文本的其余部分。

它会是这样的。 /^ \n\d+\s[a-zA-Z"“](.*?\n)*?<hr\/>/

但无论如何，这是最好在 Perl 中完成的事情。Perl 是所有高级正则表达式的来源。

use strict;
use diagnostics;

our $text =<<EOF;
The figure that now stood by its bows was tall and swart, with one white tooth <br>
evilly protruding from its steel-like lips. <br>
 <br>
1 "Hardly" had they pulled out from under the ship’s lee, when a <br>
fourth keel, coming from the windward side, pulled round under the stern, <br>
and showed the five strangers <br>
127 <br>
<br>
<hr/>
More text.
EOF

our $regex = qr{^ <br>\n\d+ +[A-Z"“].*?<hr/>}ism;
$text =~ s/($regex)/<!-- Removed -->/;
print "Removed text:\n[$1]\n\n";
print "New text:\n[$text]\n";

那打印：

Removed text:
[ <br>
1 "Hardly" had they pulled out from under the ship’s lee, when a <br>
fourth keel, coming from the windward side, pulled round under the stern, <br>
and showed the five strangers <br>
127 <br>
<br>
<hr/>]

New text:
[The figure that now stood by its bows was tall and swart, with one white tooth <br>
evilly protruding from its steel-like lips. <br>
<!-- Removed -->
More text.
]

的qr运算符生成一个正则表达式，以便它可以被存储在变量中。在^一开始的手段在一行的开头锚这场比赛。在ism上月底表示情况insensitive，s英格尔串，multiple嵌线。s允许.匹配换行符。m允许^匹配嵌入在字符串中的行的开头。您将g在替换结束时添加一个标志以进行全局替换。s///g

Perl regex 文档解释了一切。https://perldoc.perl.org/perlretut

另请参阅多行替换 perl 中的扩展表达式不起作用。

HTH

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-08-17

我来说两句

0 条评论

登录后参与评论

上一篇：查找至少参加过一次考试但没有参加过 Max 和 Min 分数的学生

TOP 榜单

文章

正则表达式匹配多行

正则表达式匹配多行

隐藏发件人没有短信PHP

Hashchange事件侦听器在将事件处理程序附加到事件之前进行侦听

在浏览器中请求URL时会发生什么？

flask-admin 如何自定义删除按钮

材质UI垂直滑块。如何改变在垂直材料UI滑块导轨的厚度（反应）

用日期数据透视表和日期顺序查询

Jqgrid：多级别组摘要

java io ioexception无法解析服务器地址解析器的响应

Swift如何使用Base64Url编码JWT标头和有效负载之类的json对象

sshd AllowGroups组未授予访问权限

jQuery无限滚动固定div中的滚动

android 背部按下

Flexbox CSS 对齐属性环境惰性？

为什么随机森林中的平均降低基尼系数取决于人口规模？

ClickHouse 创建临时表

为什么PlusShare.Builder setRecipients方法不起作用？

如何在Android中识别MICR代码

PyQt4.QtCore模块无法向sip模块注册

正则表达式，用于查找所有以任何字母开头和数字开头的文件

是否可以通过编程方式对很多动画进行重新着色？

机器密钥生成