无法搜出一些零星的文字

SMTH 发表于 Dev

潜行

我正在使用一种模式I wish to get all this使用正则表达式从以下html元素中进行解析，但我目前的尝试<DIV>I wish</DIV> to get all this改为获取我。

这是我尝试的方法：

import re

itemtxt = """
<TABLE>
    <TR>
        <TD><DIV>I wish</DIV></TD>
        <TD>to</TD>
        <TD>get all this</TD>
    </TR>
</TABLE>
"""
matches = re.findall(r">(.*)<", itemtxt)
print(' '.join(matches))

如何I wish to get all this使用正则表达式从上述html元素中进行解析？

蒂姆·比格莱森（Tim Biegeleisen）

为此使用美丽的汤库。千万不能使用正则表达式。好的，您实际上可以在这里使用正则表达式，但是Soup是更好的选择。

itemtxt = """
<TABLE>
    <TR>
        <TD><DIV>I wish</DIV></TD>
        <TD>to</TD>
        <TD>get all this</TD>
    </TR>
</TABLE>
"""
matches = re.findall(r'<[^>]+>((?!<[^>]+>).*?)</[^>]+>', itemtxt)
print(' '.join(matches))

打印：

I wish to get all this

在嵌套HTML标签的情况下，正则表达式模式使用修饰点仅匹配最里面的内容。这里是一个简短的解释：

<[^>]+>           match an opening HTML tag
((?!<[^>]+>).*?)  match any content without crossing over
                  another opening HTML tag, before reaching
</[^>]+>          a closing HTML tag

然后，我们将匹配的单词按空间连接在一起。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。