Regex pattern not working in Python script

Aubrey

I need to find a specific word in the HTML of a list of pages. I'm using regex instead of BeautifulSoup, because I find it often easier.

The code is:

links= ['http://www-01.sil.org/iso639-3/documentation.asp?id=alr','http://www-01.sil.org/iso639-3/documentation.asp?id=ami', ...]
for link in links:
    d = requests.get(link)
    p = re.compile(r'<td valign=\"top\">Name:<\/td>\n\t+<td>\n\t+(\w+)\n\t+<\/td>')
    lang = re.search(p, d.text)

This is a snippet of d.text:

<div id="main">
<h1>Documentation for ISO 639 identifier: bnn</h1>
<hr style="margin-bottom: 6pt">

        <table>
            <tr>
                <td valign="top">Identifier:</td>
                <td>bnn</td>
            </tr>

                <tr>
                    <td valign="top">Name:</td>
                    <td>
                    Bunun
                    </td>
                </tr>

            <tr>
                <td valign="top">Status:</td>
                <td>Active</td>
            </tr>

I don't know why, but lang is None. I checked my regex pattern on regex101, and also on Sublime. I printed d.text, and the HTML is normal: if I put d.text in Sublime and search the same pattern, it works.
I don't understand why but the pattern doesn't work in the script, but everywhere else... I'm using Python3. I must be doing something silly, but I don't understand what...

AndreyS Scherbakov

One should be very careful with '\n'. File lines may finish with '\n' (Linux style), with '\r' (MacOS style) or both (Windows style). In your case it's easy to correct your expression accepting [\n\r]+ in place of \n and it works fine with your example links:

p = re.compile(r'<td valign="top">Name:</td>[\n\r]+\t+<td>[\n\r]+\t+(\w+)[\n\r]+\t+</td>')

However, I strongly advise against relying on any spacing structure in a document. What if they change it? It wouldn't ever be visible on site! I believe it's better to let spacing be free. Like the following:

p = re.compile(r'<td valign="top">Name:</td>\s*<td>\s*(\w+)\s*</td>')

It's also need to be noted that valign attribute is deprecated in HTML5 (CSS is to be used instead) and thus it may completely disappear from documents in near future.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related