How to make regex if string contains comma(,) space, and other characters like (

itsmnthn Published at Dev

itsmnthn

I have string:

<td class="cspan">Proximates</td>\n\t<td style="text-align:left">Total lipid (fat)\n\t\t\n\t\t\n\t\t</td>\n\t\t\n\t\t<td>g</td>\n\t\t\n\t\t\t<td style="text-align:right;">78.30</td>

and I need a regex for it. I have tried many like this one:

Total lipid\(fat\)\\n\\t\\t\\n\\t\\t\\n\\t\\t\<\/td\>\\n\\t\\t\\n\\t\\t\<td\>g\<\/td\>\\n\\t\\t\\n\\t\\t\\t\<td style\=\"text\-align\:right\;\"\>(.*?)\<\/td\>

And also I have another string:

<td style="text-align:left">Vitamin C, total ascorbic acid\n\t\t\n\t\t\n\t\t</td>\n\t\t\n\t\t<td>mg</td>\n\t\t\n\t\t\t<td style="text-align:right;">0.0</td>

and I have tried many regex's for that one also like:

Vitamin C\, total ascorbic acid\\n\\t\\t\\n\\t\\t\\n\\t\\t\<\/td\>\\n\\t\\t\\n\\t\\t\<td\>mg\<\/td\>\\n\\t\\t\\n\\t\\t\\t\<td style\=\"text\-align\:right\;\"\>(.*?)\<\/td\>

and my third string is:

<td style="text-align:left">Vitamin B-12\n\t\t\n\t\t\n\t\t</td>\n\t\t\n\t\t<td>\xb5g</td>\n\t\t\n\t\t\t<td style="text-align:right;">0.07</td>

and I have tried this one and more like this:

data = re.search('Vitamin B\-12\\n\\t\\t\\n\\t\\t\\n\\t\\t\<\/td\>\\n\\t\\t\\n\\t\\t\<td\>µg\<\/td\>\\n\\t\\t\\n\\t\\t\\t\<td style\=\"text\-align\:right\;\"\>(.*?)\<\/td\>',tb)

From those strings I am trying to get the data which is:

from the first string is: 78.30
from the second: 0.0
from the third: 0.07

I need regex like i have written above with just minor changes because i know i am missing something

Stephen Rauch

As you have discovered, XML (HTML) and regex's do not mix well. However this problem is quite straight forward when using BeautifulSoup:

Code:

soup = BeautifulSoup(row)
print soup.findAll('td')[-1].text

Test Code:

data = (
    """
    <td class="cspan">Proximates</td>
    <td style="text-align:left">Total lipid (fat)


    </td>
    <td>g</td>
        <td style="text-align:right;">78.30</td>
    """,
    """
    <td style="text-align:left">Vitamin C, total ascorbic acid


    </td>
    <td>mg</td>
    <td style="text-align:right;">0.0</td> "
    """,
    """
    <td style="text-align:left">Vitamin B-12


    </td>
    <td>\xb5g</td>
    <td style="text-align:right;">0.07</td> "
    """,
)


from bs4 import BeautifulSoup
for row in data:
    soup = BeautifulSoup(row)
    print soup.findAll('td')[-1].text

Results: