BeautifulSoup: Extract the text that is not in a given tag

Mel Published at Dev

mel

I have the following variable, header equal to:

<p>Andrew Anglin<br/>
<strong>Daily Stormer</strong><br/>
February 11, 2017</p>

I want to extract from this variable only the date February 11, 2017. How can I do it using BeautifulSoup in python?

Josh Crozier

If you know that the date is always the last text node in the header variable, then you could access the .contents property and get the last element in the returned list:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

header.contents[-1].strip()
> February 11, 2017

Or, as MYGz pointed out in the comments below, you could split the text at new lines and retrieve the last element in the list:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

header.text.split('\n')[-1]
> February 11, 2017

If you don't know the position of the date text node, then another option would be to parse out any matching strings:

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

re.findall(r'\w+ \d{1,2}, \d{4}', header.text)[0]
> February 11, 2017

However, as your title implies, if you only want to retrieve text nodes that aren't wrapped with an element tag, then you could use the following which will filter out elements:

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

text_nodes = [e.strip() for e in header if not e.name and e.strip()]

Keep in mind that would return the following since the first text node isn't wrapped:

> ['Andrew Anglin', 'February 11, 2017']

Of course you could also combine the last two options and parse out the date strings in the returned text nodes:

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

for node in header:
    if not node.name and node.strip():
        match = re.findall(r'^\w+ \d{1,2}, \d{4}$', node.strip())
        if match:
            print(match[0])

> February 11, 2017

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2020-10-25

Comments

0 comments

TOP Ranking

Article

BeautifulSoup: Extract the text that is not in a given tag

BeautifulSoup: Extract the text that is not in a given tag

pump.io port in URL

How to import an asset in swift using Bundle.main.path() in a react-native native module

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

Double spacing in rmarkdown pdf

SQL Server : need add a dot before two last character

C++ 16 bit grayscale gradient image from 2D array

JMeter: Why get error when try to save test plan

JWT gives JsonWebTokenError "invalid token"

How to make thrown errors visible outside of a Promise?

How to tell if iOS Today Widget is being updated in the background?

Calling Doctrine clear() with an argument is deprecated

Capybara Selenium Chrome opens About Google Chrome

How to update azerothcore-wotlk docker container

Adding Ripple Effect to RecyclerView item

mysql.connector.errors.InterfaceError: 2003: Can't connect to MySQL server on '127.0.0.1:3306' (111 Connection refused)

Error while applying filter on dataframe - PySpark

Unable to add slack to bluemix project

MyPy fails dataclass argument with optional list of objects type

How can I validate and parse phone numbers to extract their country calling code and area code?

Single Sign-On in Spring by using SAML Extension and Shibboleth

python how to create many-to-many of lists inside one list