BeautifulSoup: Extract the text that is not in a given tag

mel

I have the following variable, header equal to:

<p>Andrew Anglin<br/>
<strong>Daily Stormer</strong><br/>
February 11, 2017</p>

I want to extract from this variable only the date February 11, 2017. How can I do it using BeautifulSoup in python?

Josh Crozier

If you know that the date is always the last text node in the header variable, then you could access the .contents property and get the last element in the returned list:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

header.contents[-1].strip()
> February 11, 2017

Or, as MYGz pointed out in the comments below, you could split the text at new lines and retrieve the last element in the list:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

header.text.split('\n')[-1]
> February 11, 2017

If you don't know the position of the date text node, then another option would be to parse out any matching strings:

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

re.findall(r'\w+ \d{1,2}, \d{4}', header.text)[0]
> February 11, 2017

However, as your title implies, if you only want to retrieve text nodes that aren't wrapped with an element tag, then you could use the following which will filter out elements:

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

text_nodes = [e.strip() for e in header if not e.name and e.strip()]

Keep in mind that would return the following since the first text node isn't wrapped:

> ['Andrew Anglin', 'February 11, 2017']

Of course you could also combine the last two options and parse out the date strings in the returned text nodes:

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

for node in header:
    if not node.name and node.strip():
        match = re.findall(r'^\w+ \d{1,2}, \d{4}$', node.strip())
        if match:
            print(match[0])

> February 11, 2017

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

BeautifulSoup: extract text from anchor tag

BeautifulSoup: extract text from anchor tag

Extract string from tag with BeautifulSoup

BeautifulSoup - how to extract text without opening tag and before <br> tag?

How to extract h1 tag text with beautifulsoup

How to extract the text inside a tag with BeautifulSoup in Python?

Extract text from within div tag using BeautifulSoup 4 in Python

How to extract the child of a tag in Beautifulsoup?

Python extract empty tag with beautifulsoup

How to Extract the last paragraph tag text in beautifulsoup?

Extract text only except the content of script tag from html with BeautifulSoup

BeautifulSoup, select text to extract

BeautifulSoup: How do I extract the text child element with no tag?

Using beautifulsoup to extract text between the start of paragraph tag and a line break

Beautifulsoup extract inside <br> tag

How to find tag name given a text in BeautifulSoup

How to extract a list of anchor tag text within a class and append each text to different list using beautifulsoup?

Extract href given text of anchor tag using Xpath

BeautifulSoup: How to extract tag values?

BeautifulSoup4 can't extract only text from a tag

Extract part of text with Beautifulsoup

How to extract text and tag attributes from xml using BeautifulSoup

How can I extract the text from the <em> tag using BeautifulSoup

How to extract only the text inside a given class or tag using BeautifulSoup?

How to extract text from an HTML div tag file with BeautifulSoup?

Extract Text in tag STRONG

find a tag by beautifulsoup and extract element

Extract all text within a tag & save to dictionary using beautifulSoup

Extract text from class 'bs4.element.Tag' beautifulsoup