How to print paragraphs and headings simultaneously while scraping in Python?

akisonlyforu

I am a beginner in python. I am currently using Beautifulsoup to scrape a website.

str='' #my_url
source = urllib.request.urlopen(str);
soup = bs.BeautifulSoup(source,'lxml');
match=soup.find('article',class_='xyz');
for paragraph in match.find_all('p'):
    str+=paragraph.text+"\n"

My tag Structure -

<article class="xyz" >
<h4>dr</h4>
<p>efkl</p>
<h4>dr</h4>
<p>efkl</p>
<h4>dr</h4>
<p>efkl</p>
<h4>dr</h4>
<p>efkl</p>         
</article>


I am getting output like this (as I am able to extract the paragraphs) -

 efkl
 efkl
 efkl
 efkl

Output I want ( I want the headings as well as the paragraphs) -

 dr
 efkl
 dr
 efkl
 dr
 efkl
 dr
 efkl     

I want my output to also contains headings along with paragraphs.How to modify code in such a way that it contains header before paragraphs (Like in original HTML) .

SIM

You can peel the same apple in different ways to serve the purpose. Here are few of them:

Using .find_next():

from bs4 import BeautifulSoup

content="""
<article class="xyz" >
<h4>dr</h4>
<p>efkl</p>
<h4>dr</h4>
<p>efkl</p>
<h4>dr</h4>
<p>efkl</p>
<h4>dr</h4>
<p>efkl</p>         
</article>
"""
soup = BeautifulSoup(content,"lxml")

for items in soup.find_all(class_="xyz"):
    data = '\n'.join(['\n'.join([item.text,item.find_next("p").text]) for item in items.find_all("h4")])
    print(data)

Using .find_previous_sibling():

for items in soup.find_all(class_="xyz"):
    data = '\n'.join(['\n'.join([item.find_previous_sibling("h4").text,item.text]) for item in items.find_all("p")])
    print(data)

Commonly used approach: multiple tags used within list:

for items in soup.find_all(class_="xyz"):
    data = '\n'.join([item.text for item in items.find_all(["h4","p"])])
    print(data)

All the three approaches produce the same result:

dr
efkl
dr
efkl
dr
efkl
dr
efkl

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

In Xcode how to add Multiple Paragraphs with Bold Headings?

Python web scraping table with sub headings

Alignment of paragraphs and headings in CSS

How can i apply headings to all paragraphs in a word processing document?

How to print two list simultaneously?

Python Docx - how to number headings?

How do I get value of tags while scraping a website with python?

How to access a specific object in a class HTML while web scraping with python

How to fix Cyrillic characters while web-scraping with Python

Having an issue while scraping the print preview page using selenium webdriver in python

How to break text into paragraphs (python)

How to add paragraphs to list in Python

Python - Running code simultaneously (TTS and print functions)

Python, Flask print to console and log file simultaneously

how to print two things simultaneously with ncurses

Python - BeautifulSoup error while scraping

python regex, How to match everything except headings?

how to bypass googletagmanager while scraping

How to get headings sans serif while preserving pdf bookmark text?

How to use a COUNTIF while simultaneously grouping duplicates?

How to save output to a file simultaneously while python script is running, even with errors?

Python / Selenium - can't print text contents of all paragraphs

How to keep Quill from inserting blank paragraphs (`<p><br></p>`) before headings with a 10px top margin?

Web scraping using python to print class div

Print function in python not working with web scraping

bypassing body unload ="window.print" while scraping the page

Python How to split text file into paragraphs?

How to split an image into clean paragraphs in Python/OpenCV?

How to make paragraphs readable within python code