In beautifulsoup4, when scraping a website purely based off of an element and the text within, how do you return more than one result?

Senuvox Published at Dev

Senuvox

I'm currently at the end of my rope dealing with a frustrating program, and I'm posting here for help for the first time. Using beautifulsoup4, I'm attempting to, in short, scrape a website with no reliable HTML classes or IDs to work with. All I have is the anchor element and, for the example I'm providing below, I am attempting to grab the phrase "Where the Red Fern Grows" using only the the lowercase text "red fern". So in conclusion, I am attempting to identify and collect/print the text of each unclassified/unidentified anchor element that contains the phrase "Where the Red Fern Grows", without having to type the entire string and remain case insensitive.

I've tried a multitude of things so far, with my greatest success being only a half measure. I was able to successfully collect the very first anchor element that contained 'WTRFG'. Unfortunately, despite my best efforts, that's about as much as I've been able to get. I've used both find and find_all, tried to use re.search with regex, and tried a number of other things I found in other stack overflow answers. No dice. Here's what I got right now.

import bs4
import requests
import re
import pretty_errors

url = "http://fake.site/search.php?req=where+the+red+fern+grows&lg_topic=fakesite&open=0&view=simple&res=25&phrase=1&column=def"
page = requests.get(url)
fernSoup = bs4.BeautifulSoup(page.content, "html.parser")
redFern = "red fern"

print(type(fernSoup))
print(type(redFern))

anchor = fernSoup.find_all("a", class_=False, text=lambda text: text and redFern in text.lower())

print(anchor)

Which outputs as:

<class 'bs4.BeautifulSoup'>
<class 'str'>
[<a href="book/index.php?md5=82C10FF9DA122C4B1061F83555F3800E" id="796869" title="">Where The Red Fern Grows</a>]

# This is only the first of three different results, but the only one I can access usually. The other two contain the exact same structure, minus differences in the href url and ID number.

Any advice would be greatly appreciated, and thank you for taking the time to read my post.

Edit: The three anchors I am attempting to access, copy pasted directly from the result of print(fernSoup)

<td width="500"><a href="book/index.php?md5=82C10FF9DA122C4B1061F83555F3800E" id="796869" title="">Where The Red Fern Grows</a></td>

<td width="500"><a href="book/index.php?md5=3C96145628CC4759595FB3C1A767673A" id="1157998" title="">Where the Red Fern Grows<br/> <font color="green" face="Times"><i>0553274295</i></font></a></td>

<td width="500"><a href="book/index.php?md5=9DD3079644E043E530682DA95C95B999" id="2413155" title="">Where the Red Fern Grows: The Story of Two Dogs and a Boy<br/> <font color="green" face="Times"><i>978-0-307-78156-7, 0307781569, 0553274295, 9780440412670</i></

Andrej Kesely

To select multiple <a> tags with the text "red fern", you can do:

from bs4 import BeautifulSoup

html_doc = """
 <td width="500"><a href="book/index.php?md5=82C10FF9DA122C4B1061F83555F3800E" id="796869" title="">Where The Red Fern Grows</a></td> <td width="500"><a href="book/index.php?md5=3C96145628CC4759595FB3C1A767673A" id="1157998" title="">Where the Red Fern Grows<br/> <font color="green" face="Times"><i>0553274295</i></font></a></td> 
"""

fernSoup = BeautifulSoup(html_doc, "html.parser")
redFern = "red fern"

anchor = fernSoup.find_all(
    lambda tag: tag.name == "a" and redFern in tag.text.lower()
)

print(anchor)

Prints:

[<a href="book/index.php?md5=82C10FF9DA122C4B1061F83555F3800E" id="796869" title="">Where The Red Fern Grows</a>, <a href="book/index.php?md5=3C96145628CC4759595FB3C1A767673A" id="1157998" title="">Where the Red Fern Grows<br/> <font color="green" face="Times"><i>0553274295</i></font></a>]

Or CSS selector (but this is case sensitive):

print(fernSoup.select('a:-soup-contains("Red Fern")'))

Prints:

[<a href="book/index.php?md5=82C10FF9DA122C4B1061F83555F3800E" id="796869" title="">Where The Red Fern Grows</a>, <a href="book/index.php?md5=3C96145628CC4759595FB3C1A767673A" id="1157998" title="">Where the Red Fern Grows<br/> <font color="green" face="Times"><i>0553274295</i></font></a>]

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-06-4

Comments

0 comments

How do you get the location of touches for a UILongPressGestureRecognizer when the number of required touches is more than one in swift?

TOP Ranking

Article

In beautifulsoup4, when scraping a website purely based off of an element and the text within, how do you return more than one result?

In beautifulsoup4, when scraping a website purely based off of an element and the text within, how do you return more than one result?

Loopback Error: connect ECONNREFUSED 127.0.0.1:3306 (MAMP)

Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

pump.io port in URL

How to import an asset in swift using Bundle.main.path() in a react-native native module

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

Emulator wrong screen resolution in Android Studio 1.3

3D Touch Peek Swipe Like Mail

Double spacing in rmarkdown pdf

Svchost high CPU from Microsoft.BingWeather app errors

How to how increase/decrease compared to adjacent cell

Using Response.Redirect with Friendly URLS in ASP.NET

java.lang.NullPointerException: Cannot read the array length because "<local3>" is null

BigQuery - concatenate ignoring NULL

How to fix "pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'" using YOLOv3?

ngClass error (Can't bind ngClass since it isn't a known property of div) in Angular 11.0.3

Can a 32-bit antivirus program protect you from 64-bit threats

Make a B+ Tree concurrent thread safe

Bootstrap 5 Static Modal Still Closes when I Click Outside

Vector input in shiny R and then use it

Assembly definition can't resolve namespaces from external packages