Finding direct child of an element

loremIpsum1771

I'm writing a solution to test this phenomenon in Python. I have most of the logic done, but there are many edge cases that arise when following links in Wikipedia articles.

The problem I'm running into arises for a page like this where the first <p> has multiple levels of child elements and the first <a> tag after the first set of parentheses needs to be extracted. In this case, (to extract this link), you have to skip over the parentheses, and then get to the very next anchor tag/href. In most articles, my algorithm can skip over the parentheses, but with the way that it looks for links in front of parentheses (or if they don't exist), it is finding the anchor tag in the wrong place. Specifically, here: <span style="font-size: small;"><span id="coordinates"><a href="/wiki/Geographic_coordinate_system" title="Geographic coordinate system">Coordinates</a>

The algorithm works by iterating through the elements in the first paragraph tag (in the main body of the article), stringifying each element iteratively, and first checking to see if it contains either an '(' or an '

Is there any straight forward way to avoid embedded anchor tags and only take the first link that is a direct child of the first <p> ?

Below is the function with this code for reference:

**def getValidLink(self, currResponse):
        currRoot = BeautifulSoup(currResponse.text,"lxml")
        temp = currRoot.body.findAll('p')[0]
        parenOpened = False
        parenCompleted = False
        openCount = 0
        foundParen = False
        while temp.next:
            temp = temp.next
            curr = str(temp)
            if '(' in curr and str(type(temp)) == "<class 'bs4.element.NavigableString'>":
                foundParen = True
                break
            if '<a' in curr and str(type(temp)) == "<class 'bs4.element.Tag'>":
                link = temp
                break

        temp = currRoot.body.findAll('p')[0]
        if foundParen:
            while temp.next and not parenCompleted:
                temp = temp.next
                curr = str(temp)
                if '(' in curr:
                    openCount += 1
                    if parenOpened is False:
                        parenOpened = True
                if ')' in curr and parenOpened and openCount > 1:
                    openCount -= 1
                elif ')' in curr and parenOpened and openCount == 1:
                    parenCompleted = True
            try:
                return temp.findNext('a').attrs['href']
            except KeyError:
                print "\nReached article with no main body!\n"
                return None
        try:
            return str(link.attrs['href'])
        except KeyError:
            print "\nReached article with no main body\n"
            return None**

alecxe

I think you are seriously overcomplicating the problem.

There are multiple ways to use the direct parent-child relationship between the elements in BeautifulSoup. One way is the > CSS selector:

In [1]: import requests  

In [2]: from bs4 import BeautifulSoup   

In [3]: url = "https://en.wikipedia.org/wiki/Sierra_Leone"    

In [4]: response = requests.get(url)    

In [5]: soup = BeautifulSoup(response.content, "html.parser")

In [6]: [a.get_text() for a in soup.select("#mw-content-text > p > a")]
Out[6]: 
['West Africa',
 'Guinea',
 'Liberia',
 ...
 'Allen Iverson',
 'Magic Johnson',
 'Victor Oladipo',
 'Frances Tiafoe']

Here we've found a elements that are located directly under the p elements directly under the element with id="mw-content-text" - from what I understand this is where the main Wikipedia article is located in.

If you need a single element, use select_one() instead of select().

Also, if you want to solve it via find*(), pass the recursive=False argument.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-04-17

Comments

0 comments

What does Error:(13) Error: The <receiver> element must be a direct child of the <application> element [WrongManifestParent] mean and how do i fix it?

TOP Ranking

Article

Finding direct child of an element

Finding direct child of an element

pump.io port in URL

How to import an asset in swift using Bundle.main.path() in a react-native native module

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

Double spacing in rmarkdown pdf

SQL Server : need add a dot before two last character

C++ 16 bit grayscale gradient image from 2D array

JMeter: Why get error when try to save test plan

JWT gives JsonWebTokenError "invalid token"

How to make thrown errors visible outside of a Promise?

How to tell if iOS Today Widget is being updated in the background?

Calling Doctrine clear() with an argument is deprecated

Capybara Selenium Chrome opens About Google Chrome

How to update azerothcore-wotlk docker container

Adding Ripple Effect to RecyclerView item

mysql.connector.errors.InterfaceError: 2003: Can't connect to MySQL server on '127.0.0.1:3306' (111 Connection refused)

Error while applying filter on dataframe - PySpark

Unable to add slack to bluemix project

MyPy fails dataclass argument with optional list of objects type

How can I validate and parse phone numbers to extract their country calling code and area code?

Single Sign-On in Spring by using SAML Extension and Shibboleth

python how to create many-to-many of lists inside one list