Finding direct child of an element

loremIpsum1771

I'm writing a solution to test this phenomenon in Python. I have most of the logic done, but there are many edge cases that arise when following links in Wikipedia articles.

The problem I'm running into arises for a page like this where the first <p> has multiple levels of child elements and the first <a> tag after the first set of parentheses needs to be extracted. In this case, (to extract this link), you have to skip over the parentheses, and then get to the very next anchor tag/href. In most articles, my algorithm can skip over the parentheses, but with the way that it looks for links in front of parentheses (or if they don't exist), it is finding the anchor tag in the wrong place. Specifically, here: <span style="font-size: small;"><span id="coordinates"><a href="/wiki/Geographic_coordinate_system" title="Geographic coordinate system">Coordinates</a>

The algorithm works by iterating through the elements in the first paragraph tag (in the main body of the article), stringifying each element iteratively, and first checking to see if it contains either an '(' or an '

Is there any straight forward way to avoid embedded anchor tags and only take the first link that is a direct child of the first <p> ?

Below is the function with this code for reference:

**def getValidLink(self, currResponse):
        currRoot = BeautifulSoup(currResponse.text,"lxml")
        temp = currRoot.body.findAll('p')[0]
        parenOpened = False
        parenCompleted = False
        openCount = 0
        foundParen = False
        while temp.next:
            temp = temp.next
            curr = str(temp)
            if '(' in curr and str(type(temp)) == "<class 'bs4.element.NavigableString'>":
                foundParen = True
                break
            if '<a' in curr and str(type(temp)) == "<class 'bs4.element.Tag'>":
                link = temp
                break

        temp = currRoot.body.findAll('p')[0]
        if foundParen:
            while temp.next and not parenCompleted:
                temp = temp.next
                curr = str(temp)
                if '(' in curr:
                    openCount += 1
                    if parenOpened is False:
                        parenOpened = True
                if ')' in curr and parenOpened and openCount > 1:
                    openCount -= 1
                elif ')' in curr and parenOpened and openCount == 1:
                    parenCompleted = True
            try:
                return temp.findNext('a').attrs['href']
            except KeyError:
                print "\nReached article with no main body!\n"
                return None
        try:
            return str(link.attrs['href'])
        except KeyError:
            print "\nReached article with no main body\n"
            return None**
alecxe

I think you are seriously overcomplicating the problem.

There are multiple ways to use the direct parent-child relationship between the elements in BeautifulSoup. One way is the > CSS selector:

In [1]: import requests  

In [2]: from bs4 import BeautifulSoup   

In [3]: url = "https://en.wikipedia.org/wiki/Sierra_Leone"    

In [4]: response = requests.get(url)    

In [5]: soup = BeautifulSoup(response.content, "html.parser")

In [6]: [a.get_text() for a in soup.select("#mw-content-text > p > a")]
Out[6]: 
['West Africa',
 'Guinea',
 'Liberia',
 ...
 'Allen Iverson',
 'Magic Johnson',
 'Victor Oladipo',
 'Frances Tiafoe']

Here we've found a elements that are located directly under the p elements directly under the element with id="mw-content-text" - from what I understand this is where the main Wikipedia article is located in.

If you need a single element, use select_one() instead of select().

Also, if you want to solve it via find*(), pass the recursive=False argument.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Finding a dynamic child of the previous element

Element.querySelector for the element's direct child

Getting direct child element in protractor using chaining

How to target direct child from element

How to target direct child selector of parent element

Finding child element of parent pure javascript

Android Studio - The 'activity' element must be a direct child of the 'application' element

Finding correct tag element with regex or next_child_element ( Beautifulsoup)

Google Pagespeed recommends <script> as direct child element of <html> - valid HTML?

How do I select a direct child of "this element" in JSoup

Data binding does not support include as a direct child of a merge element

Can't find the first direct child element of a div

Is there anything wrong with adding an HTML element as a direct child of document.documentElement?

jQuery click event on parent, but finding the child (clicked) element

Finding child element by class from parent with pure javascript cross browser

The <activity> element must be a direct child of the <application> element with Android Studio Android Manifest XML File

Not Finding Element

How to get the parent xml element after finding a child xml element using lxml and python

Finding direct children in a geb module

Finding Direct "Friends" and "Group" Friends

CSS img not direct child of

Get direct child with querySelectorAll

How could I get a direct child element within JavaScript (without using jQuery)?

mix-blend-mode not working in webkit-browsers when element is direct child of body

What does Error:(13) Error: The <receiver> element must be a direct child of the <application> element [WrongManifestParent] mean and how do i fix it?

Finding A Child Element If It Exists When Manipulating PowerPoint XML With python-pptx

Get Direct Child - Not Nested Children

Get direct child within parrent

Select Only Direct Child Of Table