Multiprocessing BeautifulSoup bs4.element.Tag

slaw

I'm trying to use multiprocessing along with BeautifulSoup but am encountering a maximum recursion depth exceeded error:

def process_card(card):
    result = card.find("p")
    # Do some more parsing with beautifulsoup

    return results


pool = multiprocessing.Pool(processes=4)
soup = BeautifulSoup(url, 'html.parser')
cards = soup.findAll("li")
for card in cards:
    result = pool.apply_async(process_card, [card]) 
    article = result.get()
    if article is not None:
        print article
        articles.append(article)
pool.close()
pool.join()

From what I can gather, card is of type <class bs4.element.Tag> and the problem may have to do with pickling this object. It's not clear how I'd have to modify my code to resolve this.

slaw

It was pointed out in the comments that one could simply cast card as unicode. However, this resulted in the process_card function erroring out with slice indices must be integers or None or have an __index__ method. It turns out that this error has to do with the fact that card is no longer a bs4 object and therefore has no access to bs4 functions. Instead, card is simply unicode and the error is a unicode-related error. And so one needs to turn card into soup first and then proceed from there. This works!

def process_card(unicode_card):
    card = BeautifulSoup(unicode_card)
    result = card.find("p")
    # Do some more parsing with beautifulsoup

    return results


pool = multiprocessing.Pool(processes=4)
soup = BeautifulSoup(url, 'html.parser')
cards = soup.findAll("li")
for card in cards:
    result = pool.apply_async(process_card, [unicode(card)]) 
    article = result.get()
    if article is not None:
        print article
        articles.append(article)
pool.close()
pool.join()

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related