In BeautifulSoup, what's the proper way to use a strainer with lxml parsing?

Dave Published at Dev

Dave

I'm using Beautiful Soup 4 and Python 3.8. I want to parse only certain elements from an HTML page, so I decided to use a strainer like so ...

req = urllib2.Request(full_url, headers=settings.HDR)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html, features="lxml", parse_only=DictionaryService.idiom_match_strainer)

,,,

    @staticmethod
    def idiom_match_strainer(elem, attrs):
        if elem == 'ul' and 'class' in attrs and attrs['class'] == 'idiKw':
            return True
        return False

Unfortunately when I try and parse any URL (https://idioms.thefreedictionary.com/testing is an example), I'm getting the below error

Internal Server Error: /ajax/get_hints
Traceback (most recent call last):
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/django/core/handlers/base.py", line 126, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/django/core/handlers/base.py", line 124, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/Users/davea/Documents/workspace/dictionary_project/dictionary/views.py", line 194, in get_hints
    objects = s.get_hints(article)
  File "/Users/davea/Documents/workspace/dictionary_project/dictionary/services/article_service.py", line 398, in get_hints
    idioms = DictionaryService.get_idioms(word)
  File "/Users/davea/Documents/workspace/dictionary_project/dictionary/services/dictionary_service.py", line 75, in get_idioms
    soup = BeautifulSoup(html, features="lxml", parse_only=DictionaryService.idiom_match_strainer)
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/__init__.py", line 281, in __init__
    self._feed()
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/__init__.py", line 342, in _feed
    self.builder.feed(self.markup)
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/builder/_lxml.py", line 287, in feed
    self.parser.feed(markup)
  File "src/lxml/parser.pxi", line 1242, in lxml.etree._FeedParser.feed
  File "src/lxml/parser.pxi", line 1364, in lxml.etree._FeedParser.feed
  File "src/lxml/parsertarget.pxi", line 148, in lxml.etree._TargetParserContext._handleParseResult
  File "src/lxml/parsertarget.pxi", line 136, in lxml.etree._TargetParserContext._handleParseResult
  File "src/lxml/etree.pyx", line 314, in lxml.etree._ExceptionContext._raise_if_stored
  File "src/lxml/saxparser.pxi", line 389, in lxml.etree._handleSaxTargetStartNoNs
  File "src/lxml/saxparser.pxi", line 404, in lxml.etree._callTargetSaxStart
  File "src/lxml/parsertarget.pxi", line 80, in lxml.etree._PythonSaxParserTarget._handleSaxStart
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/builder/_lxml.py", line 220, in start
    self.soup.handle_starttag(name, namespace, nsprefix, attrs)
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/__init__.py", line 582, in handle_starttag
    and (self.parse_only.text
AttributeError: 'function' object has no attribute 'text'

Is there a different way I should be using the strainer?

Martin Honnen

It should suffice to use the SoupStrainer from the package:

from bs4 import BeautifulSoup
from bs4 import SoupStrainer

html = '<html><body><section><ul class="foo"><li>a<li>b</ul><ul><li>1<li>2</ul></section><ul class="foo"><li>c<li>d</ul></body></html>'

soup = BeautifulSoup(html, features="lxml", parse_only=SoupStrainer('ul', class_='foo'))

print(soup.prettify())

gives

<ul class="foo">
 <li>
  a
 </li>
 <li>
  b
 </li>
</ul>
<ul class="foo">
 <li>
  c
 </li>
 <li>
  d
 </li>
</ul>

So for your call you want parse_only=SoupStrainer('ul', class_='idiKw') I think.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-05-30

Comments

0 comments

What's the proper way to use Coroutines in Activity?

What is the proper way to approach parsing this website?

What is the proper way to use JQuery’s .load() on an image?

What is the proper way to use inotify?

What is the proper way to use continue?

What is the proper way to use if _ is _ or .isKind(of: )

What is the proper way to use IF THEN in AQL?

What's the relationship between 'BeautifulSoup' and 'lxml'?

Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?

What is the proper way to use @property in python

what is the "proper" way to use django REST framework?

What is the proper way to use a .equals method in Java?

What is the proper way to use an alternative binary

What is the proper way to use multiple layouts in ReactJS

What is the proper way to use bit array in Rust?

What is the proper way to use Toolbar and SwipeRefreshLayout?

What is the proper way to use React Memo with Flow?

what is the proper way to use $nin operator with mongoDB

What is the proper/right way to use Async Storage?

What is a Proper way to use Input range listener

What is the proper way to use codecs' encoding in Python?

What is the proper way to use Python mock's autospec for objects's methods?

What's the proper way to setup an Android PreferenceFragment?

What's the proper way to share the interface?

What's the proper way to check if a constant is defined?

What's the proper way to document callbacks with jsdoc?

What's the proper way of passing a ref to a prop?

What's the proper way to propagate .catch in promise?

What's the proper way to recurse in LLVM assembly?

TOP Ranking

Article

In BeautifulSoup, what's the proper way to use a strainer with lxml parsing?

In BeautifulSoup, what's the proper way to use a strainer with lxml parsing?

pump.io port in URL

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

How to import an asset in swift using Bundle.main.path() in a react-native native module

Inner Loop design for webscrapping

Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

ggplotly no applicable method for 'plotly_build' applied to an object of class "NULL" if statements

mysql.connector.errors.InterfaceError: 2003: Can't connect to MySQL server on '127.0.0.1:3306' (111 Connection refused)

Removed zsh, but forgot to change shell back to bash, and now Ubuntu crashes (wsl)

Ambiguous use of 'init' with CFStringTransform and Swift 3

Resetting Value of <input type="time"> in Firefox

Execute ./script.sh with a crontab

Converting a class method to a property with a backing field

Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

How to update azerothcore-wotlk docker container

How to set tab order for array of cluster,where cluster elements have different data types in LabVIEW?

Grails with Oracle thick OCI driver authenticate to Oracle with wrong user

How to pass data to the ng2-bs3-modal?

Making Array From Page Elements in jQuery

Retrieve Element Tag Value XML Using Bash

Laravel's ORM sync with timestamps doesn't update timestamps

Do animations stop css changes after animation completion?