In BeautifulSoup, what's the proper way to use a strainer with lxml parsing?

Dave

I'm using Beautiful Soup 4 and Python 3.8. I want to parse only certain elements from an HTML page, so I decided to use a strainer like so ...

req = urllib2.Request(full_url, headers=settings.HDR)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html, features="lxml", parse_only=DictionaryService.idiom_match_strainer)

,,,

    @staticmethod
    def idiom_match_strainer(elem, attrs):
        if elem == 'ul' and 'class' in attrs and attrs['class'] == 'idiKw':
            return True
        return False

Unfortunately when I try and parse any URL (https://idioms.thefreedictionary.com/testing is an example), I'm getting the below error

Internal Server Error: /ajax/get_hints
Traceback (most recent call last):
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/django/core/handlers/base.py", line 126, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/django/core/handlers/base.py", line 124, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/Users/davea/Documents/workspace/dictionary_project/dictionary/views.py", line 194, in get_hints
    objects = s.get_hints(article)
  File "/Users/davea/Documents/workspace/dictionary_project/dictionary/services/article_service.py", line 398, in get_hints
    idioms = DictionaryService.get_idioms(word)
  File "/Users/davea/Documents/workspace/dictionary_project/dictionary/services/dictionary_service.py", line 75, in get_idioms
    soup = BeautifulSoup(html, features="lxml", parse_only=DictionaryService.idiom_match_strainer)
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/__init__.py", line 281, in __init__
    self._feed()
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/__init__.py", line 342, in _feed
    self.builder.feed(self.markup)
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/builder/_lxml.py", line 287, in feed
    self.parser.feed(markup)
  File "src/lxml/parser.pxi", line 1242, in lxml.etree._FeedParser.feed
  File "src/lxml/parser.pxi", line 1364, in lxml.etree._FeedParser.feed
  File "src/lxml/parsertarget.pxi", line 148, in lxml.etree._TargetParserContext._handleParseResult
  File "src/lxml/parsertarget.pxi", line 136, in lxml.etree._TargetParserContext._handleParseResult
  File "src/lxml/etree.pyx", line 314, in lxml.etree._ExceptionContext._raise_if_stored
  File "src/lxml/saxparser.pxi", line 389, in lxml.etree._handleSaxTargetStartNoNs
  File "src/lxml/saxparser.pxi", line 404, in lxml.etree._callTargetSaxStart
  File "src/lxml/parsertarget.pxi", line 80, in lxml.etree._PythonSaxParserTarget._handleSaxStart
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/builder/_lxml.py", line 220, in start
    self.soup.handle_starttag(name, namespace, nsprefix, attrs)
  File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/__init__.py", line 582, in handle_starttag
    and (self.parse_only.text
AttributeError: 'function' object has no attribute 'text'

Is there a different way I should be using the strainer?

Martin Honnen

It should suffice to use the SoupStrainer from the package:

from bs4 import BeautifulSoup
from bs4 import SoupStrainer

html = '<html><body><section><ul class="foo"><li>a<li>b</ul><ul><li>1<li>2</ul></section><ul class="foo"><li>c<li>d</ul></body></html>'

soup = BeautifulSoup(html, features="lxml", parse_only=SoupStrainer('ul', class_='foo'))

print(soup.prettify())

gives

<ul class="foo">
 <li>
  a
 </li>
 <li>
  b
 </li>
</ul>
<ul class="foo">
 <li>
  c
 </li>
 <li>
  d
 </li>
</ul>

So for your call you want parse_only=SoupStrainer('ul', class_='idiKw') I think.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

What's the proper way to use Coroutines in Activity?

What is the proper way to approach parsing this website?

What is the proper way to use JQuery’s .load() on an image?

What is the proper way to use inotify?

What is the proper way to use continue?

What is the proper way to use if _ is _ or .isKind(of: )

What is the proper way to use IF THEN in AQL?

What's the relationship between 'BeautifulSoup' and 'lxml'?

Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?

What is the proper way to use @property in python

what is the "proper" way to use django REST framework?

What is the proper way to use a .equals method in Java?

What is the proper way to use an alternative binary

What is the proper way to use multiple layouts in ReactJS

What is the proper way to use bit array in Rust?

What is the proper way to use Toolbar and SwipeRefreshLayout?

What is the proper way to use React Memo with Flow?

what is the proper way to use $nin operator with mongoDB

What is the proper/right way to use Async Storage?

What is a Proper way to use Input range listener

What is the proper way to use codecs' encoding in Python?

What is the proper way to use Python mock's autospec for objects's methods?

What's the proper way to setup an Android PreferenceFragment?

What's the proper way to share the interface?

What's the proper way to check if a constant is defined?

What's the proper way to document callbacks with jsdoc?

What's the proper way of passing a ref to a prop?

What's the proper way to propagate .catch in promise?

What's the proper way to recurse in LLVM assembly?

TOP Ranking

HotTag

Archive