我已经在python中结合xpath编写了一个脚本,以从具有xml内容的站点中抓取链接。由于我从未使用过xml,因此无法弄清楚哪里出错了。在此先感谢您提供解决方法。这是我正在尝试的:
import requests
from lxml import html
response = requests.get("https://drinkup.london/sitemap.xml").text
tree = html.fromstring(response)
for item in tree.xpath('//div[@class="expanded"]//span[@class="text"]'):
print(item)
链接所在的xml内容:
<div xmlns="http://www.w3.org/1999/xhtml" class="collapsible" id="collapsible4"><div class="expanded"><div class="line"><span class="button collapse-button"></span><span class="html-tag"><url></span></div><div class="collapsible-content"><div class="line"><span class="html-tag"><loc></span><span class="text">https://drinkup.london/</span><span class="html-tag"></loc></span></div></div><div class="line"><span class="html-tag"></url></span></div></div><div class="collapsed hidden"><div class="line"><span class="button expand-button"></span><span class="html-tag"><url></span><span class="text">...</span><span class="html-tag"></url></span></div></div></div>
执行时引发的错误如下:
value = etree.fromstring(html, parser, **kw)
File "src\lxml\lxml.etree.pyx", line 3228, in lxml.etree.fromstring (src\lxml\lxml.etree.c:79593)
File "src\lxml\parser.pxi", line 1843, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:119053)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
切换到.content
返回字节而不是.text
返回unicode的位置:
import requests
from lxml import html
response = requests.get("https://drinkup.london/sitemap.xml").content
tree = html.fromstring(response)
for item in tree.xpath('//url/loc/text()'):
print(item)
还要注意固定的XPath表达式。
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句