urllib.request.urlopen返回字节，但是我无法对其进行解码

布拉扎德

我尝试使用urllib.request的urlopen()方法解析网页，例如：

from urllib.request import Request, urlopen
req = Request(url)
html = urlopen(req).read()

但是，最后一行以字节为单位返回结果。

所以我尝试将其解码，例如：

html = urlopen(req).read().decode("utf-8")

但是，发生错误：

UnicodeDecodeError：'utf-8'编解码器无法解码位置1的字节0x8b：无效的起始字节。

通过一些研究，我找到了一个相关的答案，该答案通过解析charset来决定解码。但是，该页面未返回字符集，并且当我尝试在Chrome Web Inspector上对其进行检查时，在其标题中写了以下行：

<meta charset="utf-8">

那为什么不能用解码呢utf-8？以及如何成功解析网页？

网站URL是http://www.vogue.com/fashion-shows/fall-2016-menswear/fendi/slideshow/collection#2，我要将图像保存到磁盘上。

请注意，我使用的是Python 3.5.1。我还注意到，我在上面编写的所有工作在我的其他抓取程序中都运行良好。

虚假的

内容用压缩gzip。您需要解压缩它：

import gzip
from urllib.request import Request, urlopen

req = Request(url)
html = gzip.decompress(urlopen(req).read()).decode('utf-8')

如果您使用requests，它将自动为您解压缩：

import requests
html = requests.get(url).text  # => str, not bytes

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2020-11-1

0 条评论

登录后参与评论