因此,我开始使用urllib和bs4在python中学习网络抓取,
我正在寻找要分析的代码,发现了这个:-https : //stackoverflow.com/a/38620894/14252018这是代码:-
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)
for result in page.cssselect(".r a"):
url = result.get("href")
if url.startswith("/url?"):
url = parse_qs(urlparse(url).query)['q']
print(url[0])
当我尝试运行它时,它不会打印任何内容
所以然后我尝试使用bs4,这次我选择了https://www.duckduckgo.com
并将代码更改为此:
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('https://duckduckgo.com/?q=dinosaur&t=h_&ia=web').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
print(soup.get_text())
我收到一个错误:
在未启用javascript的情况下,将您的duckduckgo URL更改为网站尝试将您重定向的位置。
import bs4 as bs
import urllib.request
# url = 'https://duckduckgo.com/?q=dinosaur&t=h_&ia=web' # uses javascript
url = 'https://html.duckduckgo.com/html?q=dinosaur' # no javascript
sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, 'lxml')
print(soup.get_text())
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句