无法使用 Python 请求获取整个 HTML 页面

米克尔·曼萨纳尔

我在 Cards Against Humanity 游戏卡编辑器工作。为了获得卡片创意,我希望从以下网页以编程方式下载整个卡片组使用检查工具,我发现了存储卡的位置:

卡片描述位置

可以看出,在 whitecards 类和 blackcards 类中,可以找到每个卡片 id,其中写有卡片短语或想法。

我的代码的一般功能是提供卡片 URL 并获得所有卡片示例(白色和黑色)。我的第一种方法是在 Python 中使用 Requests 包。我使用了以下代码:

import requests
from bs4 import BeautifulSoup

URL = 'https://cardslackingoriginality.com/expansions/5e758e4034489b003f4529f6/view'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

root = soup.find(id='root')

尽管如此,在检查根对象时,我发现它是空的,但它应该包含所有的 whitecards 和 blackcards 类。

布布

通常情况下,网页在初始页面加载时未完全加载。通常在页面加载后,JavaScript 代码会执行一个或多个 AJAX 请求,导致 DOM 被修改,这就是为什么获取页面requests不会产生最终的、完整的 DOM。因此,我在浏览器中加载了页面,并查看了页面加载后发出的 XHR 网络请求。然而,似乎没有人返回丢失的信息。所以这有点令人费解。因此,我的解决方案是用来Selenium驱动浏览器(下面示例中的 Chrome)并抓取页面。有必要在初始页面加载后等待一两秒以确保 DOM 完成:

from selenium import webdriver
from bs4 import BeautifulSoup
import time

URL = 'https://cardslackingoriginality.com/expansions/5e758e4034489b003f4529f6/view'
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
driver.get(URL)
time.sleep(1) # wait a second for <div id="root"> to be fully loaded
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
root = soup.find(id='root')
print(root)

更新

我更仔细地查看了 AJAX 调用,看起来以下 URL 将返回您感兴趣的实际数据:

https://cardslackingoriginality.com/expansions/5e758e4034489b003f4529f6/get
import requests


URL = 'https://cardslackingoriginality.com/expansions/5e758e4034489b003f4529f6/get'
resp = requests.get(URL)
print(resp.json())

印刷:

{'success': True, 'expansion': {'_id': '5e758e4034489b003f4529f6', 'name': 'Global Pandemic Pack', 'author': '5dfde1f4897a0f003e2fb547', 'description': "Who says in-house quarantine has to suck? For the price of a handful of toilet paper rolls, you can gain some original pandemic-themed cards that'll surely spice up your card games. Get your hands on the first-ever official Cards Lacking Originality card pack now! I mean it, right now!", 'price': 0, 'published': True, 'featured': True, 'dateCreated': '2020-03-21T03:47:12.167Z', '__v': 0, 'gamesUsed': 655, 'whiteCards': ['$1,200 Trump bucks.', 'A free extra week on the cruise ship!', 'A long Zoom meeting with no obvious purpose.', 'A lukewarm bowl of bat soup.', 'A mass panic caused by a sneeze.', 'Babies concieved under quarantine.', 'Beautiful cross-cultural friendships.', 'Binging 30 straight seasons of "The Simpsons."', 'Burying your head in a screen to escape family time.', 'Costco: Battle Royale.', 'Craving any excuse to party.', 'Crying and then sleeping and then crying.', 'Eating all the quarantine food within a day.', 'Ejaculating into the air and trying to catch it in your mouth.', 'Exchanging blowjobs for Kleenex and toilet paper.', 'Forgetting what genuine human connection feels like.', 'Groupons at funeral homes.', 'Hating the media.', 'Insatiable horniness.', 'Kung Flu fighting.', "My Gram-Gram's loooooong vacation!", 'Online class shootings.', 'Only washing hands after the CDC says you have to.', 'Plague, Inc.', 'Praying for the sweet release of death.', 'Raging Ebola.', 'Rediscovering the wonders of video games.', 'Some Lyme disease to go with your Coronavirus.', 'The National Guard.', 'The other eighteen COVIDs.', 'Unnecessarily sensual Zoom messages.'], 'blackCards': ['America: #1 in _______!', "Doctor, I've been doing _______ lately and I fear that I may be very sick.", 'I cannot BELIEVE that the grocery store is sold out of _______ already!', 'We regret to inform you that _______ has officially been cancelled due to COVID-19.', 'What is the one good thing about this pandemic?', 'What was the most difficult thing to give up for social distancing?', "What's really to blame for the spread of the virus?", "What's the best way to kill time while trapped inside the house?", "_______ is the entire reason I'm still holding onto some sanity."]}}

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章