我试图从这个网站的超链接中提取 URL:https : //riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/
我使用了以下 Python 代码:
import requests
from bs4 import BeautifulSoup
url = "https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/"
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')
print(soup.prettify())
links = soup.find_all('a')
for link in links:
if "href" in link.attrs:
print(str(link.attrs['href'])+"\n")
问题是此代码不返回任何 URL。
您无法解析它,因为数据是动态加载的。如下图所示,当您下载 HTML 源代码时,写入页面的 HTML 数据实际上并不存在。JavaScript 稍后会解析window.__SITE
变量并从中提取数据:
但是,我们可以在 Python 中复制它。下载页面后:
import requests
url = "https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/"
req = requests.get(url)
您可以使用re
(regex) 提取编码的页面源:
import re
encoded_data = re.search("window\.__SITE=\"(.*)\"", req.text).groups()[0]
之后,您可以使用urllib
URL 解码文本,并json
解析 JSON 字符串数据:
from urllib.parse import unquote
from json import loads
json_data = loads(unquote(encoded_data))
然后,您可以解析 JSON 树以获取 HTML 源数据:
html_src = json_data["site"]["data"]["values"]["layout"]["sections"][1]["rows"][0]["cards"][0]["component"]["settings"]["markdown"]
此时,您可以使用自己的代码来解析 HTML:
soup = BeautifulSoup(html_src, 'html.parser')
print(soup.prettify())
links = soup.find_all('a')
for link in links:
if "href" in link.attrs:
print(str(link.attrs['href'])+"\n")
如果你把它们放在一起,这是最终的脚本:
import requests
import re
from urllib.parse import unquote
from json import loads
from bs4 import BeautifulSoup
# Download URL
url = "https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/"
req = requests.get(url)
# Get encoded JSON from HTML source
encoded_data = re.search("window\.__SITE=\"(.*)\"", req.text).groups()[0]
# Decode and load as dictionary
json_data = loads(unquote(encoded_data))
# Get the HTML source code for the links
html_src = json_data["site"]["data"]["values"]["layout"]["sections"][1]["rows"][0]["cards"][0]["component"]["settings"]["markdown"]
# Parse it using BeautifulSoup
soup = BeautifulSoup(html_src, 'html.parser')
print(soup.prettify())
# Get links
links = soup.find_all('a')
# For each link...
for link in links:
if "href" in link.attrs:
print(str(link.attrs['href'])+"\n")
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句