无法使用 Beautifulsoup 正确抓取 <strong> 标签

itsDV7 发表于 Dev

100

它的DV7

所以我试图使用以下代码从这个阿迪达斯网站上抓取产品的日期：

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/'
                         '84.0.4147.105 Safari/537.36'}
url = "https://www.adidas.com.sg/release-dates"
productsource = requests.get(url, headers=headers, timeout=15)
productinfo = BeautifulSoup(productsource.text, "lxml")


def jdMonitor():
    # webscraper
    all_items = productinfo.find_all(name="div", class_="gl-product-card")
    # print(all_items)
    for item in all_items:
        # print(item)
        pname = item.find(name="div", class_="plc-product-name___2cofu").text
        pprice = item.find(name="div", class_="gl-price-item").text
        imagelink = item.find(name="img")['src']
        plink = f"https://www.adidas.com.sg/{item.a['href']}"
        try:
            pdate = item.find(name="div", class_="plc-product-date___1zgO_").strong.text
        except AttributeError as e:
            print(e)
            pdate = "Data Not Available"
        print(f"""
        Product Name: {pname}
        Product Price: {pprice}
        Image Link: {imagelink}
        Product Link: {plink}
        Product Date: {pdate}
""")


jdMonitor()

但是我在pdate. 但是如果我print(productinfo.find_all(name="strong"))用来提取页面上的所有强标签，我就能够正确提取所有标签，而不是我需要的标签。我得到的输出为：

... <strong>All Recycled Materials</strong>, <strong> </strong> ...

第二对强标签之间的空白区域应包含日期，如

<strong>Wednesday 30 Jun 21:30</strong>

有人可以解释为什么会这样吗？以及提取它的方法。

迪米特里·祖布

似乎日期是动态更新的，并且源代码中没有这样的日期（打开源代码并查找“WEDNESDAY 30 JUN 19:00”，什么都不会显示）。最明显的事情是使用selenium使其工作，但这可能是一个缓慢的解决方案。requests-html对我不起作用，就像bs4. 呈现页面也无济于事（或者我做错了什么）。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
# running in headless mode for some reason gives no result or throws an error.
# options.headless = True 
driver = webdriver.Chrome(options=options)
driver.get('https://www.adidas.com.sg/release-dates')

for date in driver.find_elements_by_css_selector('.plc-product-date___1zgO_.gl-label.gl-label--m.gl-label--condensed'):
    print(date.text)
driver.quit()

# output:
'''
WEDNESDAY 30 JUN 19:00
THURSDAY 01 JUL 05:00
THURSDAY 01 JUL 05:00
THURSDAY 01 JUL 05:00
THURSDAY 01 JUL 05:00
THURSDAY 01 JUL 05:00
THURSDAY 01 JUL 05:00
THURSDAY 01 JUL 05:00
'''

您还可以regex像这样获取此日期（如果它们会出现）：

import re

test = '''
Wednesday 30 Jun 19:00
THURSDAY 01 JUL 05:00
THURSDAY 01 FEb 25:00
'''
matches = re.findall(r"[a-zA-Z]+\s\d+\s\w+\s\d+:\d+", str(test))
finall_matches = '\n'.join(matches)
print(finall_matches)

# output before joining: "['Wednesday 30 Jun 19:00', 'THURSDAY 01 JUL 05:00', 'THURSDAY 01 FEb 25:00']"

# output after joining:
'''
Wednesday 30 Jun 19:00
THURSDAY 01 JUL 05:00
THURSDAY 01 FEb 25:00
'''

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-09-1

我来说两句

0 条评论

登录后参与评论

上一篇：实体和数据传输对象中嵌套类的flutter ddd firebase course (reso coder)问题

TOP 榜单

文章

无法使用 Beautifulsoup 正确抓取 <strong> 标签

无法使用 Beautifulsoup 正确抓取 <strong> 标签

Qt Creator Windows 10 - “使用 jom 而不是 nmake”不起作用

使用next.js时出现服务器错误，错误：找不到react-redux上下文值；请确保组件包装在<Provider>中

Swift 2.1-对单个单元格使用UITableView

SQL Server中的非确定性数据类型

如何避免每次重新编译所有文件？

Hashchange事件侦听器在将事件处理程序附加到事件之前进行侦听

在同一Pushwoosh应用程序上Pushwoosh多个捆绑ID

HttpClient中的角度变化检测

在 Avalonia 中是否有带有柱子的 TreeView 或类似的东西？

在Wagtail管理员中，如何禁用图像和文档的摘要项？

通过iwd从Linux系统上的命令行连接到wifi（适用于Linux的无线守护程序）

构建类似于Jarvis的本地语言应用程序

Camunda-根据分配的组过滤任务列表

如何了解DFT结果

Embers js中的更改侦听器上的组合框

ggplot：对齐多个分面图-所有大小不同的分面

使用分隔符将成对相邻的数组元素相互连接

PHP Curl PUT 在 curl_exec 处停止

您如何通过 Nativescript 中的 Fetch 发出发布请求？

错误：找不到存根。请确保已调用spring-cloud-contract：convert

应用发明者仅从列表中选择一个随机项一次