使用 BeautifulSoup 抓取時如何處理某些頁面中缺失的元素

schalkjoubert 发表于 Dev

拉斯卡約伯特

我需要從一系列產品頁面中抓取下面的代碼，然後將其拆分以分別顯示作者和插圖畫家。

問題是：

有些頁面同時包含<li>作者和<li>插圖畫家，如第 1 頁

某些頁面只有<li>作者，如第 2 頁

某些頁面既沒有作者也沒有插畫家，所以根本沒有<ul>，如第 3 頁

知道是否<li>適用於插畫家的唯一方法是，是否<li>包含文本“（Illustreerder）”。

當作者和插畫家為空時，如何為它們分配默認值？

<ul class="product-brands">
    <li class="brand-item">
        <a href="https://lapa.co.za/Skrywer/zinelda-mcdonald-illustreerder.html" title="Zinelda McDonald (Illustreerder)">Zinelda McDonald (Illustreerder)</a>
    </li>
    <li class="brand-item">
        <a href="https://lapa.co.za/Skrywer/jose-reinette-palmer.html" title="Jose  Palmer &amp; Reinette Lombard">Jose  Palmer &amp; Reinette Lombard</a>
    </li>
</ul>

from bs4 import BeautifulSoup
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
}

# AUTHOR & ILLUSTRATOR
page1 = 'https://lapa.co.za/kinder-en-tienerboeke/leer-my-lees-vlak-r-grootboek-10-tippie-help-vir-frikkie'

# AUTHOR ONLY
page2 = 'https://lapa.co.za/catalog/product/view/id/1649/s/hoendervleis-grillerige-stories-en-rympies/category/84/'

# NO AUTHOR and NO ILLUSTRATOR
page3 = 'https://lapa.co.za/catalog/product/view/id/1633/s/sanri-steyn-7-vampiere-van-vlermuishoogte/category/84/'

# PAGE WITH NO STOCK
page4 = 'https://lapa.co.za/kinder-en-tienerboeke/my-groot-lofkleuterbybel-2-oudiomusiek'


illustrator = '(Illustreerder)'
productlist = []

r = requests.get(page2, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')

isbn = soup.find('div', class_='value', itemprop='sku').text.replace(" ", "")
stocks = soup.find('div', class_='stock available')
if stocks is not None:
    stock = stocks.text.strip()
if stocks is None:
    stock = 'n/a'
 
for ultag in soup.find_all('ul', {'class': 'product-brands'}):
    for litag in ultag.find_all('li'):
        author = litag.text.strip() or 'None'

        if illustrator not in author:
            author = author

for ultag in soup.find_all('ul', {'class': 'product-brands'}):
    for litag in ultag.find_all('li'):
        author = litag.text.strip()

        if illustrator in author:
            illustrator = author
          
bookdata = [isbn, stock, author, illustrator]
print(bookdata)

預期輸出： r = requests.get(page1, headers=headers)

['9781776356515', 'In voorraad', 'Jose  Palmer & Reinette Lombard', 'Zinelda McDonald']

預期輸出： r = requests.get(page2, headers=headers)

['9780799383874', 'In voorraad', 'Jaco Jacobs', 'None']

預期輸出： r = requests.get(page3, headers=headers)

['9780799383690', 'In voorraad', 'None', 'None']

內存

你可以這樣做。

首先選擇<ul>您需要使用的find()

ul = soup.find('ul', class_='product-brands')

現在檢查是否<ul>存在。如果True那麼您至少有作者或插畫家之一或兩者兼而有之。
如果True，則獲取元素<li>內標籤的字符串<ul>並返回列表。您可以使用.stripped_strings獲取標籤內所有字符串的列表。

如果False簡單地返回None。

if ul:
      return list(ul.stripped_strings)
return None

根據返回的列表中的項目數量，我認為很容易弄清楚您在問題中提到的內容：

知道是否<li>適用於插畫家的唯一方法是，是否<li>包含文本“（Illustreerder）”。

這是給出作者和 Illustrator 列表的代碼（如果它們中的任何一個存在） else None。

import requests
from bs4 import BeautifulSoup

# AUTHOR & ILLUSTRATOR
page1 = 'https://lapa.co.za/kinder-en-tienerboeke/leer-my-lees-vlak-r-grootboek-10-tippie-help-vir-frikkie'

# AUTHOR ONLY
page2 = 'https://lapa.co.za/catalog/product/view/id/1649/s/hoendervleis-grillerige-stories-en-rympies/category/84/'

# NO AUTHOR and NO ILLUSTRATOR
page3 = 'https://lapa.co.za/catalog/product/view/id/1633/s/sanri-steyn-7-vampiere-van-vlermuishoogte/category/84/'

# PAGE WITH NO STOCK
page4 = 'https://lapa.co.za/kinder-en-tienerboeke/my-groot-lofkleuterbybel-2-oudiomusiek'


def test(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
    }
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'lxml')
    ul = soup.find('ul', class_='product-brands')

    # Return a list only if ul is not None
    if ul:
        return list(ul.stripped_strings)

    return None

print(test(page1))
print(test(page2))
print(test(page3))
print(test(page4))

['Zinelda McDonald (Illustreerder)', 'Jose  Palmer & Reinette Lombard']
['Jaco Jacobs']
None
['Jan de Wet']

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-11-5

我来说两句

0 条评论

登录后参与评论

上一篇：第一次使用 API。感覺我可以更簡單地顯示圖像，但不知道如何

如何使用beautifulsoup從多個頁面抓取數據

使用beautifulSoup在元素中抓取数据

使用cheerio 進行網頁抓取無法處理某些元素

使用 BeautifulSoup 抓取 HTML

使用BeautifulSoup抓取亚马逊

使用 BeautifulSoup 抓取数据

使用BeautifulSoup抓取网页

使用 BeautifulSoup 抓取问题

使用 BeautifulSoup 抓取网页

使用 BeautifulSoup 抓取 url

使用 BeautifulSoup 抓取表格

如何使用BeautifulSoup抓取HTML？

如何使用BeautifulSoup从页面抓取

如何使用BeautifulSoup在HTML中抓取链接

如何使用BeautifulSoup刮

无法使用python和beautifulsoup抓取网页中的某些href

使用BeautifulSoup Python抓取网页

使用BeautifulSoup抓取特定网站

使用 BeautifulSoup 抓取該字段

使用 BeautifulSoup 抓取 Web 数据

使用 Beautifulsoup 抓取 UEFA 网页

使用 BeautifulSoup 抓取 IMG SRC

使用 python 抓取网站 - BeautifulSoup

使用 Beautifulsoup 抓取视频描述

使用BeautifulSoup按Python中的元素抓取HTML

使用 BeautifulSoup 抓取 CSS 类中的特定元素

使用 Python 抓取 HTML 中的特定元素：BeautifulSoup4

如何使用 Python BeautifulSoup 抓取 ID

如何使用BeautifulSoup从reddit抓取表链接

TOP 榜单

文章

使用 BeautifulSoup 抓取時如何處理某些頁面中缺失的元素

使用 BeautifulSoup 抓取時如何處理某些頁面中缺失的元素

Qt Creator Windows 10 - “使用 jom 而不是 nmake”不起作用

使用next.js时出现服务器错误，错误：找不到react-redux上下文值；请确保组件包装在<Provider>中

SQL Server中的非确定性数据类型

Swift 2.1-对单个单元格使用UITableView

如何避免每次重新编译所有文件？

在同一Pushwoosh应用程序上Pushwoosh多个捆绑ID

Hashchange事件侦听器在将事件处理程序附加到事件之前进行侦听

应用发明者仅从列表中选择一个随机项一次

在 Avalonia 中是否有带有柱子的 TreeView 或类似的东西？

HttpClient中的角度变化检测

在Wagtail管理员中，如何禁用图像和文档的摘要项？

如何了解DFT结果

Camunda-根据分配的组过滤任务列表

错误：找不到存根。请确保已调用spring-cloud-contract：convert

为什么此后台线程中未处理的异常不会终止我的进程？

构建类似于Jarvis的本地语言应用程序

使用分隔符将成对相邻的数组元素相互连接

您如何通过 Nativescript 中的 Fetch 发出发布请求？

通过iwd从Linux系统上的命令行连接到wifi（适用于Linux的无线守护程序）

使用React / Javascript在Wordpress API中通过ID获取选择的多个帖子/页面

使用 text() 獲取特定文本節點的 XPath