如何在一个Web上从多个页面抓取数据，我正在使用Python和BeautifulSoup

Helena 发表于 Dev

海伦娜

   # -*- coding: utf-8 -*-
"""
Created on Fri Jun 29 10:38:46 2018

@author: Cinthia
"""

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
array = ['146-face', '153-palettes-sets', 'https://www.sociolla.com/147-eyes', 'https://www.sociolla.com/150-lips', 'https://www.sociolla.com/149-brows', 'https://www.sociolla.com/148-lashes']
base_url='https://www.sociolla.com/142-face'
uClient = uReq(base_url)
page_html = uClient.read()
uClient.close()

#html parsing
page_soup = soup(page_html, "html.parser")

#grab the product
kosmetik = page_soup.findAll("div", {"class":"col-md-3 col-sm-6 ipad-grid col-xs-12 productitem"})
print(len(kosmetik))

我想从该网站上抓取数据，上面的代码仅在基本网址上占用了多少产品。我不知道该数组如何工作，因此它可以从产品中获取的数据（例如描述，图像，价格）从我在数组中创建的所有页面中获取。

我是Python的新手，对循环了解不多。

贝特朗·马特尔

您可以在id=product-list-grid此处找到表/网格的根元素，并提取包含所有需要的信息（品牌，链接，类别）和第一个<img>标签的属性。

对于分页，似乎可以添加到下一页p=<page number>，而当该页面不存在时，它将重定向到第一页。一种解决方法是检查响应URL，并检查其是否与您请求的URL相同。如果相同，则可以增加页码，否则将所有页面刮掉

from bs4 import BeautifulSoup
import urllib.request

count = 1
url = "https://www.sociolla.com/142-nails?p=%d"

def get_url(url):
    req = urllib.request.Request(url)
    return urllib.request.urlopen(req)

expected_url = url % count
response = get_url(expected_url)

results = []

while (response.url == expected_url):
    print("GET {0}".format(expected_url))
    soup = BeautifulSoup(response.read(), "html.parser")

    products = soup.find("div", attrs = {"id" : "product-list-grid"})

    results.append([
        (
            t["data-eec-brand"],    #brand
            t["data-eec-category"], #category
            t["data-eec-href"],     #product link
            t["data-eec-name"],     #product name
            t["data-eec-price"],    #price
            t.find("img")["src"]    #image link
        ) 
        for t in products.find_all("div", attrs = {"class" : "product-item"})
        if t
    ])

    count += 1
    expected_url = url % count
    response = get_url(expected_url)

print(results)

结果存储在这里results，是一个元组数组

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2020-12-3

我来说两句

0 条评论

登录后参与评论

上一篇：为了避免生成Makevars文件，是否需要在Imports和LinkingTo中都指定Rcpp？

如何使用Python和BeautifulSoup抓取多个Google页面

在python和beautifulsoup中查找同一个<div>内的多个抓取数据

如何使用python和beautifulsoup4循环抓取网站中多个页面的数据

我正在尝试删除使用Python和BeautifulSoup抓取的Web链接的重复数据，但是它不起作用

如何在一个Web上从多个页面抓取数据，我正在使用Python和BeautifulSoup

如何在一个Web上从多个页面抓取数据，我正在使用Python和BeautifulSoup

构建类似于Jarvis的本地语言应用程序

在 Avalonia 中是否有带有柱子的 TreeView 或类似的东西？

Qt Creator Windows 10 - “使用 jom 而不是 nmake”不起作用

SQL Server中的非确定性数据类型

使用next.js时出现服务器错误，错误：找不到react-redux上下文值；请确保组件包装在<Provider>中

Swift 2.1-对单个单元格使用UITableView

Hashchange事件侦听器在将事件处理程序附加到事件之前进行侦听

HttpClient中的角度变化检测

如何了解DFT结果

错误：找不到存根。请确保已调用spring-cloud-contract：convert

Embers js中的更改侦听器上的组合框

在Wagtail管理员中，如何禁用图像和文档的摘要项？

如何避免每次重新编译所有文件？

Java中的循环开关案例

ng升级性能注意事项

Swift中的指针替代品？

如何使用geoChoroplethChart和dc.js在Mapchart的路径上添加标签或自定义值？

使用分隔符将成对相邻的数组元素相互连接

在同一Pushwoosh应用程序上Pushwoosh多个捆绑ID

ggplot：对齐多个分面图-所有大小不同的分面

完全禁用暂停（在内核级别？-必须与使用的DE和登录状态无关！）