使用 BeautifulSoup 进行网页抓取时如何移动到新页面？

混合豆

下面我有从 craigslist 中提取记录的代码。一切都很好，但我需要能够进入下一组记录并重复相同的过程，但我是编程新手，我被卡住了。从查看页面代码看来，我应该单击此处跨度中包含的箭头按钮，直到它不包含 href 为止：

<a href="/search/syp?s=120" class="button next" title="next page">next &gt; </a>

我在想，也许这是一个循环中的循环，但我想这也可能是一种尝试/除外情况。听起来对吗？你将如何实施？

import requests
from urllib.request import urlopen
import pandas as pd

response = requests.get("https://nh.craigslist.org/d/computer-parts/search/syp")

soup = BeautifulSoup(response.text,"lxml")

listings = soup.find_all('li', class_= "result-row")

base_url = 'https://nh.craigslist.org/d/computer-parts/search/'

next_url = soup.find_all('a', class_= "button next")


dates = []
titles = []
prices = []
hoods = []

while base_url !=
    for listing in listings:
        datar = listing.find('time', {'class': ["result-date"]}).text
        dates.append(datar)

        title = listing.find('a', {'class': ["result-title"]}).text
        titles.append(title)

        try:
            price = listing.find('span', {'class': "result-price"}).text
            prices.append(price)
        except:
            prices.append('missing')

        try:
            hood = listing.find('span', {'class': "result-hood"}).text
            hoods.append(hood)
        except:
            hoods.append('missing')

#write the lists to a dataframe
listings_df = pd.DataFrame({'Date': dates, 'Titles' : titles, 'Price' : prices, 'Location' : hoods})

 #write to a file
listings_df.to_csv("craigslist_listings.csv")

托马斯·卡瓦略

对于您抓取的每个页面，您都可以找到下一个要抓取的网址并将其添加到列表中。

这就是我将如何做到这一点，而不会过多地更改您的代码。我添加了一些评论，以便您了解发生了什么，但如果您需要任何额外的解释，请给我留言：

import requests
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup


base_url = 'https://nh.craigslist.org/d/computer-parts/search/syp'
base_search_url = 'https://nh.craigslist.org'
urls = []
urls.append(base_url)
dates = []
titles = []
prices = []
hoods = []

while len(urls) > 0: # while we have urls to crawl
    print(urls)
    url = urls.pop(0) # removes the first element from the list of urls
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")
    next_url = soup.find('a', class_= "button next") # finds the next urls to crawl
    if next_url: # if it's not an empty string
        urls.append(base_search_url + next_url['href']) # adds next url to crawl to the list of urls to crawl

    listings = soup.find_all('li', class_= "result-row") # get all current url listings
    # this is your code unchanged
    for listing in listings:
        datar = listing.find('time', {'class': ["result-date"]}).text
        dates.append(datar)

        title = listing.find('a', {'class': ["result-title"]}).text
        titles.append(title)

        try:
            price = listing.find('span', {'class': "result-price"}).text
            prices.append(price)
        except:
            prices.append('missing')

        try:
            hood = listing.find('span', {'class': "result-hood"}).text
            hoods.append(hood)
        except:
            hoods.append('missing')

#write the lists to a dataframe
listings_df = pd.DataFrame({'Date': dates, 'Titles' : titles, 'Price' : prices, 'Location' : hoods})

 #write to a file
listings_df.to_csv("craigslist_listings.csv")

编辑：您还忘记导入BeautifulSoup您的代码，我在我的回复中添加了该代码Edit2：您只需要找到下一个按钮的第一个实例，因为页面可以（在这种情况下它确实）有更多的下一个按钮。
Edit3：为了抓取计算机部件，base_url应更改为此代码中存在的那个

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-07-7

我来说两句

0 条评论

登录后参与评论

上一篇：apt 更新谷歌云 EXPKEYSIG Err:3 公钥不可用

TOP 榜单

文章

使用 BeautifulSoup 进行网页抓取时如何移动到新页面？

使用 BeautifulSoup 进行网页抓取时如何移动到新页面？

UITableView的项目向下滚动后更改颜色，然后快速备份

Linux的官方Adobe Flash存储库是否已过时？

用日期数据透视表和日期顺序查询

应用发明者仅从列表中选择一个随机项一次

Mac OS X更新后的GRUB 2问题

验证REST API参数

Java Eclipse中的错误13，如何解决？

带有错误“ where”条件的查询如何返回结果？

ggplot：对齐多个分面图-所有大小不同的分面

尝试反复更改屏幕上按钮的位置 - kotlin android studio

如何从视图一次更新多行（ASP.NET - Core）

计算数据帧中每行的NA

蓝屏死机没有修复解决方案

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

离子动态工具栏背景色

VB.net将2条特定行导出到DataGridView

通过 Git 在运行 Jenkins 作业时获取 ClassNotFoundException

在Windows 7中无法删除文件（2）

python中的boto3文件上传

当我尝试下载 StanfordNLP en 模型时，出现错误

Node.js中未捕获的异常错误，发生调用