BeautifulSoup4抓取不能超过网站的第一页（Python 3.6）

刘爱莉

我正在尝试从该网站的首页刮到第14页：https : //cross-currents.berkeley.edu/archives?author=&title=&type=All&issue=All®ion=All这是我的代码：

import requests as r
from bs4 import BeautifulSoup as soup
import pandas 

#make a list of all web pages' urls
webpages=[]
for i in range(15):
    root_url = 'https://cross-currents.berkeley.edu/archives?author=&title=&type=All&issue=All&region=All&page='+ str(i)
    webpages.append(root_url)
    print(webpages)

#start looping through all pages
for item in webpages:  
    headers = {'User-Agent': 'Mozilla/5.0'}
    data = r.get(item, headers=headers)
    page_soup = soup(data.text, 'html.parser')

#find targeted info and put them into a list to be exported to a csv file via pandas
    title_list = [title.text for title in page_soup.find_all('div', {'class':'field field-name-node-title'})]
    title = [el.replace('\n', '') for el in title_list]

#export to csv file via pandas
    dataset = {'Title': title}
    df = pandas.DataFrame(dataset)
    df.index.name = 'ArticleID'
    df.to_csv('example31.csv',encoding="utf-8")

输出的csv文件仅包含最后一页的目标信息。当我打印“网页”时，它表明所有页面的网址均已正确地放入列表中。我究竟做错了什么？先感谢您！

您只需覆盖所有页面的相同输出CSV文件，就可以.to_csv()在“追加”模式下调用以将新数据添加到现有文件的末尾：

df.to_csv('example31.csv', mode='a', encoding="utf-8", header=False)

或者，最好将标题收集到标题列表中，然后转储到CSV中一次：

#start looping through all pages
titles = []
for item in webpages:
    headers = {'User-Agent': 'Mozilla/5.0'}
    data = r.get(item, headers=headers)
    page_soup = soup(data.text, 'html.parser')

    #find targeted info and put them into a list to be exported to a csv file via pandas
    title_list = [title.text for title in page_soup.find_all('div', {'class':'field field-name-node-title'})]

    titles += [el.replace('\n', '') for el in title_list]

# export to csv file via pandas
dataset = [{'Title': title} for title in titles]
df = pandas.DataFrame(dataset)
df.index.name = 'ArticleID'
df.to_csv('example31.csv', encoding="utf-8")

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2020-12-7

我来说两句

0 条评论

登录后参与评论

上一篇：python使用include模式复制文件

使用python3的pdfminer库提取pdf文件的第一页

Codeigniter 3博客错误：分页仅显示第一页

TYPO3-一页网站

BeautifulSoup4抓取不能超过网站的第一页（Python 3.6）

BeautifulSoup4抓取不能超过网站的第一页（Python 3.6）

Linux的官方Adobe Flash存储库是否已过时？

如何使用HttpClient的在使用SSL证书，无论多么“糟糕”是

错误：“ javac”未被识别为内部或外部命令，

Modbus Python施耐德PM5300

为什么Object.hashCode（）不遵循Java代码约定

如何正确比较 scala.xml 节点？

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

在令牌内联程序集错误之前预期为 ')'

数据表中有多个子行，asp.net核心中来自sql server的数据

VBA 自动化错误：-2147221080 (800401a8)

错误TS2365：运算符'！=='无法应用于类型'“（”'和'“）”'

如何在JavaScript中获取数组的第n个元素？

检查嵌套列表中的长度是否相同

如何将sklearn.naive_bayes与（多个）分类功能一起使用？

ValueError：尝试同时迭代两个列表时，解包的值太多（预期为 2）

ES5的代理替代

在同一Pushwoosh应用程序上Pushwoosh多个捆绑ID

如何监视应用程序而不是单个进程的CPU使用率？

如何检查字符串输入的格式

解决类Koin的实例时出错

如何自动选择正确的键盘布局？-仅具有一个键盘布局