BeautifulSoup 用于具有不同模板的多个 URL

曼加拉保尔

我想用 2 个不同的 HTML 模板抓取多个 URL。我可以毫无问题地自行抓取每个 HTML，但是在尝试组合两个抓取器时遇到了问题。下面是我的代码：

import requests
from bs4 import BeautifulSoup
import pandas as pd

page_url1 = 'https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory'
page_url2 = 'https://www.med.upenn.edu/apps/faculty/index.php/g20001100/p8866286'
page_url_lst = {'url': [page_url1, page_url2], 'template': [1,2]}
page_url_df = pd.DataFrame(page_url_lst)

data = []
if page_url_df['template'] == 1:
    for url in page_url_df['url']:
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'lxml')
        for e in soup.select('#tabs-publications em'):
            data.append({
                'author':e.previous.get_text(strip=True)[:-1],
                'title':e.get_text(strip=True),
                'journal':e.next_sibling.get_text(strip=True),
                'source': url
            })
else:
    for url_2 in page_url_df['url']:
        r_2 = requests.get(url_2)
        soup_2 = BeautifulSoup(r_2.text, 'lxml')
        for a in soup_2.find_all('span',{'class':'fac_citation'}):
            data.append({
                'author':a.find('b').get_text(),
                'title':a.find('i').get_text(strip=True),
                'journal':a.find('i').next_sibling.get_text(strip=True),
                'source': url_2
            })

这里的逻辑如果“模板”列返回值 1，则使用第一个模板提取数据，否则使用第二个模板提取数据。但是，此代码返回此错误：The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

先感谢您！

安德烈·凯塞利

如果我理解正确，您想基于以下内容创建新数据框page_url_df：

import requests
import pandas as pd
from bs4 import BeautifulSoup


page_url1 = "https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory"
page_url2 = (
    "https://www.med.upenn.edu/apps/faculty/index.php/g20001100/p8866286"
)
page_url_lst = {"url": [page_url1, page_url2], "template": [1, 2]}
page_url_df = pd.DataFrame(page_url_lst)


def get_template_1(url):
    data = []
    soup = BeautifulSoup(requests.get(url).content, "lxml")
    for e in soup.select("#tabs-publications em"):
        data.append(
            {
                "author": e.previous.get_text(strip=True)[:-1],
                "title": e.get_text(strip=True),
                "journal": e.next_sibling.get_text(strip=True),
                "source": url,
            }
        )
    return data


def get_template_2(url):
    data = []
    soup = BeautifulSoup(requests.get(url).text, "lxml")
    for a in soup.find_all("span", {"class": "fac_citation"}):
        data.append(
            {
                "author": a.find("b").get_text(),
                "title": a.find("i").get_text(strip=True),
                "journal": a.find("i").next_sibling.get_text(strip=True),
                "source": url,
            }
        )
    return data


all_data = []
for _, row in page_url_df.iterrows():
    print("Getting", row["url"])
    if row["template"] == 1:
        all_data.extend(get_template_1(row["url"]))
    elif row["template"] == 2:
        all_data.extend(get_template_2(row["url"]))


df_out = pd.DataFrame(all_data)

# print sample data
print(df_out.head().to_markdown())

印刷：

	作者	标题	杂志	资源
0	老鼠丽莎、康菲尔德萨拉、安格拉蒙特塞拉特 C、埃普森 C 尼尔	炎症：母亲压力和后代神经精神风险之间的建议中介。[PMID30314641]	生物精神病学 85（2）：97-106，2019 年 1 月。	https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory
1	塞拉伊莎贝尔，安格拉蒙特塞拉特 C	享受沉默：体细胞中 X 染色体失活的多样性。[PMID31108425]	遗传学与发育的当前观点 55：26-31，2019 年 5 月。	https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory
2	Syrett Camille M，安格拉蒙特塞拉特 C	当平衡被打破时：来自两条 X 染色体的 X 连锁基因剂量和偏向于女性的自身免疫。[PMID31125996]	白细胞生物学杂志 2019 年 5 月。	https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory
3	Kotzin Jonathan J, Iseka Fany, Wright Jasmine, Basavappa Megha G, Clark Megan L, Ali Mohammed-Alkhatim, Abdel-Hakeem Mohammed S, Robertson Tanner F, Mowel Walter K, Joannas Leonel, Neal Vanessa D, Spencer Sean P, Syrett Camille M，安格拉·蒙特塞拉特 C，威廉姆斯·亚当，Wherry E John，Henao-Mejia George	长链非编码 RNA 调节 CD8 T 细胞以响应病毒感染。[PMID31138702]	美国国家科学院院刊 116(24): 11916-11925，2019 年 6 月。	https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory
4	Syrett Camille M, Paneru Bam, Sandoval-Heglund Donavon, Wang Jianle, Banerjee Sarmistha, Sindhava Vishal, Behrens Edward M, Atchison Michael, Anguera Montserrat C	T细胞中改变的X染色体失活可能会促进性别偏见的自身免疫性疾病。[PMID30944248	JCI 洞察 4(7)，2019 年 4 月。	https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2022-08-15

我来说两句

0 条评论

登录后参与评论

上一篇：如何计算字符串编写较少代码中的大写和小写字母？

Beautifulsoup链接（URL）具有特殊字符

Wordpress URL路由，具有不同模板的多个永久链接

BeautifulSoup 用于具有不同模板的多个 URL

BeautifulSoup 用于具有不同模板的多个 URL

构建类似于Jarvis的本地语言应用程序

在 Avalonia 中是否有带有柱子的 TreeView 或类似的东西？

Qt Creator Windows 10 - “使用 jom 而不是 nmake”不起作用

SQL Server中的非确定性数据类型

使用next.js时出现服务器错误，错误：找不到react-redux上下文值；请确保组件包装在<Provider>中

Swift 2.1-对单个单元格使用UITableView

Hashchange事件侦听器在将事件处理程序附加到事件之前进行侦听

HttpClient中的角度变化检测

如何了解DFT结果

错误：找不到存根。请确保已调用spring-cloud-contract：convert

Embers js中的更改侦听器上的组合框

在Wagtail管理员中，如何禁用图像和文档的摘要项？

如何避免每次重新编译所有文件？

Java中的循环开关案例

ng升级性能注意事项

Swift中的指针替代品？

如何使用geoChoroplethChart和dc.js在Mapchart的路径上添加标签或自定义值？

使用分隔符将成对相邻的数组元素相互连接

在同一Pushwoosh应用程序上Pushwoosh多个捆绑ID

ggplot：对齐多个分面图-所有大小不同的分面

完全禁用暂停（在内核级别？-必须与使用的DE和登录状态无关！）