我想用 2 个不同的 HTML 模板抓取多个 URL。我可以毫无问题地自行抓取每个 HTML,但是在尝试组合两个抓取器时遇到了问题。下面是我的代码:
import requests
from bs4 import BeautifulSoup
import pandas as pd
page_url1 = 'https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory'
page_url2 = 'https://www.med.upenn.edu/apps/faculty/index.php/g20001100/p8866286'
page_url_lst = {'url': [page_url1, page_url2], 'template': [1,2]}
page_url_df = pd.DataFrame(page_url_lst)
data = []
if page_url_df['template'] == 1:
for url in page_url_df['url']:
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for e in soup.select('#tabs-publications em'):
data.append({
'author':e.previous.get_text(strip=True)[:-1],
'title':e.get_text(strip=True),
'journal':e.next_sibling.get_text(strip=True),
'source': url
})
else:
for url_2 in page_url_df['url']:
r_2 = requests.get(url_2)
soup_2 = BeautifulSoup(r_2.text, 'lxml')
for a in soup_2.find_all('span',{'class':'fac_citation'}):
data.append({
'author':a.find('b').get_text(),
'title':a.find('i').get_text(strip=True),
'journal':a.find('i').next_sibling.get_text(strip=True),
'source': url_2
})
这里的逻辑如果“模板”列返回值 1,则使用第一个模板提取数据,否则使用第二个模板提取数据。但是,此代码返回此错误:The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
先感谢您!
如果我理解正确,您想基于以下内容创建新数据框page_url_df
:
import requests
import pandas as pd
from bs4 import BeautifulSoup
page_url1 = "https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory"
page_url2 = (
"https://www.med.upenn.edu/apps/faculty/index.php/g20001100/p8866286"
)
page_url_lst = {"url": [page_url1, page_url2], "template": [1, 2]}
page_url_df = pd.DataFrame(page_url_lst)
def get_template_1(url):
data = []
soup = BeautifulSoup(requests.get(url).content, "lxml")
for e in soup.select("#tabs-publications em"):
data.append(
{
"author": e.previous.get_text(strip=True)[:-1],
"title": e.get_text(strip=True),
"journal": e.next_sibling.get_text(strip=True),
"source": url,
}
)
return data
def get_template_2(url):
data = []
soup = BeautifulSoup(requests.get(url).text, "lxml")
for a in soup.find_all("span", {"class": "fac_citation"}):
data.append(
{
"author": a.find("b").get_text(),
"title": a.find("i").get_text(strip=True),
"journal": a.find("i").next_sibling.get_text(strip=True),
"source": url,
}
)
return data
all_data = []
for _, row in page_url_df.iterrows():
print("Getting", row["url"])
if row["template"] == 1:
all_data.extend(get_template_1(row["url"]))
elif row["template"] == 2:
all_data.extend(get_template_2(row["url"]))
df_out = pd.DataFrame(all_data)
# print sample data
print(df_out.head().to_markdown())
印刷:
作者 | 标题 | 杂志 | 资源 | |
---|---|---|---|---|
0 | 老鼠丽莎、康菲尔德萨拉、安格拉蒙特塞拉特 C、埃普森 C 尼尔 | 炎症:母亲压力和后代神经精神风险之间的建议中介。[PMID30314641] | 生物精神病学 85(2):97-106,2019 年 1 月。 | https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory |
1 | 塞拉伊莎贝尔,安格拉蒙特塞拉特 C | 享受沉默:体细胞中 X 染色体失活的多样性。[PMID31108425] | 遗传学与发育的当前观点 55:26-31,2019 年 5 月。 | https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory |
2 | Syrett Camille M,安格拉蒙特塞拉特 C | 当平衡被打破时:来自两条 X 染色体的 X 连锁基因剂量和偏向于女性的自身免疫。[PMID31125996] | 白细胞生物学杂志 2019 年 5 月。 | https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory |
3 | Kotzin Jonathan J, Iseka Fany, Wright Jasmine, Basavappa Megha G, Clark Megan L, Ali Mohammed-Alkhatim, Abdel-Hakeem Mohammed S, Robertson Tanner F, Mowel Walter K, Joannas Leonel, Neal Vanessa D, Spencer Sean P, Syrett Camille M,安格拉·蒙特塞拉特 C,威廉姆斯·亚当,Wherry E John,Henao-Mejia George | 长链非编码 RNA 调节 CD8 T 细胞以响应病毒感染。[PMID31138702] | 美国国家科学院院刊 116(24): 11916-11925,2019 年 6 月。 | https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory |
4 | Syrett Camille M, Paneru Bam, Sandoval-Heglund Donavon, Wang Jianle, Banerjee Sarmistha, Sindhava Vishal, Behrens Edward M, Atchison Michael, Anguera Montserrat C | T细胞中改变的X染色体失活可能会促进性别偏见的自身免疫性疾病。[PMID30944248 | JCI 洞察 4(7),2019 年 4 月。 | https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory |
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句