如何将具有多个页面和内部链接的网站抓取到 Pandas 数据框中?

马克·阿尔穆拉

我需要从以下链接中的每个公司获取数据以及链接中的所有内容。我需要每个公司的数据都排成一行。我遇到的问题是我不确定如何准确地做到这一点。我不知道采取哪种方法以及从哪里开始。

这是网站:https : //www.adgm.com/public-registers/fsra

我试图至少将信息输入我的代码并尝试从 IDE 打印它,但我失败了,我不明白为什么。

import requests
import pandas as pd
from bs4 import BeautifulSoup

res = requests.get("https://www.adgm.com/public-registers/fsra")
soup = BeautifulSoup(res.content,'html.parser')
table  = soup.find_all('.every-accord')

for element in table:
    print(element.text)

这是我一直在尝试的代码。每个表格行都在我试图获得的“每个协议”类中。它没有给我任何错误,但我也没有得到任何结果。

提前感谢您的任何帮助。

Ajax1234

您可以遍历容器:

import requests
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://www.adgm.com/public-registers/fsra').text, 'html.parser')
results = [[c.text for c in i.find_all('div', {'class':'col-sm-6'})]+[i.a['href'], i.find('div', {'class':'col-lg-5'}).text] for i in d.find_all('div', {'class':'every-accord'})]
no_headers = [[i for i in c[1:] if i not in {'Company Status', 'Address'}] for c in results]

输出:

[['160024', 'Active', 'Level 7, Al Sila Tower, Abu Dhabi Global Market Square, Al Maryah Island, Abu Dhabi, United Arab Emirates', '/public-registers/fsra/fsf/aarna-capital-limited', 'Aarna Capital Limited'], ['160007', 'Active', 'Unit 8, 6th floor Al Khatem Tower, Abu Dhabi Global Markets Square, Al Maryah Island Abu Dhabi, United Arab Emirates P.O. Box 764605', '/public-registers/fsra/fsf/aberdeen-asset-middle-east-limited', 'Aberdeen Asset Middle East Limited'], ['180041', 'Active', 'Floor 22, Al Sila Tower, Abu Dhabi Global Market Square, Al Maryah Island', '/public-registers/fsra/fsf/abu-dhabi-catalyst-partners-limited', 'Abu Dhabi Catalyst Partners Limited'], ['180021', 'Active', 'Unit 5, 6th Floor, Al Khatem Tower, Abu Dhabi Global Market Square, Al Maryah Island, Abu Dhabi, United Arab Emirates', '/public-registers/fsra/fsf/ad-global-investors-limited', 'AD Global Investors Limited'], ['180039', 'Active', '3419, 34th Floor, Al Maqam Tower, Abu Dhabi Global Market Square, Al Maryah Island, Abu Dhabi, United Arab Emirates', '/public-registers/fsra/fsf/ad-investment-management-limited', 'AD Investment Management Limited'], ['170036', 'Active', '10th Floor, Al Sila Tower, ADGM Square, Al Maryah Island', '/public-registers/fsra/fsf/adcb-asset-management-ltd', 'ADCB Asset Management Ltd.'], ['160006', 'Active', 'Level 34, Al Khatem Tower, Abu Dhabi Global Market Square, Al Maryah Island, Abu Dhabi, United Arab Emirates', '/public-registers/fsra/fsf/adcm-altus-investment-management-limited', 'ADCM Altus Investment Management Limited'], ['160005', 'Active', '33rd floor, Al Khatem Tower, Abu Dhabi Global Market Square, Al Maryah Island, Abu Dhabi, United Arab Emirates', '/public-registers/fsra/fsf/adcorp-ltd', 'ADCORP Ltd'], ['180024', 'Active', 'Unit 10, Level 6, Al Khatem Tower, Abu Dhabi Global Market Square, Al Maryah Island, Abu Dhabi, United Arab Emirates', '/public-registers/fsra/fsf/adnoc-reinsurance-limited', 'ADNOC Reinsurance Limited'], ['170025', 'Active', 'Office 712, Al Khatem Tower, Abu Dhabi Global Market Square, Al Maryah Island, Abu Dhabi, United Arab Emirates', '/public-registers/fsra/fsf/ads-investment-solutions-limited', 'ADS Investment Solutions Limited']]

编辑:格式化列results

new_results = [{**{j[i]:j[i+1] for i in range(0, len(j), 2)}, **{'link':a, 'name':b}} for *j, a, b in results]

输出:

[{'Financial Services Permission Number': '160024', 'Company Status': 'Active', 'Address': 'Level 7, Al Sila Tower, Abu Dhabi Global Market Square, Al Maryah Island, Abu Dhabi, United Arab Emirates', 'link': '/public-registers/fsra/fsf/aarna-capital-limited', 'name': 'Aarna Capital Limited'}, {'Financial Services Permission Number': '160007', 'Company Status': 'Active', 'Address': 'Unit 8, 6th floor Al Khatem Tower, Abu Dhabi Global Markets Square, Al Maryah Island Abu Dhabi, United Arab Emirates P.O. Box 764605', 'link': '/public-registers/fsra/fsf/aberdeen-asset-middle-east-limited', 'name': 'Aberdeen Asset Middle East Limited'}, {'Financial Services Permission Number': '180041', 'Company Status': 'Active', 'Address': 'Floor 22, Al Sila Tower, Abu Dhabi Global Market Square, Al Maryah Island', 'link': '/public-registers/fsra/fsf/abu-dhabi-catalyst-partners-limited', 'name': 'Abu Dhabi Catalyst Partners Limited'}, {'Financial Services Permission Number': '180021', 'Company Status': 'Active', 'Address': 'Unit 5, 6th Floor, Al Khatem Tower, Abu Dhabi Global Market Square, Al Maryah Island, Abu Dhabi, United Arab Emirates', 'link': '/public-registers/fsra/fsf/ad-global-investors-limited', 'name': 'AD Global Investors Limited'}, {'Financial Services Permission Number': '180039', 'Company Status': 'Active', 'Address': '3419, 34th Floor, Al Maqam Tower, Abu Dhabi Global Market Square, Al Maryah Island, Abu Dhabi, United Arab Emirates', 'link': '/public-registers/fsra/fsf/ad-investment-management-limited', 'name': 'AD Investment Management Limited'}, {'Financial Services Permission Number': '170036', 'Company Status': 'Active', 'Address': '10th Floor, Al Sila Tower, ADGM Square, Al Maryah Island', 'link': '/public-registers/fsra/fsf/adcb-asset-management-ltd', 'name': 'ADCB Asset Management Ltd.'}, {'Financial Services Permission Number': '160006', 'Company Status': 'Active', 'Address': 'Level 34, Al Khatem Tower, Abu Dhabi Global Market Square, Al Maryah Island, Abu Dhabi, United Arab Emirates', 'link': '/public-registers/fsra/fsf/adcm-altus-investment-management-limited', 'name': 'ADCM Altus Investment Management Limited'}, {'Financial Services Permission Number': '160005', 'Company Status': 'Active', 'Address': '33rd floor, Al Khatem Tower, Abu Dhabi Global Market Square, Al Maryah Island, Abu Dhabi, United Arab Emirates', 'link': '/public-registers/fsra/fsf/adcorp-ltd', 'name': 'ADCORP Ltd'}, {'Financial Services Permission Number': '180024', 'Company Status': 'Active', 'Address': 'Unit 10, Level 6, Al Khatem Tower, Abu Dhabi Global Market Square, Al Maryah Island, Abu Dhabi, United Arab Emirates', 'link': '/public-registers/fsra/fsf/adnoc-reinsurance-limited', 'name': 'ADNOC Reinsurance Limited'}, {'Financial Services Permission Number': '170025', 'Company Status': 'Active', 'Address': 'Office 712, Al Khatem Tower, Abu Dhabi Global Market Square, Al Maryah Island, Abu Dhabi, United Arab Emirates', 'link': '/public-registers/fsra/fsf/ads-investment-solutions-limited', 'name': 'ADS Investment Solutions Limited'}]

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章

如何将Pandas数据框中的多列弹出到新数据框中?

如何在Pandas中透视数据框?

将多个JSON记录读取到Pandas数据框中

将字典存储在pandas数据框中

将多个csv文件读取到Pandas数据框中

如何从pandas数据框中创建字典?

如何选择在Pandas数据框中的列上保留的数据

使用现有数据框中的数据在Pandas的数据框中添加列

根据列中的条件将Pandas数据框拆分为多个数据框

在Pandas数据框中具有多个索引的数据透视

如何将具有多个行标题的Excel数据插入Pandas数据框

如何将此JSON文件存储在Pandas数据框中?

使用Selenium和Python将表数据提取到pandas数据框中

如何将多个目录中的多个.parquet文件读取到单个pandas数据框中?

如何将具有多个聚合字段和多个索引字段的pandas数据框旋转到python中的sumIfs?

PANDAS(在熊猫的数据框中填充datetime和ffill()数据)

Web将信息从多个页面抓取到pandas数据框中

预览Pandas数据框中的完整链接

将多个文件连接到 Pandas 数据框中

将数据添加到 Pandas 中的数据框

如何将包含多个表的 .dat 文件读取到 Pandas 数据框中?

如何将所有 CSV 文件从谷歌云存储位置读取到单个 Pandas 数据框中?

我如何将数据和索引设置到 Pandas 数据框中

Pandas:根据现有数据框中列的名称和数据创建新的数据框

如何从 Pandas 数据框中组织 JSON 数据

如何将具有多字名称的行添加到 Pandas 数据框中

如何将 Pandas 数据框列提取到向量

如何使用 Python 将多个文本文件的内容提取到 Pandas 数据框中?

将字典嵌套元素提取到 Pandas 数据框中