以下是抓取此网页的代码。在页面上的所有URL中,我只需要那些具有有关职位发布的更多信息的URL,例如,公司名称的URL,例如-“ Abbot”,“ Abbvie”,“ Affymetrix”,等等。
import requests
import pandas as pd
import re
from lxml import html
from bs4 import BeautifulSoup
from selenium import webdriver
list = ['#medical-device','#engineering','#recruitment','#job','#linkedin']
page = "https://dpseng.com.sg/definitive-singapore-pharma-job-website-directory/"
list_of_pages = [page + x for x in list]
for info in list_of_pages:
pages= requests.get(info)
soup = BeautifulSoup(pages.content, 'html.parser')
tags = [div.p for div in soup.find_all('div', attrs ={'class':'fusion-text'})]
for m in tags:
try:
links = [link['href'] for link in tags]
except KeyError:
pass
print(links)
我得到的输出是一个空白列表,如下所示:
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
我应该在上面的代码中添加/编辑哪些内容,以抓取这些URL以及这些URL中的更多信息。
谢谢 !!
我注意到的是,带有锚点的网页并没有真正隔离您真正想要的HTML。因此,您要获取的所有实例<div class='fusion-text'>
。
下面的代码示例将检索所需的所有URL:
import requests
from bs4 import BeautifulSoup
# Get webpage
page = "https://dpseng.com.sg/definitive-singapore-pharma-job-website-directory/"
soup= BeautifulSoup(requests.get(page).content, 'html.parser')
# Grab all URLs under each section
for section in ['medical-device','engineering','recruitment','job','linkedin']:
subsection = soup.find('div', attrs ={'id': section})
links = [a['href'] for a in subsection.find_all('a')]
print("{}: {}".format(section, links))
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句