我的单页刮板:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.cvbankas.lt/?padalinys%5B0%5D=76&page=1'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for h3 in soup.select('h3.list_h3'):
job_title = h3.get_text(strip=True)
company = h3.find_next(class_="heading_secondary").get_text(strip=True)
salary = h3.find_next(class_="salary_amount").get_text(strip=True)
location = h3.find_next(class_="list_city").get_text(strip=True)
print('{:<50} {:<15} {:<15} {}'.format(company, salary, location, job_title))
all_data.append({
'Job Title': job_title,
'Company': company,
'Salary': salary,
'Location': location
})
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
#tips = sns.load_dataset('data.csv')
#print(tips)
给我一个csv文件,但只有50行。我想抓取所有页面,本来想在HTML
代码中查找,'class=':'prev_next'
但是BACK和FORWARD相同,只是href不同。因此,我决定进行范围循环并更改页面:
import requests
import pandas as pd
from bs4 import BeautifulSoup
#url = 'https://www.cvbankas.lt/?padalinys%5B0%5D=76&page=1'
#soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for i in range(1, 9):
url = 'https://www.cvbankas.lt/?padalinys%5B0%5D=76&page='+str(i)
print(url)
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for h3 in soup.select('h3.list_h3'):
try:
job_title = h3.get_text(strip=True)
company = h3.find_next(class_="heading_secondary").get_text(strip=True)
salary = h3.find_next(class_="salary_amount").get_text(strip=True)
location = h3.find_next(class_="list_city").get_text(strip=True)
print('{:<50} {:<15} {:<15} {}'.format(company, salary, location, job_title))
except AttributeError:
all_data.append({
'Job Title': job_title,
'Company': company,
'Salary': salary,
'Location': location
})
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
运行代码后,它仅保存5行,因此比我仅刮取一页的代码少10倍。
您将如何循环页面?页面从1
到8
还有如何清理薪水对象?因为它是字符串,其中包含Nuo 2700
或Iki 2500
或具有两个数字,例如1000-3000
。因为我想使用Salary列作为整数,所以我可以对Seaborn进行一些绘图。
您已将添加缩进到块all_data
内的列表中except
。因此,except
仅在出现异常时控件才进入。运行以下脚本可在csv文件中提供约365行
import requests
import pandas as pd
from bs4 import BeautifulSoup
#url = 'https://www.cvbankas.lt/?padalinys%5B0%5D=76&page=1'
#soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for i in range(1, 9):
url = 'https://www.cvbankas.lt/?padalinys%5B0%5D=76&page='+str(i)
print(url)
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for h3 in soup.select('h3.list_h3'):
try:
job_title = h3.get_text(strip=True)
company = h3.find_next(class_="heading_secondary").get_text(strip=True)
salary = h3.find_next(class_="salary_amount").get_text(strip=True)
location = h3.find_next(class_="list_city").get_text(strip=True)
print('{:<50} {:<15} {:<15} {}'.format(company, salary, location, job_title))
all_data.append({
'Job Title': job_title,
'Company': company,
'Salary': salary,
'Location': location
})
except AttributeError:
pass
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句