我目前正在尝试从 ABS 网站访问一些数据。
表 5。
excel 文件的名称在每次发布时都会更改。我想通过自动下载并将其保存到数据框中来更新它。
现在的进展:
谢谢你美丽的汤。使用该函数获取网站上的 Url 列表。
#####Step 1: start by importing all of the necessary packages#####
import requests #requesting URLs
import urllib.request #requesting URLs
import pandas as pd #for simplifying data operations (e.g. creating dataframe objects)
from bs4 import BeautifulSoup #for web-scraping operations
#####Step 2: connect to the URL in question for scraping#####
url = 'https://www.abs.gov.au/statistics/labour/earnings-and-work-hours/weekly-payroll-jobs-and-wages-australia/latest-release'
response = requests.get(url) #Connect to the URL using the "requests" package
response #if successful then it will return 200
#####Step 3: read in the URL via the "BeautifulSoup" package#####
soup = BeautifulSoup(response.text, 'html.parser')
#####Step 4: html print#####
for link in soup('a'):
print(link.get('href'))
##how to get the link to table 5?##
**url = ?**
##last step to save into data frame##
ws = pd.read_excel(url, sheet_name='Payroll jobs index-SA4', skiprows=5)
您可以从 URL 中找到与 XSLX 关联的 div 类,并使用find_all
方法返回元素列表并使用索引 1 进行查找href
import requests
from bs4 import BeautifulSoup
url = 'https://www.abs.gov.au/statistics/labour/earnings-and-work-hours/weekly-payroll-jobs-and-wages-australia/latest-release'
response = requests.get(url)
response
soup = BeautifulSoup(response.text, 'html.parser')
url=soup.find_all("div",class_="abs-data-download-right")[1].find("a")['href']
pd.read_excel(url, sheet_name='Payroll jobs index-SA4', skiprows=5,engine='openpyxl')
查找所有 URL:
urls=soup.find_all("div",class_="abs-data-download-right")
for i in urls:
print(i.find("a")['href'])
输出:
https://www.abs.gov.au/statistics/labour/earnings-and-work-hours/weekly-payroll-jobs-and-wages-australia/week-ending-31-july-2021/6160055001_DO004.xlsx
https://www.abs.gov.au/statistics/labour/earnings-and-work-hours/weekly-payroll-jobs-and-wages-australia/week-ending-31-july-2021/6160055001_DO005.xlsx
....
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句