How to download only PDF documents under the 'Design Review' header from the below URL through Selenium in Python.
https://platform.sustain-cert.com/public-project/2756
Design Review header can be anywhere on the web page (top, middle or at bottom). There can be many unique headers apart from the design review header.
This is how you may try:
import time
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver = Chrome()
url = "https://platform.sustain-cert.com/public-project/2756"
driver.get(url)
files = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'div.MuiBox-root.css-16uqhx7')))
print(f"total files: {len(files)}")
container = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'div.MuiContainer-root.MuiContainer-maxWidthLg.css-got2s4')))
categories = container.find_elements(By.CSS_SELECTOR, 'div>h6')
for category in categories:
if category.text == "Design Review":
design_files = category.find_element(By.XPATH, "parent::*").find_elements(By.CSS_SELECTOR, 'div.MuiBox-root.css-16uqhx7')
print(f"total files under Design Review:: {len(design_files)}")
delay = 5
for file in design_files:
file_detail = file.text.split('\n')
if file_detail[0].endswith('.pdf)'):
print(f"pdf files under Design Review:")
print(file_detail[0].replace('(', '').replace(')', ''))
# click button to download the pdf file
file.find_element(By.TAG_NAME, 'button').click()
time.sleep(delay)
delay += 10
output:
total files: 12
total files under Design Review:: 6
pdf files under Design Review:
03 Deviation Request Form-Zengjiang wind power project-20220209-V01.pdf
pdf files under Design Review:
20220901_GS4GG VAL FVR_Yunxiao Wind_clean.pdf
Few things to note:
Design Review
section, so we first locate the element using h6
tagh6
tags and pick only the one with the Design Review
text.h6
tag, find all the files, and store them in a variable design_files
.Design Review
and we easily filter out the files which end with .pdf
Downloading the files takes a bit of time, so we add incremental delay to wait for the current files to get downloaded before starting the next file download.
I hope this answers your problem.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments