我对网络抓取非常陌生,目前正在尝试从该站点抓取有关所有供水设施的信息,该站点具有不同区域的选项并输出到 csv 文件。
本站网址不变;每次选择下拉选项时它都保持不变。到目前为止,我的代码(受此stackoverflow 帖子的影响能够从选项中选择第一个区域,但似乎没有更进一步。到目前为止,我有以下内容:
from bs4 import BeautifulSoup
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select
url = 'https://database.ib-net.org/search_utilities?type=2'
browser = webdriver.Chrome()
browser.get(url)
time.sleep(4)
print("Retriving the site...")
# All regions available
regions = ['Africa', 'East Asia and Pacific', 'Europe and Central Asia', 'Latin America (including USA and Canada', 'Middle East and Northern Africa', 'South Asia']
for region in regions:
print("Starting output for the region: " + region)
# Select all options from drop down menu
selectOption = Select(browser.find_element_by_id('MainContent_ddRegion'))
print("Now constructing output for: " + region)
# Select table and wait for data to populate
selectOption.select_by_visible_text(region)
time.sleep(4)
# Select the table containing the data and select all rows
table = browser.find_element_by_xpath("//*[@id='MainContent_gvUtilities']")
print(table)
table_rows = table.find_elements_by_xpath(".//tr")
# Create a list for each column in the table with each column number
utility_name = [] #0
country = [] #2
city = [] #3
population = [] #4
for row in table_rows:
column_element = row.find_elements_by_xpath(".//td")
utility_name.append(column_element[0])
country.append(column_element[2])
city.append(column_element[3])
population.append(column_element[4])
#Create a dictionary of all utilities for each region
dict_output = {
"Utility Name": utility_name,
"Country": country,
"City": city,
"Population": population,
}
df = pd.DataFrame.from_dict(dict_output)
df.to_csv(region, index = False)
browser.close()
browser.quit()
我每次都会收到此错误:
File "/home/ken/.local/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
(Session info: chrome=91.0.4472.77)
(Driver info: chromedriver=2.26.436382 (70eb799287ce4c2208441fc057053a5b07ceabac),platform=Linux 5.8.0-59-generic x86_64)
我被困在这里,我似乎无法弄清楚我做错了什么,或者我实际上应该做什么来解决这个错误。对此的任何帮助或指示将不胜感激!
谢谢!!
我似乎无法重现您的错误。但是运行它,这里有一些事情:
regions
列表中有错字:'Latin America (including USA and Canada'
应该是'Latin America (including USA and Canada)'
代码:
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select
url = 'https://database.ib-net.org/search_utilities?type=2'
browser = webdriver.Chrome()
browser.get(url)
time.sleep(4)
print("Retriving the site...")
# All regions available
regions = ['Africa', 'East Asia and Pacific', 'Europe and Central Asia', 'Latin America (including USA and Canada)', 'Middle East and Northern Africa', 'South Asia']
for region in regions:
print("Starting output for the region: " + region)
# Select all options from drop down menu
selectOption = Select(browser.find_element_by_id('MainContent_ddRegion'))
print("Now constructing output for: " + region)
# Select table and wait for data to populate
selectOption.select_by_visible_text(region)
time.sleep(4)
# Select the table containing the data and select all rows
table = pd.read_html(browser.page_source)[0][:-1].dropna(axis=1)
print(table)
table.csv(region, index = False)
browser.close()
browser.quit()
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句