使用scrapy-selenium模块从多个JavaScript页面中抓取硒数据

卡里姆·纳比尔（Karim Nabil）

你好，现代世界的英雄，

我目前正在抓取这个基于JS的网页https://golden.com/list-of-cryptocurrency-companies/，这是我到目前为止已实现的代码

import scrapy
from scrapy.selector import Selector
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from shutil import which
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException


class ScrapperSpider(scrapy.Spider):
    name = 'scrapper'
    allowed_domains = ['golden.com']
    start_urls = ['https://golden.com/list-of-cryptocurrency-companies/']
    current_page = 1


    def __init__(self):
        
        chrome_path = which('chromedriver')
        self.driver = webdriver.Chrome(executable_path=chrome_path)  


    def parse(self, response):
        driver = self.driver 
        number_of_pages = 27

        for i in range(number_of_pages): 

            url = 'https://golden.com/list-of-cryptocurrency-companies/'
            driver.get(url + str(i+1))
            driver.set_window_size(1920, 1080)

            all_results = driver.find_element_by_xpath("//select[contains(@class, 'PageSize')]/option[3]").click()

            new_table = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "NewTable__body")))

            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

            import time
            time.sleep(5)

            driver.implicitly_wait(10)
                    # driver.find_element

            self.html = driver.page_source
         

            resp = Selector(text=self.html)
            for currency in resp.xpath("//div[@class='NewTable__body']/div"):
                exchange_name = currency.xpath('.//div[1]/div/div/div/span/a/span/text()').get()
                website = currency.xpath(".//div[3]/div/div/div/div/span/a/@href").get()

                industry_type = currency.xpath(".//div[4]/div/div/div/div")
                for industry in industry_type:
                    industry_1 = industry.xpath(".//div[1]/span/a/span/text()").get()
                    industry_2 = industry.xpath(".//div[2]/span/a/span/text()").get()
                    industry_3 = industry.xpath(".//div[3]/span/a/span/text()").get()
                    industry_4 = industry.xpath(".//div[4]/span/a/span/text()").get()
                    industry_5 = industry.xpath(".//div[5]/span/a/span/text()").get()


                    
                    location = currency.xpath(".//div[5]/div/div/div/div/div/span/a/span/text()").get()
                

                    yield {
                        'ex_name': exchange_name,
                        'url': website,
                        'industry_1': industry_1,
                        'industry_2': industry_2,
                        'industry_3': industry_3,
                        'industry_4': industry_4,
                        'indsutry_5': industry_5,
                        'location': location

                    }
            
        driver.close()   
        driver.quit()

我的主要问题是网页从https://golden.com/list-of-cryptocurrency-companies/更改为https://golden.com/list-of-cryptocurrency-companies/2，然后立即返回到原始表格，而不会从其他任何页面上刮掉其他任何东西。现在，对于我的一生来说，我似乎无法理解正在发生的事情，因为我已经整整一个星期都在为此工作。

如果有人可以在这里帮助我，将不胜感激，因为我真的很笨

苏雷什曼尼

这是有关如何等到URL更改为示例的示例代码。这将从每个页面上刮取公司名称。

number_of_pages = 27

for i in range(number_of_pages):
    url = 'https://golden.com/list-of-cryptocurrency-companies/'+ str(i+1)
    driver.get(url)
    # wait upto 10 seconds for url changes
    WebDriverWait(driver, 10).until(EC.url_to_be(url))
    companies = driver.find_elements_by_xpath("//div[@class='QueryResults']//span[@class='TopicLink__text']")
    print("Printing from page#" , i+1)
    for company in companies:
        print(company.text)

 
 
driver.close()   
driver.quit()

这是输出：

Printing from page# 1
Temtum
CRYPTOCURRENCY
BLOCKCHAIN
Tortola
National Digital Asset Exchange Inc. (NDAX)
CRYPTOCURRENCY
...
Printing from page# 2
Dentacoin
CRYPTOCURRENCY
BLOCKCHAIN
HEALTHCARE
Netherlands
Waves Platform
...

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-01-25

我来说两句

0 条评论

登录后参与评论

使用scrapy-selenium模块从多个JavaScript页面中抓取硒数据

使用scrapy-selenium模块从多个JavaScript页面中抓取硒数据

隐藏发件人没有短信PHP

Hashchange事件侦听器在将事件处理程序附加到事件之前进行侦听

在浏览器中请求URL时会发生什么？

flask-admin 如何自定义删除按钮

材质UI垂直滑块。如何改变在垂直材料UI滑块导轨的厚度（反应）

用日期数据透视表和日期顺序查询

Jqgrid：多级别组摘要

java io ioexception无法解析服务器地址解析器的响应

Swift如何使用Base64Url编码JWT标头和有效负载之类的json对象

sshd AllowGroups组未授予访问权限

jQuery无限滚动固定div中的滚动

android 背部按下

Flexbox CSS 对齐属性环境惰性？

为什么随机森林中的平均降低基尼系数取决于人口规模？

ClickHouse 创建临时表

为什么PlusShare.Builder setRecipients方法不起作用？

如何在Android中识别MICR代码

PyQt4.QtCore模块无法向sip模块注册

正则表达式，用于查找所有以任何字母开头和数字开头的文件

是否可以通过编程方式对很多动画进行重新着色？

机器密钥生成