无法通过scrapy 从网站中提取数据，但可以使用xpath 辅助扩展

yeopcp 发表于 Dev

叶科普

所以我创建了一个爬虫来从一个站点中提取数据，例如https://www.sportstoto.com.my/result_print.asp?drawNo=5291/21

这是我的代码，

    import scrapy
from totoprintasp.items import Result


def generate_start_urls():
    drawNums = ['5291/21']
    return ['https://www.sportstoto.com.my/result_print.asp?drawNo={}'.format(drawNum) for drawNum in drawNums]


class TotoprintSpider(scrapy.Spider):
    name = 'totoprint'
    allowed_domains = ['www.sportstoto.com.my/result_print.asp']
    start_urls = generate_start_urls()
    download_delay = 3

    def parse(self, response):
        # print(response.body)

        items = []
        # print(response.body)
        for each in response.xpath("/html/body/div/center/table/tbody"):
            item = Result()
            drawDate = each.xpath(
                "tr[2]/td/div/table/tbody/tr/td[1]/span/font/b/text()").extract() 
            drawNo = each.xpath(
                "tr[2]/td/div/table/tbody/tr/td[2]/span/b/font/text()").extract()
            gameType = each.xpath(
                "tr[4]/td/span/font/text()").extract()
            firstPrize = each.xpath(
                "tr[5]/td/table[1]/tbody/tr[2]/td[1]/span/b/font/text()").extract()

            item['drawDate'] = drawDate
            item['drawNo'] = drawNo
            item['gameType'] = gameType
            item['firstPrize'] = firstPrize
            items.append(item)
            yield item

它没有提取任何东西。我正在运行命令，scrapy runspider totoprint.py并设置了值，

FEED_URI = 'results.json'

FEED_FORMAT = 'json'

在我的settings.py文件中

所以结果应该写到json文件中

有趣的是什么都没有出现，也没有得到任何提取物。我尝试了不同的变化，甚至改变.extract()对.get()

XPath 可以正常工作，因为我已经在我的 chrome 浏览器中的 XPath 帮助程序扩展上尝试过它。

在此处输入图片说明

感谢一些帮助或建议。

穆拉特·德米尔

我重写了你的脚本，但你必须根据你自己的项目重新修复它。这里的问题是你正在寻找 1tbody和他们的 1 个孩子。但是有很多tbody.

据我了解，您希望 gameType 作为列表，而其他人则作为字符串。我得到以下输出：

|------------------|-----------------|----------------------------------------|------------|
| drawDate         | drawNo          | gameType                               | firstPrize |
|------------------|-----------------|----------------------------------------|------------|
| Date:30/05/2021  | DrawNo. 5291/21 | TOTO 4D,TOTO 4D ZODIAC,TOTO 5D,TOTO 6D | 4800       |
|------------------|-----------------|----------------------------------------|------------|

顺便说一句，您不必为每个 URL 执行 for 循环。每个 URL 都一一调用解析。所以这里是脚本：

import scrapy

def generate_start_urls():
    drawNums = ['5291/21']
    return ['https://www.sportstoto.com.my/result_print.asp?drawNo={}'.format(drawNum) for drawNum in drawNums]

class TotoprintSpider(scrapy.Spider):
    name = 'totoprint'
    allowed_domains = ['www.sportstoto.com.my/result_print.asp']
    start_urls = generate_start_urls()
    download_delay = 3
    custom_settings = { 
        "ROBOTSTXT_OBEY":False, #You have to close the robotstxt rule because they are not letting you in.
    }

    def parse(self, response):
        drawDate,drawNo = response.xpath('//*[@class="dataDD"]//text()').extract() #Both have same class so you can scrape them together
        gameType = response.xpath('//*[@class="tit4D"]//text()').extract()
        firstPrize = response.xpath('(//*[@class="dataResultA"])[1]//text()').get() #According to your scrit you want just first price because of that I write the xpath with [1]
        yield {
            'drawDate':drawDate.replace("\t","").replace("\n","").replace("\r",""), #There was some issue about t,n,r I delete simply with replace
            "drawNo":drawNo.replace("\t","").replace("\n","").replace("\r",""),
            "gameType":gameType,
            "firstPrize":firstPrize
        }

我想我写的剧本就是你想要的。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-08-31

我来说两句

0 条评论

登录后参与评论

上一篇：如何在使用 ffmpeg/fluidsynth 从 MIDI 文件转换为 mp3 时更改音频的比特率

无法通过scrapy 从网站中提取数据，但可以使用xpath 辅助扩展

无法通过scrapy 从网站中提取数据，但可以使用xpath 辅助扩展

Qt Creator Windows 10 - “使用 jom 而不是 nmake”不起作用

使用next.js时出现服务器错误，错误：找不到react-redux上下文值；请确保组件包装在<Provider>中

SQL Server中的非确定性数据类型

Swift 2.1-对单个单元格使用UITableView

如何避免每次重新编译所有文件？

在同一Pushwoosh应用程序上Pushwoosh多个捆绑ID

Hashchange事件侦听器在将事件处理程序附加到事件之前进行侦听

应用发明者仅从列表中选择一个随机项一次

在 Avalonia 中是否有带有柱子的 TreeView 或类似的东西？

HttpClient中的角度变化检测

在Wagtail管理员中，如何禁用图像和文档的摘要项？

如何了解DFT结果

Camunda-根据分配的组过滤任务列表

错误：找不到存根。请确保已调用spring-cloud-contract：convert

为什么此后台线程中未处理的异常不会终止我的进程？

构建类似于Jarvis的本地语言应用程序

使用分隔符将成对相邻的数组元素相互连接

您如何通过 Nativescript 中的 Fetch 发出发布请求？

通过iwd从Linux系统上的命令行连接到wifi（适用于Linux的无线守护程序）

使用React / Javascript在Wordpress API中通过ID获取选择的多个帖子/页面

使用 text() 獲取特定文本節點的 XPath