Python Scrapy SitemapSpider回调未被调用

和克罗斯托夫斯基

我在这里阅读了SitemapSpider类的文档：https ://scrapy.readthedocs.io/en/latest/topics/spiders.html#sitemapspider

这是我的代码：

class CurrentHarvestSpider(scrapy.spiders.SitemapSpider):
    name = "newegg"
    allowed_domains = ["newegg.com"]
    sitemap_urls = ['http://www.newegg.com/Siteindex_USA.xml']
    # if I comment this out, then the parse function should be called by default for every link, but it doesn't
    sitemap_rules = [('/Product', 'parse_product_url'), ('product','parse_product_url')]
    sitemap_follow = ['/newegg_sitemap_product', '/Product']

    def parse(self, response):
        with open("/home/dan/debug/newegg_crawler.log", "a") as log:
        log.write("logging from parse " + response.url)
        self.this_function_does_not_exist()
        yield Request(response.url, callback=self.some_callback)

    def some_callback(self, response):
        with open("/home/dan/debug/newegg_crawler.log", "a") as log:
            log.write("logging from some_callback " + response.url)
        self.this_function_does_not_exist()

    def parse_product_url(self, response):
        with open("/home/dan/debug/newegg_crawler.log ", "a") as log:
            log.write("logging from parse_product_url" + response.url)
        self.this_function_does_not_exist()

可以在安装scrapy的情况下成功运行。从工作目录中
运行pip install scrapy以获取抓取并执行scrapy crawl newegg。

我的问题是，为什么不调用这些回调中的任何一个？该文档声称sitemap_rules应调用中定义的回调。如果我将其注释掉，则parse()默认情况下应调用它，但仍然不会被调用。文档只是100％错误吗？我正在检查我设置的该日志文件，没有任何内容在写。我什至将文件的权限设置为777。此外，我正在调用一个不存在的函数，该函数会导致错误，以证明未调用该函数，但未发生任何错误。我究竟做错了什么？

保罗·特姆布雷斯

当我运行您的Spider时，这就是我在控制台上看到的内容：

$ scrapy runspider op.py 
2016-11-09 21:34:51 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot)
(...)
2016-11-09 21:34:51 [scrapy] INFO: Spider opened
2016-11-09 21:34:51 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-09 21:34:51 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-09 21:34:51 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Siteindex_USA.xml> (referer: None)
2016-11-09 21:34:53 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 21:34:53 [scrapy] ERROR: Spider error processing <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
Traceback (most recent call last):
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spiders/sitemap.py", line 44, in _parse_sitemap
    s = Sitemap(body)
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/utils/sitemap.py", line 17, in __init__
    rt = self._root.tag
AttributeError: 'NoneType' object has no attribute 'tag'

您可能已经注意到了AttributeError例外。如此仓促地说，它在解析站点地图响应正文时遇到了麻烦。

而且，如果scrapy无法理解站点地图的内容，则它无法将内容解析为XML，因此无法跟随任何<loc>URL，并且由于未找到任何内容，因此将不会调用任何回调。

因此，您实际上已经发现了scrapy的错误（感谢报告）：https : //github.com/scrapy/scrapy/issues/2389

至于错误本身，

不同的子站点地图（例如http://www.newegg.com/Sitemap/USA/newegg_sitemap_store01.xml.gz）作为gzip压缩的.gz文件（被gzip压缩两次）“在线发送”（因此，HTTP响应需要两次压缩）以正确解析为XML。

Scrapy无法处理这种情况，因此会打印出异常。

这是一个基本的站点地图蜘蛛，试图对响应进行双重压缩：

from scrapy.utils.gz import gunzip
import scrapy


class CurrentHarvestSpider(scrapy.spiders.SitemapSpider):
    name = "newegg"
    allowed_domains = ["newegg.com"]
    sitemap_urls = ['http://www.newegg.com/Siteindex_USA.xml']

    def parse(self, response):
        self.logger.info('parsing %r' % response.url)

    def _get_sitemap_body(self, response):
        body = super(CurrentHarvestSpider, self)._get_sitemap_body(response)
        self.logger.debug("body[:32]: %r" % body[:32])
        try:
            body_unzipped_again = gunzip(body)
            self.logger.debug("body_unzipped_again[:32]: %r" % body_unzipped_again[:100])
            return body_unzipped_again
        except:
            pass
        return body

这些日志显示newegg的.xml.gz站点地图确实需要两次压缩。

$ scrapy runspider spider.py 
2016-11-09 13:10:56 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot)
(...)
2016-11-09 13:10:56 [scrapy] INFO: Spider opened
2016-11-09 13:10:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-09 13:10:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-09 13:10:57 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Siteindex_USA.xml> (referer: None)
2016-11-09 13:10:57 [newegg] DEBUG: body[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding='
2016-11-09 13:10:57 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_store01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 13:10:57 [newegg] DEBUG: body[:32]: '\x1f\x8b\x08\x08\xda\xef\x1eX\x00\x0bnewegg_sitemap_store01'
2016-11-09 13:10:57 [newegg] DEBUG: body_unzipped_again[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"'
2016-11-09 13:10:57 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.newegg.com/Hubs/SubCategory/ID-26> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-11-09 13:10:59 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product15.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 13:10:59 [newegg] DEBUG: body[:32]: '\x1f\x8b\x08\x08\xe3\xfa\x1eX\x00\x0bnewegg_sitemap_product'
2016-11-09 13:10:59 [newegg] DEBUG: body_unzipped_again[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"'
(...)
2016-11-09 13:11:02 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Product/Product.aspx?Item=9SIA04Y0766512> (referer: http://www.newegg.com/Sitemap/USA/newegg_sitemap_product15.xml.gz)
(...)
2016-11-09 13:11:02 [newegg] INFO: parsing 'http://www.newegg.com/Product/Product.aspx?Item=9SIA04Y0766512'
(...)

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-05-15

我来说两句

0 条评论

登录后参与评论

上一篇：Azure DocumentDB受限制的请求

TOP 榜单

文章

Python Scrapy SitemapSpider回调未被调用

Python Scrapy SitemapSpider回调未被调用

UITableView的项目向下滚动后更改颜色，然后快速备份

Linux的官方Adobe Flash存储库是否已过时？

用日期数据透视表和日期顺序查询

应用发明者仅从列表中选择一个随机项一次

Mac OS X更新后的GRUB 2问题

验证REST API参数

Java Eclipse中的错误13，如何解决？

带有错误“ where”条件的查询如何返回结果？

ggplot：对齐多个分面图-所有大小不同的分面

尝试反复更改屏幕上按钮的位置 - kotlin android studio

如何从视图一次更新多行（ASP.NET - Core）

计算数据帧中每行的NA

蓝屏死机没有修复解决方案

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

离子动态工具栏背景色

VB.net将2条特定行导出到DataGridView

通过 Git 在运行 Jenkins 作业时获取 ClassNotFoundException

在Windows 7中无法删除文件（2）

python中的boto3文件上传

当我尝试下载 StanfordNLP en 模型时，出现错误

Node.js中未捕获的异常错误，发生调用