Scrapy CrawlSpider：URL 深度

An Se 发表于 Dev

安瑟

我正在尝试实现与 ScreamingFrog 相同的功能——测量 url 深度。为此，我正在访问 response.meta 的深度参数，就像这样: response.meta.get('depth', 0)，但我得到的结果与 ScreamingFrog 的结果有很大不同。因此，我想通过保存CrawlSpider 经历的所有页面来调试为什么会发生这种情况，以便到达当前页面。

这是我目前的蜘蛛的样子：

class FrSpider(scrapy.spiders.CrawlSpider):
    """Designed to crawl french version of dior.com"""

    name = 'Fr'
    allowed_domains = [website]
    denyList = []

    start_urls = ['https://www.%s/' % website]
    rules = (Rule(LinkExtractor(deny=denyList), follow=True, callback='processLink'),)

    def processLink(self, response):
        link = response.url
        depth = response.meta.get('depth', 0)
        print('%s: depth is %s' % (link, depth))

这里比较了我的爬虫和尖叫蛙之间的爬行统计数据（同一网站，仅限前 ~500 页）：

Depth(Clicks from Start Url)  Number of Urls  % of Total
1                             62              12.4
2                             72              14.4
3                             97              19.4
4                             49              9.8
5                             40              8.0
6                             28              5.6
7                             46              9.2
8                             50              10.0
9                             56              11.2
----------------------------  --------------  ----------

对比