302重定向301后，如何获取第一个请求网址

Xie 发表于 Dev

xie

我用scrapy（ver：1.1.1）在互联网上刮了一些日期。这就是我要面对的：

class Link_Spider(scrapy.Spider):
    name = 'GetLink'
    allowed_domains = ['example_0.com']
    with codecs.open('link.txt', 'r', 'utf-8') as f:
        start_urls = [url.strip() for url in f.readlines()]

def parse(self, response):
    print response.url

在上面的代码中，“ start_urls”类型是一个列表：

start_urls = [
              example_0.com/?id=0,
              example_0.com/?id=1,
              example_0.com/?id=2,
             ] # and so on

当草率运行时，调试信息告诉我：

[scrapy] DEBUG: Redirecting (302) to (GET https://example_1.com/?subid=poison_apple) from (GET http://example_0.com/?id=0)
[scrapy] DEBUG: Redirecting (301) to (GET https://example_1/ture_a.html) from (GET https://example_1.com/?subid=poison_apple)
[scrapy] DEBUG: Crawled (200) (GET https://example_1/ture_a.html) (referer: None)

现在，如何知道“ start_url”中“ http://example_0.com/?id= ***”的哪个URL与“ https：//example_1/ture_a.html ”的URL成对？有人可以帮助我吗？

扩展答案，如果您希望控制每个请求而无需自动重定向（因为重定向是一个额外的请求），则可以禁用RedirectMiddleware或仅将meta参数传递dont_redirect给该请求，因此在这种情况下：

class Link_Spider(scrapy.Spider):
    name = 'GetLink'
    allowed_domains = ['example_0.com']

    # you'll have to control the initial requests with `start_requests`
    # instead of declaring start_urls

    def start_requests(self):
        with codecs.open('link.txt', 'r', 'utf-8') as f:
            start_urls = [url.strip() for url in f.readlines()]
        for start_url in start_urls:
            yield Request(
                start_url, 
                callback=self.parse_handle1, 
                meta={'dont_redirect':True, 'handle_httpstatus_list': [301, 302]},
            )
    def parse_handle1(self, response):
        # here you'll have to handle the redirect yourself
        # remember that the redirected url is in in the header: `Location`
        # do something with the response.body, response.headers. etc.
        ...
        yield Request(
            response.headers['Location'][0], 
            callback=self.parse_handle2,
            meta={'dont_redirect':True, 'handle_httpstatus_list': [301, 302]},
        )

    def parse_handle2(self, response):
        # here you'll have to handle the second redirect yourself
        # do something with the response.body, response.headers. etc.
        ...
        yield Request(response.headers['Location'][0], callback=self.parse)


    def parse(self, response):
        # actual last response
        print response.url

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。