我用scrapy(ver:1.1.1)在互联网上刮了一些日期。这就是我要面对的:
class Link_Spider(scrapy.Spider):
name = 'GetLink'
allowed_domains = ['example_0.com']
with codecs.open('link.txt', 'r', 'utf-8') as f:
start_urls = [url.strip() for url in f.readlines()]
def parse(self, response):
print response.url
在上面的代码中,“ start_urls”类型是一个列表:
start_urls = [
example_0.com/?id=0,
example_0.com/?id=1,
example_0.com/?id=2,
] # and so on
当草率运行时,调试信息告诉我:
[scrapy] DEBUG: Redirecting (302) to (GET https://example_1.com/?subid=poison_apple) from (GET http://example_0.com/?id=0)
[scrapy] DEBUG: Redirecting (301) to (GET https://example_1/ture_a.html) from (GET https://example_1.com/?subid=poison_apple)
[scrapy] DEBUG: Crawled (200) (GET https://example_1/ture_a.html) (referer: None)
现在,如何知道“ start_url”中“ http://example_0.com/?id= ***”的哪个URL与“ https://example_1/ture_a.html ”的URL成对?有人可以帮助我吗?
扩展答案,如果您希望控制每个请求而无需自动重定向(因为重定向是一个额外的请求),则可以禁用RedirectMiddleware
或仅将meta参数传递dont_redirect
给该请求,因此在这种情况下:
class Link_Spider(scrapy.Spider):
name = 'GetLink'
allowed_domains = ['example_0.com']
# you'll have to control the initial requests with `start_requests`
# instead of declaring start_urls
def start_requests(self):
with codecs.open('link.txt', 'r', 'utf-8') as f:
start_urls = [url.strip() for url in f.readlines()]
for start_url in start_urls:
yield Request(
start_url,
callback=self.parse_handle1,
meta={'dont_redirect':True, 'handle_httpstatus_list': [301, 302]},
)
def parse_handle1(self, response):
# here you'll have to handle the redirect yourself
# remember that the redirected url is in in the header: `Location`
# do something with the response.body, response.headers. etc.
...
yield Request(
response.headers['Location'][0],
callback=self.parse_handle2,
meta={'dont_redirect':True, 'handle_httpstatus_list': [301, 302]},
)
def parse_handle2(self, response):
# here you'll have to handle the second redirect yourself
# do something with the response.body, response.headers. etc.
...
yield Request(response.headers['Location'][0], callback=self.parse)
def parse(self, response):
# actual last response
print response.url
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句