我是新手,我用python scrapy编写了一个脚本以递归获取信息。
首先,它会刮擦城市的链接(包括旅游信息),然后追踪每个城市并到达其页面。接下来,在进入下一页之前,它会获取与城市有关的旅行所需的信息,依此类推。分页在没有可见链接的Java脚本上运行。
我用来获取结果以及csv输出的命令是:
scrapy crawl pratice -o practice.csv -t csv
预期结果是csv文件:
title, city, price, tour_url
t1, c1, p1, url_1
t2, c2, p2, url_2
...
问题是csv文件为空。运行在“ parse_page”处停止,并且callback =“ self.parse_item”不起作用。我不知道该如何解决。也许我的工作流程无效或我的代码有问题。谢谢你的帮助。
name = 'practice'
start_urls = ['https://www.klook.com/vi/search?query=VI%E1%BB%86T%20NAM%20&type=country',]
def parse(self, response): # Extract cities from country
hxs = HtmlXPathSelector(response)
urls = hxs.select("//div[@class='swiper-wrapper cityData']/a/@href").extract()
for url in urls:
url = urllib.parse.urljoin(response.url, url)
self.log('Found city url: %s' % url)
yield response.follow(url, callback=self.parse_page) # Link to city
def parse_page(self, response): # Move to next page
url_ = response.request.url
yield response.follow(url_, callback=self.parse_item)
# I will use selenium to move next page because of next button is running
# on javascript without fixed url.
def parse_item(self, response): # Extract tours
for block in response.xpath("//div[@class='m_justify_list m_radius_box act_card act_card_lg a_sd_move j_activity_item js-item ']"):
article = {}
article['title'] = block.xpath('.//h3[@class="title"]/text()').extract()
article['city'] = response.xpath(".//div[@class='g_v_c_mid t_mid']/h1/text()").extract()# fixed
article['price'] = re.sub(" +","",block.xpath(".//span[@class='latest_price']/b/text()").extract_first()).strip()
article['tour_url'] = 'www.klook.com'+block.xpath(".//a/@href").extract_first()
yield article
hxs = HtmlXPathSelector(response) #response is already in Selector, use direct `response.xpath`
url = urllib.parse.urljoin(response.url, url)
用于:
网址= response.urljoin(网址)
是的,它将停止,因为其重复请求。网址,您需要添加dont_filter=True
检查
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句