Scrapy爬虫以递归方式解析数据无法回调

潘阮定

我是新手,我用python scrapy编写了一个脚本以递归获取信息。

首先,它会刮擦城市的链接(包括旅游信息),然后追踪每个城市并到达其页面。接下来,在进入下一页之前,它会获取与城市有关的旅行所需的信息,依此类推。分页在没有可见链接的Java脚本上运行。

我用来获取结果以及csv输出的命令是:

scrapy crawl pratice -o practice.csv -t csv

预期结果是csv文件:

title, city, price, tour_url
t1, c1, p1, url_1
t2, c2, p2, url_2
...

问题是csv文件为空。运行在“ parse_page”处停止,并且callback =“ self.parse_item”不起作用。我不知道该如何解决。也许我的工作流程无效或我的代码有问题。谢谢你的帮助。

name = 'practice'
start_urls = ['https://www.klook.com/vi/search?query=VI%E1%BB%86T%20NAM%20&type=country',]

def parse(self, response): # Extract cities from country
    hxs = HtmlXPathSelector(response)
    urls = hxs.select("//div[@class='swiper-wrapper cityData']/a/@href").extract()
    for url in urls:
        url = urllib.parse.urljoin(response.url, url)
        self.log('Found city url: %s' % url)
        yield response.follow(url, callback=self.parse_page) # Link to city


def parse_page(self, response): # Move to next page
    url_ = response.request.url
    yield response.follow(url_, callback=self.parse_item)

    # I will use selenium to move next page because of next button is running
    # on javascript without fixed url.

def parse_item(self, response): # Extract tours
    for block in response.xpath("//div[@class='m_justify_list m_radius_box act_card act_card_lg a_sd_move j_activity_item js-item ']"):
        article = {}
        article['title'] = block.xpath('.//h3[@class="title"]/text()').extract()
        article['city'] = response.xpath(".//div[@class='g_v_c_mid t_mid']/h1/text()").extract()# fixed
        article['price'] = re.sub("  +","",block.xpath(".//span[@class='latest_price']/b/text()").extract_first()).strip()
        article['tour_url'] = 'www.klook.com'+block.xpath(".//a/@href").extract_first()

        yield article
雷心
hxs = HtmlXPathSelector(response)    #response is already in Selector, use direct `response.xpath`

url = urllib.parse.urljoin(response.url, url)

用于:

网址= response.urljoin(网址)

是的,它将停止,因为其重复请求。网址,您需要添加dont_filter=True检查

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章