Scrapy的Python递归爬取

ISuckAtLife 发表于 Dev

人生

我正在尝试制作一个抓取工具，以拖动craigslist上的链接，标题，价格和帖子正文。我已经能够获得价格，但是它返回页面上每个列表的价格，而不仅是特定行的价格。我也无法将其转到下一页并继续抓取。

这是我正在使用的教程-http: //mherman.org/blog/2012/11/08/recursively-scraping-web-pages-with-scrapy/

我已经尝试过此线程的建议，但仍然无法使它起作用-Scrapy Python Craigslist Scraper

我要抓取的页面是-http: //medford.craigslist.org/cto/

在链接价格变量中，如果我在span [@ class =“ l2”]之前删除//，则不会返回任何价格，但是如果我将其保留在那里，则它将包含页面上的所有价格。

对于规则，我尝试使用class标签，但是它似乎挂在第一页上。我在想可能需要单独的蜘蛛类？

这是我的代码：

#-------------------------------------------------------------------------------
# Name:        module1
# Purpose:
#
# Author:      CD
#
# Created:     02/03/2014
# Copyright:   (c) CD 2014
# Licence:     <your licence>
#-------------------------------------------------------------------------------
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy.selector import *
import sys

class PageSpider(BaseSpider):
    name = "cto"
    allowed_domains = ["medford.craigslist.org"]
    start_urls = ["http://medford.craigslist.org/cto/"]

    rules = (Rule(SgmlLinkExtractor(allow=("index\d00\.html", ), restrict_xpaths=('//span[@class="button next"]' ,))
        , callback="parse", follow=True), )

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//span[@class="pl"] | //span[@class="l2"]')

        for title in titles:
            item = CraigslistSampleItem()
            item['title'] = title.select("a/text()").extract()
            item['link'] = title.select("a/@href").extract()
            item['price'] = title.select('//span[@class="l2"]//span[@class="price"]/text()').extract()

            url = 'http://medford.craigslist.org{}'.format(''.join(item['link']))
            yield Request(url=url, meta={'item': item}, callback=self.parse_item_page)


    def parse_item_page(self, response):
        hxs = HtmlXPathSelector(response)

        item = response.meta['item']
        item['description'] = hxs.select('//section[@id="postingbody"]/text()').extract()
        return item

这个想法很简单：在中找到所有div带有的段落class="content"。然后从每个段落中提取链接，文本链接和价格。请注意，该select()方法不建议使用currentlty，请xpath()改用。

这是parse()方法的修改版本：

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    rows = hxs.select('//div[@class="content"]/p[@class="row"]')

    for row in rows:
        item = CraigslistSampleItem()
        link = row.xpath('.//span[@class="pl"]/a')
        item['title'] = link.xpath("text()").extract()
        item['link'] = link.xpath("@href").extract()
        item['price'] = row.xpath('.//span[@class="l2"]/span[@class="price"]/text()').extract()

        url = 'http://medford.craigslist.org{}'.format(''.join(item['link']))
        yield Request(url=url, meta={'item': item}, callback=self.parse_item_page)

这是我得到的样本：

{'description': [u"\n\t\tHave a nice, sturdy, compact car hauler/trailer.  May be used for other hauling like equipstment, ATV's and the like,   Very solid and in good shape.   Parice to sell at only $995.   Call Bill at 541 944 2929 top see or Roy at 541 9733421.   \n\t"],
 'link': [u'/cto/4354771900.html'],
 'price': [u'$995'],
 'title': [u'compact sturdy car trailer ']}

希望能有所帮助。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-03-17

我来说两句

0 条评论

登录后参与评论

上一篇：当输入http：// xxxxx / index时，存在文件的重写规则-特别是索引-无法重定向到index.php

Scrapy的Python递归爬取

Scrapy的Python递归爬取

蓝屏死机没有修复解决方案

计算数据帧中每行的NA

UITableView的项目向下滚动后更改颜色，然后快速备份

Node.js中未捕获的异常错误，发生调用

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

Linux的官方Adobe Flash存储库是否已过时？

验证REST API参数

ggplot：对齐多个分面图-所有大小不同的分面

Mac OS X更新后的GRUB 2问题

通过 Git 在运行 Jenkins 作业时获取 ClassNotFoundException

带有错误“ where”条件的查询如何返回结果？

用日期数据透视表和日期顺序查询

VB.net将2条特定行导出到DataGridView

如何从视图一次更新多行（ASP.NET - Core）

Java Eclipse中的错误13，如何解决？

尝试反复更改屏幕上按钮的位置 - kotlin android studio

离子动态工具栏背景色

应用发明者仅从列表中选择一个随机项一次

当我尝试下载 StanfordNLP en 模型时，出现错误

python中的boto3文件上传

在同一Pushwoosh应用程序上Pushwoosh多个捆绑ID