使用项目字段中的内容重命名Scrapy 0.24中下载的图像，同时避免文件名冲突？

杰库普扎克

我正在尝试重命名由Scrapy 0.24 Spider下载的图像。现在，下载的图像以其URL的SHA1哈希作为文件名存储。我想用它们提取的值来命名它们item['model']。2011年的问题概述了我想要的内容，但答案仅适用于Scrapy的早期版本，不适用于最新版本。

一旦设法完成此工作，我还需要确保我考虑到使用相同文件名下载的不同图像。因此，我需要将每个图像下载到它自己的唯一命名的文件夹中，大概是基于原始URL。

这是我在管道中使用的代码的副本。我从上面链接中的最新解答中获得了此代码，但对我而言不起作用。没有任何错误，图像可以正常下载。我的额外代码似乎对文件名没有任何影响，因为它们仍然显示为SHA1哈希。

pipelines.py

class AllenheathPipeline(object):
    def process_item(self, item, spider):
        return item

import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.http import Request
from scrapy.exceptions import DropItem

class MyImagesPipeline(ImagesPipeline):

    #Name download version
    def file_path(self, request, response=None, info=None):
        item=request.meta['item'] # Like this you can use all from item, not just url.
        image_guid = request.url.split('/')[-1]
        return 'full/%s' % (image_guid)

    #Name thumbnail version
    def thumb_path(self, request, thumb_id, response=None, info=None):
        image_guid = thumb_id + request.url.split('/')[-1]
        return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)

    def get_media_requests(self, item, info):
        #yield Request(item['images']) # Adding meta. Dunno how to put it in one line :-)
        for image in item['images']:
            yield Request(image)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

settings.py

BOT_NAME = 'allenheath'

SPIDER_MODULES = ['allenheath.spiders']
NEWSPIDER_MODULE = 'allenheath.spiders'

ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}

IMAGES_STORE = 'c:/allenheath/images'

products.py（我的蜘蛛）

import scrapy
import urlparse

from allenheath.items import ProductItem
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

class productsSpider(scrapy.Spider):
    name = "products"
    allowed_domains = ["http://www.allen-heath.com/"]
    start_urls = [
        "http://www.allen-heath.com/ahproducts/ilive-80/",
        "http://www.allen-heath.com/ahproducts/ilive-112/"
    ]

    def parse(self, response):
        for sel in response.xpath('/html'):
            item = ProductItem()
            item['model'] = sel.css('#prodsingleouter > div > div > h2::text').extract() # The value I'd like to use to name my images.
            item['shortdesc'] = sel.css('#prodsingleouter > div > div > h3::text').extract()
            item['desc'] = sel.css('#tab1 #productcontent').extract()
            item['series'] = sel.css('#pagestrip > div > div > a:nth-child(3)::text').extract()
            item['imageorig'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
            item['image_urls'] = sel.css('#tab1 #productcontent .col-sm-9 img').xpath('./@src').extract()
            item['image_urls'] = [urlparse.urljoin(response.url, url) for url in item['image_urls']]
            yield item

items.py

import scrapy

class ProductItem(scrapy.Item):
    model = scrapy.Field()
    itemcode = scrapy.Field()
    shortdesc = scrapy.Field()
    desc = scrapy.Field()
    series = scrapy.Field()
    imageorig = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

这是我运行蜘蛛程序时从命令提示符处获得的输出的pastebin：http : //pastebin.com/ir7YZFqf

任何帮助将不胜感激！

天际线75489

pipes.py：

from scrapy.pipelines.images import ImagesPipeline
from scrapy.http import Request
from scrapy.exceptions import DropItem
from scrapy import log

class MyImagesPipeline(ImagesPipeline):

    #Name download version
    def file_path(self, request, response=None, info=None):
        image_guid = request.meta['model'][0]
        log.msg(image_guid, level=log.DEBUG)
        return 'full/%s' % (image_guid)

    #Name thumbnail version
    def thumb_path(self, request, thumb_id, response=None, info=None):
        image_guid = thumb_id + request.url.split('/')[-1]
        log.msg(image_guid, level=log.DEBUG)
        return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)

    def get_media_requests(self, item, info):
        yield Request(item['image_urls'][0], meta=item)

您使用的是settings.py错误的。您应该使用此：

ITEM_PIPELINES = {'allenheath.pipelines.MyImagesPipeline': 1}

为了使缩略图起作用，请将其添加到settings.py：

IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (100, 100),
}

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2020-10-28

我来说两句

0 条评论

登录后参与评论

上一篇：如何从R中少于2个唯一级别的数据框中删除列

从内容重命名文件

如何使用PowerShell根据每个文件中的特定内容重命名文件夹中的文件？

用文件内容重命名文件

根据内容重命名文件

根据内容重命名 JSON 文件

根据JSON内容重命名文件

R Shiny中DataTable按钮扩展中下载文件名的动态命名

根据python中csv文件中的一些内容重命名csv文件

根据文件内容重命名文件夹中的所有txt文件

批处理脚本以使用内容重命名PDF文件

拆分文件名并用文件内容重命名的PHP代码

根据文件内容重命名文件文件夹

根据 txt 文件的内容重命名 txt 文件

根据文件内容重命名txt文件（OSX）

根据元素的内容重命名XML文件

Excel / VBA：根据内容重命名文件

使用卷曲选项在下载过程中重命名重复的文件名

使用Regex在Visual Studio Code中查找所有短语，以给定短语开头和结尾，然后在整个解决方案中将内容重命名为小写

使用循环重命名 R 中的字段

无法重命名文件，文件夹名称与Ubuntu 13.10中的文件名冲突

在/ dev / block目录中，文件名（例如8：0或11：0）是什么意思？

如何重命名文件名以避免在Windows或Mac中发生冲突？

使用重命名 sed 从文件名中减去

使用PowerShell重命名或删除文件名中的字符

在 PowerShell 中重命名文件名

使用ImageMagick将PDF转换为JPG图像-如何使用0填充文件名？

如何使用图像哈希作为下载图像的文件名？

文件无法在 Chrome 或 Firefox 中打开，URL 中的文件名更改为“0”

Scrapy 2.4.0 重命名管道中的图像

TOP 榜单

文章

使用项目字段中的内容重命名Scrapy 0.24中下载的图像，同时避免文件名冲突？

使用项目字段中的内容重命名Scrapy 0.24中下载的图像，同时避免文件名冲突？

蓝屏死机没有修复解决方案

计算数据帧中每行的NA

UITableView的项目向下滚动后更改颜色，然后快速备份

Node.js中未捕获的异常错误，发生调用

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

Linux的官方Adobe Flash存储库是否已过时？

验证REST API参数

ggplot：对齐多个分面图-所有大小不同的分面

Mac OS X更新后的GRUB 2问题

通过 Git 在运行 Jenkins 作业时获取 ClassNotFoundException

带有错误“ where”条件的查询如何返回结果？

用日期数据透视表和日期顺序查询

VB.net将2条特定行导出到DataGridView

如何从视图一次更新多行（ASP.NET - Core）

Java Eclipse中的错误13，如何解决？

尝试反复更改屏幕上按钮的位置 - kotlin android studio

离子动态工具栏背景色

应用发明者仅从列表中选择一个随机项一次

当我尝试下载 StanfordNLP en 模型时，出现错误

python中的boto3文件上传

在同一Pushwoosh应用程序上Pushwoosh多个捆绑ID