尝试将数据发送到数据库时出现Scrapy ItemLoader错误

斯坦·S。

我给了一个要抓取的网站,一个额外的文件(dp_data_mgr.py)包括一个send_data函数,可以导入到我的蜘蛛脚本中并将抓取的数据发送到Web数据库。我困扰了几天的问题是我无法让ItemLoader在收到此消息时就发送数据:

Traceback (most recent call last):
  File "C:\Users\gs\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "C:\Users\gs\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 30, in process_spider_output
    for x in result:
  File "C:\Users\gs\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "C:\Users\gs\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\gs\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\gs\scrapy_projects\DynamicPricing\avtogumibg\avtogumibg\spiders\avtogumibg.py", line 88, in parse_specs_page
    misc=params
  File "/Users/gs/scrapy_projects/DynamicPricing/avtogumibg/avtogumibg/spiders\dp_data_mgr.py", line 48, in send_data
    common_keys = data.keys() & misc.keys()
AttributeError: 'ItemLoader' object has no attribute 'keys'

`dp_data_mgr文件位于我的蜘蛛文件夹中。我需要如何修改代码才能使其正常工作?

Spider.py:

import scrapy
import sys

sys.path.append('/Users/gs/scrapy_projects/DynamicPricing/avtogumibg/avtogumibg/spiders')
from dp_data_mgr import send_data
from scrapy.spiders import Spider

from scrapy.loader import ItemLoader
from scrapy.http import Request

from avtogumibg.items import AvtogumiItem


class AvtogumiSpider(scrapy.Spider):
    name = 'avtogumibg'
    allowed_domains = ['bg.avtogumi.bg']
    start_urls = ['https://bg.avtogumi.bg/oscommerce/']
    BASE_URL = 'https://bg.avtogumi.bg/oscommerce/'


    def parse(self, response):
        brands = response.xpath('//div[@class="brands"]//@href').extract()
        if brands:
            for brand in brands:
                yield Request(url=self.BASE_URL + brand, callback=self.parse_page, dont_filter = True)

    def parse_page(self, response):

        brand = response.xpath('//h4[@class="brand-header"]/span/text()').extract_first()
        listing_url = response.url
        urls = response.xpath('//div[@class="col-xs-12 full-box"]//h4//@href').extract()
        if urls:
            for url in urls:
                yield Request(url=url, callback=self.parse_specs_page,meta={'brand':brand,'listing_url':listing_url})
        else:
            return

        next_page_url = response.xpath('//div[@class="col-md-12 text-center hidden-sh hidden-xs hidden-sm m-top"]//li/a[@class="next"]/@href').extract_first()
        if next_page_url:
            yield Request(url=self.BASE_URL + next_page_url[0], callback=self.parse_page)

    def parse_specs_page(self, response):

        subj = response.xpath('//div[@class="full br-5 bg-white top-yellow-bd"]')
        l = ItemLoader(item=AvtogumiItem(), selector=subj, response=response)

        l.add_value('url', response.url,)

        l.add_xpath('name', '//div[@class="product-box-desc"]/h4/text()',)
        l.add_xpath('prodId', '//div[@class="product-box-desc"]/p/text()',)

        l.add_xpath('category', './/div[@class="col-sh-6 col-xs-4 col-lg-1"]/p/text()',)

        l.add_value('brand', response.meta.get['brand'],)

        l.add_xpath('sPrice', './/p[@class="price font-bold"]//text()',)

        l.add_xpath('stock', './/div[@class="full m-top product-availability"]//span//text()',)
        l.add_xpath('images', './/div[@class="full-product-box main-product"]//@src',)

        specsTable = {}
        atms_key = subj.xpath('.//div[@class="full m-top product-features"]/div/p/span/text()').extract()[0]
        atms_val = subj.xpath('.//div[@class="full m-top product-features"]/div/p/text()').extract()[0]
        specsTable[atms_key] = atms_val

        speed_key = subj.xpath('.//div[@class="full m-top product-features"]/div/p/span/text()').extract()[1]
        speed_val = subj.xpath('.//div[@class="full m-top product-features"]/div/p/text()').extract()[1]
        specsTable[speed_key] = speed_val

        tyre_type_key = subj.xpath('.//div[@class="full m-top product-features"]/div/p/span/text()').extract()[2]
        tyre_type_val = subj.xpath('.//div[@class="full m-top product-features"]/div/p/text()').extract()[2]
        specsTable[tyre_type_key] = tyre_type_val

        manuf_key = subj.xpath('.//div[@class="full m-top product-features"]/div/p/text()').extract()[3]
        manuf_val = subj.xpath('.//div[@class="full m-top product-features"]/div/p/span/text()').extract()[3]
        specsTable[manuf_key] = manuf_val

        l.add_value('specsTable', specsTable)

        listing_url = response.meta.get['listing_url']

        params = l
        yield l.load_item()
        send_data(access_token='',  # Provided by DP
              site_id='https://bg.avtogumi.bg/oscommerce/',  # Provided by DP
              proxy_ip='No proxy_ip',  # When using a proxy servers provider, they can provide a response header with the IP of the proxy used for this page request
              page_url=response.url,  # The current URL of the product page
              listing_url=listing_url,  # URL of the listing page from where we came to the product page
              misc=params
              )

dp_data_mgr.py:

def send_data(access_token, site_id, proxy_ip, page_url, listing_url, misc):
# print('Gonna send req to: ', url_service, '    dev_mode: ', dev_mode)
headers = {
    'Dp-Craser-User-Token': access_token,
    'Dp-Craser-Dev-Mode': 'yes' if dev_mode else 'no'
}
data = OrderedDict([
    ('siteId', site_id),
    ('proxyIP', proxy_ip),
    ('urlPage', page_url),
    ('urlRef', listing_url),
])
common_keys = data.keys() & misc.keys()
assert not common_keys, 'You have passed some properties in "misc" that have the same names as the explicit params: ' + ', '.join(common_keys)
data.update(sorted(misc.items()))  # Append all misc items to the end, but sort only them
# print('Req data:\n', json.dumps(data, indent=4), '\n')
try:
    resp = requests.post(url=url_service, data=json.dumps(data), headers=headers, verify=False)
    if resp.status_code != 200:
        print('RECEIVED ERROR FROM SERVER:', resp.json())
except requests.exceptions.RequestException as e:
    print('REQUEST EXCEPTION:', e)


 #===================== Usage example ==========================================

def send_example_request():
    params = dict(
        # Here are some commonly used properties. Populate them whenever possible.
        prodId='3842JfK',  # The product ID (also known as SKU). Must be a string (even if it only contains digits).
        name='The name of product X',
        category='Hardware >> Laptops',  # breadcrumbs
        brand='ASUS',
        eans=['1234567'],  # The expected type is an array of strings. Do NOT assign a string directly, even if the product has exactly one EAN!
        partNums=[],  # The expected type is an array of strings. Do NOT assign a string directly, even if the product has exactly one part number!
        images=['http://example.com/3842JfK/p1.jpg', 'http://example.com/3842JfK/p2.jpg'],  # An array of image URLs. Do NOT assign a string directly, even if the product only has zero or one images!
        stock='Out of stock',  # Other example values are "In stock", "Not available", etc.
        specsTable=[
            {'key': 'Color', 'value': 'Brown'},  # Note that the keys and values will usually be localized (i.e. not necessarily in English)
            {'key': 'Series', 'value': 'X540'},
            {'key': 'CPU', 'value': 'Intel Core i3-5005U'},
            {'key': 'RAM', 'value': '4GB (1x 4096MB) - DDR3, 1600Mhz'},
        ],
        sPrice='1,299.99',  # The raw value as a string. If the product is in promotion, set the promo price here.
        sOldPrice='1,429.99',  # The raw value as a string. If the product is in promotion, this price will often be displayed as scratched.
        # We can also add some custom properties:
        someCustomProperty='abc',
        zzzzz=False
    )
    send_data(access_token='someCode',  # Provided by DP
              site_id=102,  # Provided by DP
              proxy_ip='SomeIP-here',  # When using a proxy servers provider, they can provide a response header with the IP of the proxy used for this page request
              page_url='SomeURL',  # The current URL of the product page
              listing_url='URL-FromWhereWeCameToThisProductPage',  # URL of the listing page from where we came to the product page
              misc=params
              )

items.py:

# -*- coding: utf-8 -*-
import scrapy
from urllib.parse import urljoin
from scrapy.loader.processors import MapCompose, Join, TakeFirst

from scrapy.item import Item, Field


class AvtogumiItem(scrapy.Item):


    def make_absolute_url(url, loader_context):
        return loader_context['response'].urljoin(url)

    strip = MapCompose(str.strip)

    url = scrapy.Field(input_processor=strip, output_processor=TakeFirst())
    name = scrapy.Field()
    prodId = scrapy.Field()

    category = scrapy.Field()
    brand = scrapy.Field(input_processor=strip,)

    sPrice = scrapy.Field(input_processor=strip, output_processor=TakeFirst())
    sOldPrice = scrapy.Field(input_processor=strip, output_processor=Join())

    stock = scrapy.Field(input_processor=strip,output_processor=TakeFirst())
    images = scrapy.Field(input_processor=MapCompose(make_absolute_url), output_processor=TakeFirst())    
    specsTable = scrapy.Field(input_processor=strip,output_processor=TakeFirst())

由于我对如何解决此问题一无所知,所有帮助将不胜感激。谢谢大家 !

恐龙龙

您正在使用ItemLoader对象,而不是它生成的Item / dict:

params = l
yield l.load_item()

应该:

params = l.load_item()
yield params

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章

我正在尝试使用 laravel 将多个标题和多个文件发送到数据库中,但出现错误

即使已成功将发布请求发送到数据库,为什么仍然出现500错误?

Bootstrap表单将数据错误发送到MongoDB数据库

ReactJS:表单将错误的日期值发送到数据库

使用React JS和Laravel将带有图像文件的POST请求发送到数据库时出现内部服务器错误

通过参数化的插入语句将错误的数据从VB.net发送到Mysql数据库

尝试推送到OCI数据库时发生错误

尝试在Python套接字中将数据从服务器发送到客户端时出现管道错误

我正在尝试将数据从活动发送到片段,但出现此错误,请问有什么可以帮助的吗?

尝试将 json 数据保存到数据库时出现 422 错误

尝试将多个 Pandas 数据帧合并到 postgresql 数据库时出现编程错误

尝试使用Entity Framework将值保存到数据库时出现元数据错误

尝试检索数据库表时出现PDO错误

尝试将音频从mpd发送到PulseAudio时出现“无法打开音频输出”错误

尝试将nodejs发送到mysql时出现sql语法错误

使用 .to_sql 将数据帧发送到 MySQL 时出现“SELECT name FROM sqlite_master”错误

将数据插入数据库时出现错误

尝试将消息上传到 Firebase 数据库时出现“发现冲突的 getter”错误

HTTP 状态 400 – 尝试将日期保存到数据库时出现错误请求

Laravel:尝试将帖子添加到数据库时出现 BadMethodCallException 错误

将图像发送到数据缓冲区,但出现关键错误

错误:使用GET方法将数据作为JSON发送到Rails资源时,URI错误

angularjs 在使用 POST 将数据发送到服务器时使 json 格式错误

将数据发送到iTunes商店时发生错误“计划很快重新启动”

将 POST 请求值从 react 发送到烧瓶时未定义名称错误数据

ViewPager-将数据从片段发送到活动时,Android错误setCurrentItem(int,boolean)

使用php将表单数据发送到邮件时发生错误

尝试从数据库中检索数据时,出现错误“ InvalidOperationException未处理”

尝试在C#中将数据插入数据库时,出现IEntityChangeTracker错误的多个实例