用 mongodb 编写一个简单的 python scrapy 爬虫

埃蒙伍兹

我已经开始编写一个简单的scrapy 模块供mongodb 使用。我是 python 新手，我写的代码有问题：

国会.py

import scrapy

from scrapy.selector import Selector
from scrapy.loader import ItemLoader
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import HtmlResponse
from congress.items import CongressItem

class CongressSpider(CrawlSpider):
    name = "congres"
    allowed_domains = ["www.congress.gov"]
    start_urls = [
            'https://www.congress.gov/members',
        ]
    #creating a rule for my crawler. I only want it to continue to the next page, don't follow any other links.
    rules = (Rule(LinkExtractor(allow=(),restrict_xpaths=("//a[@class='next']",)), callback="parse_page", follow=True),)

    def parse_page(self, response):
        for search in response.selector.xpath(".//li[@class='compact']"):
            yield {'member' : ' '.join(search.xpath("normalize-space(span/a/text())").extract()).strip(),
               'state' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item']/span/text())").extract()).strip(),
                'District' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][2]/span/text())").extract()).strip(),
                'party' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][3]/span/text())").extract()).strip(),
                'Served' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][4]/span//li/text())").extract()).strip(),
            }

项目.py

import scrapy
class CongressItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()

    member = scrapy.Field()
    state = scrapy.Field()
    District = scrapy.Field()
    party = scrapy.Field()
    served = scrapy.Field()

管道.py

from pymongo import MongoClient
from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log

class CongressPipeline(object):
    collection_name= 'members'
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db
    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )
    def open_spider(self,spider):
            self.client = pymongo.MongoClient(self.mongo_uri)
            self.db = self.client[self.mongo_db]
    def close_spider(self, spider):
        self.client.close()
    def process_item(self, item, spider):
        self.db[self.collection_name].insert(dict(item))
        return item

设置.py

BOT_NAME = 'congres'

SPIDER_MODULES = ['congres.spiders']
NEWSPIDER_MODULE = 'congres.spiders'





MONGO_URI = 'mongodb://localhost:27017'
MONGO_DATABASE = 'congres'
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 3
ITEM_PIPELINES = {
   'congress.pipelines.CongresPipeline': 300,
}

它显示的错误是

Unhandled error in Deferred:
2017-07-09 11:15:33 [twisted] CRITICAL: Unhandled error in Deferred:

2017-07-09 11:15:34 [twisted] CRITICAL:
Traceback (most recent call last):
File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 1386, 
in _inlineCallbacks
result = g.send(result)
File "c:\python27\lib\site-packages\scrapy\crawler.py", line 95, in crawl
six.reraise(*exc_info)
File "c:\python27\lib\site-packages\scrapy\crawler.py", line 79, in crawl
yield self.engine.open_spider(self.spider, start_requests)
NameError: global name 'pymongo' is not defined

托马斯·林哈特

要导入刚刚MongoClient在pipelines.py

from pymongo import MongoClient

在open_spider方法中，您正在以这种方式使用它

self.client = pymongo.MongoClient(self.mongo_uri)

您收到错误是因为pymongo未导入。将最后一行更改为

self.client = MongoClient(self.mongo_uri)

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-06-10

我来说两句

0 条评论

登录后参与评论

上一篇：从图像的右中心使用 jquery 的图像幻灯片

用 mongodb 编写一个简单的 python scrapy 爬虫

用 mongodb 编写一个简单的 python scrapy 爬虫

Linux的官方Adobe Flash存储库是否已过时？

如何使用HttpClient的在使用SSL证书，无论多么“糟糕”是

错误：“ javac”未被识别为内部或外部命令，

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

Modbus Python施耐德PM5300

为什么Object.hashCode（）不遵循Java代码约定

如何检查字符串输入的格式

检查嵌套列表中的长度是否相同

错误TS2365：运算符'！=='无法应用于类型'“（”'和'“）”'

如何自动选择正确的键盘布局？-仅具有一个键盘布局

如何正确比较 scala.xml 节点？

在令牌内联程序集错误之前预期为 ')'

如何在JavaScript中获取数组的第n个元素？

如何将sklearn.naive_bayes与（多个）分类功能一起使用？

ValueError：尝试同时迭代两个列表时，解包的值太多（预期为 2）

如何监视应用程序而不是单个进程的CPU使用率？

解决类Koin的实例时出错

ES5的代理替代

有什么解决方案可以将android设备用作Cast Receiver？

VBA 自动化错误：-2147221080 (800401a8)

套接字无法检测到断开连接