我正在尝试使用以下代码从此页面抓取以A开头的项目。
import scrapy
from scrapy.selector import Selector
from ..items import RozeepkItem
class JobcatsSpider(scrapy.Spider):
name = 'jobcats'
allowed_domains = ['www.rozee.pk']
start_urls = ['https://www.rozee.pk/jobs-by-industry']
def parse(self, response):
items = RozeepkItem()
for job_cat in Selector(response).xpath("//div[@class = 'boxb job-dtl sitemap']"):
category_title = job_cat.xpath(".//div[@id = 'A-block']/div[@class = 'row']/ul/li/a/@title").get()
url = job_cat.xpath(".//div[@id = 'A-block']/div[@class = 'row']/ul/li/a/@href").get()
items['job_category'] = category_title
items['url_str'] = url
yield items
以下是 items.py
import scrapy
class RozeepkItem(scrapy.Item):
job_category = scrapy.Field()
url_str = scrapy.Field()
它给出输出
2020-06-27 06:55:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.rozee.pk/jobs-by-industry>
{'job_category': 'Accounting Jobs in Pakistan',
'url_str': '//www.rozee.pk/search/accounting-jobs-in-pakistan'}
2020-06-27 06:55:00 [scrapy.core.engine] INFO: Closing spider (finished)
2020-06-27 06:55:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 232,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 16977,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 2.227556,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 6, 27, 1, 55, 0, 379725),
'item_scraped_count': 1,
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 6, 27, 1, 54, 58, 152169)}
2020-06-27 06:55:00 [scrapy.core.engine] INFO: Spider closed (finished)
可以看出,我只得到一个项目及其对应的链接,另一方面,如果我在浏览器中尝试此xpath,则可以获得所有信息,如以下屏幕截图所示。
有人可以帮我解决我犯错的地方吗?谢谢
从文档:
.get()总是返回单个结果;如果有多个匹配项,则返回第一个匹配项的内容;如果没有匹配项,则返回None。.getall()返回带有所有结果的列表。
因此,.getall()
请不要在代码中使用get()
。
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句