所以我创建了一个爬虫来从一个站点中提取数据,例如https://www.sportstoto.com.my/result_print.asp?drawNo=5291/21
这是我的代码,
import scrapy
from totoprintasp.items import Result
def generate_start_urls():
drawNums = ['5291/21']
return ['https://www.sportstoto.com.my/result_print.asp?drawNo={}'.format(drawNum) for drawNum in drawNums]
class TotoprintSpider(scrapy.Spider):
name = 'totoprint'
allowed_domains = ['www.sportstoto.com.my/result_print.asp']
start_urls = generate_start_urls()
download_delay = 3
def parse(self, response):
# print(response.body)
items = []
# print(response.body)
for each in response.xpath("/html/body/div/center/table/tbody"):
item = Result()
drawDate = each.xpath(
"tr[2]/td/div/table/tbody/tr/td[1]/span/font/b/text()").extract()
drawNo = each.xpath(
"tr[2]/td/div/table/tbody/tr/td[2]/span/b/font/text()").extract()
gameType = each.xpath(
"tr[4]/td/span/font/text()").extract()
firstPrize = each.xpath(
"tr[5]/td/table[1]/tbody/tr[2]/td[1]/span/b/font/text()").extract()
item['drawDate'] = drawDate
item['drawNo'] = drawNo
item['gameType'] = gameType
item['firstPrize'] = firstPrize
items.append(item)
yield item
它没有提取任何东西。我正在运行命令,scrapy runspider totoprint.py
并设置了值,
FEED_URI = 'results.json'
FEED_FORMAT = 'json'
在我的settings.py
文件中
所以结果应该写到json文件中
有趣的是什么都没有出现,也没有得到任何提取物。我尝试了不同的变化,甚至改变.extract()
对.get()
XPath 可以正常工作,因为我已经在我的 chrome 浏览器中的 XPath 帮助程序扩展上尝试过它。
感谢一些帮助或建议。
我重写了你的脚本,但你必须根据你自己的项目重新修复它。这里的问题是你正在寻找 1tbody
和他们的 1 个孩子。但是有很多tbody
.
据我了解,您希望 gameType 作为列表,而其他人则作为字符串。我得到以下输出:
|------------------|-----------------|----------------------------------------|------------|
| drawDate | drawNo | gameType | firstPrize |
|------------------|-----------------|----------------------------------------|------------|
| Date:30/05/2021 | DrawNo. 5291/21 | TOTO 4D,TOTO 4D ZODIAC,TOTO 5D,TOTO 6D | 4800 |
|------------------|-----------------|----------------------------------------|------------|
顺便说一句,您不必为每个 URL 执行 for 循环。每个 URL 都一一调用解析。所以这里是脚本:
import scrapy
def generate_start_urls():
drawNums = ['5291/21']
return ['https://www.sportstoto.com.my/result_print.asp?drawNo={}'.format(drawNum) for drawNum in drawNums]
class TotoprintSpider(scrapy.Spider):
name = 'totoprint'
allowed_domains = ['www.sportstoto.com.my/result_print.asp']
start_urls = generate_start_urls()
download_delay = 3
custom_settings = {
"ROBOTSTXT_OBEY":False, #You have to close the robotstxt rule because they are not letting you in.
}
def parse(self, response):
drawDate,drawNo = response.xpath('//*[@class="dataDD"]//text()').extract() #Both have same class so you can scrape them together
gameType = response.xpath('//*[@class="tit4D"]//text()').extract()
firstPrize = response.xpath('(//*[@class="dataResultA"])[1]//text()').get() #According to your scrit you want just first price because of that I write the xpath with [1]
yield {
'drawDate':drawDate.replace("\t","").replace("\n","").replace("\r",""), #There was some issue about t,n,r I delete simply with replace
"drawNo":drawNo.replace("\t","").replace("\n","").replace("\r",""),
"gameType":gameType,
"firstPrize":firstPrize
}
我想我写的剧本就是你想要的。
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句