I am using the following code to get values from a site
import scrapy
class scraping(scrapy.Spider):
name = 'NewsSpider'
start_urls = ['https://www.uol.com.br/']
def parse(self, response):
news = response.xpath('//article')
for n in news:
print({
'Link': n.xpath("//a[@class='hyperlink headlineSub__link']").get(),
'Title': n.xpath('//a/div/h3/text()').get(),
})
On "Link" I am getting a lot of information but I want to get only the link inside the href, is it possible to get only that information?
I have a sample of doing this very same thing. You should use something like this selector:
.css('a[href*=topic]::attr(href)')
a tag
in my case was something like <a ... href="topic/1321343">something</a>
.
The key is a::attr(href)
parse your response and make it as small as you can and get your wanted href value.
This is my solution on a project for scraping Microsoft Academia articles. The linked line gets items in "Related Topics" section.
Here is some other example:
<span class="title">
<a href="https://www.example.com"></a>
</span>
pars by:
Link = Link1.css('span.title a::attr(href)').extract()[0]
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments