我正在尝试扫描网页,以使用部分产品名称找到指向特定产品的链接。
下面的HTML是我要从中提取信息的部分:
<article class='product' data-json-url='/en/GB/men/products/omia066s188000161001.json' id='product_24793' itemscope='' itemtype='http://schema.org/Product'>
<header>
<h3>OMIA066S188000161001</h3>
</header>
<a itemProp="url" href="/en/GB/men/products/omia066s188000161001"><span content='OFF WHITE Shoes OMIA066S188000161001' itemProp='name' style='display:none'></span>
<span content='OFF WHITE' itemProp='brand' style='display:none'></span>
<span content='OMIA066S188000161001' itemProp='model' style='display:none'></span>
<figure>
<img itemProp="image" alt="OMIA066S188000161001 image" class="top" src="https://cdn.off---white.com/images/156374/product_OMIA066S188000161001_1.jpg?1498806560" />
<figcaption>
<div class='brand-name'>
HIGH 3.0 SNEAKER
</div>
<div class='category-and-season'>
<span class='category'>Shoes</span>
</div>
<div class='price' itemProp='offers' itemscope='' itemtype='http://schema.org/Offer'>
<span content='530.0' itemProp='price'>
<strong>£ 530</strong>
</span>
<span content='GBP' itemProp='priceCurrency'></span>
</div>
<div class='size-box js-size-box'>
<!-- / .available-size -->
<!-- / = render 'availability', product: product -->
<div class='sizes'></div>
</div>
</figcaption>
</figure>
</a></article>
我的代码如下:
import requests
from bs4 import BeautifulSoup
item_to_find = 'off white shoes'
s = requests.Session()
r = s.get('https://www.off---white.com/en/GB/section/new-arrivals.js')
soup = BeautifulSoup(r.content, 'html.parser')
#find_url = soup.find("a", {"content":item_to_find})['href']
#print(find_url)
如何仅过滤“内容”包含item_to_find的行,然后提取该产品的“ href”?
最终输出应如下所示:
/en/GB/men/products/omia066s188000161001
试一下。
import requests
from bs4 import BeautifulSoup
item_to_find = 'off white shoes'
s = requests.Session()
r = s.get('https://www.off---white.com/en/GB/section/new-arrivals.js')
soup = BeautifulSoup(r.content, 'html.parser')
links = soup.find_all("a")
for link in links:
if 'OFF WHITE Shoes' in link.encode_contents():
print link.get('href')
由于跨度内存在“ OFF WHITE Shoes”文本,我们可以encode_contents()
用来检查每个链接中的所有标记。如果我们要搜索的文本存在,则可以使用BeautifulSoups.get
方法获得链接。
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句