在Python中抓取绝对URL而不是相对路径

322

user7800892：

我正在尝试从HTML代码中获取所有href，并将其存储在列表中以供将来处理，例如：

范例网址：www.example-page-xl.com

 <body>
    <section>
    <a href="/helloworld/index.php"> Hello World </a>
    </section>
 </body>

我正在使用以下代码列出href的列表：

import bs4 as bs4
import urllib.request

sauce = urllib.request.urlopen('https:www.example-page-xl.com').read()
soup = bs.BeautifulSoup(sauce,'lxml')

section = soup.section

for url in section.find_all('a'):
    print(url.get('href'))

但是，我想将URL存储为：www.example-page-xl.com/helloworld/index.php，而不仅仅是/helloworld/index.php的相对路径。

不需要使用相对路径附加/加入URL，因为当我加入URL和相对路径时动态链接可能会有所不同。

简而言之，我想抓取绝对URL，而不是单独抓取相对路径（并且不加入）

索米尔：

在这种情况下，urlparse.urljoin可以为您提供帮助。您应该像这样修改您的代码-

import bs4 as bs4
import urllib.request
from urlparse import  urljoin

web_url = 'https:www.example-page-xl.com'
sauce = urllib.request.urlopen(web_url).read()
soup = bs.BeautifulSoup(sauce,'lxml')

section = soup.section

for url in section.find_all('a'):
    print urljoin(web_url,url.get('href'))

在这里urljoin管理绝对路径和相对路径。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。