从网站中提取链接

sanaz 发表于 Dev

萨纳兹

所以我想去http://www.medhelp.org/forums/list那里有很多不同疾病的链接。在每个链接内，有几个页面，每个页面都有一些我想要的链接。

我想得到一些链接。所以我使用了这个代码：

myArray=[]
html_page = urllib.request.urlopen("http://www.medhelp.org/forums/list")
soup = bs(html_page)
temp =soup.findAll('div',attrs={'class' : 'forums_link'})
for div in temp:
  myArray.append('http://www.medhelp.org' + div.a['href'])
myArray_for_questions=[]
myPages=[]

#this for is going over all links on the main page. in this case, all 
diseases
for link in myArray:

  # "link" is the URL for each link in the main page of our website
  html_page = urllib.request.urlopen(link)
  soup1 = bs(html_page)

  #getting the questions's links in the first page
  temp =soup1.findAll('div',attrs={'class' : 'subject_summary'}) 
  for div in temp:
     myArray_for_questions.append('http://www.medhelp.org' + div.a['href'])

  #now getting the URL for all next pages for this page
  pages = soup1.findAll('a' ,href=True, attrs={'class' : 'page_nav'})
  for l in pages:
    html_page_t = urllib.request.urlopen('http://www.medhelp.org' 
    +l.get('href'))
    soup_t = bs(html_page_t)
    other_pages = soup_t.findAll('a' ,href=True, attrs={'class' : 
    'page_nav'})
    for p in other_pages:
        mystr='http://www.medhelp.org' +p.get('href')   
        if mystr not in myPages:
            myPages.append(mystr)
        if p not in pages:
            pages.append(p)

  # getting all links inside this page which are people's questions
  for page in myPages:
      html_page1 = urllib.request.urlopen(page)
      soup2 = bs(html_page1)
      temp =soup2.findAll('div',attrs={'class' : 'subject_summary'}) 
      for div in temp:
        myArray_for_questions.append('http://www.medhelp.org' + 
        div.a['href'])

但是从所有页面获取我想要的所有链接需要很长时间。有任何想法吗？

谢谢

亚泽

试试scrapy教程，按照它尝试替换你的网页提供的例子：

https://doc.scrapy.org/en/latest/intro/tutorial.html

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。