使用BeautifulSoup网站抓取IMDb页面

帕特里克·李

我是WebScraping / Python和BeautifulSoup的新手,并且很难使我的代码正常工作。

我想抓取网址:http : //m.imdb.com/feature/bornondate,以获取:

  • 名人姓名
  • 名人形象
  • 职业
  • 最佳作品

该页面上的十位名人。我不确定自己在做什么错。

这是我的代码:

import urllib2
from bs4 import BeautifulSoup

url = 'http://m.imdb.com/feature/bornondate'

test_url = urllib2.urlopen(url)
readHtml = test_url.read()
test_url.close()

soup = BeautifulSoup(readHtml)
# Using it track the number of Actor
count = 0
# Fetching the value present within tag results
person = soup.findChildren('section', 'posters list')
# Changing the person into an iterator
iterperson = iter(person[0].findChildren('a'))

# Finding 'a' in iterperson. Every 'a' tag contains information of a person
for a in iterperson:
    imgSource = a.find('img')['src'].split('._V1.')[0] + '._V1_SX214_AL_.jpg'
    person = a.findChildren('div', 'label')
    title = person[0].find('span', 'title').contents[0]
    ##profession = person[0].find('div', 'detail').contents[0].split(,)
    ##bestWork = person[0].find('div', 'detail').contents[1].split(,)

    print '*******************************IMDB People Born Today***********************************'
    # Printing the S.No of the person
    print 'S.No. --> ',
    count += 1
    print count
    # Printing the title/name of the person
    print 'Title --> ' + title
    # Printing the Image Source of the person
    print 'Image Source --> ', imgSource
    # Printing the Profession of the person
    ##print 'Profession --> ', profession
    # Printing the Best work of the person
    ##print 'Best Work --> ', bestWork

目前,什么都没有打印出来。另外,如果这个含糊不清,您能解释一下如何做例如名人名吗?

如果有帮助,这是第一位名人的html代码:

<section class="posters list">
<h1>March 7</h1>

    <a href="/name/nm0186505/" class="poster "><img src="http://ia.media-imdb.com/images/M/[email protected]_V1._CR0,0,1369,2019_SX40_SY59.jpg" style="background:url('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')" width="40" height="59"><div class="label"><span class="title">Bryan Cranston</span><div class="detail">Actor, "Ozymandias"</div></div></a>
ec

首先,IMDb “使用条件”明确禁止刮屏

机械手和屏幕抓取:未经以下明确的书面同意,您不得在本网站上使用数据挖掘,机械手,屏幕抓取或类似的数据收集和提取工具。

尝试探索IMDb JSON API而非网络抓取方法。


您当前的问题是-在特定日期出生的人员列表是通过IMDbAPI单独调用以及所涉及javascript逻辑加载的

现在最简单的选择是切换到selenium浏览器自动化工具。使用无头PhantomJS浏览器的工作示例

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.PhantomJS()
driver.get("http://m.imdb.com/feature/bornondate")

# waiting for posters to load
wait = WebDriverWait(driver, 10)
posters = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "section.posters")))

# extracting the data poster by poster
for a in posters.find_elements_by_css_selector('a.poster'):
    img = a.find_element_by_tag_name('img').get_attribute('src').split('._V1.')[0] + '._V1_SX214_AL_.jpg'

    person = a.find_element_by_css_selector('div.detail').text
    title = a.find_element_by_css_selector('span.title').text

    print img, person, title

印刷品:

http://ia.media-imdb.com/images/M/[email protected]_V1_SX214_AL_.jpg Actor, "Ozymandias" Bryan Cranston
http://ia.media-imdb.com/images/M/[email protected]_V1_SX214_AL_.jpg Actress, "Karla" Laura Prepon
http://ia.media-imdb.com/images/M/[email protected]@._V1_SX214_AL_.jpg Actress, "The Mummy" Rachel Weisz
http://ia.media-imdb.com/images/M/[email protected]@._V1_SX214_AL_.jpg Actor, "Jarhead" Peter Sarsgaard
http://ia.media-imdb.com/images/M/[email protected]@._V1_SX214_AL_.jpg Actress, "Blades of Glory" Jenna Fischer
http://ia.media-imdb.com/images/M/[email protected]@._V1_SX214_AL_.jpg Actress, "Tangled" Donna Murphy
http://ia.media-imdb.com/images/M/[email protected]@._V1_SX214_AL_.jpg Actor, "How the Grinch Stole Christmas" T.J. Thyne
http://ia.media-imdb.com/images/M/[email protected]@._V1_SX214_AL_.jpg Actor, "Home Alone" John Heard
http://ia.media-imdb.com/images/M/[email protected]_V1_SX214_AL_.jpg Actress, "Beerfest" Audrey Marie Anderson
http://ia.media-imdb.com/images/M/MV5BMTQyOTc5NzA0M15BMl5BanBnXkFtZTYwODQ2MjYz._V1_SX214_AL_.jpg Producer, "Kick-Ass" Matthew Vaughn

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章