Python:Web爬网标签导航Wiki表

NoWaterNoKoolaid

我正在尝试使用bs4仅将NFL Qbs的'职业生涯历史'-球员参与过的球队列表-NFL Qbs的表格部分隔离出来:

我想要的输出是:

['St. Louis Rams (2005–2006)', 'Cincinnati Bengals (2007–2008)', 'Buffalo Bills (2009–2012)', 'Tennessee Titans (2013)', 'Houston Texans (2014)', 'New York Jets (2015–2016)', 'Tampa Bay Buccaneers (2017–2018)', 'Miami Dolphins (2019–present)']

我的代码是:

url = 'https://en.wikipedia.org/wiki/Ryan_Fitzpatrick'
table = BeautifulSoup(player_wiki.text , 'html.parser')

for tr  in table.find('tbody').find_all('ul'):
  v = [li.text for li in tr.find_all('li')]
  print(v)

当前输出:

['St. Louis Rams (2005–2006)', 'Cincinnati Bengals (2007–2008)', 'Buffalo Bills (2009–2012)', 'Tennessee Titans (2013)', 'Houston Texans (2014)', 'New York Jets (2015–2016)', 'Tampa Bay Buccaneers (2017–2018)', 'Miami Dolphins (2019–present)']
['Ivy League Player of the Year (2004)', 'First-team All–Ivy League (2004)', 'George H. “Bulger” Lowe Award (2004)']

我确定这是我的外循环的'ul'标签。如何缩小我的find_all()范围以防止出现不需要的数据?有小费吗?我是网络爬网的新手。

阿贾克斯1234

您可以使用soup.find_all

import requests
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://en.wikipedia.org/wiki/Ryan_Fitzpatrick').text, 'html.parser')
result = [i.get_text(strip=True) for i in d.find('table', {'class':'infobox vcard'}).find_all('tr')[12].find_all('li')]

输出:

['St. Louis Rams(2005–2006)', 'Cincinnati Bengals(2007–2008)', 'Buffalo Bills(2009–2012)', 'Tennessee Titans(2013)', 'Houston Texans(2014)', 'New York Jets(2015–2016)', 'Tampa Bay Buccaneers(2017–2018)', 'Miami Dolphins(2019–present)']

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章