我正在尝试使用bs4仅将NFL Qbs的'职业生涯历史'-球员参与过的球队列表-NFL Qbs的表格部分隔离出来:
我想要的输出是:
['St. Louis Rams (2005–2006)', 'Cincinnati Bengals (2007–2008)', 'Buffalo Bills (2009–2012)', 'Tennessee Titans (2013)', 'Houston Texans (2014)', 'New York Jets (2015–2016)', 'Tampa Bay Buccaneers (2017–2018)', 'Miami Dolphins (2019–present)']
我的代码是:
url = 'https://en.wikipedia.org/wiki/Ryan_Fitzpatrick'
table = BeautifulSoup(player_wiki.text , 'html.parser')
for tr in table.find('tbody').find_all('ul'):
v = [li.text for li in tr.find_all('li')]
print(v)
当前输出:
['St. Louis Rams (2005–2006)', 'Cincinnati Bengals (2007–2008)', 'Buffalo Bills (2009–2012)', 'Tennessee Titans (2013)', 'Houston Texans (2014)', 'New York Jets (2015–2016)', 'Tampa Bay Buccaneers (2017–2018)', 'Miami Dolphins (2019–present)']
['Ivy League Player of the Year (2004)', 'First-team All–Ivy League (2004)', 'George H. “Bulger” Lowe Award (2004)']
我确定这是我的外循环的'ul'标签。如何缩小我的find_all()范围以防止出现不需要的数据?有小费吗?我是网络爬网的新手。
您可以使用soup.find_all
:
import requests
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://en.wikipedia.org/wiki/Ryan_Fitzpatrick').text, 'html.parser')
result = [i.get_text(strip=True) for i in d.find('table', {'class':'infobox vcard'}).find_all('tr')[12].find_all('li')]
输出:
['St. Louis Rams(2005–2006)', 'Cincinnati Bengals(2007–2008)', 'Buffalo Bills(2009–2012)', 'Tennessee Titans(2013)', 'Houston Texans(2014)', 'New York Jets(2015–2016)', 'Tampa Bay Buccaneers(2017–2018)', 'Miami Dolphins(2019–present)']
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句