Web爬网Python访问表数据

斯潘塞·斯普劳斯

因此，我尝试使用Beautiful Soup对本网站http://www.killedbypolice.net/kbp2013.html进行一些网络抓取，并访问表中的数据。我当前的代码是：

url = "http://www.killedbypolice.net/kbp2013.html"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "html.parser")

data = soup.find_all('table')
data[0]

但是...我遇到了最大递归深度运行时错误。我不确定如何访问表中保存数据的“ td”字段。谢谢

帕德拉克·坎宁安（Padraic Cunningham）

该错误是因为HTML是非常糟糕的格式，你RuntimeError: maximum recursion depth exceeded创造了汤对象既lxml和html.parser，唯一的解析器，在所有的作品是html5lib：

html = requests.get("http://www.killedbypolice.net/kbp2013.html").content
soup = BeautifulSoup(html, "html5lib")

# get all the table rows
table = soup.find("table")

可以得到表格：

    <table background="/kbp/bg1.jpg" border="1" cellpadding="0" cellspacing="0" width="100%">
<tbody><tr><td><img src="http://www.killedbypolice.net/size.jpg" width="185"/><br/><b><center># since Jan 1st '14</center></b></td>
<td><b><center>St.</center></b></td>
<td><b><center>g/r</center></b></td>
<td><img src="http://www.killedbypolice.net/size.jpg" width="200"/><b><center>Name, Age</center></b></td>
<td></td>
<td><img src="http://www.killedbypolice.net/size.jpg" width="297"/><b><center>KBP link <font color="red">(plus extensive follow-ups)</font></center></b></td>
<td><b><center>News link</center></b></td>
......................................................................
</font></a></td></tr><tr><td><center>(2) May 2, 2013        </center>
</td><td>CA</td><td>M/B</td><td>Kenneth Bernard Williams, 55   </td><td><font size="2">G</font></td><td><a href="http://facebook.com/KilledByPolice/posts/622539181107556" target="new"><font size="2"></font></a><font size="2"><center><a href="http://facebook.com/KilledByPolice/posts/622539181107556" target="new">facebook.com/KilledByPolice/posts/622539181107556  </a></center></font></td><td><a href="http://www.nbclosangeles.com/news/local/Police-Shoot-Kill-Suspect-in-Skid-Row-Prompting-Angry-Crowd-to-Gather-205646861.html" target="new"><font size="2">http://www.nbclosangeles.com/news/local/Police-Shoot-Kill-Suspect-in-Skid-Row-Prompting-Angry-Crowd-to-Gather-205646861.html
</font></a></td></tr><tr><td><center>(1) May 1, 2013        </center></td><td>MI</td><td>M/B</td><td><a href="http://www.killedbypolice.net/victims/2261.jpg" target="new">Jordan West-Morson, 26   </a></td><td><font size="2">G</font></td><td><a href="http://facebook.com/KilledByPolice/posts/1033800406648096" target="new"></a><center><a href="http://facebook.com/KilledByPolice/posts/1033800406648096" target="new"><font size="2">facebook.com/KilledByPolice/posts/1033800406648096    </font></a></center></td><td><a href="http://www.mlive.com/news/detroit/index.ssf/2013/09/detroit_transit_officer_charge.html" target="new"><font size="2">http://www.mlive.com/news/detroit/index.ssf/2013/09/detroit_transit_officer_charge.html</font></a><font size="2"><br/><i>Detroit transit officer not guilty in fatal shooting: </i><a href="http://www.clickondetroit.com/news/detroit-transit-officer-not-guilty-in-fatal-shooting/32405878" target="new">http://www.clickondetroit.com/news/detroit-transit-officer-not-guilty-in-fatal-shooting/32405878

</a></font></td></tr></tbody></table>

但是接下来是对find_all的简单调用：

print(table.find_all("tr"))

给你：

 AttributeError: 'NoneType' object has no attribute 'next_element'

html只是一团糟，不幸的是，我看不到用bs4解析它的简单方法，这可能是您需要使用某些正则表达式的罕见情况之一。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-04-19

我来说两句

0 条评论

登录后参与评论

上一篇：通过GTM电子商务增强功能和dataLayer推送跟踪的结帐ajax步骤

Web爬网Python访问表数据

Web爬网Python访问表数据

蓝屏死机没有修复解决方案

计算数据帧中每行的NA

UITableView的项目向下滚动后更改颜色，然后快速备份

Node.js中未捕获的异常错误，发生调用

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

Linux的官方Adobe Flash存储库是否已过时？

验证REST API参数

ggplot：对齐多个分面图-所有大小不同的分面

Mac OS X更新后的GRUB 2问题

通过 Git 在运行 Jenkins 作业时获取 ClassNotFoundException

带有错误“ where”条件的查询如何返回结果？

用日期数据透视表和日期顺序查询

VB.net将2条特定行导出到DataGridView

如何从视图一次更新多行（ASP.NET - Core）

Java Eclipse中的错误13，如何解决？

尝试反复更改屏幕上按钮的位置 - kotlin android studio

离子动态工具栏背景色

应用发明者仅从列表中选择一个随机项一次

当我尝试下载 StanfordNLP en 模型时，出现错误

python中的boto3文件上传

在同一Pushwoosh应用程序上Pushwoosh多个捆绑ID