Python Beautiful Soup：如何提取标签旁边的文本？

挥发物3

我有以下HTML

<p>
<b>Father:</b> Michael Haughton
<br>
<b>Mother:</b> Diane
<br><b>Brother:</b> 
Rashad Haughton<br>
<b>Husband:</b> <a href="/people/540/000024468/">R. Kelly</a> (m. 1994, annulled that same year)
<br><b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a> (Roc-a-Fella co-CEO)<br></p>

我必须将标题和文本分开，例如Mother：Diane ..

所以最后我将得到一个字典列表，如下所示：

[{"label":"Mother","value":"Diane"}]

我正在尝试以下内容，但无法正常工作：

def parse(u):
    u = u.rstrip('\n')
    r = requests.get(u, headers=headers)
    if r.status_code == 200:
        html = r.text.strip()
        soup = BeautifulSoup(html, 'lxml')
        headings = soup.select('table p')
        for h in headings:
            b = h.find('b')
            if b is not None:
                print(b.text)
                print(h.text + '\n')
                print('=================================')


url = 'http://www.nndb.com/people/742/000024670/'

德米特里（Dmitriy Fialkovskiy）

from bs4 import BeautifulSoup
from urllib.request import urlopen

#html = '''<p>
#<b>Father:</b> Michael Haughton
#<br>
#<b>Mother:</b> Diane
#<br><b>Brother:</b> 
#Rashad Haughton<br>
#<b>Husband:</b> <a href="/people/540/000024468/">R. Kelly</a> (m. 1994, annulled that same year)
#<br><b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a> (Roc-a-Fella co-CEO)<br></p>'''

page = urlopen('http://www.nndb.com/people/742/000024670/')
source = page.read()

soup = BeautifulSoup(source)

needed_p = soup.find_all('p')[8]

bs = needed_p.find_all('b')

res = {}

for b in bs:
    if b.find_next('a').text:
        res[b.text] = b.find_next('a').text.strip().strip('\n')
    if b.next_sibling != ' ':
        res[b.text] = b.next_sibling.strip().strip('\n')

res

输出：

{'Brother:': 'Rashad Haughton',
 'Mother:': 'Diane',
 'Husband:': 'R. Kelly',
 'Father:': 'Michael Haughton',
 'Boyfriend:': 'Damon Dash'}

编辑：有关页面顶部的其他信息：

... (code above) ...
soup = BeautifulSoup(source)

needed_p = soup.find_all('p')[1:4] + [soup.find_all('p')[8]] # here explicitly selecting needed p-tags for further parsing

res = {}

for p in needed_p:
    bs = p.find_all('b')
    for b in bs:
        if b.find_next('a').text:
            res[b.text] = b.find_next('a').text.strip().strip('\n')
        if b.next_sibling != ' ':
            res[b.text] = b.next_sibling.strip().strip('\n')

res

输出：

{'Race or Ethnicity:': 'Black',
 'Husband:': 'R. Kelly',
 'Died:': '25-Aug',
 'Nationality:': 'United States',
 'Executive summary:': 'R&B singer, died in plane crash',
 'Mother:': 'Diane',
 'Birthplace:': 'Brooklyn, NY',
 'Born:': '16-Jan',
 'Boyfriend:': 'Damon Dash',
 'Sexual orientation:': 'Straight',
 'Occupation:': 'Singer',
 'Cause of death:': 'Accident - Airplane',
 'Brother:': 'Rashad Haughton',
 'Remains:': 'Interred,',
 'Gender:': 'Female',
 'Father:': 'Michael Haughton',
 'Location of death:': 'Marsh Harbour, Abaco Island, Bahamas'}

对于此页面，您还可以例如通过以下方式抓取高中：

res['High School'] = soup.find_all('p')[9].text.split(':')[1].strip()

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2020-11-27

我来说两句

0 条评论

登录后参与评论

上一篇：我在JavaScript / ReactJS中遇到问题

无法使用Python的Beautiful Soup从特定的span标签提取文本

Python：使用 Beautiful Soup 从 HTML 标签中提取图像源

Python - Beautiful Soup - 我如何從標籤中提取一段文本

如何使用Beautiful Soup在Python中的span标签内抓取文本

如何用 Beautiful Soup 提取 Span 标签

使用Python和Beautiful Soup如何捕获空标签

Python Beautiful Soup提取HTML元数据

Python Beautiful Soup 提取 src 错误

使用 Beautiful Soup 提取文本

Beautiful Soup - 提取包含随机标记标签的完整文本句子

使用 Beautiful Soup 从选定标签中提取文本

使用Python和Beautiful Soup在非直接网页上提取文本

在 Python 中使用 Beautiful Soup 提取凌乱的、未标记的 HTML 文本

無法在 Python 中使用 Beautiful Soup 從元素中提取文本

如何使用scrapy或beautiful Soup提取特定html标签的内容？

使用 Python Beautiful Soup 进行网页抓取 - 提取单个值

Beautiful Soup或Python请求库未检测到某些标签

Python3 Beautiful Soup获取HTML标签锚

Python 3 Beautiful Soup查找带有冒号的标签

使用python检查孩子在Beautiful Soup 4中的标签

Beautiful Soup 正在引入不存在的标签 Python

在 Python 中使用 Beautiful Soup 添加缺少的子标签

如何使用 Beautiful Soup 和 Python 为 NASDAQ 站点中的表格提取 HTML 代码

如何使用 Beautiful Soup 在 `p` 标签中获取文本？

如何使用Beautiful Soup查找带有特定文本的标签？

使用Python和Beautiful Soup分割抓取的文本

Python Beautiful Soup-基于HTML中的文本寻找价值

Python Beautiful Soup在<br>之后获得部分文本

TOP 榜单

文章

Python Beautiful Soup：如何提取标签旁边的文本？

Python Beautiful Soup：如何提取标签旁边的文本？

Android Studio Kotlin：提取为常量

IE 11中的FormData未定义

计算数据帧R中的字符串频率

如何在R中转置数据

如何使用Redux-Toolkit重置Redux Store

Excel 2016图表将增长与4个参数进行比较

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

未捕获的SyntaxError：带有Ajax帖子的意外令牌u

OpenCv：改变 putText() 的位置

ActiveModelSerializer仅显示关联的ID

算术中的c ++常量类型转换

如何开始为Ubuntu开发

将加号/减号添加到jQuery菜单

去噪自动编码器和常规自动编码器有什么区别？

获取并汇总所有关联的数据

OpenGL纹理格式的颜色错误

在 React Native Expo 中使用 react-redux 更改另一个键的值

http：// localhost：3000 /＃！/为什么我在localhost链接中得到“＃！/”。

TreeMap中的自定义排序

Redux动作正常，但减速器无效

如何对treeView的子节点进行排序