使用 BS4 进行网页抓取,如何设置查看位置的范围

鲨鱼

我正在尝试抓取此维基百科页面的“事件”部分:https : //en.wikipedia.org/wiki/2020该页面没有最容易导航的 HTML,因为大多数标签不是嵌套的,而是同级的。

我想确保我抓取的唯一数据位于下面显示的两个 h2 标签之间。
这是精简的 HTML:

<h2>                  #I ONLY WANT TO SEARCH BETWEEN HERE
    <span id="Events">Events</span>
</h2>
<h3>...</h3>
<ul>...</ul>
<h3>...</h3>
<ul>
    <li>
        <a title="June 17"</a>   #My code below is looking for this, if not found it jumps to another section
    </li>
</ul>
<h3>...</h3>
<ul>...</ul>
<h2>                 #AND HERE. DON"T WANT TO GO PAST HERE
    <span id="Predicted_and_scheduled_events">Predicted_and_scheduled_events</span>
</h2>

如果不清楚,每个标签(跨度除外)都是兄弟姐妹。如果日期存在于两个 h2 标记之间,我的代码目前可以工作,但是如果日期不存在,它将转到页面的另一部分以提取数据,这是我不想要的。

这是我的代码:

import sys
import requests
import bs4
res = requests.get('https://en.wikipedia.org/wiki/2020')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,"lxml")
todaysNews = soup.find('a', {"title": "June 17"}) #goes to date's stories
简单

BS 有很多有用的函数和参数。值得阅读整个文档。

它具有获取父元素、下一个兄弟元素、具有任何标题的元素等的功能。


首先我搜索<span id="Events">Events</span>,接下来我得到它的parent元素<h2>,我有数据的开始。

接下来,我可以获取next_siblings并在for循环中运行,直到获得带有名称的项目h2并获得数据结束。

for-loop我可以检查所有ul的元素和搜索直接li元素没有嵌套li元素(recursive=False)了,里面li我可以得到第一个a具有title用任何文本({"title": True}

import requests
import bs4

res = requests.get('https://en.wikipedia.org/wiki/2020')
res.raise_for_status()

soup = bs4.BeautifulSoup(res.text, 'lxml')

# found start of data `h2`
start = soup.find('span', {'id': 'Events'}).parent

# check sibling items
for item in start.next_siblings:

    # found end of data `h2`
    if item.name == 'h2': 
        break

    if item.name == 'ul':

        # only direct `li` without nested `li`
        for li in item.find_all('li', recursive=False): 

            # `a` which have `title`
            a = li.find('a', {'title': True}) 

            if a:
                print(a['title'])

结果:

January 1
January 2
January 3
January 5
January 7
January 8
January 9
January 10
January 12
January 16
January 18
January 28
January 29
January 30
January 31
February 5
February 11
February 13
February 27
February 28
February 29
March 5
March 8
March 9
March 11
March 12
March 13
March 14
March 16
March 17
March 18
March 20
March 23
March 24
March 26
March 27
March 30
April 1
April 2
April 4
April 5
April 6
April 7
April 8
April 9
April 10
April 12
April 14
April 15
April 17
April 18
April 19
April 20
April 21
April 22
April 23
April 25
April 26
April 27
April 28
April 29
April 30
May 1
May 3
May 4
May 5
May 6
May 7
May 9
May 10
May 11
May 12
May 14
May 15
May 16
May 18
May 19
May 21
May 22
May 23
May 24
May 26
May 27
May 28
May 30
May 31
June 1
June 2
June 3
June 4
June 6
June 7
June 8
June 9
June 16

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章