我正在尝试抓取此维基百科页面的“事件”部分:https : //en.wikipedia.org/wiki/2020。该页面没有最容易导航的 HTML,因为大多数标签不是嵌套的,而是同级的。
我想确保我抓取的唯一数据位于下面显示的两个 h2 标签之间。
这是精简的 HTML:
<h2> #I ONLY WANT TO SEARCH BETWEEN HERE
<span id="Events">Events</span>
</h2>
<h3>...</h3>
<ul>...</ul>
<h3>...</h3>
<ul>
<li>
<a title="June 17"</a> #My code below is looking for this, if not found it jumps to another section
</li>
</ul>
<h3>...</h3>
<ul>...</ul>
<h2> #AND HERE. DON"T WANT TO GO PAST HERE
<span id="Predicted_and_scheduled_events">Predicted_and_scheduled_events</span>
</h2>
如果不清楚,每个标签(跨度除外)都是兄弟姐妹。如果日期存在于两个 h2 标记之间,我的代码目前可以工作,但是如果日期不存在,它将转到页面的另一部分以提取数据,这是我不想要的。
这是我的代码:
import sys
import requests
import bs4
res = requests.get('https://en.wikipedia.org/wiki/2020')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,"lxml")
todaysNews = soup.find('a', {"title": "June 17"}) #goes to date's stories
BS 有很多有用的函数和参数。值得阅读整个文档。
它具有获取父元素、下一个兄弟元素、具有任何标题的元素等的功能。
首先我搜索<span id="Events">Events</span>
,接下来我得到它的parent
元素<h2>
,我有数据的开始。
接下来,我可以获取next_siblings
并在for
循环中运行,直到获得带有名称的项目h2
并获得数据结束。
在for
-loop我可以检查所有ul
的元素和搜索直接li
元素没有嵌套li
元素(recursive=False
)了,里面li
我可以得到第一个a
具有title
用任何文本({"title": True}
)
import requests
import bs4
res = requests.get('https://en.wikipedia.org/wiki/2020')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'lxml')
# found start of data `h2`
start = soup.find('span', {'id': 'Events'}).parent
# check sibling items
for item in start.next_siblings:
# found end of data `h2`
if item.name == 'h2':
break
if item.name == 'ul':
# only direct `li` without nested `li`
for li in item.find_all('li', recursive=False):
# `a` which have `title`
a = li.find('a', {'title': True})
if a:
print(a['title'])
结果:
January 1
January 2
January 3
January 5
January 7
January 8
January 9
January 10
January 12
January 16
January 18
January 28
January 29
January 30
January 31
February 5
February 11
February 13
February 27
February 28
February 29
March 5
March 8
March 9
March 11
March 12
March 13
March 14
March 16
March 17
March 18
March 20
March 23
March 24
March 26
March 27
March 30
April 1
April 2
April 4
April 5
April 6
April 7
April 8
April 9
April 10
April 12
April 14
April 15
April 17
April 18
April 19
April 20
April 21
April 22
April 23
April 25
April 26
April 27
April 28
April 29
April 30
May 1
May 3
May 4
May 5
May 6
May 7
May 9
May 10
May 11
May 12
May 14
May 15
May 16
May 18
May 19
May 21
May 22
May 23
May 24
May 26
May 27
May 28
May 30
May 31
June 1
June 2
June 3
June 4
June 6
June 7
June 8
June 9
June 16
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句