使用BeautifulSoup和Requests和Pandas从<div>中的<span>抓取数据

赫尔曼·L

我正在尝试从此HTML代码之间提取“ T”,“ 0-0”和“(2 OT)”。我开始在下面编写代码,但是对于新手来说太难了。谢谢你的帮助。


    <div class ="sidearm-schedule-game-details flex item-1 columns"> == $0
        <div class="sidearm-schedule-game-result text-italic"> == $0
            <span></span>
            <span>T,</span>
            <span>0-0</span>
            <span>(2 OT)</span>
        </div>


    import requests
    import pandas as pd
    from pandas import ExcelWriter
    from bs4 import BeautifulSoup


    url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
    school = requests.get(url).text
    soup = BeautifulSoup(school,'lxml')

    rows = soup.find_all('div',class_="sidearm-schedule-game-row flex flex-wrap flex-align-center row")
        sheet = pd.DataFrame()
        for row in rows:
            result = row.find('div',class_="sidearm-schedule-game-result").text.strip()


            df = pd.DataFrame([[result]], columns=['result'])
            sheet = sheet.append(df,sort=True).reset_index(drop=True)

        results.append(sheet)

安德烈·凯斯利(Andrej Kesely)

您可以使用re模块来解析内的文本<span>S和存储在单独的列中的每个信息ResultScoreOT

例如:

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
school = requests.get(url).text
soup = BeautifulSoup(school,'lxml')

rows = soup.find_all('div',class_="sidearm-schedule-game-row flex flex-wrap flex-align-center row")

data = []
for row in rows:
    opponent = row.select_one('.sidearm-schedule-game-opponent-logo img')['alt'].rsplit(maxsplit=1)[0]
    name_date = row.select_one('.sidearm-schedule-game-opponent-name a')['aria-label']

    result = re.findall(r'([A-Z]),\s+([\d-]+)\s*(.*)', row.select_one('.sidearm-schedule-game-result').get_text(strip=True, separator=' '))[0]

    data.append([opponent, *result, name_date])

df = pd.DataFrame(data, columns=['Name', 'Result', 'Score', 'OT', 'Info'])
print(df)

印刷品:

                            Name Result Score      OT                                             Info
0      University of Connecticut      L   1-2                                UConn on August 24 7 p.m.
1              Drexel University      L   1-2    (OT)                       Drexel on August 27 7 p.m.
2   George Washington University      W   1-0                  George Washington on September 1 4 p.m.
3          St. John's University      W   1-0                      St. John's on September 4 7:30 p.m.
4          Binghamton University      L   1-2                         Binghamton on September 7 8 p.m.
5               Rider University      W   1-0  (2 OT)                     Rider on September 11 7 p.m.
6     University of Pennsylvania      T   0-0  (2 OT)                      Penn on September 15 6 p.m.
7                           Army      W   3-0                              Army on September 22 7 p.m.
8             Cornell University      L   2-3    (OT)                   Cornell on September 25 7 p.m.
9              Boston University      W   2-1    (OT)                  Boston U on September 29 4 p.m.
10            Colgate University      W   1-0                              Colgate on October 3 7 p.m.
11   United States Naval Academy      W   1-0                                 Navy on October 6 6 p.m.
12             Lafayette College      L   0-1                          Lafayette on October 13 12 p.m.
13             Dartmouth College      T   0-0  (2 OT)                   Dartmouth on October 16 6 p.m.
14           American University      L   0-1                            American on October 20 6 p.m.
15           Bucknell University      W   1-0                            Bucknell on October 24 7 p.m.
16       Loyola University (Md.)      L   0-1                        Loyola (Md.) on October 27 3 p.m.
17                    Holy Cross      W   3-1                          Holy Cross on November 3 6 p.m.
18            Colgate University      L   1-2          No. 3 Colgate (Semifinals) on November 9 7 p.m.

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章

从 div 中抓取数据

BeautifulSoup 找到 div > span > a 中的所有 title 和 href

在python和beautifulsoup中查找同一个<div>内的多个抓取数据

抓取 <a href 和 <span 中的信息

如何使用BeautifulSoup,Requests和Python从HTML的特定表中抓取数据?

如何使用 Python、Selenium 和 BeautifulSoup 从 HTML <span id> 中抓取此文本?

使用 Requests 和 Beautifulsoup 抓取数据

如何使用 Python 和 BeautifulSoup 从 html 表中抓取数据?

如何使用 Python selenium 抓取“span typeof”或“span property”数据

VBA:使用<ul和<li和<div和<span进行Web抓取

如何使用BeautifulSoup从Python中基于数据自动属性的div类中抓取内容?

使用 BeautifulSoup 从数据框中抓取数据

使用Selenium从<div>中的<span>获取文本

Python从“div:class”中抓取数据

从 Windows.Form 中的 div 抓取数据

使用BeautifulSoup和Requests从xml文件中打印数据

如何从python和beautifulsoup中的页面抓取iframe数据范围

使用python和Beautifulsoup4从抓取数据中写入和保存CSV文件

使用beautifulSoup在元素中抓取数据

使用 BeautifulSoup 从 wiki 类别中抓取数据

使用beautifulsoup从脚本标签中抓取数据

从在BeautifulSoup中包含嵌套span标签的span标签中抓取文本

如何在DIV中垂直和水平对齐SPAN

Python(html) 使用 bf4 从动态变化的 div/span 中获取数据

使用 beautifulsoup 在 Pandas 数据框中抓取问题/错误

Xpath基于span类抓取div内容

如何使用beautifulsoup和python在span标签中获取文本

如何使用python和beautifulsoup4循环抓取网站中多个页面的数据

来自 div 类中包含的 span(无 id)的 Web 抓取字符串