我正在尝试从此HTML代码之间提取“ T”,“ 0-0”和“(2 OT)”。我开始在下面编写代码,但是对于新手来说太难了。谢谢你的帮助。
<div class ="sidearm-schedule-game-details flex item-1 columns"> == $0
<div class="sidearm-schedule-game-result text-italic"> == $0
<span></span>
<span>T,</span>
<span>0-0</span>
<span>(2 OT)</span>
</div>
import requests
import pandas as pd
from pandas import ExcelWriter
from bs4 import BeautifulSoup
url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
school = requests.get(url).text
soup = BeautifulSoup(school,'lxml')
rows = soup.find_all('div',class_="sidearm-schedule-game-row flex flex-wrap flex-align-center row")
sheet = pd.DataFrame()
for row in rows:
result = row.find('div',class_="sidearm-schedule-game-result").text.strip()
df = pd.DataFrame([[result]], columns=['result'])
sheet = sheet.append(df,sort=True).reset_index(drop=True)
results.append(sheet)
您可以使用re
模块来解析内的文本<span>
S和存储在单独的列中的每个信息Result
,Score
,OT
。
例如:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
school = requests.get(url).text
soup = BeautifulSoup(school,'lxml')
rows = soup.find_all('div',class_="sidearm-schedule-game-row flex flex-wrap flex-align-center row")
data = []
for row in rows:
opponent = row.select_one('.sidearm-schedule-game-opponent-logo img')['alt'].rsplit(maxsplit=1)[0]
name_date = row.select_one('.sidearm-schedule-game-opponent-name a')['aria-label']
result = re.findall(r'([A-Z]),\s+([\d-]+)\s*(.*)', row.select_one('.sidearm-schedule-game-result').get_text(strip=True, separator=' '))[0]
data.append([opponent, *result, name_date])
df = pd.DataFrame(data, columns=['Name', 'Result', 'Score', 'OT', 'Info'])
print(df)
印刷品:
Name Result Score OT Info
0 University of Connecticut L 1-2 UConn on August 24 7 p.m.
1 Drexel University L 1-2 (OT) Drexel on August 27 7 p.m.
2 George Washington University W 1-0 George Washington on September 1 4 p.m.
3 St. John's University W 1-0 St. John's on September 4 7:30 p.m.
4 Binghamton University L 1-2 Binghamton on September 7 8 p.m.
5 Rider University W 1-0 (2 OT) Rider on September 11 7 p.m.
6 University of Pennsylvania T 0-0 (2 OT) Penn on September 15 6 p.m.
7 Army W 3-0 Army on September 22 7 p.m.
8 Cornell University L 2-3 (OT) Cornell on September 25 7 p.m.
9 Boston University W 2-1 (OT) Boston U on September 29 4 p.m.
10 Colgate University W 1-0 Colgate on October 3 7 p.m.
11 United States Naval Academy W 1-0 Navy on October 6 6 p.m.
12 Lafayette College L 0-1 Lafayette on October 13 12 p.m.
13 Dartmouth College T 0-0 (2 OT) Dartmouth on October 16 6 p.m.
14 American University L 0-1 American on October 20 6 p.m.
15 Bucknell University W 1-0 Bucknell on October 24 7 p.m.
16 Loyola University (Md.) L 0-1 Loyola (Md.) on October 27 3 p.m.
17 Holy Cross W 3-1 Holy Cross on November 3 6 p.m.
18 Colgate University L 1-2 No. 3 Colgate (Semifinals) on November 9 7 p.m.
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句