Python BS4 Beautiful Soup HTML.Parser 在网站上不起作用

卢夫·托马尔

我有 python 3.7 代码来尝试从以下网站 ( https://www.whoscored.com/Matches/1294545/LiveStatistics/Germany-Bundesliga-2018-2019-Bayern-Munich-Hoffenheim ) 中提取足球统计数据。似乎我与 BS4 Beautiful 汤一起使用的 HTML 解析器根本没有提取网站中的任何标签。

我首先尝试提取特定标签，例如代表主客队的两个不同 div 标签以及包含球员姓名的标签。当它呈现一个空的提取标签列表时，我只是尝试提取该网站上的所有 div 标签，但我仍然得到一个空列表。

这是我使用的代码：

from requests import get
from bs4 import BeautifulSoup

url = 'https://www.whoscored.com/Matches/1294545/LiveStatistics/Germany- 
Bundesliga-2018-2019-Bayern-Munich-Hoffenheim'

response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
containers_home_offensive = html_soup.find_all('div')

阿布杜斯科

当您可以直接从 HTML 中提取比赛统计数据时，您不必使用 Selenium：

import re
from ast import literal_eval

url = 'https://www.whoscored.com/Matches/1294545/LiveStatistics/Germany-Bundesliga-2018-2019-Bayern-Munich-Hoffenheim'
res = requests.get(
    url,
    headers={
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0',
    }
)
res.raise_for_status()
html = res.text

到目前为止没什么特别的。

match_data = re.search('var matchStats = ([^;]+)', html, flags=re.MULTILINE).group(1)
match_data_clean = re.sub(',,', ",'',", match_data_clean)

stats = literal_eval(match_data_clean)

当我们检查时，match_data我们可以看到一堆语法无效的数组，如下所示：

ams',,'yellow',,,21,328

所以我们re通过在逗号之间插入空字符串来用一点魔法来清除它。

印刷stats为我们提供：

[[[37,
   1211,
   'Bayern Munich',
   'Hoffenheim',
   '24/08/2018 19:30:00',
   '24/08/2018 00:00:00',
   6,
   'FT',
   '1 : 0',
   '3 : 1',
   '',
   '',
   '3 : 1',
   'Germany',
   'Germany'],
  [[[21, [], [['Kasim Adams', '', 'yellow', '', '', 21, 328428, 0]], 0, 1],
    [23,
     [['Thomas Müller',
       'Joshua Kimmich',
       'goal',
       '(1-0)',
       '',
       23,
       37099,
       283323]],
     [],
     1,
     0],

从现在开始，它只是找到与您正在寻找的数据相对应的正确索引。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。