我正在尝试从伦敦证券交易所的新闻文章中抓取正文,但是当我尝试使用BeautifulSoup将其拉出时,它并未出现。有谁知道我该如何获取此信息?
单击检查时可以找到标签,但是在查看源代码(Ctrl + U)时,不会显示文本。我认为该信息可能是从另一个站点加载到该站点的,但是我对此不确定,也不知道如何抓取它。
我正在查看的网站是:https : //www.londonstockexchange.com/news-article/PFG/interim-results-for-six-months-ended-30-june-2020/14665452
我正在尝试获取有关中期业绩的主要内容。
文章存储在页面内<script>
标签内。您可以使用以下示例将其提取:
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.londonstockexchange.com/news-article/PFG/interim-results-for-six-months-ended-30-june-2020/14665452'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data = soup.select_one('#ng-lseg-state').string.replace('&q;', '"').replace('&l;', '<').replace('&g;', '>').replace('&a;', '&').replace('&s;', "'")
data = json.loads(data)
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
def find_news_article(data):
if isinstance(data, dict):
for k, v in data.items():
if k == 'newsArticle':
yield v
else:
yield from find_news_article(v)
elif isinstance(data, list):
for v in data:
yield from find_news_article(v)
article = BeautifulSoup(next(find_news_article(data))['value'], 'html.parser')
# print text from article on screen:
print(article.get_text(strip=True, separator='\n'))
印刷品:
RNS Number : 1348X
Provident Financial PLC
26 August 2020
Provident Financial plc
Interim results for the six months ended 30 June 2020
Provident Financial plc ('the Group') is the leading provider of credit products to consumers who are underserved by mainstream lenders. The Group serves c.2.2 million customers and its operations consist of Vanquis Bank, Moneybarn, and the Consumer Credit Division ('CCD') comprising Provident home credit and Satsuma.
...and so on.
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句