如何通过不更改 URL 的“显示更多”按钮获取数据？

171

戴帽

我正在尝试使用站点搜索关键字从 Vogue 中抓取文章标题和链接。我无法获得前 100 个结果，因为“显示更多”按钮掩盖了它们。我以前通过使用更改的 URL 解决了这个问题，但是 Vogue 的 URL 没有更改为包含页码、结果编号等。

import requests
from bs4 import BeautifulSoup as bs

url = 'https://www.vogue.com/search?q=HARRY+STYLES&sort=score+desc'
r = requests.get(url)
soup = bs(r.content, 'html')

links = soup.find_all('a', {'class':"summary-item-tracking__hed-link summary-item__hed-link"})
titles = soup.find_all('h2', {'class':"summary-item__hed"})

res = []
for i in range(len(titles)):
    entry = {'Title': titles[i].text.strip(), 'Link': 'https://www.vogue.com'+links[i]['href'].strip()}
    res.append(entry)

关于如何通过“显示更多”按钮抓取数据的任何提示？

哈桑·胡塞因·尤塞尔

您必须通过开发人员工具检查网络。然后你必须确定如何向网站请求数据。您可以在屏幕截图中看到请求和响应。

如您所见，该网站正在使用页面参数。

每页有 8 个标题。所以你必须使用循环来获得 100 个标题。

代码：

import cloudscraper,json,html
counter=1
for i in range(1,14):
    url = f'https://www.vogue.com/search?q=HARRY%20STYLES&page={i}&sort=score%20desc&format=json'
    scraper = cloudscraper.create_scraper(browser={'browser': 'firefox','platform': 'windows','mobile': False},delay=10)
    byte_data = scraper.get(url).content
    json_data = json.loads(byte_data)
    for j in range(0,8):
        title_url = 'https://www.vogue.com' + (html.unescape(json_data['search']['items'][j]['url']))
        t = html.unescape(json_data['search']['items'][j]['source']['hed'])
        print(counter," - " + t + ' - ' + title_url)
        if (counter == 100):
            break
        counter = counter + 1

输出：