如何使用python中的BeautifulSoup库从具有“查看更多”选项的网站上抓取数据

妮维达

我正在尝试解析来自此网站链接的评论：我需要获得 1000 条评论，默认情况下它只显示 10 条

我想获得 1000 条评论，默认情况下只显示 10 条。单击“查看更多”后，我无法找到获取网页上显示内容的方法

到目前为止，我有以下代码：

import urllib.request
from bs4 import BeautifulSoup
import sys

non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)

response = urllib.request.urlopen("https://www.mygov.in/group-issue/share-
your-ideas-pm-narendra-modis-mann-ki-baat-26th-march-2017/")

srcode = response.read()

soup = BeautifulSoup(srcode, "html.parser")

all_comments_div=soup.find_all('div', class_="comment_body");

all_comments=[]
for div in all_comments_div:
    all_comments.append(div.find('p').text.translate(non_bmp_map))



print (all_comments)
print (len(all_comments))

妈妈

您可以使用 while 循环来获取下一页
（即当有下一页且所有评论少于 1000 时）

import urllib.request
from bs4 import BeautifulSoup
import sys

non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
all_comments = [] 
max_comments = 1000
base_url = 'https://www.mygov.in/'
next_page = base_url + '/group-issue/share-your-ideas-pm-narendra-modis-mann-ki-baat-26th-march-2017/'

while next_page and len(all_comments) < max_comments : 
    response = response = urllib.request.urlopen(next_page)
    srcode = response.read()
    soup = BeautifulSoup(srcode, "html.parser")

    all_comments_div=soup.find_all('div', class_="comment_body");
    for div in all_comments_div:
        all_comments.append(div.find('p').text.translate(non_bmp_map))

    next_page = soup.find('li', class_='pager-next first last')
    if next_page : 
        next_page = base_url + next_page.find('a').get('href')
    print('comments: {}'.format(len(all_comments)))

print(all_comments)
print(len(all_comments))