当没有“下一页”按钮而是一堆“页码”页面时的分页

Ramiro 发表于 Dev

拉米罗

我很高兴用R进行报废，但发现了它的局限性。在试图取消阿根廷最高法院案件摘要时，我发现了一个我找不到答案的问题。很可能的结果做中学---所以，请不要在我的代码工作，但下面一个相当不好的做法，指出。无论如何，我设法：

访问搜索页面。
在中输入相关的分类法术语（例如'DECRETO DE NECESIDAD Y URGENCIA'）#voces，单击“搜索”并将其剪贴，然后.datosSumarios找到需要的信息（案例名称，日期，报告者等）。代码如下：


const puppeteer = require('puppeteer');

let scrape = async () => {
    const browser = await puppeteer.launch({headless: false});
    const page = await browser.newPage();

    await page.goto('https://sjconsulta.csjn.gov.ar/sjconsulta/');

  // wait until element ready  
    await Promise.all([
        page.type('#voces', 'DECRETO DE NECESIDAD Y URGENCIA'),
        page.waitForSelector('.ui-menu-item')
    ]);

    await page.click('.ui-menu-item');

    await Promise.all([
    page.click('.glyphicon-search'),
    page.waitForNavigation({ waitUntil: 'networkidle0' }),
  ]);

    //Here we are in the place we want to be, and then capture what we need:     
    
    const result = await page.evaluate(() => {

        let data = []; // Create an empty array that will store our data
        
        let elements = document.querySelectorAll('.row'); // Select all Products

        for (var element of elements){ // Loop through each proudct
            
            let title = document.querySelector('.datosSumario').innerText;

            data.push({title}); // Push an object with the data onto our array

        }

        return data; // Return our data array
        
    });

    //review -> 
    
    await page.click('#paginate_button2')  

    browser.close();
    return result;
};

scrape().then((value) => {
    console.log(value); // Success!
});

我似乎无法做的是浏览不同的页面。如果您跟随该页面，您会发现分页非常奇怪：没有“下一页”按钮，而是一堆“页面编号按钮”，我可以按此按钮，但不能重复上面代码的打包部分。我尝试了一个循环功能（无法使其正常工作）。我已经阅读了一些分页教程，但是找不到面对这种特殊问题的教程。

＃更新

我能够解决分页问题，但目前看来，我似乎无法提供一种功能来实际分页需要在分页内工作的文本（它可以在单个页面外运行）。分享，以防有人指出我可能犯的明显错误。

const puppeteer = require('puppeteer');
const fs = require('fs');

let scrape = async () => {
    const browser = await puppeteer.launch({headless: false});
    const page = await browser.newPage();

    await page.goto('https://sjconsulta.csjn.gov.ar/sjconsulta/');

  // wait until element ready  
    await Promise.all([
        page.type('#voces', 'DECRETO DE NECESIDAD Y URGENCIA'),
        page.waitForSelector('.ui-menu-item')
    ]);

    await page.click('.ui-menu-item');

    await Promise.all([
    page.click('.glyphicon-search'),
    page.waitForNavigation({ waitUntil: 'networkidle0' }),
  ]);

    var results = []; // variable to hold the "sumarios" I need
    var lastPageNumber = 2; // I am using 2 to test, but I can choose any number and it works (in this case, the 31 pages I need to scrap)
    for (let index = 0; index < lastPageNumber; index++) {
        // wait 1 sec for page load
        await page.waitFor(5000);
        // call and wait extractedEvaluateCall and concatenate results every iteration.
        // You can use results.push, but will get collection of collections at the end of iteration
        results = results.concat(await MyFunction); // I call my function but the function does not work, see below 
        if (index != lastPageNumber - 1) {
            await page.click('li.paginate_button.active + li a[onclick]'); //This does the trick 
            await page.waitFor(5000);
        }
    }

    browser.close();
    return results;

};

    async function MyFunction() {
    
        const data = await page.evaluate( () => // This bit works outside of the async function environment and I get the text I need in a single page

            Array.from( 

                document.querySelectorAll('div[class="col-sm-8 col-lg-9 datosSumario"]'), element => element.textContent) 
    
            );

    }

scrape().then((results) => {
    console.log(results); // Success!
    
});

vsemozhebuty

您可以尝试document.querySelector('li.paginate_button.active + li a[onclick]')作为等效的下一页按钮。点击后，您可以等待网址以开头的响应'https://sjconsulta.csjn.gov.ar/sjconsulta/consultaSumarios/paginarSumarios.html?startIndex='。

＃更新

乍看之下，存在一些问题：

MyFunction不被调用：您需要await MyFunction()而不是await MyFunction。
您需要转移page到MyFunction()范围内：

  results = results.concat(await MyFunction(page));
//...
async function MyFunction(page) {
// ...
}

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-01-24

我来说两句

0 条评论

登录后参与评论

上一篇：概念可以与模板模板参数一起使用吗？

TOP 榜单

文章

当没有“下一页”按钮而是一堆“页码”页面时的分页

当没有“下一页”按钮而是一堆“页码”页面时的分页

隐藏发件人没有短信PHP

材质UI垂直滑块。如何改变在垂直材料UI滑块导轨的厚度（反应）

在Windows 7中无法删除文件（2）

HttpClient中的角度变化检测

Azure VM启动/停止日志

如何在 Vb.net 中使用函数返回多个值

Powerpoint-条形长度错误的堆积条形图

最新歌剧断断续续的快速拨号和渲染错误

Mac OS X更新后的GRUB 2问题

需要公式以vlookup逗号分隔单个单元格中的值

Hashchange事件侦听器在将事件处理程序附加到事件之前进行侦听

ggplot：对齐多个分面图-所有大小不同的分面

OS X-为什么我需要打开WiFi才能确定最近的位置

用日期数据透视表和日期顺序查询

Java Eclipse中的错误13，如何解决？

如何在Django中使用UUID

加载Microsoft Visual菜单时出现问题

具有if条件的SQL UPDATE

从JSON到JSONL的Python转换

如何在Kod中更改字体？

共享图像将路径放入地址