我很高兴用R进行报废,但发现了它的局限性。在试图取消阿根廷最高法院案件摘要时,我发现了一个我找不到答案的问题。很可能的结果做中学---所以,请不要在我的代码工作,但下面一个相当不好的做法,指出。无论如何,我设法:
#voces
,单击“搜索”并将其剪贴,然后.datosSumarios
找到需要的信息(案例名称,日期,报告者等)。代码如下:
const puppeteer = require('puppeteer');
let scrape = async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('https://sjconsulta.csjn.gov.ar/sjconsulta/');
// wait until element ready
await Promise.all([
page.type('#voces', 'DECRETO DE NECESIDAD Y URGENCIA'),
page.waitForSelector('.ui-menu-item')
]);
await page.click('.ui-menu-item');
await Promise.all([
page.click('.glyphicon-search'),
page.waitForNavigation({ waitUntil: 'networkidle0' }),
]);
//Here we are in the place we want to be, and then capture what we need:
const result = await page.evaluate(() => {
let data = []; // Create an empty array that will store our data
let elements = document.querySelectorAll('.row'); // Select all Products
for (var element of elements){ // Loop through each proudct
let title = document.querySelector('.datosSumario').innerText;
data.push({title}); // Push an object with the data onto our array
}
return data; // Return our data array
});
//review ->
await page.click('#paginate_button2')
browser.close();
return result;
};
scrape().then((value) => {
console.log(value); // Success!
});
我似乎无法做的是浏览不同的页面。如果您跟随该页面,您会发现分页非常奇怪:没有“下一页”按钮,而是一堆“页面编号按钮”,我可以按此按钮,但不能重复上面代码的打包部分。我尝试了一个循环功能(无法使其正常工作)。我已经阅读了一些分页教程,但是找不到面对这种特殊问题的教程。
#更新
我能够解决分页问题,但目前看来,我似乎无法提供一种功能来实际分页需要在分页内工作的文本(它可以在单个页面外运行)。分享,以防有人指出我可能犯的明显错误。
const puppeteer = require('puppeteer');
const fs = require('fs');
let scrape = async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('https://sjconsulta.csjn.gov.ar/sjconsulta/');
// wait until element ready
await Promise.all([
page.type('#voces', 'DECRETO DE NECESIDAD Y URGENCIA'),
page.waitForSelector('.ui-menu-item')
]);
await page.click('.ui-menu-item');
await Promise.all([
page.click('.glyphicon-search'),
page.waitForNavigation({ waitUntil: 'networkidle0' }),
]);
var results = []; // variable to hold the "sumarios" I need
var lastPageNumber = 2; // I am using 2 to test, but I can choose any number and it works (in this case, the 31 pages I need to scrap)
for (let index = 0; index < lastPageNumber; index++) {
// wait 1 sec for page load
await page.waitFor(5000);
// call and wait extractedEvaluateCall and concatenate results every iteration.
// You can use results.push, but will get collection of collections at the end of iteration
results = results.concat(await MyFunction); // I call my function but the function does not work, see below
if (index != lastPageNumber - 1) {
await page.click('li.paginate_button.active + li a[onclick]'); //This does the trick
await page.waitFor(5000);
}
}
browser.close();
return results;
};
async function MyFunction() {
const data = await page.evaluate( () => // This bit works outside of the async function environment and I get the text I need in a single page
Array.from(
document.querySelectorAll('div[class="col-sm-8 col-lg-9 datosSumario"]'), element => element.textContent)
);
}
scrape().then((results) => {
console.log(results); // Success!
});
您可以尝试document.querySelector('li.paginate_button.active + li a[onclick]')
作为等效的下一页按钮。点击后,您可以等待网址以开头的响应'https://sjconsulta.csjn.gov.ar/sjconsulta/consultaSumarios/paginarSumarios.html?startIndex='
。
#更新
乍看之下,存在一些问题:
MyFunction
不被调用:您需要await MyFunction()
而不是await MyFunction
。
您需要转移page
到MyFunction()
范围内:
results = results.concat(await MyFunction(page));
//...
async function MyFunction(page) {
// ...
}
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句