我想从类似网站(我玩的游戏的状态游戏)的网站中抓取数据,在该网站中,<canvas>
元素中呈现了交互式图表,并且没有将任何数据显示为可抓取的HTML元素。检查HTML,页面似乎使用chartjs。
虽然首选python帮助,但是如果我真的需要使用一些javascript,那也可以。
另外,我想避免使用需要额外文件的方法,例如phantomjs,但是,如果那是唯一的方法,请再次慷慨地分享它。
解决此问题的一种方法是通过<script>
在第1050行附近的页面源中检出页面的内容,实际上是在图表初始化的位置。图表的初始化过程中有一种重复出现的模式,其中画布元素被逐一查询以获取其上下文,然后是提供图表标签和统计信息的变量。
该解决方案涵盖了如何使用node.js,至少是具有以下模块的最新版本:
这是下面的解决方案和源代码:
const cheerio = require('cheerio');
const axios = require('axios');
const { parse, each, find } = require('abstract-syntax-tree');
async function main() {
// get the page source
const { data } = await axios.get(
'https://stats.warbrokers.io/players/i/5d2ead35d142affb05757778'
);
// load the page source with cheerio to query the elements
const $ = cheerio.load(data);
// get the script tag that contains the string 'Chart.defaults'
const contents = $('script')
.toArray()
.map(script => $(script).html())
.find(contents => contents.includes('Chart.defaults'));
// convert the script content to an AST
const ast = parse(contents);
// we'll put all declarations in this object
const declarations = {};
// current key
let key = null;
// iterate over all variable declarations inside a script
each(ast, 'VariableDeclaration', node => {
// iterate over possible declarations, e.g. comma separated
node.declarations.forEach(item => {
// let's get the key to contain the values of the statistics and their labels
// we'll use the ID of the canvas itself in this case..
if(item.id.name === 'ctx') { // is this a canvas context variable?
// get the only string literal that is not '2d'
const literal = find(item, 'Literal').find(v => v.value !== '2d');
if(literal) { // do we have non- '2d' string literals?
// then assign it as the current key
key = literal.value;
}
}
// ensure that the variable we're getting is an array expression
if(key && item.init && item.init.type === 'ArrayExpression') {
// get the array expression
const array = item.init.elements.map(v => v.value);
// did we get the values from the statistics?
if(declarations[key]) {
// zip the objects to associate keys and values properly
const result = {};
for(let index = 0; index < array.length; index++) {
result[array[index]] = declarations[key][index];
}
declarations[key] = result;
// let's make the key null again to avoid getting
// unnecessary array expression
key = null;
} else {
// store the values
declarations[key] = array;
}
}
});
});
// logging it here, it's up to you how you deal with the data itself
console.log(declarations);
}
main();
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句