Scrape data from AJAX webpage with python

Nice Guy Eddie

im having this issue - i need to scrape a dynamic table's data from this webpage - https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb

This webpage uses ajax to generate the table that I want to fetch. I have inspected the element and it seems to be straightforward, I have the request url with param, I try to send a request, get response code 200 and the response is empty.

I must be doing something wrong, but im not sure how to fetch this data in python even though it seems kind of straightforward, could anyone help me out?

I want to get the same table as the one that is displayed on the website.

baduker

Actually, this page turned out to be a pretty cool challenge!

Breakdown:

The link to the report sits in the source HTML, but the table is rendered dynamically by JavaScript but you can easily scoop it out
The safeargs_data value

5f5f7265706f72743d504c5f5553455f524242265f5f63616c6c547970653d7026646174613d323032332d30382d3130265f737667737570706f72743d74727565267265736f7572636549443d72656e646572696e6755524c265f5f706167654e756d6265723d31265f5f626174636849443d31383964663635613634642d31

is just a silly way of obfuscating in hex this value

__report=PL_USE_RBB&__callType=p&data=2023-08-10&_svgsupport=true&resourceID=renderingURL&__pageNumber=1&__batchID=189df65a64d-1

I've decoded it for ease of readability and editing e.g. the data key
Finally, I use the table_link, payload data, and updated headers to make a POST request.
Then, it's easy to get the table out of the JSON and parse it with pandas

By the way, if you convert the hex value from the URL and add the safeargs_data to it, you'll still get your report.

Here's a full, decoded URL:

https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb?p_auth=2XVP5Wtz&p_p_id=VisioPortlet_WAR_visioneoportlet_INSTANCE_xOsekso49yXt&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&p_p_col_id=column-2&p_p_col_pos=1&p_p_col_count=2&_VisioPortlet_WAR_visioneoportlet_INSTANCE_xOsekso49yXt___action=processEdit&__action=processEdit__report=PL_USE_RBB&__callType=p&data=2023-08-10&_svgsupport=true&resourceID=renderingURL&__pageNumber=1&__batchID=189df65a64d-1

Here's my take on it:

import binascii
from urllib.parse import urlencode

import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

url = (
    "https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/"
    "raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb"
)

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.200",
}

with requests.Session() as session:
    table_link = (
        BeautifulSoup(session.get(url, headers=headers).content, "lxml")
        .select_one("a[class='vui-generic-url']")
        .get("href")
    )

    headers.update({"X-Requested-With": "XMLHttpRequest"})

    payload_data = {
        "__report": "PL_USE_RBB",
        "__callType": "p",
        "data": "2023-08-10",
        "_svgsupport": "true",
        "resourceID": "renderingURL",
        "__pageNumber": "1",
        "__batchID": "189df65a64d-1",
    }

    hex_it = binascii.hexlify(urlencode(payload_data).encode()).decode()

    table_data = session.post(
        table_link,
        data={"safeargs_data": hex_it},
        headers=headers,
    )

    df = pd.read_html(
        # .replace() is used to get rid of NBSPs
        table_data.json()["reportContent"].replace("\xa0", ""),
        flavor="lxml",
        skiprows=[0],
    )[1]
    df.dropna(how="all", inplace=True)
    df.to_csv("PL_USE_RBB.csv", index=False)

    print(tabulate(df, headers="keys", tablefmt="psql", showindex=False))

This should save a .csv file PL_USE_RBB.csv and then print this:

+----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------+
|   ('Numer iteracji', '24') | ('Początek', '2023-08-10 15:15:34')   | ('Koniec', '2023-08-10 15:16:06')   |   ('Początek', '17') |   ('Koniec', '24') |   ('[MWh]', '47561,000') |
|----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------|
|                         23 | 2023-08-10 14:15:42                   | 2023-08-10 14:16:16                 |                   16 |                 24 |              4.88788e+07 |
|                         22 | 2023-08-10 13:15:39                   | 2023-08-10 13:16:24                 |                   15 |                 24 |              4.50884e+07 |
|                         21 | 2023-08-10 12:15:36                   | 2023-08-10 12:16:10                 |                   14 |                 24 |              4.09294e+07 |
|                         20 | 2023-08-10 11:15:33                   | 2023-08-10 11:16:15                 |                   13 |                 24 |              3.12136e+07 |
|                         19 | 2023-08-10 10:15:41                   | 2023-08-10 10:16:07                 |                   12 |                 24 |              2.55946e+07 |
|                         18 | 2023-08-10 09:15:40                   | 2023-08-10 09:16:05                 |                   11 |                 24 |              2.26086e+07 |
|                         17 | 2023-08-10 08:15:40                   | 2023-08-10 08:16:00                 |                   10 |                 24 |              1.58324e+07 |
|                         16 | 2023-08-10 07:15:35                   | 2023-08-10 07:15:56                 |                    9 |                 24 |              1.11414e+07 |
|                         15 | 2023-08-10 06:15:33                   | 2023-08-10 06:15:52                 |                    8 |                 24 |              1.11796e+07 |
|                         14 | 2023-08-10 05:15:32                   | 2023-08-10 05:15:52                 |                    7 |                 24 |              9.639e+06   |
|                         13 | 2023-08-10 04:15:41                   | 2023-08-10 04:16:11                 |                    6 |                 24 |              9.0502e+06  |
|                         12 | 2023-08-10 03:15:36                   | 2023-08-10 03:15:55                 |                    5 |                 24 |              7.871e+06   |
|                         11 | 2023-08-10 02:15:35                   | 2023-08-10 02:16:03                 |                    4 |                 24 |              8.395e+06   |
|                         10 | 2023-08-10 01:15:41                   | 2023-08-10 01:16:04                 |                    3 |                 24 |              7.8954e+06  |
|                          9 | 2023-08-10 00:15:37                   | 2023-08-10 00:15:55                 |                    2 |                 24 |              8.2582e+06  |
|                          8 | 2023-08-09 23:15:03                   | 2023-08-09 23:15:24                 |                    1 |                 24 |              6.6784e+06  |
|                          7 | 2023-08-09 22:15:08                   | 2023-08-09 22:15:16                 |                    1 |                 24 |         603200           |
|                          6 | 2023-08-09 21:15:12                   | 2023-08-09 21:15:22                 |                    1 |                 24 |              0           |
|                          5 | 2023-08-09 20:15:06                   | 2023-08-09 20:15:12                 |                    1 |                 24 |              0           |
|                          4 | 2023-08-09 19:15:04                   | 2023-08-09 19:15:14                 |                    1 |                 24 |              0           |
|                          3 | 2023-08-09 18:15:11                   | 2023-08-09 18:15:32                 |                    1 |                 24 |              0           |
|                          2 | 2023-08-09 17:15:11                   | 2023-08-09 17:15:22                 |                    1 |                 24 |              0           |
|                          1 | 2023-08-09 16:15:09                   | 2023-08-09 16:15:31                 |                    1 |                 24 |              0           |
+----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------+

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2023-08-11

Comments

0 comments

TOP Ranking

Article

Scrape data from AJAX webpage with python

Scrape data from AJAX webpage with python

pump.io port in URL

grouping by column variables and appending a new variable based on condition

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

Group boxplot data while keeping their individual X axis labels in ggplot2 in R

Vector input in shiny R and then use it

BigQuery - concatenate ignoring NULL

Can a 32-bit antivirus program protect you from 64-bit threats

How to remove the extra space from right in a webview?

How to how increase/decrease compared to adjacent cell

android.content.Context.getSharedPreferences(java.lang.String, int)' on a null object reference id DBhandler

Getting 502 Bad Gateway Error While Deploying WordPress On Dockerized Lemp?

Type 'number' is not assignable to type 'NgIterable<any>' when trying to async observe a datasource

Check if a number is a perfect square

FFmpeg resize without upscaling

How do I display Label text character-by-character?

How to show an image in a View with ASP.NET MVC 5? (Many suggestions failed so far)

Json Schema - Conditional Evaluation with RegEx

PlayOnLinux displays weird looking window on 18.04 for MS Office installation

JMeter: Why get error when try to save test plan

Emulator wrong screen resolution in Android Studio 1.3