Scrape data from AJAX webpage with python

Nice Guy Eddie

im having this issue - i need to scrape a dynamic table's data from this webpage - https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb

This webpage uses ajax to generate the table that I want to fetch. I have inspected the element and it seems to be straightforward, I have the request url with param, I try to send a request, get response code 200 and the response is empty.

I must be doing something wrong, but im not sure how to fetch this data in python even though it seems kind of straightforward, could anyone help me out?

I want to get the same table as the one that is displayed on the website.

baduker

Actually, this page turned out to be a pretty cool challenge!

Breakdown:

  • The link to the report sits in the source HTML, but the table is rendered dynamically by JavaScript but you can easily scoop it out
  • The safeargs_data value
5f5f7265706f72743d504c5f5553455f524242265f5f63616c6c547970653d7026646174613d323032332d30382d3130265f737667737570706f72743d74727565267265736f7572636549443d72656e646572696e6755524c265f5f706167654e756d6265723d31265f5f626174636849443d31383964663635613634642d31

is just a silly way of obfuscating in hex this value

__report=PL_USE_RBB&__callType=p&data=2023-08-10&_svgsupport=true&resourceID=renderingURL&__pageNumber=1&__batchID=189df65a64d-1
  • I've decoded it for ease of readability and editing e.g. the data key
  • Finally, I use the table_link, payload data, and updated headers to make a POST request.
  • Then, it's easy to get the table out of the JSON and parse it with pandas

By the way, if you convert the hex value from the URL and add the safeargs_data to it, you'll still get your report.

Here's a full, decoded URL:

https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb?p_auth=2XVP5Wtz&p_p_id=VisioPortlet_WAR_visioneoportlet_INSTANCE_xOsekso49yXt&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&p_p_col_id=column-2&p_p_col_pos=1&p_p_col_count=2&_VisioPortlet_WAR_visioneoportlet_INSTANCE_xOsekso49yXt___action=processEdit&__action=processEdit__report=PL_USE_RBB&__callType=p&data=2023-08-10&_svgsupport=true&resourceID=renderingURL&__pageNumber=1&__batchID=189df65a64d-1

Here's my take on it:

import binascii
from urllib.parse import urlencode

import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

url = (
    "https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/"
    "raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb"
)

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.200",
}

with requests.Session() as session:
    table_link = (
        BeautifulSoup(session.get(url, headers=headers).content, "lxml")
        .select_one("a[class='vui-generic-url']")
        .get("href")
    )

    headers.update({"X-Requested-With": "XMLHttpRequest"})

    payload_data = {
        "__report": "PL_USE_RBB",
        "__callType": "p",
        "data": "2023-08-10",
        "_svgsupport": "true",
        "resourceID": "renderingURL",
        "__pageNumber": "1",
        "__batchID": "189df65a64d-1",
    }

    hex_it = binascii.hexlify(urlencode(payload_data).encode()).decode()

    table_data = session.post(
        table_link,
        data={"safeargs_data": hex_it},
        headers=headers,
    )

    df = pd.read_html(
        # .replace() is used to get rid of NBSPs
        table_data.json()["reportContent"].replace("\xa0", ""),
        flavor="lxml",
        skiprows=[0],
    )[1]
    df.dropna(how="all", inplace=True)
    df.to_csv("PL_USE_RBB.csv", index=False)

    print(tabulate(df, headers="keys", tablefmt="psql", showindex=False))

This should save a .csv file PL_USE_RBB.csv and then print this:

+----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------+
|   ('Numer iteracji', '24') | ('Początek', '2023-08-10 15:15:34')   | ('Koniec', '2023-08-10 15:16:06')   |   ('Początek', '17') |   ('Koniec', '24') |   ('[MWh]', '47561,000') |
|----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------|
|                         23 | 2023-08-10 14:15:42                   | 2023-08-10 14:16:16                 |                   16 |                 24 |              4.88788e+07 |
|                         22 | 2023-08-10 13:15:39                   | 2023-08-10 13:16:24                 |                   15 |                 24 |              4.50884e+07 |
|                         21 | 2023-08-10 12:15:36                   | 2023-08-10 12:16:10                 |                   14 |                 24 |              4.09294e+07 |
|                         20 | 2023-08-10 11:15:33                   | 2023-08-10 11:16:15                 |                   13 |                 24 |              3.12136e+07 |
|                         19 | 2023-08-10 10:15:41                   | 2023-08-10 10:16:07                 |                   12 |                 24 |              2.55946e+07 |
|                         18 | 2023-08-10 09:15:40                   | 2023-08-10 09:16:05                 |                   11 |                 24 |              2.26086e+07 |
|                         17 | 2023-08-10 08:15:40                   | 2023-08-10 08:16:00                 |                   10 |                 24 |              1.58324e+07 |
|                         16 | 2023-08-10 07:15:35                   | 2023-08-10 07:15:56                 |                    9 |                 24 |              1.11414e+07 |
|                         15 | 2023-08-10 06:15:33                   | 2023-08-10 06:15:52                 |                    8 |                 24 |              1.11796e+07 |
|                         14 | 2023-08-10 05:15:32                   | 2023-08-10 05:15:52                 |                    7 |                 24 |              9.639e+06   |
|                         13 | 2023-08-10 04:15:41                   | 2023-08-10 04:16:11                 |                    6 |                 24 |              9.0502e+06  |
|                         12 | 2023-08-10 03:15:36                   | 2023-08-10 03:15:55                 |                    5 |                 24 |              7.871e+06   |
|                         11 | 2023-08-10 02:15:35                   | 2023-08-10 02:16:03                 |                    4 |                 24 |              8.395e+06   |
|                         10 | 2023-08-10 01:15:41                   | 2023-08-10 01:16:04                 |                    3 |                 24 |              7.8954e+06  |
|                          9 | 2023-08-10 00:15:37                   | 2023-08-10 00:15:55                 |                    2 |                 24 |              8.2582e+06  |
|                          8 | 2023-08-09 23:15:03                   | 2023-08-09 23:15:24                 |                    1 |                 24 |              6.6784e+06  |
|                          7 | 2023-08-09 22:15:08                   | 2023-08-09 22:15:16                 |                    1 |                 24 |         603200           |
|                          6 | 2023-08-09 21:15:12                   | 2023-08-09 21:15:22                 |                    1 |                 24 |              0           |
|                          5 | 2023-08-09 20:15:06                   | 2023-08-09 20:15:12                 |                    1 |                 24 |              0           |
|                          4 | 2023-08-09 19:15:04                   | 2023-08-09 19:15:14                 |                    1 |                 24 |              0           |
|                          3 | 2023-08-09 18:15:11                   | 2023-08-09 18:15:32                 |                    1 |                 24 |              0           |
|                          2 | 2023-08-09 17:15:11                   | 2023-08-09 17:15:22                 |                    1 |                 24 |              0           |
|                          1 | 2023-08-09 16:15:09                   | 2023-08-09 16:15:31                 |                    1 |                 24 |              0           |
+----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------+

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Python: How to scrape the table data from webpage

Scrape data from a link within a webpage. python, beautifulsoup

Scrape data from a link in a webpage using beautiful soup. python

How to scrape Ajax webpage using python

Web Scrape table data from this webpage

Unable to scrape the table data from a webpage

scrape text from webpage using python 2.7

Scrape data from webpage with BeautifulSoup - How to append data to existing dataframe?

Not able to load and scrape data from webpage using Scrapy

Can't scrape data from webpage with popup/frame

Using jsoup to scrape data from a webpage in between specific tags

Scrape value from table of webpage

rvest scrape links from webpage

How to scrape text from this webpage?

Scrape XML response from ajax request with python

Scraping URLs from webpage to further scrape single URLs from that first scrape (re-using URL from scraped data)

Python scrape data from "div: class

python scrape flight and price data from skyscanner

How To Scrape Data From Button In Selenium Python

how to scrape data from telemart using python?

Python Scrape Financial Data From iFrame

Unable to scrape data from Website with Python

How to scrape data from html table with python?

Python - web scrape data from xml

How do I scrape mouse hover-box content on an ajax webpage using Python

Scrape data and interact with webpage rendered in HTML

enter data and iterate to scrape information on a webpage

How to scrape a table of data from an AJAX-heavy website?

I'm trying to scrape some data from a webpage and keep getting selenium.common.exceptions.TimeoutException errors