在 Postgres 中存储范围时间序列数据

Adam Charnock 发表于 Dev

亚当查诺克

我需要在 Postgresql 中存储 netflow 数据。这是有关网络流量的数据。每条记录包含以下内容：

连接开始时间
连接结束时间
传输的数据总量
源/目标 IP/ASN
（还有很多，但这对于这个问题的目的来说已经足够了）。

我的问题是：如何存储这些数据，以便有效地计算过去 X 天/小时的数据传输率？例如，我可能想绘制过去 7 天到 Netflix 的 ASN 的所有流量的图表，每小时分辨率。

连接开始和结束时间之间的差异可能是几毫秒，也可能是一个多小时。

我的第一步是将连接存储在带有 GiST 索引的 TSTZRANGE 字段中。然后，查询过去 7 天每小时流量的数据：

使用 CTE 生成一系列每小时时间段
查找与每个存储桶重叠的任何 TSTZRANGE
计算重叠的持续时间
以每秒字节数计算记录的数据速率
执行持续时间 * 每秒字节数以获取总数据
将其全部分组在存储桶上，汇总总数据值

然而，这听起来像很多繁重的工作。谁能想到更好的选择？

亚当查诺克

在对此进行更多研究之后，我认为真正的答案是没有一种开箱即用的方法可以以高性能的方式实现这一目标。特别是随着数据量的增加。最终，聚合数千行会很慢，因为这只是大量的数据访问。

相反，我走了一条不同的路。我在存储原始流 ( traffic_flow)的表上使用 Postgresql 触发器。每次将记录插入到中时traffic_flow，触发器都会将新数据更新为每日、每小时和每分钟数据的单独聚合表。

这是我的实验性实现，以防它对某人有用。这可以改进以处理更新和删除。

create or replace function update_aggregated_traffic(NEW RECORD, table_name TEXT, interval_name text, store_customer BOOLEAN)
    returns void
    language plpgsql
as
$body$
declare
    aggregate_interval interval;
    customer_ip_ inet;
begin
    -- Update the data aggregated traffic data given the insertion of a new flow.
    -- A flow is the data about a single connection (start time, stop time, total
    -- bytes/packets). This function essentially rasterises that data into a
    -- series of aggregation buckets.

    -- interval_name should be second, hour, or minute
    -- turn the interval_name into an actual INTERVAL
    aggregate_interval = ('1 ' || interval_name)::INTERVAL;
    if store_customer then
        customer_ip_ = NEW.source_address;
    else
        customer_ip_ = '100.64.0.0'::INET;
    end if;

    -- We need to insert into a dynamically generated table name. There is
    -- no way to do this without writing the whole SQL statement as a string.
    -- Instead, let's use a trick. Create a temporary view, then insert into that.
    -- Postgres will proxy this insert into the desired table
    drop view if exists table_pointer;
    execute format('create temporary view table_pointer as select * from %s', table_name);

    -- We use a CTE to keep things readable, even though it is pretty long
    with aggregate_range AS (
        -- Create all the aggregate buckets spanned by the inserted flow
        SELECT generate_series(
            date_trunc(interval_name, lower(NEW.range)),
            date_trunc(interval_name, upper(NEW.range)),
            aggregate_interval
        ) as range_lower
    ),
    -- For each bucket, figure out its overlap with the provided flow data.
    -- Only the first and last buckets will have less than than complete overlap,
    -- but we do the calculation for all buckets anyway
    with_overlaps AS (
        SELECT
            NEW.range * tstzrange(range_lower, range_lower + aggregate_interval) AS overlap,
            range_lower
        FROM
        aggregate_range
    ),
    -- Convert the overlap intervals into seconds (FLOAT)
    with_overlap_seconds AS (
        SELECT
            extract(epoch from (upper(overlap) - lower(overlap))) as overlap_seconds,
            range_lower
        FROM
            with_overlaps
    )
    -- Now we have enough information to do the inserts
    insert into table_pointer as traffic
        (timestamp, customer_ip, as_number, bytes, packets)
        select
            range_lower,
            customer_ip_,
            NEW.as_number,
            -- Scale the packets/bytes per second to be a total number of
            -- of packets/bytes
            round(NEW.bytes_per_second * overlap_seconds)::INT,
            round(NEW.packets_per_second * overlap_seconds)::INT
        from with_overlap_seconds
        -- We shouldn't have any 0-second overlaps, but let's just be sure
        where overlap_seconds > 0
        -- If there is already existing data, then increment the bytes/packets values
        on conflict (customer_ip, timestamp, as_number) DO UPDATE SET
            bytes = EXCLUDED.bytes + traffic.bytes,
            packets = EXCLUDED.packets + traffic.packets
    ;
end;
$body$;


create or replace function update_aggregated_traffic_hourly() returns trigger
    language plpgsql
as
$body$
begin
    -- Store aggregated data for different resolutions. For each we also store data
    -- without the customer information. This way we can efficiently see traffic data
    -- for the whole network
    PERFORM update_aggregated_traffic(NEW, 'traffic_perdaytraffic','day', True);
    PERFORM update_aggregated_traffic(NEW, 'traffic_perdaytraffic','day', False);

    PERFORM update_aggregated_traffic(NEW, 'traffic_perhourtraffic','hour', True);
    PERFORM update_aggregated_traffic(NEW, 'traffic_perhourtraffic','hour', False);

    PERFORM update_aggregated_traffic(NEW, 'traffic_perminutetraffic','minute', True);
    PERFORM update_aggregated_traffic(NEW, 'traffic_perminutetraffic','minute', False);

    PERFORM update_aggregated_traffic(NEW, 'traffic_persecondtraffic','second', True);
    PERFORM update_aggregated_traffic(NEW, 'traffic_persecondtraffic','second', False);

    return NEW;
end;
$body$;

create trigger update_aggregated_traffic_hourly_trigger AFTER INSERT ON traffic_flow
    FOR EACH ROW EXECUTE PROCEDURE update_aggregated_traffic_hourly();

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-08-19

我来说两句

0 条评论

登录后参与评论

在 Postgres 中存储范围时间序列数据

在 Postgres 中存储范围时间序列数据

蓝屏死机没有修复解决方案

计算数据帧中每行的NA

UITableView的项目向下滚动后更改颜色，然后快速备份

Node.js中未捕获的异常错误，发生调用

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

Linux的官方Adobe Flash存储库是否已过时？

验证REST API参数

ggplot：对齐多个分面图-所有大小不同的分面

Mac OS X更新后的GRUB 2问题

通过 Git 在运行 Jenkins 作业时获取 ClassNotFoundException

带有错误“ where”条件的查询如何返回结果？

用日期数据透视表和日期顺序查询

VB.net将2条特定行导出到DataGridView

如何从视图一次更新多行（ASP.NET - Core）

Java Eclipse中的错误13，如何解决？

尝试反复更改屏幕上按钮的位置 - kotlin android studio

离子动态工具栏背景色

应用发明者仅从列表中选择一个随机项一次

当我尝试下载 StanfordNLP en 模型时，出现错误

python中的boto3文件上传

在同一Pushwoosh应用程序上Pushwoosh多个捆绑ID