How to aggregate columns into a JSON array?

robynico

How can I transform data like below in order to store data in ElasticSearch?

Here is a dataset of a bean that I would aggregate by product into a JSON array.

List<Bean> data = new ArrayList<Bean>();
data.add(new Bean("book","John",59));
data.add(new Bean("book","Björn",61));
data.add(new Bean("tv","Roger",36));
Dataset ds = spark.createDataFrame(data, Bean.class);

ds.show(false);

+------+-------+---------+
|amount|product|purchaser|
+------+-------+---------+
|59    |book   |John     |
|61    |book   |Björn    |
|36    |tv     |Roger    |
+------+-------+---------+


ds = ds.groupBy(col("product")).agg(collect_list(map(ds.col("purchaser"),ds.col("amount")).as("map")));
ds.show(false);

+-------+---------------------------------------------+
|product|collect_list(map(purchaser, amount) AS `map`)|
+-------+---------------------------------------------+
|tv     |[[Roger -> 36]]                              |
|book   |[[John -> 59], [Björn -> 61]]                |
+-------+---------------------------------------------+

This is what I want to transform it into:

+-------+------------------------------------------------------------------+
|product|json                                                              |
+-------+------------------------------------------------------------------+
|tv     |[{purchaser: "Roger", amount:36}]                                 |
|book   |[{purchaser: "John", amount:36}, {purchaser: "Björn", amount:61}] |
+-------+------------------------------------------------------------------+
robynico

The solution :

ds.groupBy(col("product"))
  .agg(collect_list(to_json(struct(col("purchaser"), col("amount"))).alias("json")));

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

PySpark : how to aggregate json columns in a clean way

Aggregate multiple columns into array

How to aggregate nested json array in Pandas?

How to aggregate two PostgreSQL columns to an array separated by brackets

How to aggregate columns in pandas

How to aggregate columns of sets?

How to aggregate all columns of 2 json arrays with same keys

How to aggregate several columns into a JSON file for each group in HIVE

How can i combine json data in a array but aggregate the data on minutes?

How to aggregate multiple columns - Pandas

How to aggregate columns of type `dict`

How to aggregate rows and converts into columns?

How to convert JSON Array of Arrays to columns and rows

How to build JSON array with custom columns in postgres

How To Search In Multiple Columns In Json File Array?

PostgreSQL: Efficiently aggregate array columns as part of a group by

Athena array aggregate and filter multiple columns on condition

Postgresql - How to SELECT using a json aggregate function that takes two columns as key: value pair?

Create json object and aggregate into json array in SqlServer

PostgreSQL aggregate json objects into single json array

MySQL: In a select with aggregate functions, how to join two columns not in aggregate functions?

How to aggregate array of document in MongoDB

How to achieve conditional array aggregate?

How to aggregate on huge array in mongoDB?

How to aggregate the values of a field into an array?

How to pass aggregate array to JavaScript?

How to aggregate the average of a calculation based on two columns?

How to aggregate a different number of columns for each row

How to group by and aggregate on multiple columns in pandas