使用OpenRefine(或R)解析复杂的大型Json文件的最佳方法

埃托尔·里扎(Ettore Rizza)

我知道如何在Openfine中解析json单元格,但是这一步对我来说太棘手了。

在此处输入图片说明

我已经使用API​​提取了4730 AirBNB房间的日历,这些日历由其ID标识。

这里是一个JSON文件的一个例子:https://fr.airbnb.com/api/v2/calendar_months?key=d306zoyjsyarp7ifhu67rjxn52tv0t20¤cy=EUR&locale=fr&listing_id=4212133&month=11&year=2016&count=12&_format=with_conditions

对于从现在到2017年11月的一年中的每个ID,我想提取该房间的可用性(正确或错误)及其价格。

在此处输入图片说明

I can't figure out how to parse out these informations. I guess that it implies a series of nested forEach, but i can't find the right way to do this with Open Refine.

I've tried, of course,

forEach(value.parseJson().calendar_months, e, e.days) 

The result is an array of arrays of dictionnaries that disrupts me.

Any help would be appreciate. If the operation is too difficult in Open Refine, a solution with R (or Python) would also be fine for me.

Owen Stephens

I think you are on the right track. The output of:

forEach(value.parseJson().calendar_months, e, e.days) 

is hard to read because OpenRefine and JSON both use square brackets to indicate arrays. What you are getting from this expression is an OR array containing twelve items (one for each month of the year). The items in the OR array are JSON - each one an array of days in the month.

To keep the steps manageable I'd suggest tackling it like this:

First use

forEach(value.parseJson().calendar_months,m,m.days).join("|")

You have to use 'join' because OR can't store OR arrays directly in a cell - it has to be a string.

Then use "Edit Cells->Split multi-valued cells" - this will get you 12 rows per ID, each containing a JSON expression. Now for each ID you have 12 rows in OR

Then use:

forEach(value.parseJson(),d,d).join("|")

This splits the JSON down into the individual days

Then use "Edit Cells->Split multi-valued cells" again to split the details for each day into its own cell.

Using the JSON from example URL above - this gives me 441 rows for the single ID - each contains the JSON describing the availability & price for a single day. At this point you can use the 'fill down' function on the ID column to fill in the ID for each of the rows.

You've now got some pretty easy JSON in each cell - so you can extract availability using

value.parseJson().available

etc.

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章