在 PysparkSQL 中分解 JSON

皮坤95

我希望將嵌套的 json 分解為 CSV 文件。希望將嵌套的 json 解析為行和列。

from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql import Row
df=spark.read.option("multiline","true").json("sample1.json")
df.printSchema()

root
 |-- pid: struct (nullable = true)
 |    |-- Body: struct (nullable = true)
 |    |    |-- Vendor: struct (nullable = true)
 |    |    |    |-- RC: struct (nullable = true)
 |    |    |    |    |-- Updated_From_Date: string (nullable = true)
 |    |    |    |    |-- Updated_To_Date: string (nullable = true)
 |    |    |    |-- RD: struct (nullable = true)
 |    |    |    |    |-- Supplier: struct (nullable = true)
 |    |    |    |    |    |-- Supplier_Data: struct (nullable = true)
 |    |    |    |    |    |    |-- Days: long (nullable = true)
 |    |    |    |    |    |    |-- Reference: struct (nullable = true)
 |    |    |    |    |    |    |    |-- ID: array (nullable = true)
 |    |    |    |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |    |    |-- Expected: long (nullable = true)
 |    |    |    |    |    |    |-- Payments: long (nullable = true)
 |    |    |    |    |    |    |-- Approval: struct (nullable = true)
 |    |    |    |    |    |    |    |-- ID: array (nullable = true)
 |    |    |    |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |    |    |-- Areas_Changed: struct (nullable = true)
 |    |    |    |    |    |    |    |-- Alternate_Names: long (nullable = true)
 |    |    |    |    |    |    |    |-- Attachments: long (nullable = true)
 |    |    |    |    |    |    |    |-- Classifications: long (nullable = true)
 |    |    |    |    |    |    |    |-- Contact_Information: long (nullable = true)

我的代碼:

df2=(df.select(F.explode("pid").alias('pid'))
         .select('pid.*')
         .select(F.explode('Body').alias('Body'))
         .select('Body.*')
         .select((F.explode('Vendor').alias('Vendor'))
         .select('Vendor.*')
         .select((F.explode('RC').alias('RC'))
         .select('RC.*'))))

錯誤:AnalysisException:由於數據類型不匹配而無法解析“explode(pid)”:函數explode的輸入應該是數組或映射類型,而不是struct<Body:struct< .....

如何解析為結構字段。任何幫助都感激不盡 :)

不列顛哥倫比亞省莫哈納

explode只能在地圖或數組類型上使用函數。要訪問 strcut 類型,只需使用.運算符。

假設您想在 RC 和 RD 下獲取列,則代碼語法應如下所示。

df.select("pid.Body.Vendor.RC.*", "pid.Body.Vendor.RD.*")

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章