Spark读取json对象数据作为MapType

马欣达

我编写了一个示例spark应用程序,在其中使用MapType创建数据框并将其写入磁盘。然后,我正在读取同一文件并打印其架构。但是,与输入模式相比,输出文件模式有所不同,我在输出中看不到MapType。如何使用MapType读取输出文件

import org.apache.spark.sql.{SaveMode, SparkSession}

case class Department(Id:String,Description:String)
case class Person(name:String,department:Map[String,Department])

object sample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.master("local").appName("Custom Poc").getOrCreate
    import spark.implicits._

    val schemaData = Seq(
      Person("Persion1", Map("It" -> Department("1", "It Department"), "HR" -> Department("2", "HR Department"))),
      Person("Persion2", Map("It" -> Department("1", "It Department")))
    )
    val df = spark.sparkContext.parallelize(schemaData).toDF()
    println("Input schema")
    df.printSchema()
    df.write.mode(SaveMode.Overwrite).json("D:\\save\\output")

    println("Output schema")
    spark.read.json("D:\\save\\output\\*.json").printSchema()
  }
}

输出

Input schema
root
 |-- name: string (nullable = true)
 |-- department: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- Id: string (nullable = true)
 |    |    |-- Description: string (nullable = true)
Output schema
root
 |-- department: struct (nullable = true)
 |    |-- HR: struct (nullable = true)
 |    |    |-- Description: string (nullable = true)
 |    |    |-- Id: string (nullable = true)
 |    |-- It: struct (nullable = true)
 |    |    |-- Description: string (nullable = true)
 |    |    |-- Id: string (nullable = true)
 |-- name: string (nullable = true)

杰森文件

{"name":"Persion1","department":{"It":{"Id":"1","Description":"It Department"},"HR":{"Id":"2","Description":"HR Department"}}}
{"name":"Persion2","department":{"It":{"Id":"1","Description":"It Department"}}}

编辑:仅出于解释我的要求,我在上面添加了保存文件部分。在实际情况下,我将仅读取上面提供的JSON数据并在该数据帧上工作

您可以通过schema从prevous数据框中,同时读取json数据

println("Input schema")
df.printSchema()
df.write.mode(SaveMode.Overwrite).json("D:\\save\\output")

println("Output schema")
spark.read.schema(df.schema).json("D:\\save\\output")

输入模式

root
 |-- name: string (nullable = true)
 |-- department: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- Id: string (nullable = true)
 |    |    |-- Description: string (nullable = true)

输出架构

root
 |-- name: string (nullable = true)
 |-- department: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- Id: string (nullable = true)
 |    |    |-- Description: string (nullable = true)

希望这可以帮助!

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章