我有以下查询
val CubeData = spark.sql (""" SELECT gender, department, count(bibno) AS count FROM borrowersTable, loansTable WHERE borrowersTable.bid = loansTable.bid GROUP BY gender,department WITH CUBE ORDER BY gender,department """)
我想导出 4 个具有特定数据和名称的文件。
File1 由性别和部门组成,文件名是geneder_departments File2 性别,空文件名是gender_null File3 部门,空文件名是departments_null File4 空,空文件名是null_null 这些文件是sql 查询的结果(带cube )
我尝试以下
val df1 = CubeData.withColumn("combination",concat(col("gender") ,lit(","), col("department")))
df1.coalesce(1).write.partitionBy("combination").format("csv").option("header", "true").mode("overwrite").save("final")
但我拿了超过 4 个文件 - 性别组合 - 部门。这些文件的名称也是随机的。是否可以选择这些文件的名称?
也许这是 Spark 中的一个错误,我在您的查询中没有看到任何问题,但下面的查询似乎有效。如果表名是唯一的列,则不需要指定表名。
val CubeData = spark.sql ("""
SELECT gender, department, count(bibno) AS count
FROM borrowersTable
JOIN loansTable USING(bid)
GROUP BY gender, department WITH CUBE
ORDER BY gender, department
""")
但是你的文件解析好像有些问题,试试这个:
val borrowersDF = spark.read.format("csv").option("delimiter", "|").option("header", "True").option("inferSchema", "True").load("BORROWERS.txt")
borrowersDF.createOrReplaceTempView("borrowersTable")
val loansDF = spark.read.format("csv").option("delimiter", "|").option("header", "True").option("inferSchema", "True").load("LOANS.txt")
loansDF.createOrReplaceTempView("loansTable")
val CubeData = spark.sql ("""
SELECT gender, department, count(bibno) AS count
FROM borrowersTable
JOIN loansTable USING(bid)
GROUP BY gender, department WITH CUBE
ORDER BY gender, department
""")
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句