我有下面的 DF -
scala> val df1=Seq( ("1","1_10"), ("1","1_11"), ("2","2_20"), ("3","3_30"), ("3 ","3_31") )toDF("c1","c2")
+---+----+
| c1| c2|
+---+----+
| 1|1_10|
| 1|1_11|
| 2|2_20|
| 3|3_30|
| 3|3_31|
+---+----+
val df2=Seq(("2","200"), ("3","300"))toDF("c1","val")
+---+---+
| c1| val|
+---+---+
| 2|200|
| 3|300|
+---+---+
如果,我进行 left join ,我将得到如下结果。
scala> df1.join(df2,Seq("c1"),"left").select(df1("c1").alias("df1_c1"),df1("c2"),df2("val")).show
+------+----+----+
|df1_c1| c2| val|
+------+----+----+
| 1|1_10|null|
| 1|1_11|null|
| 2|2_20| 200|
| 3|3_30| 300|
| 3|3_31| 300|
+------+----+----+
但是,我怎么能得到右表的连接键 val 呢?
预期输出 -
+------+----+----+------+
|df1_c1| c2| val|df2_c1|
+------+----+----+------+
| 1|1_10|null| null|
| 1|1_11|null| null|
| 2|2_20| 200| 2|
| 3|3_30| 300| 3|
| 3|3_31| 300| 3|
+------+----+----+------+
If I try , df1.join(df2,Seq("c1"),"left").select(df1("c1").alias("df1_c1"),df1("c2"),df2("val"),df2("c1")).show,
我会收到以下错误 -
org.apache.spark.sql.AnalysisException: Resolved attribute(s) c1#19639 missing from c1#19630,c2#19631,val#19640 in operator !Project [c1#19630 AS df1_c1#19667, c2#19631, val#19640, c1#19639]. Attribute(s) with the same name appear in the operation: c1. Please check if the right attribute(s) are used.;
如果您使用列序列 Seq("c1")进行连接,Spark 会删除重复的列。您可以改用自定义联接表达式:
df1.as("df1").join(df2.as("df2"), expr("df1.c1 == df2.c1"),"left").select($"df1.c1".alias("df1_c1"), $"df1.c2", $"df2.c1".as("df2_c1"), $"df2.val").show(false)
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句