我有数百万行这样的数据框:
val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE")).toDF("id", "status")
scala> df.show(false)
+---+--------+
|id |status |
+---+--------+
|id1|ACTIVE |
|id1|INACTIVE|
|id1|INACTIVE|
|id2|ACTIVE |
|id3|INACTIVE|
|id3|INACTIVE|
+---+--------+
现在,我想将此数据分为三个单独的dataFrame,如下所示:
如何计算activeDF和inactiveDF?
我知道bothDF可以像
df.select("id").distinct.except(activeDF).except(inactiveDF)
,但这将涉及改组(因为“区别”操作需要相同)。有没有更好的方法来计算两者
版本:
Spark : 2.2.1
Scala : 2.11
另一种方式-groupBy,按集合收集,然后如果集合的大小为1,则它仅是活动的或不活动的,否则两者
scala> val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE"), ("id4", "ACTIVE"), ("id5", "ACTIVE"), ("id6", "INACTIVE"), ("id7", "ACTIVE"), ("id7", "INACTIVE")).toDF("id", "status")
df: org.apache.spark.sql.DataFrame = [id: string, status: string]
scala> df.show(false)
+---+--------+
|id |status |
+---+--------+
|id1|ACTIVE |
|id1|INACTIVE|
|id1|INACTIVE|
|id2|ACTIVE |
|id3|INACTIVE|
|id3|INACTIVE|
|id4|ACTIVE |
|id5|ACTIVE |
|id6|INACTIVE|
|id7|ACTIVE |
|id7|INACTIVE|
+---+--------+
scala> val allstatusDF = df.groupBy("id").agg(collect_set("status") as "allstatus")
allstatusDF: org.apache.spark.sql.DataFrame = [id: string, allstatus: array<string>]
scala> allstatusDF.show(false)
+---+------------------+
|id |allstatus |
+---+------------------+
|id7|[ACTIVE, INACTIVE]|
|id3|[INACTIVE] |
|id5|[ACTIVE] |
|id6|[INACTIVE] |
|id1|[ACTIVE, INACTIVE]|
|id2|[ACTIVE] |
|id4|[ACTIVE] |
+---+------------------+
scala> allstatusDF.withColumn("status", when(size($"allstatus") === 1, $"allstatus".getItem(0)).otherwise("BOTH")).show(false)
+---+------------------+--------+
|id |allstatus |status |
+---+------------------+--------+
|id7|[ACTIVE, INACTIVE]|BOTH |
|id3|[INACTIVE] |INACTIVE|
|id5|[ACTIVE] |ACTIVE |
|id6|[INACTIVE] |INACTIVE|
|id1|[ACTIVE, INACTIVE]|BOTH |
|id2|[ACTIVE] |ACTIVE |
|id4|[ACTIVE] |ACTIVE |
+---+------------------+--------+
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句