想象一个如下所示的数据框:
+-------+--------+---------+---------+--------+-----------------+---+
|address|lastname|firstname|patientid|policyno|visitid |id |
+-------+--------+---------+---------+--------+-----------------+---+
|addr1 |Pits |Rodney |patid1 |policy1 |visituid_rodney_1| 1 |
|addr1 |Pits |Rodney |patid1 |policy1 |visituid_rodney_2| 1 |
|addr1 |Pits |Rodney |patid1 |policy1 |visituid_rodney_3| 1 |
|addr2 |Dobs |Aliya |patid2 |policy2 |visituid_aliya_1 | 2 |
|addr2 |Dobs |Aliya |patid2 |policy2 |visituid_aliya_2 | 2 |
|addr2 |Dobs |OP |patid3 |policy3 |visituid_OP_1 | 3 |
+-------+--------+---------+---------+--------+-----------------+---+
当名称(firstname
+ lastname
)相同,名称不同时,“ id”的列值保持不变-我想分配一个新的ID。
我需要这样做是因为我想确定一个特定的数据集(其中包含敏感的详细信息),以便我可以在我的应用程序中使用相同的数据,但值已更改。该字段id
将用作与其他数据帧联接的索引键。
关于的另一部分id
是,其他虚拟数据帧也将包含一个类似的id
列,可能会使用来填充该列,monotonically_increasing_id()
因此最好以id
连续增加的方式从0或1开始。
如何使用Scala在火花中实现这一目标?
使用window
功能。
scala> df.show(false)
+-------+--------+---------+---------+--------+-----------------+
|address|lastname|firstname|patientid|policyno|visitid |
+-------+--------+---------+---------+--------+-----------------+
|addr1 |Pits |Rodney |patid1 |policy1 |visituid_rodney_1|
|addr1 |Pits |Rodney |patid1 |policy1 |visituid_rodney_2|
|addr1 |Pits |Rodney |patid1 |policy1 |visituid_rodney_3|
|addr2 |Dobs |Aliya |patid2 |policy2 |visituid_aliya_1 |
|addr2 |Dobs |Aliya |patid2 |policy2 |visituid_aliya_2 |
|addr2 |Dobs |OP |patid3 |policy3 |visituid_OP_1 |
|addr4 |AN |OTHER |patid4 |policy4 |visituid_OP_1 |
|addr2 |ANO |THER |patid5 |policy5 |visituid_OP_1 |
+-------+--------+---------+---------+--------+-----------------+
val expr = Seq("lastname","firstname")
.map(c => hash(col(c)).as(c))
.reduce(concat(_,_).asc)
val winSpec = dense_rank()
.over(Window.orderBy(expr))
输出量
scala> df.withColumn("id",winSpec).show(false)
+-------+--------+---------+---------+--------+-----------------+---+
|address|lastname|firstname|patientid|policyno|visitid |id |
+-------+--------+---------+---------+--------+-----------------+---+
|addr1 |Pits |Rodney |patid1 |policy1 |visituid_rodney_1|1 |
|addr1 |Pits |Rodney |patid1 |policy1 |visituid_rodney_2|1 |
|addr1 |Pits |Rodney |patid1 |policy1 |visituid_rodney_3|1 |
|addr4 |AN |OTHER |patid4 |policy4 |visituid_OP_1 |2 |
|addr2 |Dobs |OP |patid3 |policy3 |visituid_OP_1 |3 |
|addr2 |Dobs |Aliya |patid2 |policy2 |visituid_aliya_1 |4 |
|addr2 |Dobs |Aliya |patid2 |policy2 |visituid_aliya_2 |4 |
|addr2 |ANO |THER |patid5 |policy5 |visituid_OP_1 |5 |
+-------+--------+---------+---------+--------+-----------------+---+
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句