鉴于以下数据帧:
+----+--------+--------+-----+------+------+------+
|name|platform|group_id|width|height| x| y|
+----+--------+--------+-----+-------------+------+
| a| plat_a| 0|500.0|1000.0|250.41|500.01|
| a| plat_a| 0|250.0| 500.0|125.75| 250.7|
| a| plat_a| 0|300.0| 800.0| 120.0| 111.7|
| b| plat_b| 0|500.0|1000.0| 250.5|500.67|
| b| plat_b| 1|400.0| 800.0|100.67|200.67|
| b| plat_b| 1|800.0|1600.0|201.07|401.07|
+----+--------+--------+-----+------+------+------+
我想按名称、平台、group_id 分组并按以下列逻辑计数:
//normalizing value to percent with 2 digit precision
new_x = Math.round(x / width * 100.0) / 100.0
new_y = Math.round(y / height * 100.0) / 100.0
所以输出数据帧将是:
+----+--------+--------+------+------+-----+
|name|platform|group_id| new_x| new_y|count|
+----+--------+---------------+------+-----+
| a| plat_a| 0| 0.5| 0.5| 2|
| a| plat_a| 0| 0.4| 0.13| 1|
| b| plat_b| 0| 0.5| 0.5| 1|
| b| plat_b| 1| 0.25| 0.25| 2|
+----+--------+--------+------+------+-----+
我应该如何解决这个问题?
应该相当简单groupBy
和count
import org.apache.spark.sql.functions._
df.withColumn("new_x", round($"x" / $"width" * 100.0 ) / 100.0)
.withColumn("new_y", round($"y" / $"height" * 100.0 ) / 100.0)
.groupBy("name", "platform", "group_id", "new_x", "new_y")
.count()
.show(false)
输出:
+----+--------+--------+-----+-----+-----+
|name|platform|group_id|new_x|new_y|count|
+----+--------+--------+-----+-----+-----+
|a |plat_a |0 |0.5 |0.5 |2 |
|b |plat_b |0 |0.5 |0.5 |1 |
|b |plat_b |1 |0.25 |0.25 |2 |
|a |plat_a |0 |0.4 |0.14 |1 |
+----+--------+--------+-----+-----+-----+
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句