我在PySpark中有一个像下面这样的数据框。我想选择serial_num
,devicetype
,device_model
并distinct of timestamp for each serial_num
从下面的数据帧:
+-------------+-----------------+---------------+------------------------+
| serial_num | devicetype | device_model | timestamp |
+-------------+-----------------+---------------+------------------------+
| 58172A0396 | | | 2003-01-02 17:37:15.0 |
| 58172A0396 | | | 2003-01-02 17:37:15.0 |
| 46C5Y00693 | Mac Pro | Mac PC | 2018-01-03 17:17:23.0 |
| 1737K7008F | Windows PC | Windows PC | 2018-01-05 11:12:31.0 |
| 1737K7008F | Network Device | Unknown | 2018-01-05 11:12:31.0 |
| 1737K7008F | Network Device | Unknown | 2018-01-05 11:12:31.0 |
| 1737K7008F | Network Device | | 2018-01-06 03:12:52.0 |
| 1737K7008F | Windows PC | Windows PC | 2018-01-06 03:12:52.0 |
| 1737K7008F | Network Device | Unknown | 2018-01-06 03:12:52.0 |
| 1665NF01F3 | Network Device | Unknown | 2018-01-07 03:42:34.0 |
+----------------+-----------------+---------------+---------------------+
我已经尝试过如下
df1 = df.select('serial_num', 'devicetype', 'device_model', f.count('distinct timestamp').over(Window.partitionBy('serial_num')).alias('val')
我想要的结果是:
+-------------+-----------------+---------------+-----+
| serial_num | devicetype | device_model |count|
+-------------+-----------------+---------------+-----+
| 58172A0396 | | | 1 |
| 58172A0396 | | | 1 |
| 46C5Y00693 | Mac Pro | Mac PC | 1 |
| 1737K7008F | Windows PC | Windows PC | 2 |
| 1737K7008F | Network Device | Unknown | 2 |
| 1737K7008F | Network Device | Unknown | 2 |
| 1737K7008F | Network Device | | 2 |
| 1737K7008F | Windows PC | Windows PC | 2 |
| 1737K7008F | Network Device | Unknown | 2 |
| 1665NF01F3 | Network Device | Unknown | 1 |
+-------------+-----------------+---------------+-----+
我该如何实现?
不幸的countDistinct
是,Windows不支持。然而,组合collect_set
和size
可用于acheive相同的最终结果。仅Spark 2.0+版本支持此功能,请按以下方式使用:
import pyspark.sql.funcions as F
w = Window.partitionBy('serial_num')
df1 = df.select(..., F.size(F.collect_set('timestamp').over(w)).alias('count'))
对于较旧的Spark版本,您可以使用groupby
并countDistinct
创建一个包含所有计数的新数据框。然后join
将此数据帧与原始数据帧一起。
df2 = df.groupby('serial_num').agg(F.countDistinct('timestamp').alias('count'))
df1 = df.join(df2, 'serial_num')
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句