为什么在reduceByKey之后所有数据最终都集中在一个分区中？

帕特里克·麦格隆（Patrick McGloin）

我有包含以下部分的Spark应用程序：

val repartitioned = rdd.repartition(16)
val filtered: RDD[(MyKey, myData)] = MyUtils.filter(repartitioned, startDate, endDate)
val mapped: RDD[(DateTime, myData)] = filtered.map(kv=(kv._1.processingTime, kv._2))
val reduced: RDD[(DateTime, myData)] = mapped.reduceByKey(_+_)

当我用一些日志记录运行它时，我看到的是：

repartitioned ======> [List(2536, 2529, 2526, 2520, 2519, 2514, 2512, 2508, 2504, 2501, 2496, 2490, 2551, 2547, 2543, 2537)]
filtered ======> [List(2081, 2063, 2043, 2040, 2063, 2050, 2081, 2076, 2042, 2066, 2032, 2001, 2031, 2101, 2050, 2068)]
mapped ======> [List(2081, 2063, 2043, 2040, 2063, 2050, 2081, 2076, 2042, 2066, 2032, 2001, 2031, 2101, 2050, 2068)]
reduced ======> [List(0, 0, 0, 0, 0, 0, 922, 0, 0, 0, 0, 0, 0, 0, 0, 0)]

我的日志记录使用以下两行完成：

val sizes: RDD[Int] = rdd.mapPartitions(iter => Array(iter.size).iterator, true)
log.info(s"rdd ======> [${sizes.collect().toList}]")

我的问题是，为什么我的数据会在reduceByKey之后归结为一个分区？在过滤器之后，可以看到数据是均匀分布的，但是reduceByKey导致数据仅在一个分区中。

帕特里克·麦格隆（Patrick McGloin）

自从我弄清楚了以后，我将回答我自己的问题。我的DateTimes都没有秒和毫秒，因为我想对属于同一分钟的数据进行分组。相距一分钟的Joda DateTimes的hashCode（）是一个常数：

scala> val now = DateTime.now
now: org.joda.time.DateTime = 2015-11-23T11:14:17.088Z

scala> now.withSecondOfMinute(0).withMillisOfSecond(0).hashCode - now.minusMinutes(1).withSecondOfMinute(0).withMillisOfSecond(0).hashCode
res42: Int = 60000

从该示例可以看出，如果hashCode值的间隔类似，则它们可以最终位于同一分区中：

scala> val nums = for(i <- 0 to 1000000) yield ((i*20 % 1000), i)
nums: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector((0,0), (20,1), (40,2), (60,3), (80,4), (100,5), (120,6), (140,7), (160,8), (180,9), (200,10), (220,11), (240,12), (260,13), (280,14), (300,15), (320,16), (340,17), (360,18), (380,19), (400,20), (420,21), (440,22), (460,23), (480,24), (500,25), (520,26), (540,27), (560,28), (580,29), (600,30), (620,31), (640,32), (660,33), (680,34), (700,35), (720,36), (740,37), (760,38), (780,39), (800,40), (820,41), (840,42), (860,43), (880,44), (900,45), (920,46), (940,47), (960,48), (980,49), (0,50), (20,51), (40,52), (60,53), (80,54), (100,55), (120,56), (140,57), (160,58), (180,59), (200,60), (220,61), (240,62), (260,63), (280,64), (300,65), (320,66), (340,67), (360,68), (380,69), (400,70), (420,71), (440,72), (460,73), (480,74), (500...

scala> val rddNum = sc.parallelize(nums)
rddNum: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:23

scala> val reducedNum = rddNum.reduceByKey(_+_)
reducedNum: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[1] at reduceByKey at <console>:25

scala> reducedNum.mapPartitions(iter => Array(iter.size).iterator, true).collect.toList

res2: List[Int] = List(50, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

为了在分区上更均匀地分布数据，我创建了自己的自定义Partitoiner：

class JodaPartitioner(rddNumPartitions: Int) extends Partitioner {
  def numPartitions: Int = rddNumPartitions
  def getPartition(key: Any): Int = {
    key match {
      case dateTime: DateTime =>
        val sum = dateTime.getYear + dateTime.getMonthOfYear +  dateTime.getDayOfMonth + dateTime.getMinuteOfDay  + dateTime.getSecondOfDay
        sum % numPartitions
      case _ => 0
    }
  }
}

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-04-7

我来说两句

0 条评论

登录后参与评论

上一篇：如何使用SQL Server查询获取产品明智库存

为什么在reduceByKey之后所有数据最终都集中在一个分区中？

Pyspark 数据帧重新分区将所有数据放在一个分区中

为什么在reduceByKey之后所有数据最终都集中在一个分区中？

为什么在reduceByKey之后所有数据最终都集中在一个分区中？

Android Studio Kotlin：提取为常量

IE 11中的FormData未定义

计算数据帧R中的字符串频率

如何在R中转置数据

如何使用Redux-Toolkit重置Redux Store

Excel 2016图表将增长与4个参数进行比较

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

未捕获的SyntaxError：带有Ajax帖子的意外令牌u

OpenCv：改变 putText() 的位置

ActiveModelSerializer仅显示关联的ID

算术中的c ++常量类型转换

如何开始为Ubuntu开发

将加号/减号添加到jQuery菜单

去噪自动编码器和常规自动编码器有什么区别？

获取并汇总所有关联的数据

OpenGL纹理格式的颜色错误

在 React Native Expo 中使用 react-redux 更改另一个键的值

http：// localhost：3000 /＃！/为什么我在localhost链接中得到“＃！/”。

TreeMap中的自定义排序

Redux动作正常，但减速器无效

如何对treeView的子节点进行排序