使用 Spark Dataframe 遍历记录并根据某些条件将当前值与先前值连接起来

维杰 B

我对 Spark 和 Scala 编码很陌生。我目前正在研究 Spark DataFrames。我需要遍历记录并重复相同的值，直到满足下一个条件。请在示例下方找到，我在给定的文件中只有一列。该示例有两种类型的值，一种是标题数据，另一种是详细信息数据。标题数据始终为 10 个字符长度，详细数据始终为 15 个字符长度。我想将前 10 个字符与下一个记录的 15 个字符连接起来，直到我们达到下一个 10 个字符，依此类推...

df
---------------
1RHGTY567U //header data 
6786TYUIOPTR141 //detail data
6786TYUIOPTYU67 //detail data
T7997999HHBFFE6 //detail data
8YUITY567U      //header data 
HJS7890876997BB //detail data
BFJFBFKFN787897
GS678790877656H
BFJFDK786WQ4243
74849469GJGNVFM
67YUBMHJKH
VFJF788968FJFJD
HFJFGKJD789768D
GFJFHFFLLJFJDLD

我已经通过收集 DataFrame、循环遍历它并将其与其他记录连接起来进行了尝试，如下所示。我遵循的方法是一个代价高昂的操作，因为 collect() 是不可取的。我可以使用滞后窗口函数将当前值与前一个值连接起来，但我的场景几乎没有什么不同。

val srcDF = spark.read.format("csv").load(location + "/" + filename)

   //Adding another column to the DataFrame which shows length of the value in the column
   var newDF = srcDF.withColumn("col_length", length($"_c0"))

   //Converting DataFrame to RDD
   var RDD = newDF.map(row => row(0).toString + "," + row(1).toString).rdd

   //Iterating through RDD to concatenate Header data with the detail
   for (row <- RDD.collect) {
      if (row.split(",")(1).toInt == 16) { Rec = row.split(",")(0).toString }
      if (row.split(",")(1).toInt > 16) {
         srcModified += Rec + row.split(",")(0).toString
      } 
      else {
         srcModified += Rec
      }
   }

   //Converting ListBuffer to RDD
   val modifiedRDD = sc.parallelize(srcModified.toSeq)

我期待的输出如下所示：

new_DF
------

1RHGTY567U //header data 
1RHGTY567U6786TYUIOPTR141 //header data concatenated with detail data
1RHGTY567U6786TYUIOPTYU67 //header data concatenated with detail data
1RHGTY567UT7997999HHBFFE6 //header data concatenated with detail data
8YUITY567U      //header data 
8YUITY567UHJS7890876997BB //header data concatenated with detail data
8YUITY567UBFJFBFKFN787897 //header data concatenated with detail data
8YUITY567UGS678790877656H //header data concatenated with detail data
8YUITY567UBFJFDK786WQ4243 //header data concatenated with detail data
8YUITY567U74849469GJGNVFM //header data concatenated with detail data
67YUBMHJKH
67YUBMHJKHVFJF788968FJFJD
67YUBMHJKHHFJFGKJD789768D
67YUBMHJKHGFJFHFFLLJFJDLD

请问有什么建议吗？

帕夏701

增量列可以添加到Dataframe，增量列的Window会通过“last”函数找到最新的标题：

val withId = originalDF.select($"value", monotonically_increasing_id().alias("id"))

val idWindow = Window.orderBy("id")
withId
  .withColumn("previousHeader",
      last( when(length($"value") < 15, $"value")
            .otherwise(null), true).over(idWindow)
          )
  .select(
      when($"value"=== $"previousHeader", $"value")
      .otherwise(concat($"previousHeader", $"value")).alias("value")
  )

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-07-21

我来说两句

0 条评论

登录后参与评论

上一篇：这是什么错误？：未绑定方法需要 vtkRenderingCorePython.vtkAbstractMapper 作为第一个参数

TOP 榜单

文章

使用 Spark Dataframe 遍历记录并根据某些条件将当前值与先前值连接起来

使用 Spark Dataframe 遍历记录并根据某些条件将当前值与先前值连接起来

蓝屏死机没有修复解决方案

计算数据帧中每行的NA

UITableView的项目向下滚动后更改颜色，然后快速备份

Node.js中未捕获的异常错误，发生调用

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

Linux的官方Adobe Flash存储库是否已过时？

验证REST API参数

ggplot：对齐多个分面图-所有大小不同的分面

Mac OS X更新后的GRUB 2问题

通过 Git 在运行 Jenkins 作业时获取 ClassNotFoundException

带有错误“ where”条件的查询如何返回结果？

用日期数据透视表和日期顺序查询

VB.net将2条特定行导出到DataGridView

如何从视图一次更新多行（ASP.NET - Core）

Java Eclipse中的错误13，如何解决？

尝试反复更改屏幕上按钮的位置 - kotlin android studio

离子动态工具栏背景色

应用发明者仅从列表中选择一个随机项一次

当我尝试下载 StanfordNLP en 模型时，出现错误

python中的boto3文件上传

在同一Pushwoosh应用程序上Pushwoosh多个捆绑ID