使用RDD和DataFrames获得不同的结果

Fisseha Berhane

我正在使用Spark RDD和DataFrame创建文本文件的字数统计，但答案略有不同。我正在使用的数据在这里可用。为什么答案不同？

from pyspark.sql.functions import col, trim, lower, regexp_replace, explode, split, length
 from pyspark.sql import Row

 from nltk.corpus import stopwords
 stopwords = stopwords.words('english')

数据框

def clean1(line):
   return trim(lower(regexp_replace(line, '[^a-zA-Z0-9\s]','')))

 df1 = (spark.read.text("Quran.txt")
   .select(clean1('value').alias('line'))
   .select(explode(split('line', ' ')).alias('word'))
   .filter(length(col("word")) > 0)
   .filter(~col('word').isin(stopwords))
   .groupBy('word').count()
   .orderBy('count', ascending = False)
  )

  df1.show(10)

RDD

 import re

 def cleanline(line):
    return re.sub('[^a-zA-Z0-9\s]', '', line).lower().strip()

df2 = (sc.textFile('Quran.txt')
   .map(lambda x: cleanline(x))
   .flatMap(lambda x: x.split(' '))
   .filter(lambda x: len(x) > 0)
   .filter(lambda x: x not in stopwords)
   .map(lambda x: (x, 1))
   .reduceByKey(lambda a, b: a + b)
   .sortBy(lambda x: -x[1])
   .map(lambda x: Row(word = x[0], count = x[1]))
   .toDF()
   .select(['word','count'])
  )

  df2.show(10)

更新

我发现，如果我将DataFrame部分中的正则表达式从“ [^ a-zA-Z0-9 \ s]”更改为“ [^ a-zA-Z0-9]”，答案将相同。这两个正则表达式模式不一样吗？

拉梅什·马哈然（Ramesh Maharjan）

如果看一下函数的定义 trim

'ltrim': 'Trim the spaces from left end for the specified string value.',
'rtrim': 'Trim the spaces from right end for the specified string value.',
'trim': 'Trim the spaces from both ends for the specified string column.',

表示修剪只删除空格，不删除制表符（\ t）。但是下面的某些行中有一些选项卡，这些选项卡并未按trim功能删除。

God could destroy him if He chose, v. 19 (488)

那就是god上面一行中没有制表符的原因，因此不计在内。而strip（）函数取出所有的空间，包括在前面的标签。

其他计数也是如此。

所以定义udf，其中函数strip（）Python函数可以用来为您的解决方案。

from pyspark.sql import functions as F
from pyspark.sql import types as T

def stripUdf(x):
    return x.strip()

callStripUdf = F.udf(stripUdf, T.StringType())

def clean1(line):
    return callStripUdf(F.trim(F.lower(F.regexp_replace(line, '[^a-zA-Z0-9\s]',''))))

现在，正如您提到的那样，从更改[^a-zA-Z0-9\s]为[^a-zA-Z0-9 ]解决了该问题，这是因为\ s表示包括制表符（\ t）在内的所有空格，因此应用更改将制表符替换为空字符。trim

我希望答案是有帮助的

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2020-11-28

我来说两句

0 条评论

登录后参与评论

上一篇：在RecyclerAdapter中的对象上获取getItem遇到麻烦

使用RDD和DataFrames获得不同的结果

使用RDD和DataFrames获得不同的结果

数据框

RDD

更新

Linux的官方Adobe Flash存储库是否已过时？

如何使用HttpClient的在使用SSL证书，无论多么“糟糕”是

错误：“ javac”未被识别为内部或外部命令，

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

Modbus Python施耐德PM5300

为什么Object.hashCode（）不遵循Java代码约定

如何检查字符串输入的格式

检查嵌套列表中的长度是否相同

错误TS2365：运算符'！=='无法应用于类型'“（”'和'“）”'

如何自动选择正确的键盘布局？-仅具有一个键盘布局

如何正确比较 scala.xml 节点？

在令牌内联程序集错误之前预期为 ')'

如何在JavaScript中获取数组的第n个元素？

如何将sklearn.naive_bayes与（多个）分类功能一起使用？

ValueError：尝试同时迭代两个列表时，解包的值太多（预期为 2）

如何监视应用程序而不是单个进程的CPU使用率？

解决类Koin的实例时出错

ES5的代理替代

有什么解决方案可以将android设备用作Cast Receiver？

VBA 自动化错误：-2147221080 (800401a8)

套接字无法检测到断开连接