PySpark：连接具有“Struc”数据类型的两列--> 错误：由于数据类型不匹配而无法解析

松子0

我在 PySpark 中有一个数据表，其中包含数据类型为“struc”的两列。

请参阅下面的示例数据框：

word_verb                   word_noun
{_1=cook, _2=VB}            {_1=chicken, _2=NN}
{_1=pack, _2=VBN}           {_1=lunch, _2=NN}
{_1=reconnected, _2=VBN}    {_1=wifi, _2=NN}

我想将两列连接在一起，以便我可以对连接的动词和名词块进行频率计数。

我试过下面的代码：

df = df.withColumn('word_chunk_final', F.concat(F.col('word_verb'), F.col('word_noun')))

但我收到以下错误：

AnalysisException: u"cannot resolve 'concat(`word_verb`, `word_noun`)' due to data type mismatch: input to function concat should have been string, binary or array, but it's [struct<_1:string,_2:string>, struct<_1:string,_2:string>]

我想要的输出表如下。连接的新字段的数据类型为字符串：

word_verb                   word_noun               word_chunk_final
{_1=cook, _2=VB}            {_1=chicken, _2=NN}     cook chicken
{_1=pack, _2=VBN}           {_1=lunch, _2=NN}       pack lunch
{_1=reconnected, _2=VBN}    {_1=wifi, _2=NN}        reconnected wifi

泡利

你的代码就快到了。

假设您的架构如下：

df.printSchema()
#root
# |-- word_verb: struct (nullable = true)
# |    |-- _1: string (nullable = true)
# |    |-- _2: string (nullable = true)
# |-- word_noun: struct (nullable = true)
# |    |-- _1: string (nullable = true)
# |    |-- _2: string (nullable = true)

您只需要访问_1每一列的字段值：

import pyspark.sql.functions as F

df.withColumn(
    "word_chunk_final", 
    F.concat_ws(' ', F.col('word_verb')['_1'], F.col('word_noun')['_1'])
).show()
#+-----------------+------------+----------------+
#|        word_verb|   word_noun|word_chunk_final|
#+-----------------+------------+----------------+
#|        [cook,VB]|[chicken,NN]|    cook chicken|
#|       [pack,VBN]|  [lunch,NN]|      pack lunch|
#|[reconnected,VBN]|   [wifi,NN]|reconnected wifi|
#+-----------------+------------+----------------+

此外，您应该使用concat_ws("concatenate with separator") 而不是concat将字符串添加在一起，并在它们之间留一个空格。它类似于str.join在 python 中的工作方式。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-07-11

我来说两句

0 条评论

登录后参与评论

上一篇：AWS S3 HTTP POST - 重定向到带有 URL 参数的页面

PySpark：连接具有“Struc”数据类型的两列--> 错误：由于数据类型不匹配而无法解析

PySpark：连接具有“Struc”数据类型的两列--> 错误：由于数据类型不匹配而无法解析

UITableView的项目向下滚动后更改颜色，然后快速备份

Linux的官方Adobe Flash存储库是否已过时？

用日期数据透视表和日期顺序查询

应用发明者仅从列表中选择一个随机项一次

Mac OS X更新后的GRUB 2问题

验证REST API参数

Java Eclipse中的错误13，如何解决？

带有错误“ where”条件的查询如何返回结果？

ggplot：对齐多个分面图-所有大小不同的分面

尝试反复更改屏幕上按钮的位置 - kotlin android studio

如何从视图一次更新多行（ASP.NET - Core）

计算数据帧中每行的NA

蓝屏死机没有修复解决方案

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

离子动态工具栏背景色

VB.net将2条特定行导出到DataGridView

通过 Git 在运行 Jenkins 作业时获取 ClassNotFoundException

在Windows 7中无法删除文件（2）

python中的boto3文件上传

当我尝试下载 StanfordNLP en 模型时，出现错误

Node.js中未捕获的异常错误，发生调用