由于数据类型不匹配PySpark无法解析列

数据视图

在PySpark中遇到错误:

pyspark.sql.utils.AnalysisException: "cannot resolve '`result_set`.`dates`.`trackers`['token']' due to data type mismatch: argument 2 requires integral type, however, ''token'' is of string type.;;\n'Project [result_parameters#517, result_set#518, <lambda>(result_set#518.dates.trackers[token]) AS result_set.dates.trackers.token#705]\n+- Relation[result_parameters#517,result_set#518] json\n"

数据结构:

-- result_set: struct (nullable = true)
 |    |-- currency: string (nullable = true)
 |    |-- dates: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- date: string (nullable = true)
 |    |    |    |-- trackers: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- countries: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- country: string (nullable = true)
 |    |    |    |    |    |    |    |-- os_names: array (nullable = true)
 |    |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |    |-- kpi_values: array (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |-- element: double (containsNull = true)
 |    |    |    |    |    |    |    |    |    |-- os_name: string (nullable = true)
 |    |    |    |    |    |-- token: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- token: string (nullable = true)

我试图创建一个视图以显示货币,日期和令牌:

df.select('result_set.currency', 'result_set.dates.date', 'result_set.dates.trackers.token').show()

数据样本:

"result_set": {
        "token": "abcdef",
        "name": "Facebook",
        "currency": "EUR",
        "dates": [
            {
                "date": "2020-03-11",
                "trackers": [
                    {
                        "token": "12345",
                        "countries": [
                            {
                                "country": "am",
                                "os_names": [
                                    {
                                        "os_name": "android",
                                        "kpi_values": [
                                            0,
                                            0,
                                            0,
                                            0,
                                            0,
                                            0,
                                            1,
                                            0,
                                            0
                                        ]
                                    }
                                ]
                            },

我正在尝试基于json数据中的其中一些级别创建视图。

更新:

复制令牌

df.selectExpr('result_set.currency','explode(result_set.dates)').\
select("*","col.*").\
selectExpr("explode(trackers)","*").\
selectExpr("currency","date","explode(trackers)").\
select("currency","date","col.*").\
selectExpr("currency","date","token", "explode(countries)").\
select("currency","date","token", "col.*").\
selectExpr("currency","date","token", "country", "explode(os_names)").\
select("currency","date","token", "country", "col.*").\
selectExpr("currency","date","token", "country", "os_name", "explode(kpi_values)").\
show(20)

在爆炸后,现在令牌重复了8次。

Shu

在访问时,array of structs我们需要提供数组中需要访问的元素,0,1,2..etc。

  • 如果需要选择数组的所有元素,则需要使用explode()

Example:

df.printSchema()
#root
# |-- result_set: struct (nullable = true)
# |    |-- currency: string (nullable = true)
# |    |-- dates: array (nullable = true)
# |    |    |-- element: struct (containsNull = true)
# |    |    |    |-- date: string (nullable = true)
# |    |    |    |-- trackers: array (nullable = true)
# |    |    |    |    |-- element: struct (containsNull = true)
# |    |    |    |    |    |-- countries: array (nullable = true)
# |    |    |    |    |    |    |-- element: struct (containsNull = true)
# |    |    |    |    |    |    |    |-- country: string (nullable = true)
# |    |    |    |    |    |    |    |-- os_names: array (nullable = true)
# |    |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
# |    |    |    |    |    |    |    |    |    |-- kpi_values: array (nullable = true)
# |    |    |    |    |    |    |    |    |    |    |-- element: long (containsNull = true)
# |    |    |    |    |    |    |    |    |    |-- os_name: string (nullable = true)
# |    |    |    |    |    |-- token: string (nullable = true)
# |    |-- name: string (nullable = true)
# |    |-- token: string (nullable = true)

#accessing token,date from array
df.selectExpr('result_set.dates.trackers[0].token','result_set.currency', 'result_set.dates.date').show()
#+--------------------------------------------------+--------+------------+
#|result_set.dates.trackers AS trackers#194[0].token|currency|        date|
#+--------------------------------------------------+--------+------------+
#|                                           [12345]|     EUR|[2020-03-11]|
#+--------------------------------------------------+--------+------------+

#accessing first elements from dates, trackers array and extracting date,token values
df.selectExpr('result_set.dates[0].trackers[0].token as token','result_set.currency', 'result_set.dates[0].date as date').show()
#+-----+--------+----------+
#|token|currency|      date|
#+-----+--------+----------+
#|12345|     EUR|2020-03-11|
#+-----+--------+----------+

#if you need to select all elements of array then we need to explode the array and select the data
df.selectExpr('result_set.currency','explode(result_set.dates)').\
select("*","col.*").\
selectExpr("explode(trackers)","*").\
selectExpr("currency","date","explode(trackers)").\
select("currency","date","col.*").\
select("currency","date","token").\
show()

#+--------+----------+-----+
#|currency|      date|token|
#+--------+----------+-----+
#|     EUR|2020-03-11|12345|
#+--------+----------+-----+

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章

PySpark:连接具有“Struc”数据类型的两列--> 错误:由于数据类型不匹配而无法解析

SparkSQL:由于数据类型不匹配,无法解析“在'表达式'然后1否则0结束'的情况:

SSIS查找“由于数据类型不匹配,所以无法映射输入列”

数据类型不匹配

默认值数据类型与列 DATES_DATE 的数据类型不匹配

解析xml文件时,由于spark中的类型不匹配而无法解析爆炸

Access中的数据类型不匹配

使用RND()的数据类型不匹配

返回数据类型不匹配

Jaydata中的数据类型不匹配

数据类型不匹配Access 2010

Excel SQL 数据类型不匹配

Pyspark SQL 合并数据类型与日期转换不匹配

Haskell解析器生成器的数据类型不匹配-快乐

PostgreSQL交叉表:月行和天列;错误的rowid数据类型与返回的rowid数据类型不匹配

使用pyspark获取列的数据类型

在pyspark中指定列数据类型

条件表达式中的数据类型与REPLACE不匹配,无法替换

输入的VBA VlookUp无法获取结果,是代码错误还是数据类型不匹配?

创建这些表:无法创建约束或索引:数据类型不匹配

通过数据类型不匹配的两列连接两个表

插入时数据类型不匹配(代码20)

IntegrityError:使用Praw在Python中数据类型不匹配

数据类型不匹配和冲突的编译器

C ++不报告数据类型不匹配的错误或警告

VBA中SQL查询上的数据类型不匹配

数据类型在函数调用中不匹配

DB连接适配器“数据类型不匹配”

VB访问更新数据类型不匹配