構造体の配列を含むPySparkデータフレームをPythonクラスに変換して戻します

トーマスR

Spark2.3.2環境でpysparkデータフレームを使用してZeppelinを使用しています。そして、データをクラスに入れたり、クラスから出したりする必要があります。

構造体の配列を正しい方法で追加するのに問題があります。

編集：データフレームは次のように生成できます：

dfPre =  sqlContext.createDataFrame([
  (1,11,53,8),
  (1,12,54,7),
  (1,16,51,11),
  (2,21,63,13),
  (2,23,65,15),
],("ID", "itemID", "Attribute1", "Attribute2"))

import pyspark.sql.functions as f
df = dfPre.groupBy(f.col("ID")).agg(f.collect_list(f.struct(f.col("itemID"),f.col("Attribute1"),f.col("Attribute1"))).alias("items"))

df.printSchema()

root 
|-- ID: string (nullable = true) 
|-- items: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- itemID: string (nullable = true) 
| | |-- Attribute1: double (nullable = true) 
| | |-- Attribute2: double (nullable = true)

df.show(2,False)

+---+------------------------------------------+ 
|ID |items                                     | 
+---+------------------------------------------+ 
|1  |[[11, 53, 11], [16, 51, 8], [12, 54, 7]]  | 
|2  |[[23, 65, 13], [21, 63, 15]]              | 
+---+------------------------------------------+

クラスは例えば次のとおりです

class Request:
    def __init__(self, data):
        self.ID = data["ID"]
        self.items = map(Items, data["items"])
    def __repr__(self):
        return "<ID:%s items:%s>" % (self.ID, self.items)
    def __str__(self):
        return "ID:%s items:%s" % (self.ID, self.items)

class Items: 
    def __init__(self, data):
        self.itemID = data["itemID"]
        self.Attribute1 = data["Attribute1"]
        self.Attribute2 = data["Attribute2"]
    def __repr__(self):
        return "<itemID:%s Attribute1:%s Attribute2:%s>" % (self.itemID, self.Attribute1, self.Attribute2)
    def __str__(self):
        return "itemID:%s Attribute1:%s Attribute2:%s" % (self.itemID, self.Attribute1, self.Attribute2)

クラスで配列を取得するために、次のことを試みました。

data = df.toPandas()
row = 0

ID = data['ID'][row]

itemList =[]
for i in range(len(data['items'][row])):
    itemList.append({"itemID": data['items'][row][i]['itemID'],
        "Attribute1": data['items'][row][i]['Attribute1'],
        "Attribute2": data['items'][row][i]['Attribute2']    })

items = {'items': itemList}

requestDataDict = {"ID": ID,"items": itemList}
request = Request(requestDataDict)

しかし、配列をクラスに適切に渡さないか、クラスから再び取得することができません。

print(request)

>> ID:102 items:<map object at 0x7fb54e234cf8>

def classExport(request):
    return request.items

test = classExport(request)

z.show(test)

>> <map object at 0x7fb54e234cf8>

最後に、クラスから元のデータフレームの最初の行を受け取りたいと思います。

前もって感謝します

トーマスR

私は自分で解決策を見つけました：

クラスRequestのクラス要素と属性要素を印刷可能にしましたが、マップオブジェクト自体を印刷するための優れた方法がありません。

ただし、マップの一部はクラスアイテムのインスタンスであるため、印刷できます。

for x in test:
    print(x)

> itemID:16 Attribute1:51 Attribute2:11 
> itemID:11 Attribute1:53 Attribute2:8 
> itemID:12 Attribute1:54 Attribute2:7

マップオブジェクトがリストに変換されるようにRequestのクラス定義が変更された場合、最初から印刷できます。

self.items = list(map(Items, data["items"]))

出力は次のように変わります。

print(request)

> ID:1 items:[<itemID:16 Attribute1:51 Attribute2:11>, <itemID:11 Attribute1:53 Attribute2:8>, <itemID:12 Attribute1:54 Attribute2:7>]

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]