dask数据帧读取实木复合地板架构差异

阿波斯托洛斯

我执行以下操作:

import dask.dataframe as dd
from dask.distributed import Client
client = Client()

raw_data_df = dd.read_csv('dataset/nyctaxi/nyctaxi/*.csv', assume_missing=True, parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])

该数据集摘自Mathew Rocklin所做的演示,并被用作dask数据框演示。然后我尝试使用pyarrow将其写入镶木地板

raw_data_df.to_parquet(path='dataset/parquet/2015.parquet/') # only pyarrow is installed

尝试读回:

raw_data_df = dd.read_parquet(path='dataset/parquet/2015.parquet/')

我收到以下错误:

ValueError: Schema in dataset/parquet/2015.parquet//part.192.parquet was different. 

VendorID: double
tpep_pickup_datetime: timestamp[us]
tpep_dropoff_datetime: timestamp[us]
passenger_count: double
trip_distance: double
pickup_longitude: double
pickup_latitude: double
RateCodeID: int64
store_and_fwd_flag: binary
dropoff_longitude: double
dropoff_latitude: double
payment_type: double
fare_amount: double
extra: double
mta_tax: double
tip_amount: double
tolls_amount: double
improvement_surcharge: double
total_amount: double
metadata
--------
{'pandas': '{"pandas_version": "0.22.0", "index_columns": [], "columns": [{"metadata": null, "field_name": "VendorID", "name": "VendorID", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "tpep_pickup_datetime", "name": "tpep_pickup_datetime", "numpy_type": "datetime64[ns]", "pandas_type": "datetime"}, {"metadata": null, "field_name": "tpep_dropoff_datetime", "name": "tpep_dropoff_datetime", "numpy_type": "datetime64[ns]", "pandas_type": "datetime"}, {"metadata": null, "field_name": "passenger_count", "name": "passenger_count", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "trip_distance", "name": "trip_distance", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "pickup_longitude", "name": "pickup_longitude", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "pickup_latitude", "name": "pickup_latitude", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "RateCodeID", "name": "RateCodeID", "numpy_type": "int64", "pandas_type": "int64"}, {"metadata": null, "field_name": "store_and_fwd_flag", "name": "store_and_fwd_flag", "numpy_type": "object", "pandas_type": "bytes"}, {"metadata": null, "field_name": "dropoff_longitude", "name": "dropoff_longitude", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "dropoff_latitude", "name": "dropoff_latitude", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "payment_type", "name": "payment_type", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "fare_amount", "name": "fare_amount", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "extra", "name": "extra", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "mta_tax", "name": "mta_tax", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "tip_amount", "name": "tip_amount", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "tolls_amount", "name": "tolls_amount", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "improvement_surcharge", "name": "improvement_surcharge", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "total_amount", "name": "total_amount", "numpy_type": "float64", "pandas_type": "float64"}], "column_indexes": []}'}
vs

VendorID: double
tpep_pickup_datetime: timestamp[us]
tpep_dropoff_datetime: timestamp[us]
passenger_count: double
trip_distance: double
pickup_longitude: double
pickup_latitude: double
RateCodeID: double
store_and_fwd_flag: binary
dropoff_longitude: double
dropoff_latitude: double
payment_type: double
fare_amount: double
extra: double
mta_tax: double
tip_amount: double
tolls_amount: double
improvement_surcharge: double
total_amount: double
metadata
--------
{'pandas': '{"pandas_version": "0.22.0", "index_columns": [], "columns": [{"metadata": null, "field_name": "VendorID", "name": "VendorID", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "tpep_pickup_datetime", "name": "tpep_pickup_datetime", "numpy_type": "datetime64[ns]", "pandas_type": "datetime"}, {"metadata": null, "field_name": "tpep_dropoff_datetime", "name": "tpep_dropoff_datetime", "numpy_type": "datetime64[ns]", "pandas_type": "datetime"}, {"metadata": null, "field_name": "passenger_count", "name": "passenger_count", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "trip_distance", "name": "trip_distance", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "pickup_longitude", "name": "pickup_longitude", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "pickup_latitude", "name": "pickup_latitude", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "RateCodeID", "name": "RateCodeID", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "store_and_fwd_flag", "name": "store_and_fwd_flag", "numpy_type": "object", "pandas_type": "bytes"}, {"metadata": null, "field_name": "dropoff_longitude", "name": "dropoff_longitude", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "dropoff_latitude", "name": "dropoff_latitude", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "payment_type", "name": "payment_type", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "fare_amount", "name": "fare_amount", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "extra", "name": "extra", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "mta_tax", "name": "mta_tax", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "tip_amount", "name": "tip_amount", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "tolls_amount", "name": "tolls_amount", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "improvement_surcharge", "name": "improvement_surcharge", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "total_amount", "name": "total_amount", "numpy_type": "float64", "pandas_type": "float64"}], "column_indexes": []}'}

但是看起来它们看起来一样。对确定原因有帮助吗?

勤工俭学

以下两个numpy规格不同意

{'metadata': None, 'field_name': 'RateCodeID', 'name': 'RateCodeID', 'numpy_type': 'int64', 'pandas_type': 'int64'}

RateCodeID: int64 


{'metadata': None, 'field_name': 'RateCodeID', 'name': 'RateCodeID', 'numpy_type': 'float64', 'pandas_type': 'float64'}

RateCodeID: double

(仔细地看!)

我建议您在加载时为该列提供dtype,或astype在写入之前使用它们来强制它们为浮点型。

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章