如何将公共数据集导入 Google Cloud Bucket

考斯图布·穆莱

我将处理一个包含有关美国 311 呼叫信息的数据集。此数据集在 BigQuery 中公开可用。我想将其直接复制到我的存储桶中。但是,由于我是新手,我对如何做到这一点一无所知。

这是数据集在 Google Cloud 上的公共位置的屏幕截图:

显示可用数据集的屏幕截图

我已经在我的 Google Cloud Storage 中创建了一个名为 311_nyc 的存储桶。如何直接传输数据而无需下载 12 GB 文件并通过我的 VM 实例再次上传?

伊特鲁利

如果您311_service_requests从左侧列表中选择表格,则会出现“导出”按钮:

BigQuery 导出

Then you can select Export to GCS, select your bucket, type a filename, choose format (between CSV and JSON) and check if you want the export file to be compressed (GZIP).

However, there are some limitations in BigQuery Exports. Copying some from the documentation link that apply to your case:

  • You can export up to 1 GB of table data to a single file. If you are exporting more than 1 GB of data, use a wildcard to export the data into multiple files. When you export data to multiple files, the size of the files will vary.
  • When you export data in JSON format, INT64 (integer) data types are encoded as JSON strings to preserve 64-bit precision when the data is read by other systems.
  • You cannot choose a compression type other than GZIP when you export data using the Cloud Console or the classic BigQuery web UI.

EDIT:

A simple way to merge the output files together is to use the gsutil compose command. However, if you do this the header with the column names will appear multiple times in the resulting file because it appears in all the files that are extracted from BigQuery.

To avoid this, you should perform the BigQuery Export by setting the print_header parameter to False:

bq extract --destination_format CSV --print_header=False bigquery-public-data:new_york_311.311_service_requests gs://<YOUR_BUCKET_NAME>/nyc_311_*.csv

and then create the composite:

gsutil compose gs://<YOUR_BUCKET_NAME>/nyc_311_* gs://<YOUR_BUCKET_NAME>/all_data.csv

Now, in the all_data.csv file there are no headers at all. If you still need the column names to appear in the first row you have to create another CSV file with the column names and create a composite of these two. This can be done either manually by pasting the following (column names of the "311_service_requests" table) into a new file:

unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,street_name,cross_street_1,cross_street_2,intersection_street_1,intersection_street_2,address_type,city,landmark,facility_type,status,due_date,resolution_description,resolution_action_updated_date,community_board,borough,x_coordinate,y_coordinate,park_facility_name,park_borough,bbl,open_data_channel_type,vehicle_type,taxi_company_borough,taxi_pickup_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location

or with the following simple Python script (in case you want to use it with a table with a big amount of columns that is hard to be done manually) that queries the column names of the table and writes them into a CSV file:

from google.cloud import bigquery

client = bigquery.Client()

query = """
    SELECT column_name
    FROM `bigquery-public-data`.new_york_311.INFORMATION_SCHEMA.COLUMNS
    WHERE table_name='311_service_requests'
"""
query_job = client.query(query)

columns = []
for row in query_job:
    columns.append(row["column_name"])
with open("headers.csv", "w") as f:
    print(','.join(columns), file=f) 

Note that for the above script to run you need to have the BigQuery Python Client library installed:

pip install --upgrade google-cloud-bigquery 

Upload the headers.csv file to your bucket:

gsutil cp headers.csv gs://<YOUR_BUCKET_NAME/headers.csv

And now you are ready to create the final composite:

gsutil compose gs://<YOUR_BUCKET_NAME>/headers.csv gs://<YOUR_BUCKET_NAME>/all_data.csv gs://<YOUR_BUCKET_NAME>/all_data_with_headers.csv

如果您想要标题,您可以跳过创建第一个组合并使用所有源创建最后一个组合:

gsutil compose gs://<YOUR_BUCKET_NAME>/headers.csv gs://<YOUR_BUCKET_NAME>/nyc_311_*.csv gs://<YOUR_BUCKET_NAME>/all_data_with_headers.csv

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章

如何将图像从公共数据库传输到Google Cloud Bucket,而无需在本地下载

如何将 Kaggle 数据集导入 Google Cloud Platform

Google Cloud Bucket的CORS设定

Google Cloud Bucket 文件路径

将Google Cloud Storage Bucket挂载到执行个体

如何从AI平台作业访问Google Cloud Storage Bucket

如何将文件上传到 Google Cloud Storage Bucket 子目录

在 Google Cloud Storage Bucket 中运行 PHP

Google Cloud ML和GCS Bucket问题

列出Google Cloud Bucket中的文件

在Google Cloud Bucket中保存Keras ModelCheckpoints

Google Cloud Platform Bucket SSL设定

Google Cloud Functions bucket.upload()

Google Cloud Bucket的视频阅读问题

如何将数据从 Google Cloud Platform (BigQuery/Cloud SQL) 导入 R?

使用 Django 将目录从 Google Cloud Storage Bucket 递归复制到另一个 Google Cloud Storage Bucket

如何从“ Google Cloud函数”连接到“ Google Cloud BigQuery”公共数据集

我可以使用元数据值搜索Google Cloud Storage Bucket吗?

从 Android 设备将文件推送到我的 Google Cloud Bucket

使用PHP和Ajax将文件上传到Google Cloud Storage Bucket

将Python文件上传到Google Cloud Storage Bucket返回管道中断错误

将文件从远程服务器复制到Google Cloud Bucket

使用Python将文件上传到Google Cloud Storage Bucket子目录

如何使用Java API在Google Cloud Bucket中创建一个空文件夹

如何将xlrd导入Google Cloud Datalab

来自 Kubernetes 部署的 Google Cloud Bucket 连接使用 Storage API

Google Cloud Storage Bucket是否使用Amazon S3?

在Firebase Google Cloud Storage Bucket上配置CORS

Google Cloud Storage Bucket会收取带宽费用吗?