如何将公共数据集导入 Google Cloud Bucket

考斯图布·穆莱

我将处理一个包含有关美国 311 呼叫信息的数据集。此数据集在 BigQuery 中公开可用。我想将其直接复制到我的存储桶中。但是，由于我是新手，我对如何做到这一点一无所知。

这是数据集在 Google Cloud 上的公共位置的屏幕截图：

我已经在我的 Google Cloud Storage 中创建了一个名为 311_nyc 的存储桶。如何直接传输数据而无需下载 12 GB 文件并通过我的 VM 实例再次上传？

伊特鲁利

如果您311_service_requests从左侧列表中选择表格，则会出现“导出”按钮：

Then you can select Export to GCS, select your bucket, type a filename, choose format (between CSV and JSON) and check if you want the export file to be compressed (GZIP).

However, there are some limitations in BigQuery Exports. Copying some from the documentation link that apply to your case:

You can export up to 1 GB of table data to a single file. If you are exporting more than 1 GB of data, use a wildcard to export the data into multiple files. When you export data to multiple files, the size of the files will vary.
When you export data in JSON format, INT64 (integer) data types are encoded as JSON strings to preserve 64-bit precision when the data is read by other systems.
You cannot choose a compression type other than GZIP when you export data using the Cloud Console or the classic BigQuery web UI.

EDIT:

A simple way to merge the output files together is to use the gsutil compose command. However, if you do this the header with the column names will appear multiple times in the resulting file because it appears in all the files that are extracted from BigQuery.

To avoid this, you should perform the BigQuery Export by setting the print_header parameter to False:

bq extract --destination_format CSV --print_header=False bigquery-public-data:new_york_311.311_service_requests gs://<YOUR_BUCKET_NAME>/nyc_311_*.csv

and then create the composite:

gsutil compose gs://<YOUR_BUCKET_NAME>/nyc_311_* gs://<YOUR_BUCKET_NAME>/all_data.csv

Now, in the all_data.csv file there are no headers at all. If you still need the column names to appear in the first row you have to create another CSV file with the column names and create a composite of these two. This can be done either manually by pasting the following (column names of the "311_service_requests" table) into a new file:

unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,street_name,cross_street_1,cross_street_2,intersection_street_1,intersection_street_2,address_type,city,landmark,facility_type,status,due_date,resolution_description,resolution_action_updated_date,community_board,borough,x_coordinate,y_coordinate,park_facility_name,park_borough,bbl,open_data_channel_type,vehicle_type,taxi_company_borough,taxi_pickup_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location

or with the following simple Python script (in case you want to use it with a table with a big amount of columns that is hard to be done manually) that queries the column names of the table and writes them into a CSV file:

from google.cloud import bigquery

client = bigquery.Client()

query = """
    SELECT column_name
    FROM `bigquery-public-data`.new_york_311.INFORMATION_SCHEMA.COLUMNS
    WHERE table_name='311_service_requests'
"""
query_job = client.query(query)

columns = []
for row in query_job:
    columns.append(row["column_name"])
with open("headers.csv", "w") as f:
    print(','.join(columns), file=f)

Note that for the above script to run you need to have the BigQuery Python Client library installed:

pip install --upgrade google-cloud-bigquery

Upload the headers.csv file to your bucket:

gsutil cp headers.csv gs://<YOUR_BUCKET_NAME/headers.csv

And now you are ready to create the final composite:

gsutil compose gs://<YOUR_BUCKET_NAME>/headers.csv gs://<YOUR_BUCKET_NAME>/all_data.csv gs://<YOUR_BUCKET_NAME>/all_data_with_headers.csv

如果您想要标题，您可以跳过创建第一个组合并使用所有源创建最后一个组合：

gsutil compose gs://<YOUR_BUCKET_NAME>/headers.csv gs://<YOUR_BUCKET_NAME>/nyc_311_*.csv gs://<YOUR_BUCKET_NAME>/all_data_with_headers.csv

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-07-31

我来说两句

0 条评论

登录后参与评论

上一篇：C# 登录 XML 循环错误的密码

TOP 榜单

文章