I have set up an airflow workflow that ingests some files from s3 to Google Cloud storage and then runs a workflow of sql queries to create new tables on Big Query. At the end of the workflow I need to push the output of the one final Big Query table to Google Cloud Storage and from there to S3.
I have cracked the the transfer of the Big Query table to Google Cloud Storage with no issues using the BigQueryToCloudStorageOperator
python operator. However it seems the transfer from Google Cloud Storage to S3 is a less trodden route and I have been unable to find a solution which I can automate in my Airflow workflow.
I am aware of rsync
which comes as part of the gsutil
and have gotten this working (see post Exporting data from Google Cloud Storage to Amazon S3) but I am unable to add this into my workflow.
I have a dockerised airflow container running on a compute engine instance.
Would really appreciate help solving this problem.
Many thanks!
So we are also using rsync
to move data between S3 and GCS,
You first need to get a bash script working, something like gsutil -m rsync -d -r gs://bucket/key s3://bucket/key
For s3 you also need to provide AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
as environment variable.
Then define your BashOperator and put it in your DAG file
rsync_yesterday = BashOperator(task_id='rsync_task_' + table,
bash_command='Your rsync script',
dag=dag)
この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。
侵害の場合は、連絡してください[email protected]
コメントを追加