Read a file from GCS in Apache Beam

rish0097

I need to read a file from a GCS bucket. I know I'll have to use GCS API/Client Libraries but I cannot find any example related to it.

I have been referring to this link in the GCS documentation: GCS Client Libraries. But couldn't really make a dent. If anybody can provide an example that would really help. Thanks.

jkff

OK. If you want to simply read files from GCS, not as a PCollection but as regular files, and if you are having trouble with the GCS Java client libraries, you can also use the Apache Beam FileSystems API:

First, you need to make sure that you have a Maven dependency in your pom.xml on beam-sdks-java-extensions-google-cloud-platform-core which contains implementation of the gs:// filesystem:

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-sdks-java-extensions-google-cloud-platform-core</artifactId>
</dependency>

Then set up the FileSystems API (it is set up by default in all pipelines, but if you're using it outside a pipeline, you need to do it manually).

PipelineOptions options = PipelineOptionsFactory.create();
// ...Optionally fill in options such as GCP credentials...
// (see GcpOptions class)
FileSystems.setDefaultPipelineOptions(options);

Then you can use it:

ReadableByteChannel chan = FileSystems.open(FileSystems.matchNewResource(
  "gs://path/to/your/file", false /* is_directory */));
try (InputStream stream = Channels.newInputStream(chan)) {
  // Use regular Java utilities to work with the input stream.
}

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Java - Apache Beam: Read file from GCS with "UCS2-LE BOM" encoding

Read whole file in Apache Beam

apache beam streaming pipeline to watch gcs file regex

How to migrate data from teradata to GCS using Apache Beam

Apache beam read csv file and groupbykey

Read CSV and write to BigQuery from Apache Beam

Apache Beam Options from property file

Write streaming data to GCS using Apache Beam

how to write to GCS with a ParDo and a DoFn in apache beam

apache-beam reading multiple files from multiple folders of GCS buckets and load it biquery python

Facing Performance issue while reading files from GCS using apache beam

How to read and manipulate a Json file with Apache beam in Python

Failed to ReadFromPubSub gz files from GCS in Beam

Create Apache Beam Pipeline that read from Google Pub/Sub

How to read data from RabbitMQ using Apache Beam

Apache beam with redis - select database and read from hash?

How to read multiple files in Apache Beam from GCP bucket

Apache Beam Pipeline to read from REST API runs locally but not on Dataflow

node fs read file from given URL or GCS

GCS - Read a text file from Google Cloud Storage directly into python

How to read json gzipped file from GCS and write to table

Apache Beam - BigQueryIO read Projection

How to read CSVRecord in apache beam?

Apache beam: Reading and transforming multiple data types from single file

Write BigQuery results to GCS in CSV format using Apache Beam

apache beam trigger when all necessary files in gcs bucket is uploaded

Beam Dataflow Pipeline Table Creation Sink as Bigquery from GCS

Read a csv file, clean it, then write out the result as a csv using Apache Beam dataflow

How to read Hadoop files with Apache Beam?