Create Apache Beam Pipeline that read from Google Pub/Sub

theShadow89

I am trying to create a stream pipeline using apache-beam, that read sentences from google pub/sub and write the words into a Bigquery Table.

I am using 0.6.0 apache-beam version.

Following the examples, I have made this:

public class StreamingWordExtract {

/**
 * A DoFn that tokenizes lines of text into individual words.
 */
static class ExtractWords extends DoFn<String, String> {
    @ProcessElement
    public void processElement(ProcessContext c) {
        String[] words = ((String) c.element()).split("[^a-zA-Z']+");
        for (String word : words) {
            if (!word.isEmpty()) {
                c.output(word);
            }
        }
    }
}

/**
 * A DoFn that uppercases a word.
 */
static class Uppercase extends DoFn<String, String> {
    @ProcessElement
    public void processElement(ProcessContext c) {
        c.output(c.element().toUpperCase());
    }
}


/**
 * A DoFn that uppercases a word.
 */
static class StringToRowConverter extends DoFn<String, TableRow> {
    @ProcessElement
    public void processElement(ProcessContext c) {
        c.output(new TableRow().set("string_field", c.element()));
    }

    static TableSchema getSchema() {
        return new TableSchema().setFields(new ArrayList<TableFieldSchema>() {
            // Compose the list of TableFieldSchema from tableSchema.
            {
                add(new TableFieldSchema().setName("string_field").setType("STRING"));
            }
        });
    }

}

private interface StreamingWordExtractOptions extends ExampleBigQueryTableOptions, ExamplePubsubTopicOptions {
    @Description("Input file to inject to Pub/Sub topic")
    @Default.String("gs://dataflow-samples/shakespeare/kinglear.txt")
    String getInputFile();

    void setInputFile(String value);
}

public static void main(String[] args) {
    StreamingWordExtractOptions options = PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(StreamingWordExtractOptions.class);

    options.setBigQuerySchema(StringToRowConverter.getSchema());

    Pipeline p = Pipeline.create(options);

    String tableSpec = new StringBuilder()
            .append(options.getProject()).append(":")
            .append(options.getBigQueryDataset()).append(".")
            .append(options.getBigQueryTable())
            .toString();

    p.apply(PubsubIO.read().topic(options.getPubsubTopic()))
            .apply(ParDo.of(new ExtractWords()))
            .apply(ParDo.of(new StringToRowConverter()))
            .apply(BigQueryIO.Write.to(tableSpec)
                    .withSchema(StringToRowConverter.getSchema())
                    .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                    .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));

    PipelineResult result = p.run();


}

I have an error near:

apply(ParDo.of(new ExtractWords()))

because the previous apply not return a String but an Object

I suppose that the problem is the type returned from PubsubIO.read().topic(options.getPubsubTopic()). The type is PTransform<PBegin, PCollection<T>> instead of PTransform<PBegin, PCollection<String>>

Which is the correct way to read from google pub/sub using apache-beam?

Davor Bonaci

You are hitting a recent backwards-incompatible change in Beam -- sorry about that!

Starting with Apache Beam version 0.5.0, PubsubIO.Read and PubsubIO.Write need to be instantiated using PubsubIO.<T>read() and PubsubIO.<T>write() instead of the static factory methods such as PubsubIO.Read.topic(String).

Specifying a coder via .withCoder(Coder) for the output type is required for Read. Specifying a coder for the input type, or specifying a format function via .withAttributes(SimpleFunction<T, PubsubMessage>) is required for Write.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2020-10-26

Comments

0 comments

TOP Ranking

Article

Create Apache Beam Pipeline that read from Google Pub/Sub

Create Apache Beam Pipeline that read from Google Pub/Sub

Loopback Error: connect ECONNREFUSED 127.0.0.1:3306 (MAMP)

Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

pump.io port in URL

How to import an asset in swift using Bundle.main.path() in a react-native native module

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

Emulator wrong screen resolution in Android Studio 1.3

3D Touch Peek Swipe Like Mail

Double spacing in rmarkdown pdf

Svchost high CPU from Microsoft.BingWeather app errors

How to how increase/decrease compared to adjacent cell

Using Response.Redirect with Friendly URLS in ASP.NET

java.lang.NullPointerException: Cannot read the array length because "<local3>" is null

BigQuery - concatenate ignoring NULL

How to fix "pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'" using YOLOv3?

ngClass error (Can't bind ngClass since it isn't a known property of div) in Angular 11.0.3

Can a 32-bit antivirus program protect you from 64-bit threats

Make a B+ Tree concurrent thread safe

Bootstrap 5 Static Modal Still Closes when I Click Outside

Vector input in shiny R and then use it

Assembly definition can't resolve namespaces from external packages