AWS Data Glue ETL Filter Extract Input Based on Job Parameter

Patrick Bray

New to AWS Glue ETL processing and trying to implement a Job to extract data from an RDS MySQL DB for a specific customer, perform some transformations and write the results to S3.

What is the best approach to filter the data input selected from the source table can this be done as part of the source extract or does this need to be a separate Filter Transformation based on a specific key?

If implementing this as a Filter Transformation is there a way to make this dynamic based on Job input parameters? Ideally this job will be triggered by an event as part of a user initiated workflow.

Transform

Any help would be much appreciated. TIA

Robert Kossendey

What is the best approach to filter the data input selected from the source table can this be done as part of the source extract or does this need to be a separate Filter Transformation based on a specific key?

Glue is basically managed Spark. Spark has a technique called PushDownPredicate which optimises filter operations. It is very likely that Spark will push the filter operation directly into the read operation, by modifying the read statement.

You can check if that is happening in your case by converting the Glue DynamicFrame into a native Spark DataFrame with the .toDF() method and the calling the explain operation on that DataFrame.

If implementing this as a Filter Transformation is there a way to make this dynamic based on Job input parameters? Ideally this job will be triggered by an event as part of a user initiated workflow.

Yes you can, but not through the Visual UI of Glue Studio, you would need to modify the ETL script manually.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Optional job parameter in AWS Glue?

Is it required to run AWS Glue crawler to detect new data before executing an ETL job?

What is the difference between AWS Glue ETL Job and AWS EMR?

AWS Glue what is optimal data size for ETL

AWS Glue: How to reduce the number of DPUs for an ETL job

What is the Scala and Java version for AWS Glue ETL job?

ETL job failing with pyspark.sql.utils.AnalysisException in AWS Glue

AWS Glue ETL job failing with "Failed to delete key: parquet-output/_temporary"

How to fix 'An error occurred (403) when calling the HeadObject operation: Forbidden' in AWS Glue ETL Job

How to Trigger Glue ETL Pyspark job through S3 Events or AWS Lambda?

AWS GLUE job latency

AWS Glue unable to access input data set

Input data to AWS Elastic Search using Glue

AWS Glue job consuming data from external REST API

How to rewind Job Bookmarks on Glue Spark ETL job?

Can we use java for ETL in AWS Glue?

AWS Glue ETL and PySpark and partitioned data: how to create a dataframe column from partition

When to use Amazon Redshift spectrum over AWS Glue ETL to query on Amazon S3 data

AWS Glue job accessing parameters

Parametrized/reusable AWS Glue job

CodeEffects - filter data source based on input parameter from other data source

Is there a better way to create a filter based on parameter input

Filter through JSON data to create a filter based on the location of the job

AWS Glue ETL Doesn't Output All Records

How to fetch data from AWS RDS in AWS Glue job script and transform the data accordingly and insert it back in aws rds?

How to extract data from Oracle database with AWS Glue and other AWS services

Save Data to AWS Glue via Glue Script

Triggering an aws glue job when in progress

aws glue job dependency in step function