AWS Glue Job fails with connection timeout error

Semant1ka

I am new to AWS Glue. I have created a job that uses two Data Catalog tables and runs simple SparkSQL query on top of them. The job fails on the Transform step with Exception

pyspark.sql.utils.AnalysisException: 'java.lang.RuntimeException: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to glue.us-east-1.amazonaws.com:443 [blah] failed: connect timed out;'

JDBC source (Redshift) VPC security group has both inbound and outbound rules configured.

I have seen another post on SO about configuring VPC endpoint for Glue itself, but I don't quite understand what it should look like? Should it be and interface to glue.us-east-1.amazonaws.com:443 or something else? I am confused.

UPD: Autogenerated pyspark script

## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "redshift_catalog", redshift_tmp_dir = TempDir, table_name = "analytics_mongo_raw_conversations", transformation_ctx = "DataSource0"]
## @return: DataSource0
## @inputs: []
DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "redshift_catalog", redshift_tmp_dir = args["TempDir"], table_name = "analytics_mongo_raw_conversations", transformation_ctx = "DataSource0")
## @type: DataSource
## @args: [database = "redshift_catalog", redshift_tmp_dir = TempDir, table_name = "analytics_mongo_raw_messages", transformation_ctx = "DataSource1"]
## @return: DataSource1
## @inputs: []
DataSource1 = glueContext.create_dynamic_frame.from_catalog(database = "redshift_catalog", redshift_tmp_dir = args["TempDir"], table_name = "analytics_mongo_raw_messages", transformation_ctx = "DataSource1")
## @type: SqlCode
## @args: [sqlAliases = {"messages": DataSource1, "conversations": DataSource0}, sqlName = SqlQuery0, transformation_ctx = "Transform0"]
## @return: Transform0
## @inputs: [dfc = DataSource1,DataSource0]
Transform0 = sparkSqlQuery(glueContext, query = SqlQuery0, mapping = {"messages": DataSource1, "conversations": DataSource0}, transformation_ctx = "Transform0")
job.commit()
Semant1ka

I was able to resolve this issue, indeed there has to be a VPC endpoint. In addition to that connection should use a private subnet with NAT gateway. My initial subnet didn't have NAT.

Example of the VPC endpoint configuration in Terraform:

resource "aws_vpc_endpoint" "glue" {
  vpc_id            = var.vpc_id
  service_name      = var.glue_vpc_service_name
  vpc_endpoint_type = "Interface"

  security_group_ids = var.security_group_ids 
  subnet_ids = var.subnet_ids

  tags = { mytag = "mytag"}
}

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

TOP Ranking

  1. 1

    Failed to listen on localhost:8000 (reason: Cannot assign requested address)

  2. 2

    Loopback Error: connect ECONNREFUSED 127.0.0.1:3306 (MAMP)

  3. 3

    How to import an asset in swift using Bundle.main.path() in a react-native native module

  4. 4

    pump.io port in URL

  5. 5

    Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

  6. 6

    Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

  7. 7

    Do Idle Snowflake Connections Use Cloud Services Credits?

  8. 8

    maven-jaxb2-plugin cannot generate classes due to two declarations cause a collision in ObjectFactory class

  9. 9

    Binding element 'string' implicitly has an 'any' type

  10. 10

    BigQuery - concatenate ignoring NULL

  11. 11

    Compiler error CS0246 (type or namespace not found) on using Ninject in ASP.NET vNext

  12. 12

    In Skype, how to block "User requests your details"?

  13. 13

    Jquery different data trapped from direct mousedown event and simulation via $(this).trigger('mousedown');

  14. 14

    Pandas - check if dataframe has negative value in any column

  15. 15

    flutter: dropdown item programmatically unselect problem

  16. 16

    Generate random UUIDv4 with Elm

  17. 17

    Is it possible to Redo commits removed by GitHub Desktop's Undo on a Mac?

  18. 18

    ngClass error (Can't bind ngClass since it isn't a known property of div) in Angular 11.0.3

  19. 19

    Change dd-mm-yyyy date format of dataframe date column to yyyy-mm-dd

  20. 20

    EXCEL: Find sum of values in one column with criteria from other column

  21. 21

    How to use merge windows unallocated space into Ubuntu using GParted?

HotTag

Archive