Is crawler required for creating an AWS glue job?

Cecilia Published at Dev

Cecilia

I'm learning Glue with Pyspark by following this page: https://aws-dojo.com/ws8/labs/configure-crawler/.

My question is: is crawler & creating a database in Lake Formation required for creating a glue job?

I have some issue with my aws role and I'm not authorised to create resourse in LakeFormation, so I'm thinking if I can skip them to only create a glue job and test my script?

For example, I only want to test my pyspark script for one single input .txt file, I store it in S3, do I still need crawler? Can I just use boto3 to create a glue job to test the script and do some preprocessing and write data back to s3?

Balu Vyamajala

No. you don't need to create a crawler to run Glue Job.

Crawler can read multiple datasources and keep Glue Catalog up to date. For example, when you have partitioned data in S3, as new partitions(folders) are created, we can schedule a crawler job to read those new S3 partitions and update metadata in Glue Catalog/tables.

Once Glue Catalog is updated with metadata, we can easily read actual data(behind these glue catalog/tables) using these Glue ETL or Athena or other processes.

In your case, you directly want to read S3 files and write them back to S3 in a Glue job, so, you don't need to a crawler or Glue Catalog.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-9

Comments

0 comments

TOP Ranking

Article

Is crawler required for creating an AWS glue job?

Is crawler required for creating an AWS glue job?

pump.io port in URL

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

How to import an asset in swift using Bundle.main.path() in a react-native native module

How to use HttpClient with ANY ssl cert, no matter how "bad" it is

Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

Modbus Python Schneider PM5300

What is the exact difference between “ use_all_dns_ips” and "resolve_canonical_bootstrap_servers_only” in client.dns.lookup options?

Loopback Error: connect ECONNREFUSED 127.0.0.1:3306 (MAMP)

BigQuery - concatenate ignoring NULL

Is there an option for a Simulink Scope to display the layout in single column?

Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

How to define a specific version of macOS in C

MERGE with DELETE on target with partial match on source?

Apache rewrite or susbstitute rule for bugzilla HTTP 301 redirect

Soundcloud API Authentication | NodeWebkit, redirect uri and local file system

express js can't redirect user

UWP access denied

How to Set Particular Area/Region Selected MapView

split column by delimiter and deleting expanded column

Center buttons and brand in Bootstrap

How to design a xml file to display in more screen