All services

Analytics

AWS Glue

Serverless data integration & ETL.

Official docs

Overview

Glue crawls data, populates a Data Catalog and runs Spark-based ETL jobs to transform S3/RDS/Redshift data.

When to use it

  • Building data lakes
  • Schema discovery via crawlers
  • Batch ETL

Setup

  1. Create database in Data Catalog → run a crawler against S3.
  2. Author job in Studio or PySpark script.
  3. Schedule via triggers or Step Functions.

How to use

Run job
aws glue start-job-run --job-name qa-etl

QA use cases

  • Generate masked datasets in S3 nightly for QA databases.