Infrastructure as Code for AWS Data Stacks with Terraform
Data infrastructure has a reputation for being the part of the platform that nobody wants to touch. A Glue job configuration that was manually patched eighteen months ago, an S3 bucket policy that lives only in one engineerβs memory, a Redshift cluster whose parameter group settings nobody documented β these are the technical debt patterns that slow teams down and create security incidents. Infrastructure as Code with Terraform is the solution, and this guide shows you how to apply it specifically to AWS data engineering stacks.
Why Data Infrastructure Needs IaC More Than Most
Application teams adopted IaC years ago. Data teams have been slower, partly because their infrastructure is more stateful (databases, S3 prefixes with billions of objects, Glue catalog schemas) and partly because the tooling to manage it properly in Terraform only matured recently.
The cost of neglecting IaC compounds in data platforms:
- Environment drift: Your production Glue job runs on a different worker configuration than dev, producing different memory errors that only appear in production.
- Disaster recovery gaps: An accidental S3 bucket deletion or a corrupted Glue Data Catalog takes days to rebuild manually.
- Compliance failures: Lake Formation permissions granted ad-hoc during an incident never get cleaned up, leaving over-privileged access in place.
- Onboarding friction: New data engineers spend their first two weeks trying to reverse-engineer the existing infrastructure.
Terraform solves all of these by making infrastructure declarative, version-controlled, reviewable in pull requests, and reproducible across environments.
Structuring Your Terraform Data Platform Repository
A flat Terraform configuration works for a handful of resources but breaks down as data stacks grow. Use a module-based structure that mirrors your data platformβs logical layers:
infra/
βββ modules/
β βββ data-lake/ # S3 buckets, bucket policies, lifecycle rules
β βββ glue-catalog/ # Glue databases, tables, crawlers
β βββ glue-jobs/ # Glue ETL job definitions
β βββ redshift/ # Cluster, parameter groups, subnet groups
β βββ lake-formation/ # Data permissions, LF-Tags
β βββ mwaa/ # Managed Airflow environment
β βββ iam/ # Roles and policies for data services
βββ environments/
β βββ dev/
β β βββ main.tf
β β βββ variables.tf
β β βββ terraform.tfvars
β βββ staging/
β βββ prod/
βββ shared/
βββ backend.tf # Remote state in S3 + DynamoDB locking
Store Terraform state in S3 with DynamoDB locking β never use local state for anything beyond a quick proof of concept:
terraform {
backend "s3" {
bucket = "my-org-terraform-state"
key = "data-platform/prod/terraform.tfstate"
region = "ca-central-1"
dynamodb_table = "terraform-locks"
encrypt = true
kms_key_id = "arn:aws:kms:ca-central-1:123456789012:key/mrk-abc123"
}
}
Managing the S3 Data Lake Layer
S3 buckets for a data lake have more configuration surface area than most teams realise. A complete Terraform module for a data lake tier should cover bucket creation, versioning, server-side encryption, lifecycle rules, intelligent tiering, public access blocking, and replication configuration.
resource "aws_s3_bucket" "raw_zone" {
bucket = "${var.org_prefix}-data-lake-raw-${var.environment}"
}
resource "aws_s3_bucket_versioning" "raw_zone" {
bucket = aws_s3_bucket.raw_zone.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "raw_zone" {
bucket = aws_s3_bucket.raw_zone.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = var.kms_key_arn
}
bucket_key_enabled = true # Reduces KMS API calls by ~99%, critical at scale
}
}
resource "aws_s3_bucket_lifecycle_configuration" "raw_zone" {
bucket = aws_s3_bucket.raw_zone.id
rule {
id = "transition-to-ia"
status = "Enabled"
transition {
days = 30
storage_class = "STANDARD_IA"
}
transition {
days = 90
storage_class = "GLACIER_INSTANT_RETRIEVAL"
}
}
}
Note the bucket_key_enabled = true setting. At data lake scale, every S3 PUT and GET call with SSE-KMS generates a KMS API call. Without bucket keys, you can easily exceed the default KMS quota of 30,000 requests per second per Region, causing throttling errors in high-throughput Glue jobs.
AWS Glue Resources in Terraform
The AWS providerβs Glue resources have matured significantly. You can now manage Glue Data Catalog databases, tables, connections, crawlers, and ETL jobs entirely through Terraform.
resource "aws_glue_catalog_database" "processed" {
name = "${var.org_prefix}_processed_${var.environment}"
description = "Processed/curated data layer for analytics consumption"
create_table_default_permission {
permissions = ["SELECT"]
principal {
data_lake_principal_identifier = "IAM_ALLOWED_PRINCIPALS"
}
}
}
resource "aws_glue_job" "customer_transform" {
name = "${var.org_prefix}-customer-transform-${var.environment}"
role_arn = aws_iam_role.glue_execution.arn
glue_version = "4.0"
command {
script_location = "s3://${aws_s3_bucket.scripts.bucket}/jobs/customer_transform.py"
python_version = "3"
}
default_arguments = {
"--job-language" = "python"
"--enable-metrics" = "true"
"--enable-continuous-cloudwatch-log" = "true"
"--enable-auto-scaling" = "true"
"--TempDir" = "s3://${aws_s3_bucket.temp.bucket}/glue-temp/"
"--conf" = "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
}
execution_property {
max_concurrent_runs = 3
}
worker_type = var.environment == "prod" ? "G.2X" : "G.1X"
number_of_workers = var.environment == "prod" ? 10 : 2
}
This pattern uses a conditional expression to size workers differently in production vs. development β a simple but effective cost control that saves thousands of dollars per month on non-production environments.
Lake Formation Permissions as Code
AWS Lake Formation is powerful but notoriously difficult to manage manually. Terraform brings discipline to LF permissions through aws_lakeformation_permissions and aws_lakeformation_data_lake_settings resources.
resource "aws_lakeformation_permissions" "analyst_select" {
principal = aws_iam_role.data_analyst.arn
permissions = ["SELECT", "DESCRIBE"]
table {
database_name = aws_glue_catalog_database.processed.name
wildcard = true # All current and future tables in this database
}
}
resource "aws_lakeformation_permissions" "etl_role_write" {
principal = aws_iam_role.glue_execution.arn
permissions = ["ALL"]
database {
name = aws_glue_catalog_database.processed.name
}
}
Managing Lake Formation permissions in Terraform has one important gotcha: if your account was using S3 bucket policies and IAM for data access before Lake Formation was activated, you may have the βIAM_ALLOWED_PRINCIPALSβ super-permission inherited on existing tables. You must explicitly revoke this in Terraform using aws_lakeformation_permissions with permissions = [] for the IAM_ALLOWED_PRINCIPALS data lake principal, or new Lake Formation grants will not take effect.
CI/CD Pipeline for Data Infrastructure
A Terraform workflow for a data platform should include automated plan previews on pull requests and gated applies to production. A minimal GitHub Actions workflow:
name: Terraform Data Platform
on:
pull_request:
paths: ['infra/**']
push:
branches: [main]
paths: ['infra/**']
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: "1.8.0"
- name: Terraform Init
run: terraform init
working-directory: infra/environments/prod
- name: Terraform Plan
run: terraform plan -out=tfplan
working-directory: infra/environments/prod
- name: Upload Plan
uses: actions/upload-artifact@v4
with:
name: tfplan
path: infra/environments/prod/tfplan
apply:
needs: plan
if: github.ref == 'refs/heads/main'
environment: production # Requires manual approval in GitHub
runs-on: ubuntu-latest
steps:
- name: Terraform Apply
run: terraform apply tfplan
working-directory: infra/environments/prod
The environment: production setting in GitHub Actions enables required reviewers β no infrastructure change reaches production without a second set of eyes on the plan output.
Handling Stateful Resources and Drift
Data platform infrastructure has unavoidable statefulness: you cannot destroy and recreate a Redshift cluster or a Glue Data Catalog table without losing data. Use Terraformβs lifecycle rules carefully:
resource "aws_redshift_cluster" "analytics" {
cluster_identifier = "${var.org_prefix}-analytics-${var.environment}"
# ... configuration ...
lifecycle {
prevent_destroy = true # Requires explicit removal before terraform destroy works
ignore_changes = [
master_password, # Managed by Secrets Manager rotation
number_of_nodes, # Allow manual scaling without triggering drift alerts
]
}
}
For Glue Data Catalog tables, it is often better to manage the database in Terraform but let Glue Crawlers manage individual table definitions β then import the crawler into Terraform state after it runs once.
If you are pairing this IaC approach with an orchestration layer, see our guide on Apache Airflow on AWS with MWAA for managing your MWAA environment with Terraform alongside your Glue and S3 resources.
Common Pitfalls and How to Avoid Them
Circular IAM dependencies: Glue jobs need IAM roles, roles need S3 bucket ARNs, S3 bucket policies reference role ARNs. Use depends_on explicitly or restructure modules to break the cycle.
Terraform state lock contention: Multiple engineers running terraform plan simultaneously can cause DynamoDB lock timeouts. Enforce that all Terraform operations run through CI/CD rather than from local machines.
Glue version pinning: Always pin glue_version in aws_glue_job resources. AWS silently deprecated Glue 2.0 and the default version has changed multiple times. Unpinned versions cause unexpected runtime changes.
S3 event notification ordering: If you have S3 event notifications triggering Lambda functions that feed Glue workflows, define the Lambda function before the S3 notification resource. Terraform does not always infer this dependency automatically.
For a complete view of how Terraform fits into a modern data stack alongside dbt, Airflow, and Iceberg, see our Modern Data Stack Explained overview.
Conclusion
Terraform for AWS data infrastructure is no longer optional for teams that want reliable, auditable, and reproducible data platforms. The patterns in this guide β module-based repository structure, environment-specific configurations, Lake Formation as code, and CI/CD gating β apply whether you are managing a small analytics platform or a multi-account data mesh.
The investment in proper IaC pays dividends at disaster recovery time, during compliance audits, and every time a new engineer joins your team. Infra IT Consulting specialises in designing and implementing IaC-first data platforms on AWS. Contact us to assess your current infrastructure posture and build a roadmap to full automation.
Related posts
Book a free 30-minute consultation to discuss your data engineering and analytics needs.
Talk to our team β