AWS Data Engineering terraformiacaws

Infrastructure as Code for AWS Data Stacks with Terraform

By Infra IT Consulting · May 27, 2024 · 9 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

Data infrastructure has a reputation for being the part of the platform that nobody wants to touch. A Glue job configuration that was manually patched eighteen months ago, an S3 bucket policy that lives only in one engineer’s memory, a Redshift cluster whose parameter group settings nobody documented — these are the technical debt patterns that slow teams down and create security incidents. Infrastructure as Code with Terraform is the solution, and this guide shows you how to apply it specifically to AWS data engineering stacks.

Why Data Infrastructure Needs IaC More Than Most

Application teams adopted IaC years ago. Data teams have been slower, partly because their infrastructure is more stateful (databases, S3 prefixes with billions of objects, Glue catalog schemas) and partly because the tooling to manage it properly in Terraform only matured recently.

The cost of neglecting IaC compounds in data platforms:

Environment drift: Your production Glue job runs on a different worker configuration than dev, producing different memory errors that only appear in production.
Disaster recovery gaps: An accidental S3 bucket deletion or a corrupted Glue Data Catalog takes days to rebuild manually.
Compliance failures: Lake Formation permissions granted ad-hoc during an incident never get cleaned up, leaving over-privileged access in place.
Onboarding friction: New data engineers spend their first two weeks trying to reverse-engineer the existing infrastructure.

Terraform solves all of these by making infrastructure declarative, version-controlled, reviewable in pull requests, and reproducible across environments.

Structuring Your Terraform Data Platform Repository

A flat Terraform configuration works for a handful of resources but breaks down as data stacks grow. Use a module-based structure that mirrors your data platform’s logical layers:

infra/
├── modules/
│   ├── data-lake/          # S3 buckets, bucket policies, lifecycle rules
│   ├── glue-catalog/       # Glue databases, tables, crawlers
│   ├── glue-jobs/          # Glue ETL job definitions
│   ├── redshift/           # Cluster, parameter groups, subnet groups
│   ├── lake-formation/     # Data permissions, LF-Tags
│   ├── mwaa/               # Managed Airflow environment
│   └── iam/                # Roles and policies for data services
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   └── prod/
└── shared/
    └── backend.tf          # Remote state in S3 + DynamoDB locking

Store Terraform state in S3 with DynamoDB locking — never use local state for anything beyond a quick proof of concept:

terraform {
  backend "s3" {
    bucket         = "my-org-terraform-state"
    key            = "data-platform/prod/terraform.tfstate"
    region         = "ca-central-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
    kms_key_id     = "arn:aws:kms:ca-central-1:123456789012:key/mrk-abc123"
  }
}

Managing the S3 Data Lake Layer

S3 buckets for a data lake have more configuration surface area than most teams realise. A complete Terraform module for a data lake tier should cover bucket creation, versioning, server-side encryption, lifecycle rules, intelligent tiering, public access blocking, and replication configuration.

resource "aws_s3_bucket" "raw_zone" {
  bucket = "${var.org_prefix}-data-lake-raw-${var.environment}"
}

resource "aws_s3_bucket_versioning" "raw_zone" {
  bucket = aws_s3_bucket.raw_zone.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "raw_zone" {
  bucket = aws_s3_bucket.raw_zone.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = var.kms_key_arn
    }
    bucket_key_enabled = true  # Reduces KMS API calls by ~99%, critical at scale
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "raw_zone" {
  bucket = aws_s3_bucket.raw_zone.id

  rule {
    id     = "transition-to-ia"
    status = "Enabled"
    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }
    transition {
      days          = 90
      storage_class = "GLACIER_INSTANT_RETRIEVAL"
    }
  }
}

Note the bucket_key_enabled = true setting. At data lake scale, every S3 PUT and GET call with SSE-KMS generates a KMS API call. Without bucket keys, you can easily exceed the default KMS quota of 30,000 requests per second per Region, causing throttling errors in high-throughput Glue jobs.

AWS Glue Resources in Terraform

The AWS provider’s Glue resources have matured significantly. You can now manage Glue Data Catalog databases, tables, connections, crawlers, and ETL jobs entirely through Terraform.

resource "aws_glue_catalog_database" "processed" {
  name        = "${var.org_prefix}_processed_${var.environment}"
  description = "Processed/curated data layer for analytics consumption"

  create_table_default_permission {
    permissions = ["SELECT"]
    principal {
      data_lake_principal_identifier = "IAM_ALLOWED_PRINCIPALS"
    }
  }
}

resource "aws_glue_job" "customer_transform" {
  name         = "${var.org_prefix}-customer-transform-${var.environment}"
  role_arn     = aws_iam_role.glue_execution.arn
  glue_version = "4.0"

  command {
    script_location = "s3://${aws_s3_bucket.scripts.bucket}/jobs/customer_transform.py"
    python_version  = "3"
  }

  default_arguments = {
    "--job-language"                     = "python"
    "--enable-metrics"                   = "true"
    "--enable-continuous-cloudwatch-log" = "true"
    "--enable-auto-scaling"              = "true"
    "--TempDir"                          = "s3://${aws_s3_bucket.temp.bucket}/glue-temp/"
    "--conf"                             = "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
  }

  execution_property {
    max_concurrent_runs = 3
  }

  worker_type       = var.environment == "prod" ? "G.2X" : "G.1X"
  number_of_workers = var.environment == "prod" ? 10 : 2
}

This pattern uses a conditional expression to size workers differently in production vs. development — a simple but effective cost control that saves thousands of dollars per month on non-production environments.

Lake Formation Permissions as Code

AWS Lake Formation is powerful but notoriously difficult to manage manually. Terraform brings discipline to LF permissions through aws_lakeformation_permissions and aws_lakeformation_data_lake_settings resources.

resource "aws_lakeformation_permissions" "analyst_select" {
  principal   = aws_iam_role.data_analyst.arn
  permissions = ["SELECT", "DESCRIBE"]

  table {
    database_name = aws_glue_catalog_database.processed.name
    wildcard      = true  # All current and future tables in this database
  }
}

resource "aws_lakeformation_permissions" "etl_role_write" {
  principal   = aws_iam_role.glue_execution.arn
  permissions = ["ALL"]

  database {
    name = aws_glue_catalog_database.processed.name
  }
}

Managing Lake Formation permissions in Terraform has one important gotcha: if your account was using S3 bucket policies and IAM for data access before Lake Formation was activated, you may have the “IAM_ALLOWED_PRINCIPALS” super-permission inherited on existing tables. You must explicitly revoke this in Terraform using aws_lakeformation_permissions with permissions = [] for the IAM_ALLOWED_PRINCIPALS data lake principal, or new Lake Formation grants will not take effect.

CI/CD Pipeline for Data Infrastructure

A Terraform workflow for a data platform should include automated plan previews on pull requests and gated applies to production. A minimal GitHub Actions workflow:

name: Terraform Data Platform

on:
  pull_request:
    paths: ['infra/**']
  push:
    branches: [main]
    paths: ['infra/**']

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.8.0"
      - name: Terraform Init
        run: terraform init
        working-directory: infra/environments/prod
      - name: Terraform Plan
        run: terraform plan -out=tfplan
        working-directory: infra/environments/prod
      - name: Upload Plan
        uses: actions/upload-artifact@v4
        with:
          name: tfplan
          path: infra/environments/prod/tfplan

  apply:
    needs: plan
    if: github.ref == 'refs/heads/main'
    environment: production  # Requires manual approval in GitHub
    runs-on: ubuntu-latest
    steps:
      - name: Terraform Apply
        run: terraform apply tfplan
        working-directory: infra/environments/prod

The environment: production setting in GitHub Actions enables required reviewers — no infrastructure change reaches production without a second set of eyes on the plan output.

Handling Stateful Resources and Drift

Data platform infrastructure has unavoidable statefulness: you cannot destroy and recreate a Redshift cluster or a Glue Data Catalog table without losing data. Use Terraform’s lifecycle rules carefully:

resource "aws_redshift_cluster" "analytics" {
  cluster_identifier = "${var.org_prefix}-analytics-${var.environment}"
  # ... configuration ...

  lifecycle {
    prevent_destroy = true  # Requires explicit removal before terraform destroy works
    ignore_changes  = [
      master_password,  # Managed by Secrets Manager rotation
      number_of_nodes,  # Allow manual scaling without triggering drift alerts
    ]
  }
}

For Glue Data Catalog tables, it is often better to manage the database in Terraform but let Glue Crawlers manage individual table definitions — then import the crawler into Terraform state after it runs once.

If you are pairing this IaC approach with an orchestration layer, see our guide on Apache Airflow on AWS with MWAA for managing your MWAA environment with Terraform alongside your Glue and S3 resources.

Common Pitfalls and How to Avoid Them

Circular IAM dependencies: Glue jobs need IAM roles, roles need S3 bucket ARNs, S3 bucket policies reference role ARNs. Use depends_on explicitly or restructure modules to break the cycle.

Terraform state lock contention: Multiple engineers running terraform plan simultaneously can cause DynamoDB lock timeouts. Enforce that all Terraform operations run through CI/CD rather than from local machines.

Glue version pinning: Always pin glue_version in aws_glue_job resources. AWS silently deprecated Glue 2.0 and the default version has changed multiple times. Unpinned versions cause unexpected runtime changes.

S3 event notification ordering: If you have S3 event notifications triggering Lambda functions that feed Glue workflows, define the Lambda function before the S3 notification resource. Terraform does not always infer this dependency automatically.

For a complete view of how Terraform fits into a modern data stack alongside dbt, Airflow, and Iceberg, see our Modern Data Stack Explained overview.

Conclusion

Terraform for AWS data infrastructure is no longer optional for teams that want reliable, auditable, and reproducible data platforms. The patterns in this guide — module-based repository structure, environment-specific configurations, Lake Formation as code, and CI/CD gating — apply whether you are managing a small analytics platform or a multi-account data mesh.

The investment in proper IaC pays dividends at disaster recovery time, during compliance audits, and every time a new engineer joins your team. Infra IT Consulting specialises in designing and implementing IaC-first data platforms on AWS. Contact us to assess your current infrastructure posture and build a roadmap to full automation.

AWS Data Engineering

Talk to our team →

Infrastructure as Code for AWS Data Stacks with Terraform

Why Data Infrastructure Needs IaC More Than Most

Structuring Your Terraform Data Platform Repository

Managing the S3 Data Lake Layer

AWS Glue Resources in Terraform

Lake Formation Permissions as Code

CI/CD Pipeline for Data Infrastructure

Handling Stateful Resources and Drift

Common Pitfalls and How to Avoid Them

Conclusion

Related posts

Mastering the AWS Glue Data Catalog for Metadata Management

Implementing Delta Lake on AWS: ACID Transactions for S3

Running Apache Airflow on AWS with MWAA

Why Data Infrastructure Needs IaC More Than Most

Structuring Your Terraform Data Platform Repository

Managing the S3 Data Lake Layer

AWS Glue Resources in Terraform

Lake Formation Permissions as Code

CI/CD Pipeline for Data Infrastructure

Handling Stateful Resources and Drift

Common Pitfalls and How to Avoid Them

Conclusion

Related posts

Mastering the AWS Glue Data Catalog for Metadata Management

Implementing Delta Lake on AWS: ACID Transactions for S3

Running Apache Airflow on AWS with MWAA

We value your privacy