Tech Tutorials & How-Tos dockerecscontainers

Docker for Data Engineers: Containerising ETL Jobs on AWS

By Infra IT Consulting · March 8, 2024 · 9 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

Docker has become an essential tool for data engineers working on AWS. Containerising ETL jobs solves the classic “works on my machine” problem, makes dependency management reproducible, and enables deployment patterns — ECS Fargate, AWS Batch, Lambda container images — that simply are not available with zip-packaged code. This tutorial walks through a complete containerisation workflow: writing a Dockerfile for a Python ETL job, running it locally with docker-compose, pushing it to Amazon ECR, and defining an ECS Fargate task.

Writing a Production-Ready Dockerfile

A good Dockerfile for a data engineering ETL job prioritises: small image size, explicit dependency pinning, a non-root user for security, and a clean separation between the base environment and application code.

Here is a Dockerfile for a Python ETL job that reads from S3, transforms data with pandas and pyarrow, and writes Parquet back to S3:

# Dockerfile
# Stage 1: Build stage — install dependencies with full build tools
FROM python:3.11-slim AS builder

WORKDIR /build

# Install build dependencies for packages that have C extensions
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc \
    g++ \
    libpq-dev \
    && rm -rf /var/lib/apt/lists/*

# Copy and install Python dependencies
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt


# Stage 2: Runtime stage — minimal image with only what's needed
FROM python:3.11-slim AS runtime

# Security: create non-root user
RUN groupadd --gid 1001 etlgroup && \
    useradd --uid 1001 --gid etlgroup --shell /bin/bash --create-home etluser

WORKDIR /app

# Copy installed packages from builder
COPY --from=builder /root/.local /home/etluser/.local

# Copy application code
COPY --chown=etluser:etlgroup src/ ./src/

# Switch to non-root user
USER etluser

# Ensure user-installed packages are on PATH
ENV PATH="/home/etluser/.local/bin:${PATH}"
ENV PYTHONPATH="/app"
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

# Health check — verifies the Python environment is intact
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD python -c "import boto3, pandas, pyarrow; print('OK')" || exit 1

ENTRYPOINT ["python", "-m", "src.etl_job"]

The multi-stage build is important: the builder stage has gcc and build tools for compiling C extensions; the runtime stage starts fresh from python:3.11-slim and only copies the compiled packages. This keeps the final image lean — typically 300-400 MB for a pandas/pyarrow workload versus 800+ MB for a single-stage build.

Your requirements.txt should pin exact versions:

boto3==1.34.34
botocore==1.34.34
pandas==2.1.4
pyarrow==14.0.2
psycopg2-binary==2.9.9
pydantic==2.5.3
structlog==24.1.0

The ETL Job Source Code

# src/etl_job.py
import os
import sys
import logging
import boto3
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from io import BytesIO
from datetime import datetime, timezone

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(name)s %(message)s",
    stream=sys.stdout,
)
logger = logging.getLogger("etl_job")


def get_config() -> dict:
    """Load configuration from environment variables."""
    required = ["SOURCE_BUCKET", "SOURCE_PREFIX", "DEST_BUCKET", "DEST_PREFIX"]
    missing = [k for k in required if not os.environ.get(k)]
    if missing:
        raise EnvironmentError(f"Missing required environment variables: {missing}")

    return {
        "source_bucket": os.environ["SOURCE_BUCKET"],
        "source_prefix": os.environ["SOURCE_PREFIX"],
        "dest_bucket": os.environ["DEST_BUCKET"],
        "dest_prefix": os.environ["DEST_PREFIX"],
        "aws_region": os.environ.get("AWS_DEFAULT_REGION", "ca-central-1"),
    }


def list_source_files(s3_client, bucket: str, prefix: str) -> list[str]:
    paginator = s3_client.get_paginator("list_objects_v2")
    keys = []
    for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
        keys.extend(
            obj["Key"] for obj in page.get("Contents", [])
            if obj["Key"].endswith(".csv")
        )
    return keys


def read_csv_from_s3(s3_client, bucket: str, key: str) -> pd.DataFrame:
    obj = s3_client.get_object(Bucket=bucket, Key=key)
    return pd.read_csv(BytesIO(obj["Body"].read()))


def transform(df: pd.DataFrame) -> pd.DataFrame:
    """Apply business logic transformations."""
    df.columns = [col.strip().lower().replace(" ", "_") for col in df.columns]
    df = df.dropna(subset=["order_id", "customer_id"])
    df["amount"] = pd.to_numeric(df["amount"], errors="coerce").fillna(0.0)
    df["processed_at"] = datetime.now(timezone.utc).isoformat()
    return df


def write_parquet_to_s3(s3_client, df: pd.DataFrame, bucket: str, key: str) -> None:
    buffer = BytesIO()
    table = pa.Table.from_pandas(df, preserve_index=False)
    pq.write_table(table, buffer, compression="snappy")
    buffer.seek(0)
    s3_client.put_object(
        Bucket=bucket,
        Key=key,
        Body=buffer.getvalue(),
        ServerSideEncryption="AES256",
    )


def main():
    config = get_config()
    logger.info("Starting ETL job", extra=config)

    s3 = boto3.client("s3", region_name=config["aws_region"])
    source_keys = list_source_files(s3, config["source_bucket"], config["source_prefix"])

    if not source_keys:
        logger.warning("No source files found. Exiting.")
        return

    logger.info(f"Found {len(source_keys)} source files")
    frames = []
    for key in source_keys:
        df = read_csv_from_s3(s3, config["source_bucket"], key)
        frames.append(df)
        logger.info(f"Read {len(df)} rows from s3://{config['source_bucket']}/{key}")

    combined = pd.concat(frames, ignore_index=True)
    transformed = transform(combined)

    output_key = f"{config['dest_prefix']}/data_{datetime.now().strftime('%Y%m%d_%H%M%S')}.parquet"
    write_parquet_to_s3(s3, transformed, config["dest_bucket"], output_key)
    logger.info(f"Wrote {len(transformed)} rows to s3://{config['dest_bucket']}/{output_key}")


if __name__ == "__main__":
    main()

Local Development with docker-compose

docker-compose lets you run the ETL job locally against a real AWS environment (using your local credentials) or against localstack for fully offline testing.

# docker-compose.yml
version: "3.9"

services:
  etl-job:
    build:
      context: .
      target: runtime
    image: etl-job:local
    environment:
      # AWS credentials — sourced from local environment, never hardcoded
      AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}
      AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
      AWS_DEFAULT_REGION: ca-central-1
      # Job configuration
      SOURCE_BUCKET: my-data-lake-dev
      SOURCE_PREFIX: raw/orders/2024/01/
      DEST_BUCKET: my-data-lake-dev
      DEST_PREFIX: processed/orders/2024/01
    # Optional: mount local source code for development iteration
    volumes:
      - ./src:/app/src:ro

  # Optional: localstack for offline S3 testing
  localstack:
    image: localstack/localstack:3.0
    ports:
      - "4566:4566"
    environment:
      SERVICES: s3
      AWS_DEFAULT_REGION: ca-central-1
    volumes:
      - localstack-data:/var/lib/localstack

volumes:
  localstack-data:

Run the job locally:

# Build the image
docker build -t etl-job:local .

# Run against real AWS (credentials from environment)
docker compose run etl-job

# Or run directly with docker
docker run --rm \
  -e AWS_ACCESS_KEY_ID="$AWS_ACCESS_KEY_ID" \
  -e AWS_SECRET_ACCESS_KEY="$AWS_SECRET_ACCESS_KEY" \
  -e AWS_DEFAULT_REGION="ca-central-1" \
  -e SOURCE_BUCKET="my-data-lake-dev" \
  -e SOURCE_PREFIX="raw/orders/2024/01/" \
  -e DEST_BUCKET="my-data-lake-dev" \
  -e DEST_PREFIX="processed/orders" \
  etl-job:local

Pushing to Amazon ECR

Amazon Elastic Container Registry (ECR) is the natural home for Docker images in an AWS workflow. It integrates with ECS, Lambda, and Batch without needing to manage DockerHub credentials.

# Set your variables
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
AWS_REGION="ca-central-1"
REPO_NAME="etl-jobs/orders-processor"
IMAGE_TAG="1.0.0"

# Create the ECR repository (once)
aws ecr create-repository \
  --repository-name "$REPO_NAME" \
  --region "$AWS_REGION" \
  --image-scanning-configuration scanOnPush=true \
  --encryption-configuration encryptionType=AES256

# Authenticate Docker to ECR
aws ecr get-login-password --region "$AWS_REGION" | \
  docker login --username AWS \
    --password-stdin "${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com"

# Build, tag, and push
docker build -t "$REPO_NAME:$IMAGE_TAG" .

docker tag "$REPO_NAME:$IMAGE_TAG" \
  "${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO_NAME}:${IMAGE_TAG}"

docker push \
  "${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO_NAME}:${IMAGE_TAG}"

# Also tag as latest
docker tag "$REPO_NAME:$IMAGE_TAG" \
  "${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO_NAME}:latest"

docker push \
  "${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO_NAME}:latest"

Enable ECR image scanning (scanOnPush=true) — it uses Amazon Inspector to check images for known CVEs automatically on every push. Set up lifecycle policies to expire untagged images after 30 days to control storage costs.

ECS Fargate Task Definition

Fargate is the serverless compute option for ECS — you define CPU and memory, and AWS manages the underlying EC2 instances. This is ideal for batch ETL jobs that run on a schedule.

{
  "family": "orders-etl-processor",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "1024",
  "memory": "2048",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/etl-task-role",
  "containerDefinitions": [
    {
      "name": "etl-processor",
      "image": "123456789012.dkr.ecr.ca-central-1.amazonaws.com/etl-jobs/orders-processor:1.0.0",
      "essential": true,
      "environment": [
        {"name": "AWS_DEFAULT_REGION", "value": "ca-central-1"},
        {"name": "SOURCE_BUCKET",      "value": "my-data-lake-prod"},
        {"name": "DEST_BUCKET",        "value": "my-data-lake-prod"},
        {"name": "DEST_PREFIX",        "value": "processed/orders"}
      ],
      "secrets": [
        {
          "name": "SOURCE_PREFIX",
          "valueFrom": "arn:aws:ssm:ca-central-1:123456789012:parameter/etl/source-prefix"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/orders-etl-processor",
          "awslogs-region": "ca-central-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "stopTimeout": 120
    }
  ]
}

Key design decisions in this task definition:

taskRoleArn grants the container permissions to access S3, SSM, and other AWS services. It uses IAM roles, not access keys — the container retrieves temporary credentials automatically from the ECS metadata endpoint. Never pass AWS_ACCESS_KEY_ID as an environment variable in production ECS tasks.
secrets pulls sensitive configuration from SSM Parameter Store at container startup. This keeps secrets out of the task definition and environment variables in plaintext.
CloudWatch Logs via awslogs log driver captures all stdout/stderr output centrally.

To run the task manually or trigger it from a Lambda or Step Functions orchestrator:

aws ecs run-task \
  --cluster prod-data-cluster \
  --task-definition orders-etl-processor:3 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-abc123],securityGroups=[sg-xyz789],assignPublicIp=DISABLED}" \
  --overrides '{
    "containerOverrides": [{
      "name": "etl-processor",
      "environment": [
        {"name": "SOURCE_PREFIX", "value": "raw/orders/2024/03/"}
      ]
    }]
  }'

For building the full infrastructure as code — ECR repository, ECS cluster, task definition, IAM roles, and CloudWatch log groups — see our guide to Terraform for AWS Data Stacks, which covers this pattern with complete HCL examples.

Conclusion

Containerising ETL jobs with Docker and deploying to ECS Fargate gives data engineering teams reproducibility, portability, and a clean separation between infrastructure and application code. The pattern scales from a single job running nightly to dozens of concurrent pipeline stages, all managed through the same ECS API.

Combined with CI/CD workflows using GitHub Actions to build and push images automatically, containerised ETL becomes a fully automated, auditable deployment process where every production run uses an immutable, tested image.

If you are modernising a legacy ETL infrastructure or building a new containerised data platform on AWS, reach out to Infra IT Consulting. We help Canadian organisations design and deploy production-grade data pipelines on AWS.

Tech Tutorials & How-Tos

Talk to our team →

Docker for Data Engineers: Containerising ETL Jobs on AWS

Writing a Production-Ready Dockerfile

The ETL Job Source Code

Local Development with docker-compose

Pushing to Amazon ECR

ECS Fargate Task Definition

Conclusion

Related posts

AWS CDK for Data Infrastructure: Type-Safe IaC for Data Teams

Monitoring Data Pipelines with Amazon CloudWatch: A How-To Guide

Kafka vs. Kinesis: A Hands-On Comparison for Data Engineers

Writing a Production-Ready Dockerfile

The ETL Job Source Code

Local Development with docker-compose

Pushing to Amazon ECR

ECS Fargate Task Definition

Conclusion

Related posts

AWS CDK for Data Infrastructure: Type-Safe IaC for Data Teams

Monitoring Data Pipelines with Amazon CloudWatch: A How-To Guide

Kafka vs. Kinesis: A Hands-On Comparison for Data Engineers

We value your privacy