Docker for Data Engineers: Containerising ETL Jobs on AWS
Docker has become an essential tool for data engineers working on AWS. Containerising ETL jobs solves the classic βworks on my machineβ problem, makes dependency management reproducible, and enables deployment patterns β ECS Fargate, AWS Batch, Lambda container images β that simply are not available with zip-packaged code. This tutorial walks through a complete containerisation workflow: writing a Dockerfile for a Python ETL job, running it locally with docker-compose, pushing it to Amazon ECR, and defining an ECS Fargate task.
Writing a Production-Ready Dockerfile
A good Dockerfile for a data engineering ETL job prioritises: small image size, explicit dependency pinning, a non-root user for security, and a clean separation between the base environment and application code.
Here is a Dockerfile for a Python ETL job that reads from S3, transforms data with pandas and pyarrow, and writes Parquet back to S3:
# Dockerfile
# Stage 1: Build stage β install dependencies with full build tools
FROM python:3.11-slim AS builder
WORKDIR /build
# Install build dependencies for packages that have C extensions
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc \
g++ \
libpq-dev \
&& rm -rf /var/lib/apt/lists/*
# Copy and install Python dependencies
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt
# Stage 2: Runtime stage β minimal image with only what's needed
FROM python:3.11-slim AS runtime
# Security: create non-root user
RUN groupadd --gid 1001 etlgroup && \
useradd --uid 1001 --gid etlgroup --shell /bin/bash --create-home etluser
WORKDIR /app
# Copy installed packages from builder
COPY --from=builder /root/.local /home/etluser/.local
# Copy application code
COPY --chown=etluser:etlgroup src/ ./src/
# Switch to non-root user
USER etluser
# Ensure user-installed packages are on PATH
ENV PATH="/home/etluser/.local/bin:${PATH}"
ENV PYTHONPATH="/app"
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
# Health check β verifies the Python environment is intact
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD python -c "import boto3, pandas, pyarrow; print('OK')" || exit 1
ENTRYPOINT ["python", "-m", "src.etl_job"]
The multi-stage build is important: the builder stage has gcc and build tools for compiling C extensions; the runtime stage starts fresh from python:3.11-slim and only copies the compiled packages. This keeps the final image lean β typically 300-400 MB for a pandas/pyarrow workload versus 800+ MB for a single-stage build.
Your requirements.txt should pin exact versions:
boto3==1.34.34
botocore==1.34.34
pandas==2.1.4
pyarrow==14.0.2
psycopg2-binary==2.9.9
pydantic==2.5.3
structlog==24.1.0
The ETL Job Source Code
# src/etl_job.py
import os
import sys
import logging
import boto3
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from io import BytesIO
from datetime import datetime, timezone
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(name)s %(message)s",
stream=sys.stdout,
)
logger = logging.getLogger("etl_job")
def get_config() -> dict:
"""Load configuration from environment variables."""
required = ["SOURCE_BUCKET", "SOURCE_PREFIX", "DEST_BUCKET", "DEST_PREFIX"]
missing = [k for k in required if not os.environ.get(k)]
if missing:
raise EnvironmentError(f"Missing required environment variables: {missing}")
return {
"source_bucket": os.environ["SOURCE_BUCKET"],
"source_prefix": os.environ["SOURCE_PREFIX"],
"dest_bucket": os.environ["DEST_BUCKET"],
"dest_prefix": os.environ["DEST_PREFIX"],
"aws_region": os.environ.get("AWS_DEFAULT_REGION", "ca-central-1"),
}
def list_source_files(s3_client, bucket: str, prefix: str) -> list[str]:
paginator = s3_client.get_paginator("list_objects_v2")
keys = []
for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
keys.extend(
obj["Key"] for obj in page.get("Contents", [])
if obj["Key"].endswith(".csv")
)
return keys
def read_csv_from_s3(s3_client, bucket: str, key: str) -> pd.DataFrame:
obj = s3_client.get_object(Bucket=bucket, Key=key)
return pd.read_csv(BytesIO(obj["Body"].read()))
def transform(df: pd.DataFrame) -> pd.DataFrame:
"""Apply business logic transformations."""
df.columns = [col.strip().lower().replace(" ", "_") for col in df.columns]
df = df.dropna(subset=["order_id", "customer_id"])
df["amount"] = pd.to_numeric(df["amount"], errors="coerce").fillna(0.0)
df["processed_at"] = datetime.now(timezone.utc).isoformat()
return df
def write_parquet_to_s3(s3_client, df: pd.DataFrame, bucket: str, key: str) -> None:
buffer = BytesIO()
table = pa.Table.from_pandas(df, preserve_index=False)
pq.write_table(table, buffer, compression="snappy")
buffer.seek(0)
s3_client.put_object(
Bucket=bucket,
Key=key,
Body=buffer.getvalue(),
ServerSideEncryption="AES256",
)
def main():
config = get_config()
logger.info("Starting ETL job", extra=config)
s3 = boto3.client("s3", region_name=config["aws_region"])
source_keys = list_source_files(s3, config["source_bucket"], config["source_prefix"])
if not source_keys:
logger.warning("No source files found. Exiting.")
return
logger.info(f"Found {len(source_keys)} source files")
frames = []
for key in source_keys:
df = read_csv_from_s3(s3, config["source_bucket"], key)
frames.append(df)
logger.info(f"Read {len(df)} rows from s3://{config['source_bucket']}/{key}")
combined = pd.concat(frames, ignore_index=True)
transformed = transform(combined)
output_key = f"{config['dest_prefix']}/data_{datetime.now().strftime('%Y%m%d_%H%M%S')}.parquet"
write_parquet_to_s3(s3, transformed, config["dest_bucket"], output_key)
logger.info(f"Wrote {len(transformed)} rows to s3://{config['dest_bucket']}/{output_key}")
if __name__ == "__main__":
main()
Local Development with docker-compose
docker-compose lets you run the ETL job locally against a real AWS environment (using your local credentials) or against localstack for fully offline testing.
# docker-compose.yml
version: "3.9"
services:
etl-job:
build:
context: .
target: runtime
image: etl-job:local
environment:
# AWS credentials β sourced from local environment, never hardcoded
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
AWS_DEFAULT_REGION: ca-central-1
# Job configuration
SOURCE_BUCKET: my-data-lake-dev
SOURCE_PREFIX: raw/orders/2024/01/
DEST_BUCKET: my-data-lake-dev
DEST_PREFIX: processed/orders/2024/01
# Optional: mount local source code for development iteration
volumes:
- ./src:/app/src:ro
# Optional: localstack for offline S3 testing
localstack:
image: localstack/localstack:3.0
ports:
- "4566:4566"
environment:
SERVICES: s3
AWS_DEFAULT_REGION: ca-central-1
volumes:
- localstack-data:/var/lib/localstack
volumes:
localstack-data:
Run the job locally:
# Build the image
docker build -t etl-job:local .
# Run against real AWS (credentials from environment)
docker compose run etl-job
# Or run directly with docker
docker run --rm \
-e AWS_ACCESS_KEY_ID="$AWS_ACCESS_KEY_ID" \
-e AWS_SECRET_ACCESS_KEY="$AWS_SECRET_ACCESS_KEY" \
-e AWS_DEFAULT_REGION="ca-central-1" \
-e SOURCE_BUCKET="my-data-lake-dev" \
-e SOURCE_PREFIX="raw/orders/2024/01/" \
-e DEST_BUCKET="my-data-lake-dev" \
-e DEST_PREFIX="processed/orders" \
etl-job:local
Pushing to Amazon ECR
Amazon Elastic Container Registry (ECR) is the natural home for Docker images in an AWS workflow. It integrates with ECS, Lambda, and Batch without needing to manage DockerHub credentials.
# Set your variables
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
AWS_REGION="ca-central-1"
REPO_NAME="etl-jobs/orders-processor"
IMAGE_TAG="1.0.0"
# Create the ECR repository (once)
aws ecr create-repository \
--repository-name "$REPO_NAME" \
--region "$AWS_REGION" \
--image-scanning-configuration scanOnPush=true \
--encryption-configuration encryptionType=AES256
# Authenticate Docker to ECR
aws ecr get-login-password --region "$AWS_REGION" | \
docker login --username AWS \
--password-stdin "${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com"
# Build, tag, and push
docker build -t "$REPO_NAME:$IMAGE_TAG" .
docker tag "$REPO_NAME:$IMAGE_TAG" \
"${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO_NAME}:${IMAGE_TAG}"
docker push \
"${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO_NAME}:${IMAGE_TAG}"
# Also tag as latest
docker tag "$REPO_NAME:$IMAGE_TAG" \
"${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO_NAME}:latest"
docker push \
"${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO_NAME}:latest"
Enable ECR image scanning (scanOnPush=true) β it uses Amazon Inspector to check images for known CVEs automatically on every push. Set up lifecycle policies to expire untagged images after 30 days to control storage costs.
ECS Fargate Task Definition
Fargate is the serverless compute option for ECS β you define CPU and memory, and AWS manages the underlying EC2 instances. This is ideal for batch ETL jobs that run on a schedule.
{
"family": "orders-etl-processor",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "1024",
"memory": "2048",
"executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::123456789012:role/etl-task-role",
"containerDefinitions": [
{
"name": "etl-processor",
"image": "123456789012.dkr.ecr.ca-central-1.amazonaws.com/etl-jobs/orders-processor:1.0.0",
"essential": true,
"environment": [
{"name": "AWS_DEFAULT_REGION", "value": "ca-central-1"},
{"name": "SOURCE_BUCKET", "value": "my-data-lake-prod"},
{"name": "DEST_BUCKET", "value": "my-data-lake-prod"},
{"name": "DEST_PREFIX", "value": "processed/orders"}
],
"secrets": [
{
"name": "SOURCE_PREFIX",
"valueFrom": "arn:aws:ssm:ca-central-1:123456789012:parameter/etl/source-prefix"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/orders-etl-processor",
"awslogs-region": "ca-central-1",
"awslogs-stream-prefix": "ecs"
}
},
"stopTimeout": 120
}
]
}
Key design decisions in this task definition:
taskRoleArngrants the container permissions to access S3, SSM, and other AWS services. It uses IAM roles, not access keys β the container retrieves temporary credentials automatically from the ECS metadata endpoint. Never passAWS_ACCESS_KEY_IDas an environment variable in production ECS tasks.secretspulls sensitive configuration from SSM Parameter Store at container startup. This keeps secrets out of the task definition and environment variables in plaintext.- CloudWatch Logs via
awslogslog driver captures all stdout/stderr output centrally.
To run the task manually or trigger it from a Lambda or Step Functions orchestrator:
aws ecs run-task \
--cluster prod-data-cluster \
--task-definition orders-etl-processor:3 \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={subnets=[subnet-abc123],securityGroups=[sg-xyz789],assignPublicIp=DISABLED}" \
--overrides '{
"containerOverrides": [{
"name": "etl-processor",
"environment": [
{"name": "SOURCE_PREFIX", "value": "raw/orders/2024/03/"}
]
}]
}'
For building the full infrastructure as code β ECR repository, ECS cluster, task definition, IAM roles, and CloudWatch log groups β see our guide to Terraform for AWS Data Stacks, which covers this pattern with complete HCL examples.
Conclusion
Containerising ETL jobs with Docker and deploying to ECS Fargate gives data engineering teams reproducibility, portability, and a clean separation between infrastructure and application code. The pattern scales from a single job running nightly to dozens of concurrent pipeline stages, all managed through the same ECS API.
Combined with CI/CD workflows using GitHub Actions to build and push images automatically, containerised ETL becomes a fully automated, auditable deployment process where every production run uses an immutable, tested image.
If you are modernising a legacy ETL infrastructure or building a new containerised data platform on AWS, reach out to Infra IT Consulting. We help Canadian organisations design and deploy production-grade data pipelines on AWS.
Related posts
Book a free 30-minute consultation to discuss your data engineering and analytics needs.
Talk to our team β