AWS Data Engineering airflowmwaaorchestration

Running Apache Airflow on AWS with MWAA

By Infra IT Consulting · April 29, 2024 · 10 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

Apache Airflow has become the dominant workflow orchestration platform for data engineering teams. Its Python-based DAG authoring model, rich ecosystem of providers, and web-based monitoring UI make it well-suited for managing complex, dependency-driven data pipelines. The challenge has always been operations: running a production Airflow deployment requires managing the scheduler, workers, webserver, metadata database, and the network infrastructure that connects them — a significant operational burden for a team whose core expertise is data engineering rather than Kubernetes.

Amazon Managed Workflows for Apache Airflow (MWAA) is AWS’s managed Airflow service. It handles the infrastructure, patching, scaling, and high availability of the Airflow control plane while giving you full control over DAG authoring, configuration, and integration with other AWS services. This post covers how to run MWAA effectively in production, including the configuration decisions that matter, the IAM model, and patterns for integrating with the AWS data stack.

MWAA Architecture and What AWS Manages

An MWAA environment consists of:

Airflow Scheduler: Monitors DAGs, triggers tasks, and manages the task queue. MWAA runs two schedulers in active-active mode for high availability.
Airflow Workers: Execute tasks. MWAA uses AWS Fargate containers for workers, scaling automatically based on queue depth.
Airflow Webserver: The UI for monitoring and managing DAG runs. MWAA runs this in a load-balanced configuration.
Metadata Database: Amazon Aurora PostgreSQL managed by AWS — you cannot access it directly.
DAG storage: An S3 bucket you provide, from which MWAA syncs DAG files automatically.

AWS manages patching, scaling the worker fleet, rotating credentials, and replacing failed components. You are responsible for DAG correctness, dependency management (requirements.txt), IAM permissions, and VPC network configuration.

The MWAA environment runs in your VPC. You must provide at least two private subnets in different Availability Zones. Workers and the scheduler are not publicly accessible — the webserver can be optionally exposed through a private or public endpoint. For most enterprise deployments, the webserver endpoint should be private and accessed via a VPN or AWS Client VPN.

Creating an MWAA Environment with Terraform

Infrastructure-as-code is strongly recommended for MWAA environments. The configuration surface is large enough that manual console setup is error-prone and difficult to reproduce. Here is a representative Terraform configuration:

resource "aws_mwaa_environment" "data_platform" {
  name              = "data-platform-airflow"
  airflow_version   = "2.8.1"
  environment_class = "mw1.medium"
  max_workers       = 10
  min_workers       = 1
  schedulers        = 2

  dag_s3_path              = "dags/"
  requirements_s3_path     = "requirements.txt"
  plugins_s3_path          = "plugins.zip"
  source_bucket_arn        = aws_s3_bucket.mwaa_storage.arn

  execution_role_arn = aws_iam_role.mwaa_execution.arn

  network_configuration {
    security_group_ids = [aws_security_group.mwaa.id]
    subnet_ids         = [
      aws_subnet.private_a.id,
      aws_subnet.private_b.id
    ]
  }

  webserver_access_mode = "PRIVATE_ONLY"

  logging_configuration {
    dag_processing_logs {
      enabled   = true
      log_level = "WARNING"
    }
    scheduler_logs {
      enabled   = true
      log_level = "WARNING"
    }
    task_logs {
      enabled   = true
      log_level = "INFO"
    }
    webserver_logs {
      enabled   = true
      log_level = "ERROR"
    }
    worker_logs {
      enabled   = true
      log_level = "WARNING"
    }
  }

  airflow_configuration_options = {
    "core.load_examples"               = "false"
    "core.parallelism"                 = "64"
    "core.max_active_tasks_per_dag"    = "16"
    "scheduler.dag_dir_list_interval"  = "30"
    "webserver.warn_deployment_exposure" = "false"
  }
}

Environment class choices: mw1.small (1 vCPU, 2 GB scheduler/webserver), mw1.medium (2 vCPU, 4 GB), and mw1.large (4 vCPU, 8 GB). The environment class affects the scheduler and webserver capacity — workers always run on Fargate and scale independently. For most production platforms processing under 1,000 DAG runs per day, mw1.medium is appropriate. Upgrade to mw1.large if you see scheduler lag (tasks staying in the “scheduled” state for more than a few seconds).

IAM: The Most Common Source of MWAA Problems

MWAA uses a single execution role that must be granted permission to everything Airflow needs: reading DAGs from S3, writing logs to CloudWatch, publishing metrics, and invoking any AWS services your DAGs interact with. The IAM model is “one role governs all” — there is no per-DAG IAM isolation in MWAA.

The minimum execution role policy for MWAA to function:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "airflow:PublishMetrics"
      ],
      "Resource": "arn:aws:airflow:ca-central-1:123456789:environment/data-platform-airflow"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject*",
        "s3:GetBucket*",
        "s3:List*"
      ],
      "Resource": [
        "arn:aws:s3:::my-mwaa-bucket",
        "arn:aws:s3:::my-mwaa-bucket/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogStream",
        "logs:CreateLogGroup",
        "logs:PutLogEvents",
        "logs:GetLogEvents",
        "logs:GetLogRecord",
        "logs:GetLogGroupFields",
        "logs:GetQueryResults"
      ],
      "Resource": "arn:aws:logs:ca-central-1:123456789:log-group:airflow-data-platform-airflow-*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:StartJobRun",
        "glue:GetJobRun",
        "glue:GetJobRuns",
        "glue:BatchStopJobRun"
      ],
      "Resource": "*"
    }
  ]
}

Add permissions for each AWS service your DAGs interact with: Step Functions (states:StartExecution, states:DescribeExecution), EMR (elasticmapreduce:*), Lambda (lambda:InvokeFunction), and so on.

Writing DAGs for AWS Data Workflows

MWAA includes the apache-airflow-providers-amazon package, which provides operators for every major AWS service. A production DAG orchestrating an S3-to-Glue-to-Redshift pipeline:

from datetime import datetime, timedelta
from airflow import DAG
from airflow.providers.amazon.aws.operators.glue import GlueJobOperator
from airflow.providers.amazon.aws.operators.glue_crawler import GlueCrawlerOperator
from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor
from airflow.providers.amazon.aws.operators.redshift_data import RedshiftDataOperator

default_args = {
    "owner": "data-platform",
    "depends_on_past": False,
    "email_on_failure": True,
    "email": ["data-alerts@company.com"],
    "retries": 2,
    "retry_delay": timedelta(minutes=5),
    "execution_timeout": timedelta(hours=2)
}

with DAG(
    dag_id="daily_orders_pipeline",
    default_args=default_args,
    description="Daily orders ETL: S3 raw → Glue transform → Redshift",
    schedule_interval="0 6 * * *",  # 6 AM UTC daily
    start_date=datetime(2024, 4, 1),
    catchup=False,
    tags=["orders", "daily", "production"],
    max_active_runs=1
) as dag:

    # Wait for upstream file to land in S3
    wait_for_raw_file = S3KeySensor(
        task_id="wait_for_raw_orders",
        bucket_name="my-raw-data-bucket",
        bucket_key="incoming/orders/{{ ds }}/orders_{{ ds_nodash }}.parquet",
        poke_interval=60,       # Check every 60 seconds
        timeout=3600,           # Fail after 1 hour
        mode="reschedule"       # Release worker slot while waiting
    )

    # Run Glue transformation job
    transform_orders = GlueJobOperator(
        task_id="transform_orders",
        job_name="orders-daily-transform",
        script_args={
            "--run_date": "{{ ds }}",
            "--source_prefix": "incoming/orders/{{ ds }}/",
            "--target_prefix": "processed/orders/year={{ macros.ds_format(ds, '%Y-%m-%d', '%Y') }}/month={{ macros.ds_format(ds, '%Y-%m-%d', '%m') }}/day={{ macros.ds_format(ds, '%Y-%m-%d', '%d') }}/"
        },
        aws_conn_id="aws_default",
        wait_for_completion=True,
        num_of_dpus=5
    )

    # Update Glue catalog partitions
    refresh_catalog = GlueCrawlerOperator(
        task_id="refresh_glue_catalog",
        config={"Name": "orders-processed-crawler"},
        aws_conn_id="aws_default",
        wait_for_completion=True
    )

    # Load summary into Redshift
    load_redshift_summary = RedshiftDataOperator(
        task_id="load_redshift_summary",
        cluster_identifier="analytics-cluster",
        database="analytics",
        db_user="airflow_user",
        sql="""
            DELETE FROM orders_daily_summary WHERE order_date = '{{ ds }}';
            INSERT INTO orders_daily_summary
            SELECT
                order_date,
                region,
                COUNT(*) AS order_count,
                SUM(total_cad) AS revenue_cad
            FROM orders_external.orders
            WHERE order_date = '{{ ds }}'
            GROUP BY order_date, region;
        """,
        wait_for_completion=True,
        aws_conn_id="aws_default"
    )

    wait_for_raw_file >> transform_orders >> refresh_catalog >> load_redshift_summary

The mode="reschedule" on the S3KeySensor is important. The default poke mode holds a worker slot for the entire sensor duration, consuming Fargate capacity even while idle. reschedule mode releases the worker slot between checks, allowing the worker to process other tasks — significantly improving MWAA resource utilisation when sensors have long wait periods.

Managing Dependencies and Plugin Packages

MWAA installs Python packages from a requirements.txt file you upload to S3. Changes to requirements.txt trigger a full environment restart (typically 20-30 minutes), so dependency management requires planning.

Best practices for MWAA dependency management:

Pin all versions: apache-airflow-providers-amazon==8.19.0 rather than apache-airflow-providers-amazon>=8. Unpinned requirements cause non-deterministic environment states.
Test locally: Use aws-mwaa-local-runner (the official Docker-based local environment) to verify requirements work before deploying.
Minimise dependencies: Each additional package increases environment restart time and the risk of version conflicts. Prefer Airflow provider packages over installing full frameworks (install boto3 is already included; avoid installing pandas or numpy unless genuinely needed in DAG code).
Custom operators in plugins: The plugins.zip mechanism allows you to package custom Airflow operators without adding them to the main Python environment.

Monitoring MWAA in Production

MWAA publishes metrics to CloudWatch under the AmazonMWAA namespace. Critical metrics to alert on:

QueuedTasks: Number of tasks waiting for a worker. If consistently non-zero, increase max_workers or reduce task parallelism.
RunningTasks: Active task count. Alert if this exceeds 80% of your worker capacity for extended periods.
SchedulerHeartbeat: Should be 1 at all times. A 0 indicates the scheduler has failed.
TasksFailed: Alert immediately on any non-zero value for production DAGs.

CloudWatch Logs receive Airflow logs for all components. Task logs are the most useful for debugging failed runs — they capture the complete stdout/stderr of each task execution and are accessible from the Airflow UI or directly in CloudWatch.

For integrating MWAA with other orchestration layers, AWS Step Functions for data pipelines can trigger MWAA DAGs via the Airflow REST API, or MWAA DAGs can invoke Step Functions state machines for complex sub-workflows. The two tools complement each other well in large data platforms.

MWAA vs. Self-Managed Airflow on EKS

MWAA’s primary trade-off is cost versus control. MWAA pricing starts at approximately $0.49/hour for mw1.small environments plus Fargate compute costs for workers. A mw1.medium environment with 3 average workers costs roughly $500-700 CAD per month.

Self-managed Airflow on EKS can achieve lower per-unit costs at scale but requires maintaining Kubernetes infrastructure, handling Airflow upgrades, managing the metadata database, and building high-availability configurations. For teams without dedicated platform engineering capacity, the operational overhead of self-managed Airflow typically exceeds the cost savings within the first six months.

The modern data stack guide covers where MWAA fits in a complete AWS analytics architecture, including its relationship with dbt, Fivetran, and other common data stack components.

Conclusion

MWAA delivers production-ready Apache Airflow without the infrastructure burden that has historically made self-managed Airflow deployments expensive to build and maintain. Its native integration with IAM, VPC networking, CloudWatch, and the full suite of AWS data services makes it a natural fit for teams already invested in the AWS ecosystem.

The configuration surface is larger than some managed services — getting IAM, networking, and dependency management right takes careful attention the first time — but once established, MWAA provides a stable, scalable orchestration layer that lets data engineering teams focus on pipeline logic rather than infrastructure operations.

If your team is evaluating MWAA for a new data platform or looking to migrate from a self-managed Airflow deployment, contact Infra IT Consulting for an architecture review and implementation support.

AWS Data Engineering

Talk to our team →

Running Apache Airflow on AWS with MWAA

MWAA Architecture and What AWS Manages

Creating an MWAA Environment with Terraform

IAM: The Most Common Source of MWAA Problems

Writing DAGs for AWS Data Workflows

Managing Dependencies and Plugin Packages

Monitoring MWAA in Production

MWAA vs. Self-Managed Airflow on EKS

Conclusion

Related posts

Infrastructure as Code for AWS Data Stacks with Terraform

S3 Data Partitioning Strategies That Cut Athena Query Costs

Decoupling Data Pipelines with AWS SNS and SQS

MWAA Architecture and What AWS Manages

Creating an MWAA Environment with Terraform

IAM: The Most Common Source of MWAA Problems

Writing DAGs for AWS Data Workflows

Managing Dependencies and Plugin Packages

Monitoring MWAA in Production

MWAA vs. Self-Managed Airflow on EKS

Conclusion

Related posts

Infrastructure as Code for AWS Data Stacks with Terraform

S3 Data Partitioning Strategies That Cut Athena Query Costs

Decoupling Data Pipelines with AWS SNS and SQS

We value your privacy