Infra IT Consulting logo Infra ITC
AWS Data Engineering step-functionsorchestrationetl

Orchestrating Data Pipelines with AWS Step Functions

By Infra IT Consulting · · 8 min read

Modern data pipelines are rarely linear. A typical production ETL workflow might fan out across a dozen AWS services, require conditional branching based on data validation results, and demand precise retry logic when downstream APIs throttle or S3 consistency delays cause transient failures. Gluing these steps together with cron jobs and shell scripts works until it doesn’t — and when it fails at 2 AM, the absence of observability makes recovery brutal. AWS Step Functions was designed specifically to solve this class of problem, and for data engineering teams running workloads on AWS, it has become one of the most underutilised tools in the stack.

What Step Functions Actually Does for Data Teams

Step Functions is a serverless orchestration service that lets you model workflows as state machines using Amazon States Language (ASL), a JSON or YAML-based specification. Each “state” in the machine represents a task, a choice, a wait, a parallel branch, or a map (fan-out) operation. The service handles execution history, retries, timeouts, and error catching natively — without you writing any of that logic in application code.

For data engineering, this matters because pipeline orchestration logic is notoriously difficult to maintain when embedded in Lambda functions or Glue job scripts. Step Functions externalises that logic into a declarative, version-controlled definition that you can view visually in the AWS Console, replay from any checkpoint, and test independently of the compute layer.

Step Functions offers two workflow types: Standard Workflows and Express Workflows. Standard Workflows are durable, support execution histories up to one year, and charge per state transition — making them ideal for long-running ETL jobs where you need a full audit trail. Express Workflows are designed for high-throughput, short-duration pipelines (under five minutes) and charge per invocation and duration, which suits streaming micro-batch scenarios or lightweight transformation chains.

A Real ETL Orchestration Pattern

Consider a daily ingestion pipeline that pulls data from an external REST API, lands it in S3, runs an AWS Glue job to transform and partition the data, validates row counts against the previous day, and then notifies a downstream team via Amazon SNS. Here is a condensed Step Functions definition illustrating the key structure:

{
  "Comment": "Daily API Ingestion Pipeline",
  "StartAt": "IngestFromAPI",
  "States": {
    "IngestFromAPI": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "api-ingestor",
        "Payload.$": "$"
      },
      "Retry": [
        {
          "ErrorEquals": ["Lambda.ServiceException", "States.TaskFailed"],
          "IntervalSeconds": 30,
          "MaxAttempts": 3,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "NotifyFailure"
        }
      ],
      "Next": "RunGlueTransform"
    },
    "RunGlueTransform": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "daily-transform-job",
        "Arguments": {
          "--source_date.$": "$.ingestion_date"
        }
      },
      "Next": "ValidateRowCounts"
    },
    "ValidateRowCounts": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "row-count-validator",
        "Payload.$": "$"
      },
      "Next": "CheckValidation"
    },
    "CheckValidation": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.validation_passed",
          "BooleanEquals": true,
          "Next": "NotifySuccess"
        }
      ],
      "Default": "NotifyFailure"
    },
    "NotifySuccess": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Parameters": {
        "TopicArn": "arn:aws:sns:ca-central-1:123456789:pipeline-alerts",
        "Message": "Daily pipeline completed successfully."
      },
      "End": true
    },
    "NotifyFailure": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Parameters": {
        "TopicArn": "arn:aws:sns:ca-central-1:123456789:pipeline-alerts",
        "Message": "Pipeline failed. Check execution history."
      },
      "End": true
    }
  }
}

This definition does several things that would require hundreds of lines of application code to replicate manually: exponential backoff retries on the Lambda ingestion step, automatic synchronous waiting on the Glue job (the .sync resource suffix blocks the state machine until Glue completes), conditional branching based on validation output, and guaranteed notification regardless of success or failure path.

Parallel Execution and Map States for Fan-Out Workloads

One of Step Functions’ most powerful features for data engineering is the Map state, which lets you iterate over an array of items and process each one in parallel — up to the concurrency limit you specify. This is useful when you need to process multiple partitions, regions, or data sources simultaneously.

Imagine a pipeline that receives a list of S3 object keys and must run a transformation Lambda on each. With a Map state, you define the transformation logic once and Step Functions fans it out automatically, collecting all results back into a single output before proceeding. You control the MaxConcurrency parameter to avoid overwhelming downstream services or hitting Lambda concurrency limits in your AWS account.

The Parallel state serves a different purpose: it runs a fixed set of independent branches simultaneously. A common pattern is running data quality checks and schema validation in parallel, then proceeding only when both complete successfully. This cuts pipeline latency significantly compared to sequential execution.

Error Handling and Observability

Step Functions provides granular error catching at the state level. Every task state can declare a Catch block that routes to a specific error-handling state based on the error type. AWS-defined error types like States.Timeout, States.TaskFailed, and Lambda.ServiceException allow you to distinguish between infrastructure failures and business logic failures and handle them differently — retrying transient issues while immediately alerting on data quality failures.

Execution history is stored and queryable for Standard Workflows. Each state transition, input, output, and error is recorded, giving you a complete audit trail. This integrates with Amazon CloudWatch for metrics and alarms, and you can push execution events to CloudWatch Logs for log-based alerting. AWS X-Ray integration provides distributed tracing across the Lambda functions and Glue jobs your state machine invokes.

For teams operating pipelines across multiple environments, Step Functions integrates cleanly with Terraform for infrastructure as code, allowing state machine definitions to be versioned, reviewed, and deployed through CI/CD pipelines just like any other infrastructure resource.

Step Functions vs. Managed Airflow (MWAA)

The honest answer is that both tools have a place. Apache Airflow on AWS with MWAA is a better fit when your team already writes Python DAGs, you need complex scheduling with calendar-aware triggers, or you require the rich ecosystem of Airflow providers for third-party integrations. Step Functions wins when you want zero infrastructure overhead, native AWS service integrations without custom operators, and per-execution billing that scales to zero when pipelines are idle.

For many Canadian data teams running AWS-native stacks, Step Functions is the lower-friction choice for orchestrating Glue, Lambda, and EMR workloads. MWAA makes more sense when you’re migrating from on-premises Airflow or when your pipelines involve non-AWS targets like Salesforce, Snowflake, or dbt Cloud.

A practical middle ground: use Step Functions to orchestrate the AWS-native compute layer (Glue jobs, Lambda transforms, EMR steps) and trigger those state machines from an Airflow DAG when you need the scheduling and sensor ecosystem that Airflow provides. The two tools compose well rather than compete.

Practical Considerations and Limits

Step Functions Standard Workflows support up to 25,000 state transitions per execution, which is more than enough for typical ETL pipelines but becomes a constraint if you build deeply nested recursive workflows. The maximum input or output payload for a state is 262,144 characters (256 KB); for larger payloads, the pattern is to pass S3 references rather than inline data between states.

Pricing for Standard Workflows is $0.025 per 1,000 state transitions. A pipeline with 20 states running daily costs roughly $0.18 per month — negligible compared to the Glue and compute costs it orchestrates. Express Workflows cost $1.00 per million executions plus duration charges, making them extremely cost-effective for high-frequency micro-batch scenarios.

Conclusion

AWS Step Functions brings production-grade reliability to data pipeline orchestration without requiring you to build and maintain a workflow engine yourself. Its native integrations with Glue, Lambda, EMR, SNS, and dozens of other AWS services mean you can wire together sophisticated multi-step pipelines entirely in configuration. Built-in retry logic, error catching, parallel execution, and complete execution history eliminate entire categories of operational burden that plague hand-rolled pipeline code.

If your team is building or scaling data pipelines on AWS and spending engineering time on orchestration glue code rather than data transformation logic, Step Functions is worth a serious evaluation. The DataOps practices that make pipelines maintainable and reliable become significantly easier to implement when your orchestration layer gives you visibility and control by default.

Ready to modernise your pipeline orchestration? Contact Infra IT Consulting to discuss how we design and implement Step Functions workflows for data teams across Canada, the UK, and Africa.

Related posts