AWS Data Engineering eventbridgeevent-drivenaws

Using Amazon EventBridge in Data Engineering Workflows

By Infra IT Consulting · April 8, 2024 · 8 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

The traditional model of scheduling data pipelines with cron is simple to understand and trivially easy to implement. It is also fundamentally misaligned with how data actually arrives in most production systems. Data does not arrive at exactly 2:00 AM every night; it arrives when upstream systems produce it, when APIs respond, when files land in S3, when transactions complete. Scheduling pipelines based on time rather than data arrival creates two persistent problems: pipelines that run before the data is ready (producing incomplete or empty results) and pipelines that wait longer than necessary (adding unnecessary latency to downstream consumers).

Amazon EventBridge is the serverless event bus that enables data teams to build truly event-driven pipelines on AWS. It routes events between producers and consumers based on rules you define, without requiring custom polling infrastructure or tight coupling between pipeline components. For data engineering, EventBridge is the connective tissue that makes it possible to trigger Glue jobs when S3 files land, kick off Step Functions executions when database records change, and coordinate across multiple AWS accounts in enterprise data mesh architectures.

EventBridge Fundamentals for Data Engineers

EventBridge operates on three core concepts: event buses, rules, and targets.

An event bus is a stream of events. AWS provides a default event bus that receives events from AWS services (S3, Glue, RDS, Step Functions, and many others), and you can create custom event buses for your application-generated events. In a data mesh architecture, cross-account event buses allow a data product team to publish events that other domain teams subscribe to independently.

Rules are pattern-matching filters that evaluate incoming events and determine which targets to invoke. Rules use JSON pattern matching against the event structure — you can match on event source, event type, and any field within the event detail object. Rules support content-based filtering, meaning you can route events based on the actual values in the event payload, not just the event type.

Targets are the AWS services or endpoints that EventBridge invokes when a rule matches. For data engineering, the most relevant targets are AWS Lambda, AWS Step Functions (state machines), AWS Glue (workflow execution), Amazon SQS, and Amazon Kinesis Data Streams.

Triggering Glue Jobs on S3 File Arrival

The most common EventBridge pattern in data engineering is S3 event-driven job triggering. When a file lands in an S3 prefix, an EventBridge rule detects the Object Created event and triggers the appropriate Glue job or Step Functions workflow. This replaces S3 event notifications directly to Lambda with a more flexible routing layer that can fan out to multiple targets and apply content-based filtering.

First, enable EventBridge notifications on your S3 bucket (this is a bucket-level configuration distinct from the older S3 event notifications feature):

aws s3api put-bucket-notification-configuration \
  --bucket my-raw-data-bucket \
  --notification-configuration '{"EventBridgeConfiguration": {}}'

Then create an EventBridge rule that matches S3 object creation events for a specific prefix:

{
  "source": ["aws.s3"],
  "detail-type": ["Object Created"],
  "detail": {
    "bucket": {
      "name": ["my-raw-data-bucket"]
    },
    "object": {
      "key": [{"prefix": "incoming/orders/"}]
    }
  }
}

This rule fires whenever any object is created under the incoming/orders/ prefix. The target can be a Step Functions state machine that validates the file, runs a Glue transformation job, and notifies downstream consumers:

{
  "Arn": "arn:aws:states:ca-central-1:123456789:stateMachine:orders-pipeline",
  "RoleArn": "arn:aws:iam::123456789:role/EventBridgeStepFunctionsRole",
  "Input": {
    "bucket.$": "$.detail.bucket.name",
    "key.$": "$.detail.object.key",
    "size.$": "$.detail.object.size"
  }
}

The Input field uses JSONPath expressions (the .$ suffix) to extract values from the incoming EventBridge event and pass them to the state machine as structured input. This means your pipeline knows exactly which file triggered it, without any polling or file listing in your pipeline code.

Custom Events for Application-to-Pipeline Integration

EventBridge is not limited to AWS service events. Your application code can publish custom events to the default event bus or a custom event bus, enabling real-time pipeline triggers based on application state.

Consider an e-commerce platform where a batch of orders is confirmed by the payment processor. The payment service publishes a custom event:

import boto3
import json
from datetime import datetime, timezone

client = boto3.client('events', region_name='ca-central-1')

response = client.put_events(
    Entries=[
        {
            'EventBusName': 'data-platform-events',
            'Source': 'com.company.payments',
            'DetailType': 'OrderBatchConfirmed',
            'Detail': json.dumps({
                'batch_id': 'batch_20240408_001',
                'order_count': 1247,
                'total_value_cad': 89432.50,
                'currency': 'CAD',
                'confirmed_at': datetime.now(timezone.utc).isoformat()
            })
        }
    ]
)

An EventBridge rule on the data-platform-events bus matches this event and triggers a Lambda function that:

Writes the batch metadata to a DynamoDB tracking table
Triggers an AWS Glue job to process the order batch
Publishes a BatchProcessingStarted event for downstream consumers

This pattern decouples the payment service from the data platform entirely. The payment service does not know or care whether the data platform is running Glue, Lambda, or Spark — it publishes an event and moves on.

EventBridge Pipes for Streamlined Source-to-Target Connections

EventBridge Pipes (launched in 2022) simplifies point-to-point event routing by providing a managed, enrichment-capable connection between a source and a target without requiring separate SQS queues or Lambda functions as intermediaries. For data engineering, Pipes can connect:

DynamoDB Streams → Kinesis Data Streams: Capture DynamoDB change events and stream them to Kinesis for real-time processing without a custom Lambda consumer
SQS → Step Functions: Route file processing events from SQS directly to a Step Functions workflow with optional Lambda enrichment
Kinesis → EventBridge custom bus: Fan out Kinesis events to multiple EventBridge targets based on content filtering

A Pipe includes an optional filter (equivalent to an EventBridge rule pattern), an optional enrichment Lambda function that adds context to the event before delivery, and a target that receives the processed event. The enrichment step is particularly useful in data pipelines where you need to look up metadata (partition scheme, schema version, data owner) before routing to the appropriate downstream processor.

Schema Registry for Event Governance

EventBridge includes a Schema Registry that automatically discovers and documents the schema of events flowing through your event buses. For data engineering teams, this provides:

Schema discovery: EventBridge samples events and infers JSON schemas automatically, with schemas versioned in the registry
Code binding generation: Generate typed Python, Java, or TypeScript code from event schemas, eliminating manual schema maintenance
Schema validation: Validate event payloads against registered schemas in your publisher code before sending

In a data mesh context, the Schema Registry becomes the contract between data producers and consumers. A producer team registers the schema for their OrderBatchConfirmed event; consumer teams generate typed bindings from the registry and are notified when schemas change. This governance layer prevents the schema drift that commonly breaks event-driven pipelines.

Cross-Account Event Routing for Data Mesh Architectures

Enterprise data platforms frequently span multiple AWS accounts — a data mesh pattern where each domain owns its infrastructure and publishes data events to a central event bus. EventBridge supports cross-account event routing with a resource-based policy on the receiving event bus.

The central data platform account creates a custom event bus and grants specific source accounts permission to publish:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowDomainAccounts",
      "Effect": "Allow",
      "Principal": {
        "AWS": [
          "arn:aws:iam::111111111111:root",
          "arn:aws:iam::222222222222:root"
        ]
      },
      "Action": "events:PutEvents",
      "Resource": "arn:aws:events:ca-central-1:999999999:event-bus/central-data-platform"
    }
  ]
}

Domain account applications then publish events directly to the central bus ARN. Rules on the central bus route events to the appropriate processing pipelines based on the Source field (which, by convention, includes the domain identifier: com.company.orders, com.company.inventory, etc.).

This pattern pairs naturally with implementing a data mesh on AWS, where domain teams need to publish data events without coupling to centralised platform infrastructure.

EventBridge, SNS, and SQS all move events between producers and consumers, but they serve different roles. EventBridge is designed for content-based routing with structured JSON events and schema governance. SNS is designed for fan-out notifications where all subscribers receive all messages (with basic filter policies). SQS is designed for durable queuing where consumers pull messages at their own pace.

For pipeline triggering use cases:

Use EventBridge when you need content-based routing, schema registry integration, or cross-account event distribution
Use SQS when you need durable buffering to handle bursts of file arrival events that your pipeline can’t process immediately
Use SNS when you need simple fan-out to multiple Lambda functions or HTTP endpoints without content-based filtering

Many production data architectures use all three: EventBridge routes S3 events to an SQS queue, a Lambda function polls SQS and processes batches, and the Lambda publishes completion events back to EventBridge for downstream notification.

The decoupling pattern with SNS and SQS covers the SQS/SNS side of this architecture in detail.

Conclusion

Amazon EventBridge enables data engineering teams to move beyond scheduled polling to truly reactive, event-driven pipelines. Files trigger jobs immediately upon arrival. Application events start processing without polling delays. Cross-account data sharing becomes a matter of publishing and subscribing rather than building bespoke integration APIs. The Schema Registry provides the governance layer that makes event-driven architectures maintainable as they scale.

The architectural shift from cron-scheduled to event-driven pipelines reduces latency, eliminates empty job runs, and decouples pipeline components in ways that make each component independently testable and deployable.

If you are building or modernising a data platform on AWS and want to evaluate whether an event-driven architecture is the right fit for your workload patterns, contact Infra IT Consulting for an architecture review.

AWS Data Engineering

Talk to our team →

Using Amazon EventBridge in Data Engineering Workflows

EventBridge Fundamentals for Data Engineers

Triggering Glue Jobs on S3 File Arrival

Custom Events for Application-to-Pipeline Integration

EventBridge Pipes for Streamlined Source-to-Target Connections

Schema Registry for Event Governance

Cross-Account Event Routing for Data Mesh Architectures

Conclusion

Related posts

AWS Data Wrangler: The Pandas-to-S3 Bridge You Need

Apache Iceberg with AWS Glue: The Modern Table Format Explained

Running Apache Airflow on AWS with MWAA

EventBridge Fundamentals for Data Engineers

Triggering Glue Jobs on S3 File Arrival

Custom Events for Application-to-Pipeline Integration

EventBridge Pipes for Streamlined Source-to-Target Connections

Schema Registry for Event Governance

Cross-Account Event Routing for Data Mesh Architectures

EventBridge vs. SNS and SQS for Pipeline Triggering

Conclusion

Related posts

AWS Data Wrangler: The Pandas-to-S3 Bridge You Need

Apache Iceberg with AWS Glue: The Modern Table Format Explained

Running Apache Airflow on AWS with MWAA

We value your privacy