Decoupling Data Pipelines with AWS SNS and SQS
Tightly coupled data pipelines are fragile. When component A writes directly to component B โ a Lambda function directly invoking another Lambda, a Glue job directly calling a downstream API, a batch process directly inserting into a database โ you introduce a synchronous dependency that propagates failures instantly and makes independent scaling impossible. If component B is temporarily unavailable, component A fails. If component B is slow, component A blocks. If component A produces data faster than component B can consume it, records are lost.
The solution is decoupling: inserting an intermediary between producers and consumers that absorbs rate mismatches, isolates failures, and enables independent evolution of each component. Amazon SQS and Amazon SNS are the two foundational AWS services for this pattern. Understanding when and how to use each โ and how to combine them effectively in data engineering architectures โ is a core skill for building production-grade pipelines on AWS.
SQS vs. SNS: The Fundamental Distinction
Amazon SQS is a durable message queue. Producers write messages to a queue; consumers poll the queue and process messages at their own pace. Each message is delivered to exactly one consumer (in a standard queue, at least once; in a FIFO queue, exactly once). SQS provides durable buffering โ if your consumer is down for an hour, messages accumulate in the queue and are processed when the consumer recovers.
Amazon SNS is a pub/sub notification service. Publishers write messages to a topic; SNS fan out the message to all subscribers simultaneously. Each subscriber receives a copy of every message. SNS does not buffer messages โ if a subscriber is unavailable when a message is published, that subscriber misses the message unless the subscription is an SQS queue (which buffers the fan-out).
The practical implication: use SQS for point-to-point buffering with load levelling. Use SNS for fan-out to multiple independent consumers. Combine both (the SNS-to-SQS fan-out pattern) when you need fan-out with durable delivery to each consumer.
The SNS-to-SQS Fan-Out Pattern for Data Pipelines
The canonical decoupling pattern in AWS data architectures combines SNS and SQS: a single event publisher writes to an SNS topic, and multiple SQS queues subscribe to the topic. Each SQS queue feeds a different downstream consumer. When a new S3 file lands, the ingestion system publishes one event to SNS, and multiple independent consumers (a transformation job, a monitoring system, a notification service) each receive their own copy via their dedicated SQS queue.
S3 Object Created โ SNS Topic
โโ SQS Queue A โ Lambda: Run Glue transformation
โโ SQS Queue B โ Lambda: Update data catalog metadata
โโ SQS Queue C โ Lambda: Send Slack notification
The benefit is that each consumer operates completely independently. If the Glue transformation Lambda is throttled and messages back up in Queue A, Queue B and Queue C are unaffected. If you add a fourth consumer later, you add a new SQS subscription to the SNS topic without changing the producer or any existing consumer.
Configuring this with Terraform:
resource "aws_sns_topic" "file_arrival_events" {
name = "data-file-arrival-events"
}
resource "aws_sqs_queue" "transform_queue" {
name = "file-transform-queue"
visibility_timeout_seconds = 300 # Must be >= Lambda timeout
message_retention_seconds = 86400 # 24 hours
receive_wait_time_seconds = 20 # Long polling
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.transform_dlq.arn
maxReceiveCount = 3
})
}
resource "aws_sqs_queue" "transform_dlq" {
name = "file-transform-dlq"
message_retention_seconds = 1209600 # 14 days
}
resource "aws_sns_topic_subscription" "transform_subscription" {
topic_arn = aws_sns_topic.file_arrival_events.arn
protocol = "sqs"
endpoint = aws_sqs_queue.transform_queue.arn
filter_policy = jsonencode({
file_type = ["parquet", "csv"]
region = ["ontario", "alberta", "bc"]
})
}
# Allow SNS to write to SQS
resource "aws_sqs_queue_policy" "transform_queue_policy" {
queue_url = aws_sqs_queue.transform_queue.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Principal = { Service = "sns.amazonaws.com" }
Action = "sqs:SendMessage"
Resource = aws_sqs_queue.transform_queue.arn
Condition = {
ArnEquals = {
"aws:SourceArn" = aws_sns_topic.file_arrival_events.arn
}
}
}
]
})
}
The filter_policy on the SNS subscription is important: it ensures that Queue A only receives events matching specified criteria (Parquet or CSV files from specific regions), rather than every event published to the topic. This content-based routing at the SNS level prevents consumers from receiving irrelevant messages that they must filter in application code.
S3 Event Notifications with SQS for File-Triggered Pipelines
The most common data engineering use of SQS is receiving S3 event notifications and triggering downstream processing. When files land in S3, S3 publishes an event notification to SQS; a Lambda function (or an EC2 worker) polls the queue, processes each event, and triggers the appropriate Glue job or Step Functions workflow.
import boto3
import json
def lambda_handler(event, context):
"""
Lambda triggered by SQS. Each SQS record contains an S3 event notification.
"""
glue_client = boto3.client('glue', region_name='ca-central-1')
failed_messages = []
for record in event['Records']:
try:
# SQS message body contains the S3 event
s3_event = json.loads(record['body'])
# Handle SNS-wrapped messages (SNS โ SQS wraps in an extra envelope)
if 'Message' in s3_event:
s3_event = json.loads(s3_event['Message'])
for s3_record in s3_event.get('Records', []):
bucket = s3_record['s3']['bucket']['name']
key = s3_record['s3']['object']['key']
size_bytes = s3_record['s3']['object']['size']
# Skip empty files (S3 sometimes fires events for 0-byte markers)
if size_bytes == 0:
continue
# Determine which Glue job to trigger based on key prefix
if key.startswith('incoming/orders/'):
job_name = 'orders-transform'
elif key.startswith('incoming/customers/'):
job_name = 'customers-transform'
else:
print(f"No handler for key: {key}")
continue
# Start the Glue job
response = glue_client.start_job_run(
JobName=job_name,
Arguments={
'--source_bucket': bucket,
'--source_key': key,
'--file_size': str(size_bytes)
}
)
print(f"Started {job_name} run: {response['JobRunId']} for {key}")
except Exception as e:
print(f"Failed to process SQS record: {str(e)}")
# Return message ID to SQS batch failure report
failed_messages.append({"itemIdentifier": record['messageId']})
# Report partial batch failures โ only failed messages will be retried
return {"batchItemFailures": failed_messages}
The batchItemFailures return pattern is critical for production SQS Lambda integrations. Without it, if any message in a batch fails, SQS retries the entire batch โ including messages that were processed successfully. With partial batch failure reporting, only the failed messages are retried, preventing duplicate processing of successful messages.
Visibility Timeout: The Most Misunderstood SQS Setting
The visibility timeout is the period during which a received message is hidden from other consumers while it is being processed. If processing completes within the timeout, the consumer deletes the message. If processing does not complete (because the consumer crashed or exceeded the timeout), the message becomes visible again and is re-delivered for retry.
The most common SQS misconfiguration is setting a visibility timeout shorter than the actual processing time. If your Lambda function processes a message in 4 minutes but the visibility timeout is 30 seconds, SQS will re-deliver the message (because it hasnโt been deleted) while the original Lambda invocation is still processing it. You end up with multiple concurrent executions processing the same file โ typically causing corrupted output or Glue job conflicts.
Rules of thumb:
- Set the SQS visibility timeout to at least 6x the maximum expected processing time
- For Lambda triggers, set the visibility timeout to at least the Lambda function timeout
- For Glue job triggers, the SQS message should remain invisible until the Glue job completes โ either use a Step Functions wrapper that deletes the message after confirmation, or set a very long visibility timeout (up to 12 hours)
Dead-Letter Queues for Operational Resilience
A dead-letter queue (DLQ) captures messages that have been delivered and failed to process more than maxReceiveCount times. Without a DLQ, poison messages (malformed records, files that consistently cause processing errors) cycle through your queue indefinitely, consuming processing capacity and blocking normal messages in FIFO queues.
Configure DLQ alerting in CloudWatch:
# CloudWatch alarm on DLQ depth
import boto3
cloudwatch = boto3.client('cloudwatch', region_name='ca-central-1')
cloudwatch.put_metric_alarm(
AlarmName='transform-dlq-depth',
AlarmDescription='Messages in transform DLQ require investigation',
MetricName='ApproximateNumberOfMessagesVisible',
Namespace='AWS/SQS',
Statistic='Maximum',
Dimensions=[{
'Name': 'QueueName',
'Value': 'file-transform-dlq'
}],
Period=300,
EvaluationPeriods=1,
Threshold=1,
ComparisonOperator='GreaterThanOrEqualToThreshold',
AlarmActions=['arn:aws:sns:ca-central-1:123456789:ops-alerts'],
TreatMissingData='notBreaching'
)
DLQ messages should never be silently discarded. Set the DLQ message retention to 14 days (the maximum) to give your team time to investigate and replay messages after fixing the underlying issue.
SQS vs. EventBridge for Pipeline Triggering
SQS and Amazon EventBridge both move events between pipeline components, but serve different roles. SQS is best when:
- You need durable buffering to absorb rate mismatches between producers and consumers
- Processing order matters (FIFO queues)
- You need at-least-once delivery with controllable retry behaviour
- A single consumer processes each message
EventBridge is better when:
- You need content-based routing to multiple different consumers based on event attributes
- You want a schema registry and event discovery
- You need cross-account event distribution
- Near-real-time routing (EventBridge has lower message retention, no buffering)
In practice, the two services compose well. EventBridge routes events to SQS queues, which buffer the events for consumer processing. The DataOps practices that enable reliable pipeline operations often involve both: EventBridge for intelligent routing and SQS for durable delivery.
Conclusion
SQS and SNS are foundational building blocks that transform brittle, tightly-coupled data pipelines into resilient, independently scalable systems. The SNS fan-out pattern enables a single data event to trigger multiple independent consumers without coupling them. SQS dead-letter queues and visibility timeout configuration ensure that processing failures are captured and recoverable rather than silently lost. Partial batch failure reporting on Lambda-SQS integrations prevents duplicate processing while enabling clean retry semantics.
These are not advanced architectural patterns โ they are table stakes for production data engineering on AWS. Pipelines built without them will fail in predictable ways as data volumes grow and failure scenarios multiply.
If you are designing or reviewing a data pipeline architecture on AWS and want an independent assessment of your decoupling and resilience strategy, contact Infra IT Consulting for an architecture review.
Related posts
Book a free 30-minute consultation to discuss your data engineering and analytics needs.
Talk to our team โ