Applying the AWS Well-Architected Framework to Data Workloads
The AWS Well-Architected Framework is one of the most practical architectural reference documents available to cloud engineering teams. Its five pillars — Operational Excellence, Security, Reliability, Performance Efficiency, and Cost Optimization — provide a structured vocabulary for evaluating trade-offs and catching design debt before it becomes operational debt. Yet most data engineering teams encounter the Framework only during a formal Well-Architected Review, typically triggered by a compliance requirement or a post-incident review.
This post applies the Framework’s five pillars specifically to data workloads — the patterns, pitfalls, and concrete recommendations that matter most for data lakes, data pipelines, and analytical systems on AWS.
Pillar 1: Operational Excellence for Data Teams
Operational Excellence in data engineering centres on observability, runbook-driven operations, and automated response to failure. The Framework’s guidance on “performing operations as code” is particularly relevant: data pipelines are code, and their operational procedures should be too.
Pipeline observability. Every data pipeline should emit structured logs and metrics that make failures diagnosable without manual log trawling. AWS Glue jobs automatically emit metrics to Amazon CloudWatch — job execution time, bytes read and written, DPU utilisation, and error rates. AWS Step Functions provide execution history and visual debugging for orchestrated workflows. Set up CloudWatch Alarms on job failure and anomalous execution duration.
Idempotent pipeline design. Data pipelines fail. The question is not whether a pipeline will fail mid-run, but whether you can safely re-run it after failure without duplicating data or corrupting state. Design every pipeline stage to be idempotent: if you write Parquet files to S3 partitioned by date, a re-run that overwrites the same S3 prefix produces the same result as the first run. Avoid patterns that append to existing files or accumulate state in external tables without idempotent overwrite logic.
Runbooks in version control. Every scheduled job should have a runbook that describes its purpose, its dependencies, its SLA, and the steps to diagnose and resolve common failures. Store runbooks alongside the pipeline code in the same Git repository. When a job fails at 3am, the on-call engineer should be able to find the runbook in the same place as the code.
Pillar 2: Security for Data Workloads
Security in data engineering has a specific character: data is the asset, and the attack surface extends across S3 buckets, Glue jobs, Redshift clusters, RDS instances, and the IAM roles that connect them. The Framework’s security pillar emphasises least-privilege access, encryption at rest and in transit, and audit logging.
Least-privilege IAM for pipeline roles. Every Glue job, Lambda function, and EMR cluster should run under a dedicated IAM role with only the permissions it needs. Avoid sharing execution roles across pipelines — a compromise of one pipeline’s role should not grant access to unrelated data. A Glue job that reads from one S3 prefix and writes to another should have s3:GetObject on the source prefix and s3:PutObject on the destination prefix only.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadSource",
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::my-data-lake/raw/transactions/*",
"arn:aws:s3:::my-data-lake"
]
},
{
"Sid": "WriteProcessed",
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:DeleteObject"],
"Resource": "arn:aws:s3:::my-data-lake/processed/transactions/*"
}
]
}
AWS Lake Formation for column-level access control. If your data lake contains sensitive columns — PII, financial data, health information — Lake Formation provides column-level access control that S3 bucket policies cannot achieve. Lake Formation lets you grant access to specific columns within a Glue Catalog table, so an analyst can query a customer table without seeing the sin_number or date_of_birth columns.
Encryption at rest and in transit. S3 server-side encryption with AWS KMS (SSE-KMS) is the standard for sensitive data. Use customer-managed KMS keys (CMKs) rather than AWS-managed keys if you need the ability to disable key access for regulatory compliance. Redshift supports encryption at rest using HSM or KMS. Enforce SSL/TLS for all JDBC connections — Redshift, RDS, and Aurora all support mandatory SSL enforcement at the cluster level.
VPC-based data infrastructure. Run Redshift clusters, RDS instances, and EMR clusters inside a VPC with no public endpoints. Use VPC endpoints for S3 access to keep data transfer within the AWS network. Glue jobs can run inside a VPC using Glue connections configured with VPC and subnet parameters.
Pillar 3: Reliability for Data Pipelines
Reliability for data workloads means pipelines that complete successfully within their SLA, data that arrives correctly and on time, and failure modes that are contained and recoverable.
Multi-AZ for stateful data stores. Any Redshift cluster, RDS instance, or ElasticSearch domain that serves production analytics should be deployed in a Multi-AZ configuration. The additional cost (roughly 2x for synchronous standby) is warranted for workloads with defined availability SLAs. Redshift RA3 nodes include Managed Storage with automatic cross-AZ redundancy.
S3 as the reliability anchor. S3 provides 11 nines of data durability (99.999999999%) and is inherently more reliable than any database for long-term data retention. Design pipelines so that intermediate and final datasets are always persisted to S3 — never to ephemeral compute-local storage. This enables re-processing from any checkpoint without data loss.
Dead-letter queues for event-driven pipelines. Pipelines that process events from SQS, SNS, or Kinesis should configure dead-letter queues (DLQs) to capture messages that fail processing repeatedly. Without a DLQ, failed messages are either lost or cause infinite retry loops. Monitor DLQ depth as a leading indicator of pipeline health.
Pillar 4: Performance Efficiency
Performance efficiency for data workloads means choosing the right service and instance type for each job, and tuning access patterns to minimise wasted compute.
Columnar storage and compression. Store analytical datasets in Parquet or ORC format on S3. Columnar formats allow query engines (Athena, Redshift Spectrum, Spark) to read only the columns required by a query, dramatically reducing I/O. Use Snappy compression for interactive query workloads (fast decompression) and Gzip/Zstandard for archival data (better compression ratio). A raw CSV file commonly compresses to 20–30% of its original size in Snappy Parquet.
Partitioning strategy. Partition S3 data by the columns most commonly used in query predicates — typically date/time for event data and region or entity for reference data. Proper partitioning allows query engines to skip irrelevant partitions entirely (partition pruning). A poorly-partitioned table with 10 TB of data may scan the full 10 TB for a date-range query; the same data correctly partitioned by year/month/day scans only the relevant partitions.
Right-sizing EMR and Glue clusters. Over-provisioned compute is wasted money; under-provisioned compute causes job failures and SLA misses. Use AWS Glue job metrics (driver and executor memory utilisation, shuffle write bytes) to identify jobs that are using less than 50% of provisioned DPUs — these are candidates for downsizing. For EMR, use the Application History Server to review task-level resource utilisation across the cluster.
For more on selecting between Glue and Spark for specific workload types, see our comparison in AWS Glue vs. Apache Spark.
Pillar 5: Cost Optimization
Cost optimisation for data workloads is a continuous practice, not a one-time configuration. The AWS pricing model rewards right-sizing, commitment discounts, and intelligent storage tiering.
S3 Intelligent-Tiering for data lakes. For data lake buckets where access patterns are unpredictable, S3 Intelligent-Tiering automatically moves objects between access tiers (Frequent, Infrequent, Archive) based on observed access patterns, with no retrieval fees. For buckets with objects > 128 KB and variable access patterns, Intelligent-Tiering typically delivers 20–40% storage cost reduction compared to S3 Standard. See S3 Storage Cost Management for a full tiering analysis.
Reserved Instances and Savings Plans for stable workloads. Redshift Reserved Nodes offer up to 75% savings over on-demand pricing for workloads with predictable, sustained usage. Compute Savings Plans cover EC2, Fargate, and Lambda usage with similar discounts. Commit to Reserved Nodes only after you have a stable Redshift cluster configuration — do not commit to instances before understanding your workload’s actual size requirements.
Spot Instances for batch EMR jobs. EMR task nodes can run on EC2 Spot Instances at 60–80% discount versus on-demand pricing. Spot Instances can be interrupted with two minutes notice, but well-designed Spark jobs handle interruption gracefully through speculative execution and YARN re-scheduling. Run core and master nodes on on-demand to maintain cluster stability.
Conducting a Well-Architected Review for Your Data Platform
AWS offers free Well-Architected Reviews through AWS Partner Network members. A review typically takes 2–4 hours and produces a prioritised list of High Risk Items (HRIs) and Medium Risk Items. For data platforms, common HRIs include: missing S3 bucket versioning on critical data, Redshift clusters with public endpoints, Glue jobs without CloudWatch alerting, and IAM roles with overly broad S3 permissions.
The Review is most valuable when conducted against a specific workload — a defined data pipeline, a Redshift cluster, or a data lake — rather than an entire AWS account. This scoping produces actionable findings rather than generic recommendations.
For organisations that have recently migrated workloads to AWS, a Well-Architected Review 6–12 months after migration typically uncovers significant cost optimisation and security improvements that were deferred during the migration programme. If you are planning a migration and want to build Well-Architected principles into the design from the start, see our On-Premises to AWS Migration guide.
Conclusion
The AWS Well-Architected Framework gives data engineering teams a shared language for architectural decisions and a structured approach to technical debt remediation. Applying it to data workloads — with specific attention to least-privilege IAM, S3 data durability patterns, columnar storage, and consumption-based cost optimisation — produces systems that are more reliable, more secure, and significantly cheaper to operate.
Infra IT Consulting conducts Well-Architected Reviews for data platforms across Canada, the UK, and Africa. If you would like a structured assessment of your AWS data infrastructure, contact us to arrange a review.
Related posts
AWS Cost Optimisation for Data Teams: 10 Tactics That Work
Read more Cloud Migration & Cost OptimizationThe AWS Data Migration Checklist: 50 Things to Verify Before Go-Live
Read more Cloud Migration & Cost OptimizationManaging S3 Storage Costs: Lifecycle Policies and Intelligent-Tiering
Read moreBook a free 30-minute consultation to discuss your data engineering and analytics needs.
Talk to our team →