Rightsizing AWS Data Workloads: A Practical Guide
Overprovisioned infrastructure is the most common and most correctable source of wasted AWS spend for data engineering teams. Unlike Spot Instance adoption or Reserved Instance purchasing β which require workflow changes or financial commitments β rightsizing requires only accurate measurement and the willingness to act on what the data shows.
The challenge for data teams is that rightsizing AWS data workloads is not the same as rightsizing general-purpose EC2 instances. Redshift, EMR, Glue, and RDS each have distinct performance models, distinct metrics, and distinct rightsizing actions. This guide covers each in turn, with the specific metrics and thresholds that identify overprovisioning.
Rightsizing Redshift Clusters
Redshift clusters are among the most commonly overprovisioned resources in a data engineering AWS estate. The typical pattern: a cluster is sized for a peak load that was projected during initial deployment, the peak never fully materialised, and the cluster has been running at 20β40% utilisation ever since.
Key metrics for Redshift rightsizing:
- CPUUtilization (CloudWatch) β sustained average below 25% for the rolling 30-day period suggests overprovisioning. Peaks are expected; it is the sustained average that matters.
- PercentageDiskSpaceUsed β if consistently below 40% across all nodes, the storage allocation is oversized. For RA3 nodes, managed storage scales separately from compute β this metric is less relevant for RA3.
- WLMQueueWaitTime β a WLM queue wait time consistently near zero combined with low CPU utilisation is the clearest signal of overprovisioning. If jobs are not waiting for resources and the cluster is not busy, it is too large.
- QueryDuration β if query durations are not bound by cluster size (i.e., queries are waiting on I/O or network, not CPU), adding more nodes will not improve performance.
Rightsizing actions for Redshift:
For ra3.4xlarge node clusters, evaluate the minimum node count that can serve peak concurrency while maintaining acceptable query performance. A two-node ra3.4xlarge cluster costs approximately $3,000 CAD/month on on-demand pricing; scaling down from four nodes to two saves $3,000/month. With a 1-year Reserved Node commitment, the savings on top of the rightsizing are an additional 40β60%.
Run the following system query to identify the last 30 days of actual cluster utilisation to inform rightsizing decisions:
-- Redshift: hourly resource utilisation summary for rightsizing analysis
SELECT
DATE_TRUNC('hour', starttime) AS hour,
COUNT(*) AS query_count,
AVG(DATEDIFF(millisecond, starttime, endtime)) / 1000.0 AS avg_duration_seconds,
MAX(DATEDIFF(millisecond, starttime, endtime)) / 1000.0 AS max_duration_seconds,
SUM(CASE WHEN aborted = 1 THEN 1 ELSE 0 END) AS aborted_queries
FROM STL_QUERY
WHERE starttime >= DATEADD(day, -30, GETDATE())
AND userid > 1 -- exclude system queries
GROUP BY 1
ORDER BY 1;
Combine this with CloudWatch metrics to correlate query load with CPU utilisation. If the peak query hour shows 15 concurrent queries and 45% CPU utilisation on a 4-node cluster, a 2-node cluster at that peak would reach approximately 90% CPU β acceptable for burst, given that off-peak load will be lower.
For deeper Redshift cost optimisation techniques beyond node count, see our dedicated post on Redshift Cost Tuning.
Rightsizing EMR Clusters
EMR cluster rightsizing is more dynamic than Redshift because EMR clusters are often transient β created per job run rather than maintained continuously. For long-running EMR clusters (clusters that run 24/7 serving interactive Spark workloads), the rightsizing approach mirrors Redshift.
For transient per-job clusters, the question is different: is the cluster appropriately sized for the specific job it runs?
Key metrics for transient EMR job rightsizing:
- YARN Memory Available (MB) at job completion β if a significant amount of YARN memory remains unused when the job completes, the cluster has excess capacity. Export YARN ResourceManager metrics via CloudWatch or the Ganglia history to quantify this.
- Container pending time β if YARN containers are never queued waiting for resources, there are more resources than the job requires.
- Task executor memory utilisation β access through the Spark History Server. If executors are using 40% of their allocated memory on average, the executor memory setting (
spark.executor.memory) is oversized relative to data volumes. - Job completion time at different cluster sizes β run the job at 60% of the current cluster size and measure completion time. If completion time increases by less than 30%, the current cluster is significantly oversized.
A practical rightsizing approach for transient EMR clusters: use the AWS EMR automatic scaling feature with Instance Fleets to let EMR dynamically adjust the number of task nodes during a job run. Configure a minimum of 2 task nodes and a maximum of 20, and let EMR scale based on YARN pending containers. This eliminates the rightsizing question for variable workloads β the cluster self-adjusts.
Rightsizing AWS Glue Jobs
Glue job rightsizing focuses on two parameters: the number of workers (DPUs) and the worker type (G.025X, G.1X, G.2X, G.4X, G.8X). Both are configured at job creation and affect cost directly β each G.1X worker at $0.44/DPU-hour is 2 DPUs, so a 10-worker G.1X job costs $0.44 Γ 2 Γ 10 = $8.80/hour.
Metrics for Glue rightsizing:
- glue.driver.jvm.heap.usage (CloudWatch) β if driver heap usage is consistently below 50%, the driver memory is oversized. Consider using a smaller worker type.
- glue.ALL.jvm.heap.usage (CloudWatch) β average across all executors. Below 50% consistently suggests executor over-allocation.
- glue.driver.BlockManager.disk.diskSpaceUsed_MB β if disk-based shuffling is occurring (non-zero), memory is insufficient; if it is always zero, memory may be excessive.
- Actual job duration vs. DPU allocation β if a 10-worker job finishes in 8 minutes, the cost is $0.44 Γ 2 Γ 10 Γ (8/60) = $1.17. Reducing to 5 workers may double the runtime to 14 minutes but cost $0.44 Γ 2 Γ 5 Γ (14/60) = $1.03 β slightly cheaper with no SLA impact.
A structured rightsizing test for a Glue job:
# Use Glue job parameters to make worker count configurable for rightsizing tests
import sys
from awsglue.utils import getResolvedOptions
args = getResolvedOptions(sys.argv, [
'JOB_NAME',
'num_workers', # Pass different values in test runs
'worker_type' # Test G.1X vs G.2X
])
# Log configuration for post-hoc analysis
print(f"Running with {args['num_workers']} workers of type {args['worker_type']}")
Run the same job with 5, 10, and 20 workers and record duration and cost for each. For most batch transformation jobs, the cost-optimal worker count is where the job duration decreases less than linearly with added workers β the point of diminishing returns on parallelism.
Rightsizing RDS Instances for Data Workloads
RDS instances serving data engineering workloads β source databases for DMS replication, operational data stores feeding pipelines β are commonly overprovisioned because they were sized for OLTP peak loads that are different from the read-heavy, bulk-extract patterns of ETL.
Key metrics for RDS rightsizing:
- CPUUtilization (CloudWatch) β sustained average below 30% for a 4-week period suggests the instance class is oversized
- FreeableMemory β if consistently above 25% of total instance memory, the instance may have excess RAM
- DatabaseConnections β if average connections are well below the instanceβs maximum (which varies by instance size), the instance is not connection-constrained
- ReadIOPS / WriteIOPS vs. provisioned IOPS β for gp3 or io1 storage, if actual IOPS are consistently below 50% of provisioned IOPS, reduce the provisioned IOPS allocation
AWS Compute Optimizer provides specific RDS rightsizing recommendations based on CloudWatch metric history. Enable Compute Optimizer in your AWS account and review its recommendations for RDS instances monthly.
For a write-heavy source database that has been scaled up to handle ETL read load, consider using an RDS read replica for ETL extraction instead of reading from the primary. The replica can be a smaller, cheaper instance class since it only needs to handle the read load from your DMS or JDBC extraction jobs.
Building a Rightsizing Process Into Your Operations
Rightsizing is most effective as a recurring operational process rather than a one-time project. Implement a monthly rightsizing review:
- Pull CloudWatch utilisation metrics for all Redshift clusters, running EMR clusters, and RDS instances for the past 30 days
- Review AWS Compute Optimizer recommendations β update the Compute Optimizer analysis frequency to weekly
- For Glue jobs with over $500/month in costs, review the job metrics and test with reduced worker counts
- Document rightsizing changes with before/after cost tracking to demonstrate savings
Pairing rightsizing with a FinOps culture β where cost efficiency is a shared team responsibility β produces the most sustained results. For guidance on building that culture, see FinOps for Data Engineering. For rightsizing combined with commitment purchasing strategy, see Reserved Instances vs. Savings Plans for Data Workloads to understand which resources should be rightsized before being reserved.
Expected Savings from a Rightsizing Programme
Data engineering teams that complete a structured rightsizing assessment typically find:
- Redshift clusters: 20β40% of clusters are oversized by at least one node tier, representing $1,000β$8,000 CAD/month per cluster in avoidable spend
- Glue jobs: 30β50% of production jobs are running with 30β50% more workers than needed for their data volume and SLA
- RDS instances: 25β35% of RDS instances supporting data workloads are one instance class larger than needed
Across a typical mid-size data platform, a rightsizing programme conducted over 4β6 weeks identifies $15,000β$60,000 CAD/year in avoidable spend β with no reduction in pipeline reliability or analytical performance.
Conclusion
Rightsizing is the highest-ROI cost optimisation activity for most data engineering teams because the savings are immediate, the risk is low, and the analysis requires only a few days of focused effort. The key is building a systematic measurement approach for each AWS data service and acting on what the metrics show.
Infra IT Consulting conducts rightsizing assessments for data engineering teams across Canada, the UK, and Africa. If your AWS data infrastructure costs are growing faster than your data volumes, contact us for a rightsizing and cost optimisation review.
Related posts
Using AWS Spot Instances for Cost-Effective Data Processing
Read more Cloud Migration & Cost OptimizationManaging S3 Storage Costs: Lifecycle Policies and Intelligent-Tiering
Read more Cloud Migration & Cost OptimizationApplying the AWS Well-Architected Framework to Data Workloads
Read moreBook a free 30-minute consultation to discuss your data engineering and analytics needs.
Talk to our team β