7 Proven Ways to Cut AWS Data Pipeline Costs Without Losing Performance
AWS data pipeline costs have a way of growing quietly. A Glue job that runs in 10 minutes on day one grows to 45 minutes six months later as the dataset doubles. An Athena query that scanned 2 GB in January scans 80 GB in August because nobody partitioned the table correctly. An EMR cluster provisioned for peak load runs at 15% utilisation for 20 hours a day because the auto-scaling policy was never tuned. By the time the cloud bill lands on a CTO’s desk, the engineering team is debugging production incidents and cost review falls to the bottom of the backlog.
This post covers seven practical, proven optimisation strategies that Infra IT Consulting has implemented across real AWS data platforms. Each strategy includes specific numbers, trade-offs, and implementation guidance — not generic advice.
1. Fix Your S3 Data Layout and Partitioning Before Everything Else
Inefficient S3 partitioning is the single most common source of unnecessary cost on AWS data platforms. When Athena or Redshift Spectrum scans data, you pay per terabyte scanned. A query that scans your entire 10 TB events table because there’s no partition pruning costs roughly $50 per execution at standard Athena pricing. The same query on a properly partitioned table might scan 50 GB and cost $0.25.
The S3 data partitioning strategies that work in production follow a clear principle: partition by the columns most frequently used in query predicates, at a granularity that keeps partition count manageable. For time-series data, year/month/day partitioning is standard. For multi-tenant data, leading with a tenant or region partition before the date often yields better pruning.
Beyond partitioning, file format and compression matter enormously. Converting CSV or JSON raw data to Parquet or ORC with Snappy compression typically reduces storage costs by 60–80% and reduces Athena scan costs proportionally. A single Glue job that converts your raw landing zone to Parquet pays for itself within days on any dataset over a few hundred gigabytes.
File sizing is the third dimension. Parquet files below 64 MB create excessive S3 request overhead and metadata management costs. Files above 512 MB limit parallelism in Spark and Athena. Target 128–256 MB Parquet files for most analytical workloads.
2. Right-Size and Auto-Scale Your EMR Clusters
EMR clusters are frequently over-provisioned because the person who created the initial cluster specification estimated conservatively, and nobody revisited the sizing after the initial deployment. An m5.4xlarge cluster provisioned for 10 nodes running at 15% CPU utilisation is wasting roughly 85% of its compute budget.
Start by collecting two weeks of CloudWatch metrics for your existing clusters: CPU utilisation, memory utilisation, YARN container pending time, and Spark executor active time. If average CPU utilisation is below 30%, your cluster is over-provisioned. If YARN shows persistent pending containers, it is under-provisioned. Most production clusters land in the “over-provisioned” category.
For clusters that run batch jobs with predictable runtimes, Instance Fleets with Spot Instances can cut compute costs by 60–80% compared to On-Demand pricing. The trade-off is Spot interruption risk. Mitigate this by using multiple instance types in your Fleet configuration (mix m5.4xlarge, m5a.4xlarge, and m5d.4xlarge to give the Spot market more options), enabling graceful decommission in your YARN configuration, and designing your Spark jobs with checkpointing so they can restart from an intermediate state.
EMR Serverless vs. EMR on EC2 provides a detailed comparison of the two modes, but for cost purposes: EMR Serverless eliminates the cost of idle cluster time entirely, billing only for vCPU-seconds and GB-seconds of actual compute used. For bursty workloads with significant idle time between jobs, this can reduce EMR costs by 40–60%.
3. Optimise AWS Glue Job Configuration and DPU Allocation
AWS Glue charges per Data Processing Unit (DPU) per second, with a 1-minute minimum per job run. A Glue job configured at 10 DPUs that actually needs 3 DPUs is paying for idle capacity on every run. Glue’s auto-scaling feature (available from Glue 3.0) dynamically adjusts DPU allocation based on actual workload, which eliminates most manual right-sizing work for variable-complexity jobs.
For fixed-complexity jobs, enable the Glue Job Metrics feature and examine executor utilisation in the Spark UI. If executors are consistently below 50% CPU utilisation, reduce the --num-workers parameter. If shuffle read/write is consuming most of your job time, increasing worker count is justified. The optimal configuration is typically one worker per 5–10 GB of data processed per run, starting from a baseline of G.1X workers (4 vCPUs, 16 GB) rather than G.2X unless you have confirmed memory pressure.
Glue job bookmarks deserve specific attention. When enabled, bookmarks track which S3 objects have been processed, enabling incremental processing. A job processing daily incremental data with bookmarks enabled processes gigabytes per run; the same job without bookmarks reprocesses terabytes. The cost difference is multiplicative over time.
4. Use Athena Workgroups to Enforce Query Cost Controls
Athena charges $5 per TB scanned. Without controls, a single poorly-written query from a business analyst or an automated report can scan your entire data lake and generate a $500 bill in under a minute. Athena Workgroups provide the governance controls to prevent this.
Configure workgroups with a per-query data scanned limit. A limit of 10 GB per query prevents runaway scans while still supporting most legitimate analytical queries. Set a workgroup-level limit of 1 TB per day for teams that don’t require large-scale processing. These limits cancel queries that would exceed the threshold rather than letting them complete — a data quality trade-off, but one that most organisations prefer to an unconstrained cloud bill.
Enforce result set caching in Athena Workgroups. Identical queries executed within the cache window (up to 24 hours) return cached results without scanning data, reducing both cost and latency. For dashboards and reports that refresh hourly with unchanged underlying data, this can eliminate 80–90% of scanning costs. The cache is scoped to the workgroup, so results are shared across users running the same query.
5. Implement S3 Intelligent-Tiering and Lifecycle Policies
S3 Standard storage costs $0.023 per GB/month in us-east-1 (approximately $0.025 in ca-central-1). For a 100 TB data lake, that is $2,500 per month in storage costs alone. S3 Intelligent-Tiering automatically moves objects between access tiers based on observed access patterns, reducing storage costs by 40–68% for data with unknown or variable access patterns.
For data with predictable access patterns, explicit lifecycle policies are more cost-effective than Intelligent-Tiering (which has a small per-object monitoring charge). Raw landing zone data that is processed within 30 days and never accessed again should transition to S3 Glacier Instant Retrieval after 90 days and S3 Glacier Flexible Retrieval after 365 days. Processed analytical data that is actively queried for 12 months and then archived should move to S3 Standard-IA after 60 days.
A concrete lifecycle policy for a raw ingestion bucket:
{
"Rules": [
{
"ID": "raw-data-tiering",
"Filter": {"Prefix": "raw/"},
"Status": "Enabled",
"Transitions": [
{"Days": 30, "StorageClass": "STANDARD_IA"},
{"Days": 90, "StorageClass": "GLACIER_IR"},
{"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
]
}
]
}
6. Tune Redshift Spectrum and Eliminate Redundant Data Movement
Redshift Spectrum charges $5 per TB scanned from S3, identical to Athena’s pricing. Teams that load data into Redshift from S3 and then also run Spectrum queries against the same S3 data are paying twice for storage and moving data unnecessarily. Audit your Redshift workload for queries that could run against S3 via Spectrum rather than requiring data loaded into Redshift managed storage.
Redshift Spectrum federated queries explains the pattern in detail, but the cost principle is: keep hot analytical data in Redshift managed storage for sub-second query performance, keep warm and cold data in S3 queryable via Spectrum, and never load data into Redshift that doesn’t benefit from Redshift’s columnar compression and distribution optimisation.
Redshift managed storage (RMS) bills by the GB-month used. Identify tables that are queried fewer than once per week and evaluate whether they belong in Redshift at all. Moving infrequently accessed data to S3/Parquet and querying via Spectrum typically reduces Redshift storage costs by 30–50% on mature platforms where historical data has accumulated.
7. Implement Infrastructure-as-Code Cost Guardrails
The most durable cost optimisation is architectural: prevent over-provisioning from occurring in the first place by building cost constraints into your infrastructure templates. Terraform for AWS data stacks enables you to parameterise resource sizes and enforce review gates for expensive configuration changes.
Specific guardrails worth implementing:
- AWS Budgets alerts: Set alerts at 50%, 80%, and 100% of monthly budget per service, with automated SNS notifications to engineering leads.
- Glue job DPU limits: Define a maximum DPU count in your Terraform module and require a PR review to increase it above the default.
- Cost allocation tags: Tag every Glue job, EMR cluster, and S3 bucket with
team,pipeline, andenvironmenttags. Without tags, cost attribution is guesswork. With tags, you can identify which pipeline doubled its cost in a given month. - Athena per-workgroup budgets: Use CloudWatch alarms on the
BytesScannedCutoffPerQueryAthena metric to alert when queries approach the scan limit. - Scheduled auto-shutdown: For development EMR clusters or Glue Dev Endpoints, use Lambda functions triggered by EventBridge schedules to terminate resources outside business hours.
Conclusion
AWS data pipeline costs are controllable without sacrificing performance or capability. The seven strategies in this post — efficient partitioning, right-sized EMR clusters, optimised Glue configurations, Athena query controls, S3 lifecycle management, Spectrum/Redshift balance, and IaC guardrails — address the most common sources of excess spend in real-world data platforms.
The key insight is that most cost problems are architectural rather than operational. Fixing a partitioning scheme or converting to Parquet delivers permanent, compounding savings on every subsequent query and job run. One-time optimisation efforts pay dividends for years.
Cost optimisation is most effective when embedded into platform engineering from the beginning, not bolted on after bills become alarming. If your AWS data platform costs are growing faster than your data volumes, contact Infra IT Consulting for a cost assessment and optimisation roadmap tailored to your specific workloads.
Related posts
Book a free 30-minute consultation to discuss your data engineering and analytics needs.
Talk to our team →