AWS Cost Optimisation for Data Teams: 10 Tactics That Work
Data infrastructure is one of the fastest-growing cost centres on AWS bills. The elasticity that makes cloud analytics powerful β the ability to spin up large Glue jobs, run Athena queries across terabytes, and scale Redshift clusters on demand β also makes it easy to accumulate spend in ways that are not immediately visible until the end-of-month invoice arrives. For Canadian organisations, CAD/USD exchange rates add an additional multiplier to every dollar of inefficiency.
This post covers ten cost optimisation tactics that data engineering teams can implement without reducing analytical capability or reliability. Unlike generic AWS cost guidance, these tactics are specific to the data engineering workload patterns β Glue ETL, Athena querying, Redshift clusters, S3 storage, and the Lambda functions and Step Functions that orchestrate it all.
1. Right-Size Glue Job DPUs with Profiling Before Commitment
AWS Glue charges per DPU-hour (Data Processing Unit). Each DPU provides 4 vCPUs and 16 GB of memory and costs approximately $0.44 USD per hour. The default Glue job configuration allocates 10 DPUs β and most jobs run perfectly well on 2-4 DPUs. Running a job on 10 DPUs when 3 suffice wastes 70% of your Glue spend on that job.
Profile before you fix: enable Glue job metrics in CloudWatch and look at actual CPU and memory utilisation across the DPU fleet during job execution. A job where the worker nodes show 15% CPU utilisation at peak is severely over-provisioned.
For small to medium jobs, try the --number-of-workers and --worker-type parameters:
# AWS CLI to create a right-sized Glue job
aws glue create-job \
--name sales-daily-transform \
--role AWSGlueServiceRole \
--command '{"Name": "glueetl", "ScriptLocation": "s3://scripts/transform.py"}' \
--number-of-workers 4 \
--worker-type G.1X \
--glue-version 4.0
G.1X workers (1 DPU each: 4 vCPUs, 16 GB) are appropriate for most jobs. G.2X (2 DPUs each) are needed for memory-intensive operations like large joins or wide pivots. The G.025X worker type (0.25 DPU) is available for very lightweight Python shell jobs.
Typical savings from right-sizing: 40-60% of Glue ETL spend.
2. Replace Glue with Athena for SQL-Only Transformations
Many Glue jobs do nothing that cannot be done with Athena SQL β they join tables, filter rows, aggregate, and write results. Athena charges per TB of data scanned ($5 USD/TB), not per time. For transformations where the SQL scans compressed, partitioned Parquet data, Athena is frequently 70-90% cheaper than equivalent Glue execution.
Migrating a Glue job to an Athena CTAS (Create Table As Select) statement:
-- Glue job replaced by Athena CTAS
CREATE TABLE processed.sales_daily
WITH (
format = 'PARQUET',
parquet_compression = 'SNAPPY',
partitioned_by = ARRAY['sale_date'],
external_location = 's3://data/processed/sales_daily/'
)
AS
SELECT
order_id,
customer_id,
product_id,
SUM(line_total) AS gross_revenue,
COUNT(*) AS line_items,
DATE(order_timestamp) AS sale_date
FROM raw.orders
WHERE DATE(order_timestamp) = DATE '{{ds}}' -- partition filter
GROUP BY 1, 2, 3, 6;
This CTAS runs in Athena for a fraction of the cost of a Glue job for the same transformation, assuming the source data is partitioned (so Athena prunes to only the target dateβs data) and stored in Parquet (compressed columnar format dramatically reduces scanned bytes).
3. Enforce Athena Partition Pruning with Partition Projection
Teams that use Amazon Athena without partition projection often discover that their βcheapβ Athena queries are scanning entire datasets rather than specific partitions. A query that filters on event_date = '2024-01-19' should scan only that dayβs data β but without partition projection or a properly configured Glue catalog, Athena may scan all partitions to find matching rows.
Partition projection is a Glue Data Catalog table property that tells Athena to compute partition paths rather than looking them up in the catalog:
# Set partition projection via boto3 when creating/updating Glue catalog table
glue_client.update_table(
DatabaseName='raw',
TableInput={
'Name': 'events',
'StorageDescriptor': {
'Location': 's3://data/raw/events/',
'Columns': [...]
},
'PartitionKeys': [{'Name': 'event_date', 'Type': 'date'}],
'Parameters': {
'projection.enabled': 'true',
'projection.event_date.type': 'date',
'projection.event_date.range': '2023-01-01,NOW',
'projection.event_date.format': 'yyyy-MM-dd',
'projection.event_date.interval': '1',
'projection.event_date.interval.unit': 'DAYS',
'storage.location.template': 's3://data/raw/events/${event_date}/'
}
}
)
With partition projection configured, Athena never scans the partition catalog β it calculates the S3 path directly from the query filter. This eliminates cold-start latency on queries against tables with thousands of partitions and ensures your WHERE clause on a date column actually prunes storage.
4. Use Redshift Serverless for Variable Workloads
Many organisations run provisioned Redshift clusters sized for peak analytical demand β which typically occurs for two to four hours per day during morning reporting. The remaining 20 hours, the cluster sits idle at full cost. A dc2.8xlarge cluster with 4 nodes costs approximately $13 USD per hour whether anyone is running queries or not.
Amazon Redshift Serverless charges per RPU-second consumed, with no charge during idle periods (defined as no queries running for 6 consecutive minutes). For workloads with significant idle time, Serverless is substantially cheaper. For continuously loaded workloads running 18+ hours per day, provisioned instances with Reserved pricing may still win.
To evaluate: export your Redshift query history from stl_query and analyse the distribution of query times across the day. If less than 40% of hours show query activity, Serverless is likely cheaper. If over 70% of hours show continuous activity, provisioned Reserved Instances are more cost-effective.
See Redshift Cost Tuning for a detailed treatment of provisioned vs. serverless economics and the specific tuning parameters that affect Redshift spend.
5. Implement S3 Intelligent-Tiering for All Data Lake Objects
Amazon S3 Standard storage costs approximately $0.025 USD/GB/month in ca-central-1. Objects older than 90 days in a data lake are frequently accessed at much lower rates than recently ingested data β and can be automatically moved to lower-cost storage tiers using S3 Intelligent-Tiering.
S3 Intelligent-Tiering monitors object access patterns and automatically moves objects between Frequent Access ($0.025/GB), Infrequent Access ($0.0138/GB), and Archive Instant Access ($0.004/GB) tiers. For raw data that is heavily queried when fresh but rarely queried after 60-90 days, Intelligent-Tiering can reduce storage costs by 40-60% with zero change to how the data is accessed.
# Apply Intelligent-Tiering to new objects via bucket lifecycle rule
import boto3
s3 = boto3.client('s3', region_name='ca-central-1')
s3.put_bucket_lifecycle_configuration(
Bucket='your-data-lake',
LifecycleConfiguration={
'Rules': [
{
'ID': 'intelligent-tiering-all-objects',
'Status': 'Enabled',
'Filter': {'Prefix': 'raw/'},
'Transitions': [
{
'Days': 0, # Apply immediately to all new objects
'StorageClass': 'INTELLIGENT_TIERING'
}
]
}
]
}
)
Objects smaller than 128 KB are not eligible for Intelligent-Tiering and are billed at Standard rates. For data lakes that produce many small files, the small-file problem (addressed in tactic 7) also affects your Intelligent-Tiering economics.
6. Set Budget Alerts Before New Services Go to Production
The cheapest AWS cost optimisation tactic is also the most commonly skipped: setting up AWS Budgets alerts before new data services are deployed. A $500/month alert on a new Glue development job cluster catches the developer who left a long-running job running over a weekend β before it becomes a $2,000 surprise.
Configure budgets at three levels: account-level (total monthly spend), service-level (individual budgets for Glue, Athena, Redshift, S3), and project/team-level (using cost allocation tags). Use AWS Cost Anomaly Detection for machine-learning-based detection of unusual spend patterns that rule-based budgets would miss.
7. Compact Small Files in Your Data Lake
Streaming pipelines and frequent small-batch loads produce many small S3 files. A table with 50,000 files of 1 MB each is dramatically more expensive to query than the same 50 GB stored in 500 files of 100 MB each β even though the total data size is identical. Athena charges the same per-TB scanned, but queries against fragmented tables incur higher per-query overhead from S3 list operations and parallel file reads.
A weekly Glue compaction job consolidates small files:
# Glue compaction job pattern
from awsglue.context import GlueContext
from pyspark.context import SparkContext
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
# Read existing partition
df = spark.read.parquet(f's3://data/raw/events/event_date={target_date}/')
# Repartition to target file size (~128MB per file)
target_file_size_mb = 128
total_size_mb = estimate_partition_size_mb(target_date)
num_partitions = max(1, int(total_size_mb / target_file_size_mb))
df.repartition(num_partitions) \
.write \
.mode('overwrite') \
.parquet(f's3://data/raw/events/event_date={target_date}/')
Compaction reduces Athena query times by 30-70% for commonly fragmented partitions and lowers per-query costs proportionally.
8. Use Step Functions Express Workflows for High-Frequency Orchestration
AWS Step Functions has two workflow types with very different pricing. Standard Workflows charge per state transition ($0.025 per 1,000 transitions). Express Workflows charge per execution duration ($0.00001 per GB-second). For orchestration of frequent, short-duration pipelines (running every minute, every 5 minutes), Express Workflows are dramatically cheaper. A pipeline that runs every minute, taking 30 seconds and executing 10 state transitions, costs $1.08/month on Standard vs. $0.04/month on Express.
9. Enable Redshift Auto-Pause for Development Clusters
Redshift provisioned clusters (non-serverless) for development and staging environments should have auto-pause enabled. A development Redshift cluster running 24/7 costs roughly $500-1,500 CAD per month depending on node type. With auto-pause after 30 minutes of inactivity, a typical development cluster that is actively used during business hours and idle overnight costs 60-70% less.
10. Audit and Retire Unused Data Assets
The most overlooked source of cost in mature data platforms is data assets that are no longer used: S3 data that no pipeline reads, Glue catalog tables with no recent queries, Redshift tables with no access in 90+ days, and Lambda functions that have not been invoked in months.
Query Athena query history and Redshift stl_scan to identify tables with no access in 90 days. Cross-reference with S3 storage size. Archive or delete confirmed-unused data to S3 Glacier or remove it entirely. Establish a quarterly data asset review process as a standard DataOps practice to prevent unused assets from accumulating.
A systematic audit of a two-year-old data lake typically identifies 20-35% of storage that is either duplicate, no longer consumed, or already available in a compressed processed form that makes the raw layer redundant.
Starting Your Optimisation Programme
None of these tactics requires significant architectural change β they are operational improvements that can be implemented incrementally. A reasonable starting point is to spend one sprint with a data engineer focused on tactics 1 (Glue right-sizing), 5 (Intelligent-Tiering), and 6 (Budget alerts). These three alone typically reduce monthly data platform spend by 25-35% in organisations that have not previously focused on cost optimisation.
Infra IT Consulting works with Canadian and international data engineering teams to identify and capture AWS cost savings without compromising platform reliability or analytical capability. Reach out to discuss a cost optimisation assessment for your environment.
Related posts
Migrating from On-Prem Hadoop to AWS: Lessons from the Field
Read more Cloud Migration & Cost OptimizationThe AWS Data Migration Checklist: 50 Things to Verify Before Go-Live
Read more Cloud Migration & Cost OptimizationReserved Instances vs. Savings Plans for Data Workloads
Read moreBook a free 30-minute consultation to discuss your data engineering and analytics needs.
Talk to our team β