Using AWS Spot Instances for Cost-Effective Data Processing
EC2 Spot Instances are AWSβs mechanism for selling spare compute capacity at discounts that routinely reach 70β90% below on-demand prices. For data engineering teams running batch workloads β nightly ETL jobs, large-scale data transformations, ML training runs, and ad-hoc analytical queries β Spot Instances represent one of the largest single levers for compute cost reduction available in AWS.
The catch is that Spot Instances can be interrupted. AWS can reclaim capacity with two minutes notice when demand for that instance type in that Availability Zone increases. This interruption risk is what puts teams off Spot β but for most data processing workloads, interruption handling is simpler than it appears, and the 60β80% cost savings more than justify the engineering effort.
Understanding Spot Interruption Risk
Spot interruption rates vary by instance family, size, and Availability Zone. AWS publishes Spot Instance Advisor data showing the frequency of interruption for each instance type. Historically, interruption rates for large batch-friendly instance types like m5.4xlarge, r5.2xlarge, and c5.9xlarge run at less than 5% per month in most regions β meaning that any given instance has better than a 95% chance of completing a one-hour job without interruption.
The risk profile changes significantly with instance family choice:
- Latest-generation, popular instance types (
m5.xlarge,c5.2xlarge) β higher interruption risk because demand is high - Older-generation or less common sizes (
m4.10xlarge,r4.16xlarge) β lower interruption risk because spare capacity is more abundant - Graviton instances (
m6g,r6g,c6g) β often the best combination of spot availability, low interruption rate, and price
The practical implication: diversify across multiple instance types and multiple Availability Zones. An EMR cluster or auto-scaling group that is configured to use only m5.xlarge in us-east-1a will be interrupted more frequently than one that accepts m5.xlarge, m5a.xlarge, m4.xlarge, and m5.large across three Availability Zones.
Spot Instances for EMR: The Core Pattern
Amazon EMR has native Spot Instance integration with a well-tested pattern for mixed on-demand and spot clusters:
- Master node: always on-demand. The YARN ResourceManager and HDFS NameNode (for EMR with HDFS) run on the master. Master interruption terminates the cluster.
- Core nodes: on-demand or reserved. Core nodes in EMR run the HDFS DataNode for any HDFS storage (though for S3-based workloads, core nodes do not hold critical data). Running core nodes on-demand provides cluster stability.
- Task nodes: Spot. Task nodes are pure compute β they run YARN containers but hold no HDFS data. They can be added and removed dynamically, and their interruption simply causes YARN to reschedule their tasks on surviving nodes.
This mixed configuration gives you Spot savings on the majority of the cluster compute while maintaining stability through on-demand master and core nodes.
Example EMR cluster configuration using Terraform:
resource "aws_emr_cluster" "data_processing" {
name = "batch-data-processing"
release_label = "emr-6.15.0"
applications = ["Spark"]
service_role = aws_iam_role.emr_service_role.arn
ec2_attributes {
subnet_id = var.subnet_id
instance_profile = aws_iam_instance_profile.emr_profile.arn
emr_managed_master_security_group = aws_security_group.emr_master.id
emr_managed_slave_security_group = aws_security_group.emr_slave.id
}
master_instance_group {
instance_type = "m5.xlarge" # On-demand for stability
}
core_instance_group {
instance_type = "m5.2xlarge" # On-demand core nodes
instance_count = 2
}
task_instance_group {
instance_type = "m5.4xlarge"
instance_count = 10
bid_price = "0.50" # Max bid; actual price is current Spot price
ebs_config {
size = 100
type = "gp3"
volumes_per_instance = 1
}
}
configurations_json = jsonencode([{
Classification = "spark",
Properties = {
"spark.speculation" = "true" # Enable speculative execution for Spot resilience
}
}])
}
With this configuration, a cluster with 2 on-demand m5.2xlarge core nodes and 10 Spot m5.4xlarge task nodes pays on-demand rates only for the core and master. A m5.4xlarge costs $0.768/hour on-demand in ca-central-1 and approximately $0.16β$0.23/hour on Spot β a 70β79% saving on task compute.
Spot Instances for AWS Glue: Job Flex Execution
AWS Glue introduced Flex execution for jobs where latency is less critical than cost. Glue Flex jobs use Spot Instances for the Spark executors, offering up to 34% cost reduction compared to standard Glue jobs. Flex jobs are appropriate for:
- Nightly batch ETL jobs with flexible completion windows
- Data quality scans and profiling jobs
- Historical backfills where speed is less important than cost
Flex execution is configured at the job level in Glue Studio or via the API:
# Boto3: create Glue job with Flex execution
import boto3
client = boto3.client('glue', region_name='ca-central-1')
response = client.create_job(
Name='nightly-transaction-etl',
Role='arn:aws:iam::123456789:role/GlueRole',
Command={
'Name': 'glueetl',
'ScriptLocation': 's3://my-bucket/scripts/transaction_etl.py',
'PythonVersion': '3'
},
ExecutionProperty={
'MaxConcurrentRuns': 1
},
ExecutionClass='FLEX', # <-- Flex execution uses Spot
NumberOfWorkers=20,
WorkerType='G.1X',
GlueVersion='4.0'
)
Note that Flex jobs cannot be used for streaming jobs, jobs with a timeout under 960 minutes, or jobs in Development endpoints.
Designing for Interruption Tolerance
The key to reliable Spot-based data processing is designing pipelines that can tolerate and recover from interruption. Several patterns make this achievable.
Checkpoint and resume. Spark Structured Streaming and Sparkβs checkpoint mechanism write progress state to S3. If a Spot interruption kills the cluster mid-way through a large Spark job, a restarted cluster can resume from the last checkpoint rather than reprocessing from the beginning.
Speculative execution. Sparkβs speculative execution (spark.speculation = true) launches duplicate copies of slow tasks on other executors. When a Spot interruption kills a task, the speculative copy is already running on a surviving node. This reduces the performance impact of individual instance interruptions.
Output to S3 with atomic writes. Never write pipeline output to HDFS or local disk β always write to S3. Use partition-overwrite mode (spark.sql.sources.partitionOverwriteMode = dynamic) so that jobs that are interrupted and restarted overwrite only the partitions they were processing, rather than partially overwriting and corrupting existing data.
S3 interruption notices via EventBridge. AWS sends an EC2 Spot interruption notice 2 minutes before reclamation via EC2 instance metadata and via EventBridge events. Configure an EventBridge rule to trigger a Lambda function that flags the impending interruption, allowing graceful task shutdown and state flushing if your pipeline supports it.
What Workloads Are Not Good Spot Candidates
Spot Instances are not universally appropriate. Avoid Spot for:
- Latency-sensitive pipelines with hard SLAs. If a pipeline must complete by 6am every morning with zero tolerance for delay, build it on on-demand instances and consider Spot only for jobs with flexible windows.
- Long-running streaming jobs. Kinesis consumers, Kafka consumers, and Spark Structured Streaming jobs on Spot need careful checkpoint design. Interruption of a streaming job that has not checkpointed recently can cause data reprocessing and consumer group lag.
- Stateful jobs without interruption handling. If the job cannot be safely restarted from the point of interruption, Spot interruptions cause full reruns. Whether a full rerun is acceptable depends on the jobβs duration and the frequency of interruptions.
Cost Modelling: Spot vs. On-Demand for a Realistic Workload
Consider a nightly batch job that processes 200 GB of transaction data using a 10-node m5.4xlarge EMR cluster (1 master on-demand, 2 core on-demand, 7 task nodes on Spot), running for 3 hours per night.
| Component | Instances | On-Demand $/hr | Spot $/hr (est.) | Daily Cost |
|---|---|---|---|---|
| Master (m5.xlarge) | 1 | $0.192 | β | $0.58 |
| Core (m5.2xlarge) | 2 | $0.384 | β | $2.30 |
| Task (m5.4xlarge) | 7 | $0.768 | $0.185 | $3.89 |
| Total (with Spot tasks) | $6.77/night | |||
| Total (all on-demand) | $24.19/night |
Annualised, the Spot-based cluster costs approximately $2,470 versus $8,830 for an all-on-demand equivalent β a saving of $6,360/year on a single job. Across a portfolio of 20β30 batch jobs, Spot Instance savings commonly reach $80,000β$150,000 CAD per year for mid-size data teams.
For a broader view of cost optimisation levers across the AWS data stack, see AWS Cost Optimisation for Data Teams and our analysis of FinOps for Data Engineering.
Conclusion
AWS Spot Instances offer the most aggressive compute cost savings available in AWS, and for data engineering batch workloads, the interruption risk is manageable with straightforward engineering practices. The combination of mixed on-demand/Spot EMR clusters, Glue Flex execution, Spark speculative execution, and S3-backed checkpointing gives you the economics of Spot with an acceptable reliability profile.
Infra IT Consulting helps data engineering teams across Canada, the UK, and Africa design cost-optimised pipeline architectures that leverage Spot Instances effectively. If your batch compute costs are higher than they should be, contact us to discuss a cost architecture review.
Related posts
FinOps for Data Engineering: Building a Cost-Conscious Culture
Read more Cloud Migration & Cost OptimizationAWS Cost Optimisation for Data Teams: 10 Tactics That Work
Read more Cloud Migration & Cost OptimizationThe AWS Data Migration Checklist: 50 Things to Verify Before Go-Live
Read moreBook a free 30-minute consultation to discuss your data engineering and analytics needs.
Talk to our team β