Cloud Migration & Cost Optimization spot-instancesemrcost

Using AWS Spot Instances for Cost-Effective Data Processing

By Infra IT Consulting · March 31, 2024 · 9 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

EC2 Spot Instances are AWS’s mechanism for selling spare compute capacity at discounts that routinely reach 70–90% below on-demand prices. For data engineering teams running batch workloads — nightly ETL jobs, large-scale data transformations, ML training runs, and ad-hoc analytical queries — Spot Instances represent one of the largest single levers for compute cost reduction available in AWS.

The catch is that Spot Instances can be interrupted. AWS can reclaim capacity with two minutes notice when demand for that instance type in that Availability Zone increases. This interruption risk is what puts teams off Spot — but for most data processing workloads, interruption handling is simpler than it appears, and the 60–80% cost savings more than justify the engineering effort.

Understanding Spot Interruption Risk

Spot interruption rates vary by instance family, size, and Availability Zone. AWS publishes Spot Instance Advisor data showing the frequency of interruption for each instance type. Historically, interruption rates for large batch-friendly instance types like m5.4xlarge, r5.2xlarge, and c5.9xlarge run at less than 5% per month in most regions — meaning that any given instance has better than a 95% chance of completing a one-hour job without interruption.

The risk profile changes significantly with instance family choice:

Latest-generation, popular instance types (m5.xlarge, c5.2xlarge) — higher interruption risk because demand is high
Older-generation or less common sizes (m4.10xlarge, r4.16xlarge) — lower interruption risk because spare capacity is more abundant
Graviton instances (m6g, r6g, c6g) — often the best combination of spot availability, low interruption rate, and price

The practical implication: diversify across multiple instance types and multiple Availability Zones. An EMR cluster or auto-scaling group that is configured to use only m5.xlarge in us-east-1a will be interrupted more frequently than one that accepts m5.xlarge, m5a.xlarge, m4.xlarge, and m5.large across three Availability Zones.

Spot Instances for EMR: The Core Pattern

Amazon EMR has native Spot Instance integration with a well-tested pattern for mixed on-demand and spot clusters:

Master node: always on-demand. The YARN ResourceManager and HDFS NameNode (for EMR with HDFS) run on the master. Master interruption terminates the cluster.
Core nodes: on-demand or reserved. Core nodes in EMR run the HDFS DataNode for any HDFS storage (though for S3-based workloads, core nodes do not hold critical data). Running core nodes on-demand provides cluster stability.
Task nodes: Spot. Task nodes are pure compute — they run YARN containers but hold no HDFS data. They can be added and removed dynamically, and their interruption simply causes YARN to reschedule their tasks on surviving nodes.

This mixed configuration gives you Spot savings on the majority of the cluster compute while maintaining stability through on-demand master and core nodes.

Example EMR cluster configuration using Terraform:

resource "aws_emr_cluster" "data_processing" {
  name          = "batch-data-processing"
  release_label = "emr-6.15.0"
  applications  = ["Spark"]
  service_role  = aws_iam_role.emr_service_role.arn

  ec2_attributes {
    subnet_id                         = var.subnet_id
    instance_profile                  = aws_iam_instance_profile.emr_profile.arn
    emr_managed_master_security_group = aws_security_group.emr_master.id
    emr_managed_slave_security_group  = aws_security_group.emr_slave.id
  }

  master_instance_group {
    instance_type = "m5.xlarge"  # On-demand for stability
  }

  core_instance_group {
    instance_type  = "m5.2xlarge"  # On-demand core nodes
    instance_count = 2
  }

  task_instance_group {
    instance_type  = "m5.4xlarge"
    instance_count = 10
    bid_price      = "0.50"  # Max bid; actual price is current Spot price

    ebs_config {
      size                 = 100
      type                 = "gp3"
      volumes_per_instance = 1
    }
  }

  configurations_json = jsonencode([{
    Classification = "spark",
    Properties = {
      "spark.speculation" = "true"  # Enable speculative execution for Spot resilience
    }
  }])
}

With this configuration, a cluster with 2 on-demand m5.2xlarge core nodes and 10 Spot m5.4xlarge task nodes pays on-demand rates only for the core and master. A m5.4xlarge costs $0.768/hour on-demand in ca-central-1 and approximately $0.16–$0.23/hour on Spot — a 70–79% saving on task compute.

Spot Instances for AWS Glue: Job Flex Execution

AWS Glue introduced Flex execution for jobs where latency is less critical than cost. Glue Flex jobs use Spot Instances for the Spark executors, offering up to 34% cost reduction compared to standard Glue jobs. Flex jobs are appropriate for:

Nightly batch ETL jobs with flexible completion windows
Data quality scans and profiling jobs
Historical backfills where speed is less important than cost

Flex execution is configured at the job level in Glue Studio or via the API:

# Boto3: create Glue job with Flex execution
import boto3

client = boto3.client('glue', region_name='ca-central-1')

response = client.create_job(
    Name='nightly-transaction-etl',
    Role='arn:aws:iam::123456789:role/GlueRole',
    Command={
        'Name': 'glueetl',
        'ScriptLocation': 's3://my-bucket/scripts/transaction_etl.py',
        'PythonVersion': '3'
    },
    ExecutionProperty={
        'MaxConcurrentRuns': 1
    },
    ExecutionClass='FLEX',       # <-- Flex execution uses Spot
    NumberOfWorkers=20,
    WorkerType='G.1X',
    GlueVersion='4.0'
)

Note that Flex jobs cannot be used for streaming jobs, jobs with a timeout under 960 minutes, or jobs in Development endpoints.

Designing for Interruption Tolerance

The key to reliable Spot-based data processing is designing pipelines that can tolerate and recover from interruption. Several patterns make this achievable.

Checkpoint and resume. Spark Structured Streaming and Spark’s checkpoint mechanism write progress state to S3. If a Spot interruption kills the cluster mid-way through a large Spark job, a restarted cluster can resume from the last checkpoint rather than reprocessing from the beginning.

Speculative execution. Spark’s speculative execution (spark.speculation = true) launches duplicate copies of slow tasks on other executors. When a Spot interruption kills a task, the speculative copy is already running on a surviving node. This reduces the performance impact of individual instance interruptions.

Output to S3 with atomic writes. Never write pipeline output to HDFS or local disk — always write to S3. Use partition-overwrite mode (spark.sql.sources.partitionOverwriteMode = dynamic) so that jobs that are interrupted and restarted overwrite only the partitions they were processing, rather than partially overwriting and corrupting existing data.

S3 interruption notices via EventBridge. AWS sends an EC2 Spot interruption notice 2 minutes before reclamation via EC2 instance metadata and via EventBridge events. Configure an EventBridge rule to trigger a Lambda function that flags the impending interruption, allowing graceful task shutdown and state flushing if your pipeline supports it.

What Workloads Are Not Good Spot Candidates

Spot Instances are not universally appropriate. Avoid Spot for:

Latency-sensitive pipelines with hard SLAs. If a pipeline must complete by 6am every morning with zero tolerance for delay, build it on on-demand instances and consider Spot only for jobs with flexible windows.
Long-running streaming jobs. Kinesis consumers, Kafka consumers, and Spark Structured Streaming jobs on Spot need careful checkpoint design. Interruption of a streaming job that has not checkpointed recently can cause data reprocessing and consumer group lag.
Stateful jobs without interruption handling. If the job cannot be safely restarted from the point of interruption, Spot interruptions cause full reruns. Whether a full rerun is acceptable depends on the job’s duration and the frequency of interruptions.

Cost Modelling: Spot vs. On-Demand for a Realistic Workload

Consider a nightly batch job that processes 200 GB of transaction data using a 10-node m5.4xlarge EMR cluster (1 master on-demand, 2 core on-demand, 7 task nodes on Spot), running for 3 hours per night.

Component	Instances	On-Demand $/hr	Spot $/hr (est.)	Daily Cost
Master (m5.xlarge)	1	$0.192	—	$0.58
Core (m5.2xlarge)	2	$0.384	—	$2.30
Task (m5.4xlarge)	7	$0.768	$0.185	$3.89
Total (with Spot tasks)				$6.77/night
Total (all on-demand)				$24.19/night

Annualised, the Spot-based cluster costs approximately $2,470 versus $8,830 for an all-on-demand equivalent — a saving of $6,360/year on a single job. Across a portfolio of 20–30 batch jobs, Spot Instance savings commonly reach $80,000–$150,000 CAD per year for mid-size data teams.

For a broader view of cost optimisation levers across the AWS data stack, see AWS Cost Optimisation for Data Teams and our analysis of FinOps for Data Engineering.

Conclusion

AWS Spot Instances offer the most aggressive compute cost savings available in AWS, and for data engineering batch workloads, the interruption risk is manageable with straightforward engineering practices. The combination of mixed on-demand/Spot EMR clusters, Glue Flex execution, Spark speculative execution, and S3-backed checkpointing gives you the economics of Spot with an acceptable reliability profile.

Infra IT Consulting helps data engineering teams across Canada, the UK, and Africa design cost-optimised pipeline architectures that leverage Spot Instances effectively. If your batch compute costs are higher than they should be, contact us to discuss a cost architecture review.

Cloud Migration & Cost Optimization

Talk to our team →

Using AWS Spot Instances for Cost-Effective Data Processing

Understanding Spot Interruption Risk

Spot Instances for EMR: The Core Pattern

Spot Instances for AWS Glue: Job Flex Execution

Designing for Interruption Tolerance

What Workloads Are Not Good Spot Candidates

Cost Modelling: Spot vs. On-Demand for a Realistic Workload

Conclusion

Related posts

Reserved Instances vs. Savings Plans for Data Workloads

Cloud Exit Strategy: What Data Teams Should Plan For

Teradata to Amazon Redshift Migration: What No One Tells You

Understanding Spot Interruption Risk

Spot Instances for EMR: The Core Pattern

Spot Instances for AWS Glue: Job Flex Execution

Designing for Interruption Tolerance

What Workloads Are Not Good Spot Candidates

Cost Modelling: Spot vs. On-Demand for a Realistic Workload

Conclusion

Related posts

Reserved Instances vs. Savings Plans for Data Workloads

Cloud Exit Strategy: What Data Teams Should Plan For

Teradata to Amazon Redshift Migration: What No One Tells You

We value your privacy