Migrating from On-Prem Hadoop to AWS: Lessons from the Field
Hadoop had a decade in the sun. Between 2008 and 2018, clusters running HDFS, YARN, Hive, Pig, and Spark were the default answer for organisations that needed to process data at scale on commodity hardware. Today, most of those clusters are ageing liabilities: underutilised nodes sitting on data centre floors, supported by a shrinking pool of Hadoop specialists, and accumulating maintenance debt that grows faster than the business value they deliver.
The migration from on-premises Hadoop to AWS is one of the most technically complex cloud journeys a data team can undertake. This post captures the lessons we have gathered from field migrations โ what works, what does not, and where projects go sideways.
The Hadoop Estate Is Rarely What It Appears
Before a single byte moves to AWS, every Hadoop migration needs a thorough estate assessment. In our experience, organisations consistently underestimate three things about their clusters.
How much data is actually used. HDFS clusters frequently contain petabytes of data that has not been queried in 12โ36 months. One client we worked with had 480 TB in HDFS; after a 90-day access log analysis, roughly 310 TB had not been touched since the original ingest. Migrating that cold data to S3 Glacier Instant Retrieval rather than S3 Standard cut their storage bill by 60% compared to a naive lift.
How many jobs are running. YARN application logs often reveal hundreds of scheduled jobs, many of which were written for one-off analyses and then never removed from cron or Oozie. A proper job inventory โ exporting YARN history server logs and correlating with Oozie coordinator schedules โ typically reveals that 20โ40% of registered jobs have not run successfully in over a year.
What the actual compute utilisation is. Most on-premises Hadoop clusters are sized for peak load and sit at 15โ30% average CPU utilisation. That hardware cost is fixed whether the cluster is idle or saturated. AWS EMR and EMR Serverless charge only for what you use, which is the central economic argument for migration โ but only if you are honest about the baseline.
Mapping Hadoop Components to AWS Services
The Hadoop ecosystem maps to AWS services with reasonable fidelity, but the mapping is not always one-to-one, and some mappings require architectural changes rather than simple substitution.
| Hadoop Component | AWS Equivalent | Notes |
|---|---|---|
| HDFS | Amazon S3 | Fundamental architectural shift; see below |
| YARN | EMR (managed YARN) or EMR Serverless | EMR Serverless eliminates cluster management |
| Hive | AWS Glue Data Catalog + Athena or Redshift Spectrum | Metastore-compatible |
| Spark | EMR Spark or AWS Glue | Glue for simpler jobs; EMR for fine-grained tuning |
| HBase | Amazon DynamoDB | Requires data model redesign |
| Kafka | Amazon MSK (Managed Kafka) | Near-direct replacement |
| Oozie | AWS Step Functions or Amazon MWAA | Workflow orchestration |
| Ranger / Kerberos | AWS IAM + Lake Formation | More granular but different model |
The HDFS-to-S3 transition is the most consequential change in the entire migration. HDFS is a distributed block store tightly coupled with compute. Hadoop jobs assume data locality โ the compute node reads data from local disks where possible, minimising network I/O. S3 is an object store accessed over HTTP. The implications:
- Spark jobs need to be tuned differently. Input partition sizing that worked in HDFS often performs poorly against S3 because S3 LIST operations and small-file access patterns carry different latency characteristics.
- The small files problem is more acute on S3. Hundreds of thousands of small Parquet or ORC files that worked adequately on HDFS will cause significant performance degradation on S3 without compaction strategies.
- S3 does not have a rename operation. Hive-style staging-then-rename patterns used in HDFS jobs fail on S3 unless you use S3 Committers designed for object stores.
Migration Phases: A Practical Sequence
Migrations that succeed follow a phased approach that manages risk while building momentum.
Phase 1 โ Cold data migration (weeks 1โ4). Export aged HDFS data (last accessed > 12 months) to S3 using s3-dist-cp or AWS DataSync with HDFS connector. This immediately demonstrates cloud cost savings and gives the team experience with S3 access patterns before any workloads move.
Phase 2 โ Reference and lookup data (weeks 3โ6). Migrate static reference tables and lookup data to S3 and register them in the Glue Data Catalog. Set up Athena queries against this data. This validates the Catalog setup and gives analysts a preview of the AWS query environment.
Phase 3 โ Non-critical batch jobs (weeks 5โ12). Identify batch jobs with the lowest criticality โ weekly aggregations, archived report generation, data quality scans. Rewrite or port these to run on EMR or Glue against S3. Run them in parallel with the Hadoop originals for 2โ4 weeks to validate output fidelity.
Phase 4 โ Core production pipelines (weeks 10โ20). Migrate the business-critical pipelines. This phase requires the most engineering time, thorough testing, and a defined cutover plan with rollback procedures. Refer to the AWS Data Migration Checklist for a cutover template.
Phase 5 โ Cluster decommission. Once all jobs run successfully on AWS for a defined stability period, schedule cluster decommission. Do not keep the cluster running indefinitely as a fallback โ the cost and operational overhead erode your migration savings.
Hive Metastore Migration
Migrating the Hive Metastore is often underestimated. The metastore contains table definitions, partition metadata, schema versions, and statistics that production jobs depend on. AWS Glue Data Catalog is Hive Metastore compatible via the Hive Thrift API, but there are differences in behaviour that require testing.
A practical approach uses the AWS Glue CLI or SDK to script table creation from exported Hive DDL:
# Export Hive DDL from existing metastore
hive -e "SHOW TABLES IN my_database" | while read table; do
hive -e "SHOW CREATE TABLE my_database.${table};" >> hive_ddl_export.sql
done
# After review and modification for S3 paths, apply via Glue CLI
# or use AWS Schema Conversion Tool for automated DDL translation
aws glue create-table \
--database-name my_database \
--table-input file://table_definition.json
Test all table definitions against real queries in Athena before cutting over Spark jobs. Partition pruning behaviour and data type handling differ between Hive and Athena in subtle ways that surface only under production query patterns.
EMR Serverless vs. EMR on EC2: Choosing the Right Target
This is the decision most teams spend the most time on. EMR on EC2 gives you full control over cluster configuration, instance types, Spark settings, and networking. EMR Serverless abstracts all of that and charges per vCPU-second and GB-second of memory used.
For Hadoop migrations, EMR on EC2 is usually the right first target. It most closely mirrors the existing YARN cluster model and allows direct application of Spark configuration parameters tuned for existing jobs. Teams that try to jump straight to EMR Serverless without first validating their Spark code runs correctly in the AWS environment often run into configuration compatibility issues that are harder to debug in a serverless context.
Once jobs are stable on EMR on EC2, evaluate migration to EMR Serverless for jobs where the auto-scaling benefits outweigh the loss of configuration control. A full comparison of these options is covered in our post on EMR Serverless vs. EMR on EC2.
The Cost Reality After Migration
A well-executed Hadoop-to-AWS migration typically achieves 35โ55% infrastructure cost reduction when comparing the full cost of on-premises operations (hardware amortisation, power, cooling, data centre space, maintenance contracts, and staff time) against the AWS equivalent.
The per-run EMR cost for a job consuming 20 vCores for 2 hours on m5.xlarge spot instances is approximately $1.40 CAD. Running that job daily is $42/month. The proportional share of an equivalent on-premises Hadoop node (assuming full cost allocation including data centre overhead) typically runs $120โ$180/month per comparable compute unit.
Where migration cost models go wrong is in forgetting data egress, Glue Catalog API call costs at scale, and the engineering time required to retune Spark jobs for the S3 access pattern. Budget for a 3โ6 month performance tuning phase post-migration. For more on ongoing cost management, see AWS Cost Optimisation for Data Teams.
Conclusion
Migrating from on-premises Hadoop to AWS is a multi-month engineering programme, not a weekend project. The organisations that do it successfully invest in a rigorous estate assessment, a phased migration approach, and genuine Spark/AWS skills development for their engineering teams. The reward is substantial: elastic compute that scales to zero, managed service layers that eliminate cluster operations, and a dramatically simpler long-term cost model.
Infra IT Consulting specialises in Hadoop migration programmes for data teams across Canada, the UK, and Africa. If you are planning a migration or struggling with a stalled one, contact us to discuss a structured migration assessment.
Related posts
Rightsizing AWS Data Workloads: A Practical Guide
Read more Cloud Migration & Cost OptimizationManaging S3 Storage Costs: Lifecycle Policies and Intelligent-Tiering
Read more Cloud Migration & Cost OptimizationThe AWS Data Migration Checklist: 50 Things to Verify Before Go-Live
Read moreBook a free 30-minute consultation to discuss your data engineering and analytics needs.
Talk to our team โ