Infra IT Consulting logo Infra ITC
Cloud Migration & Cost Optimization hadoopemrmigration

Migrating from On-Prem Hadoop to AWS: Lessons from the Field

By Infra IT Consulting ยท ยท 10 min read

Hadoop had a decade in the sun. Between 2008 and 2018, clusters running HDFS, YARN, Hive, Pig, and Spark were the default answer for organisations that needed to process data at scale on commodity hardware. Today, most of those clusters are ageing liabilities: underutilised nodes sitting on data centre floors, supported by a shrinking pool of Hadoop specialists, and accumulating maintenance debt that grows faster than the business value they deliver.

The migration from on-premises Hadoop to AWS is one of the most technically complex cloud journeys a data team can undertake. This post captures the lessons we have gathered from field migrations โ€” what works, what does not, and where projects go sideways.

The Hadoop Estate Is Rarely What It Appears

Before a single byte moves to AWS, every Hadoop migration needs a thorough estate assessment. In our experience, organisations consistently underestimate three things about their clusters.

How much data is actually used. HDFS clusters frequently contain petabytes of data that has not been queried in 12โ€“36 months. One client we worked with had 480 TB in HDFS; after a 90-day access log analysis, roughly 310 TB had not been touched since the original ingest. Migrating that cold data to S3 Glacier Instant Retrieval rather than S3 Standard cut their storage bill by 60% compared to a naive lift.

How many jobs are running. YARN application logs often reveal hundreds of scheduled jobs, many of which were written for one-off analyses and then never removed from cron or Oozie. A proper job inventory โ€” exporting YARN history server logs and correlating with Oozie coordinator schedules โ€” typically reveals that 20โ€“40% of registered jobs have not run successfully in over a year.

What the actual compute utilisation is. Most on-premises Hadoop clusters are sized for peak load and sit at 15โ€“30% average CPU utilisation. That hardware cost is fixed whether the cluster is idle or saturated. AWS EMR and EMR Serverless charge only for what you use, which is the central economic argument for migration โ€” but only if you are honest about the baseline.

Mapping Hadoop Components to AWS Services

The Hadoop ecosystem maps to AWS services with reasonable fidelity, but the mapping is not always one-to-one, and some mappings require architectural changes rather than simple substitution.

Hadoop ComponentAWS EquivalentNotes
HDFSAmazon S3Fundamental architectural shift; see below
YARNEMR (managed YARN) or EMR ServerlessEMR Serverless eliminates cluster management
HiveAWS Glue Data Catalog + Athena or Redshift SpectrumMetastore-compatible
SparkEMR Spark or AWS GlueGlue for simpler jobs; EMR for fine-grained tuning
HBaseAmazon DynamoDBRequires data model redesign
KafkaAmazon MSK (Managed Kafka)Near-direct replacement
OozieAWS Step Functions or Amazon MWAAWorkflow orchestration
Ranger / KerberosAWS IAM + Lake FormationMore granular but different model

The HDFS-to-S3 transition is the most consequential change in the entire migration. HDFS is a distributed block store tightly coupled with compute. Hadoop jobs assume data locality โ€” the compute node reads data from local disks where possible, minimising network I/O. S3 is an object store accessed over HTTP. The implications:

  • Spark jobs need to be tuned differently. Input partition sizing that worked in HDFS often performs poorly against S3 because S3 LIST operations and small-file access patterns carry different latency characteristics.
  • The small files problem is more acute on S3. Hundreds of thousands of small Parquet or ORC files that worked adequately on HDFS will cause significant performance degradation on S3 without compaction strategies.
  • S3 does not have a rename operation. Hive-style staging-then-rename patterns used in HDFS jobs fail on S3 unless you use S3 Committers designed for object stores.

Migration Phases: A Practical Sequence

Migrations that succeed follow a phased approach that manages risk while building momentum.

Phase 1 โ€” Cold data migration (weeks 1โ€“4). Export aged HDFS data (last accessed > 12 months) to S3 using s3-dist-cp or AWS DataSync with HDFS connector. This immediately demonstrates cloud cost savings and gives the team experience with S3 access patterns before any workloads move.

Phase 2 โ€” Reference and lookup data (weeks 3โ€“6). Migrate static reference tables and lookup data to S3 and register them in the Glue Data Catalog. Set up Athena queries against this data. This validates the Catalog setup and gives analysts a preview of the AWS query environment.

Phase 3 โ€” Non-critical batch jobs (weeks 5โ€“12). Identify batch jobs with the lowest criticality โ€” weekly aggregations, archived report generation, data quality scans. Rewrite or port these to run on EMR or Glue against S3. Run them in parallel with the Hadoop originals for 2โ€“4 weeks to validate output fidelity.

Phase 4 โ€” Core production pipelines (weeks 10โ€“20). Migrate the business-critical pipelines. This phase requires the most engineering time, thorough testing, and a defined cutover plan with rollback procedures. Refer to the AWS Data Migration Checklist for a cutover template.

Phase 5 โ€” Cluster decommission. Once all jobs run successfully on AWS for a defined stability period, schedule cluster decommission. Do not keep the cluster running indefinitely as a fallback โ€” the cost and operational overhead erode your migration savings.

Hive Metastore Migration

Migrating the Hive Metastore is often underestimated. The metastore contains table definitions, partition metadata, schema versions, and statistics that production jobs depend on. AWS Glue Data Catalog is Hive Metastore compatible via the Hive Thrift API, but there are differences in behaviour that require testing.

A practical approach uses the AWS Glue CLI or SDK to script table creation from exported Hive DDL:

# Export Hive DDL from existing metastore
hive -e "SHOW TABLES IN my_database" | while read table; do
  hive -e "SHOW CREATE TABLE my_database.${table};" >> hive_ddl_export.sql
done

# After review and modification for S3 paths, apply via Glue CLI
# or use AWS Schema Conversion Tool for automated DDL translation
aws glue create-table \
  --database-name my_database \
  --table-input file://table_definition.json

Test all table definitions against real queries in Athena before cutting over Spark jobs. Partition pruning behaviour and data type handling differ between Hive and Athena in subtle ways that surface only under production query patterns.

EMR Serverless vs. EMR on EC2: Choosing the Right Target

This is the decision most teams spend the most time on. EMR on EC2 gives you full control over cluster configuration, instance types, Spark settings, and networking. EMR Serverless abstracts all of that and charges per vCPU-second and GB-second of memory used.

For Hadoop migrations, EMR on EC2 is usually the right first target. It most closely mirrors the existing YARN cluster model and allows direct application of Spark configuration parameters tuned for existing jobs. Teams that try to jump straight to EMR Serverless without first validating their Spark code runs correctly in the AWS environment often run into configuration compatibility issues that are harder to debug in a serverless context.

Once jobs are stable on EMR on EC2, evaluate migration to EMR Serverless for jobs where the auto-scaling benefits outweigh the loss of configuration control. A full comparison of these options is covered in our post on EMR Serverless vs. EMR on EC2.

The Cost Reality After Migration

A well-executed Hadoop-to-AWS migration typically achieves 35โ€“55% infrastructure cost reduction when comparing the full cost of on-premises operations (hardware amortisation, power, cooling, data centre space, maintenance contracts, and staff time) against the AWS equivalent.

The per-run EMR cost for a job consuming 20 vCores for 2 hours on m5.xlarge spot instances is approximately $1.40 CAD. Running that job daily is $42/month. The proportional share of an equivalent on-premises Hadoop node (assuming full cost allocation including data centre overhead) typically runs $120โ€“$180/month per comparable compute unit.

Where migration cost models go wrong is in forgetting data egress, Glue Catalog API call costs at scale, and the engineering time required to retune Spark jobs for the S3 access pattern. Budget for a 3โ€“6 month performance tuning phase post-migration. For more on ongoing cost management, see AWS Cost Optimisation for Data Teams.

Conclusion

Migrating from on-premises Hadoop to AWS is a multi-month engineering programme, not a weekend project. The organisations that do it successfully invest in a rigorous estate assessment, a phased migration approach, and genuine Spark/AWS skills development for their engineering teams. The reward is substantial: elastic compute that scales to zero, managed service layers that eliminate cluster operations, and a dramatically simpler long-term cost model.

Infra IT Consulting specialises in Hadoop migration programmes for data teams across Canada, the UK, and Africa. If you are planning a migration or struggling with a stalled one, contact us to discuss a structured migration assessment.

Related posts