Infra IT Consulting logo Infra ITC
Cloud Migration & Cost Optimization etlmigrationglue

Modernising Legacy ETL: From SSIS and Informatica to AWS Glue

By Infra IT Consulting Β· Β· 9 min read

If your data team is still running SQL Server Integration Services packages on ageing Windows servers, or paying six-figure Informatica PowerCenter licence renewals, you are not alone. A significant share of enterprise data infrastructure in Canada, the UK, and across Africa still depends on ETL tooling that was designed before cloud storage cost a fraction of a cent per gigabyte. The business case for modernising is straightforward. The technical path is not.

This post walks through the real decisions, trade-offs, and migration steps involved in moving from SSIS or Informatica to AWS Glue β€” based on the patterns we see repeatedly at Infra IT Consulting when working with data teams at mid-market and enterprise clients.

Why Legacy ETL Becomes a Liability

SSIS and Informatica PowerCenter were excellent products for their era. They were built for on-premises data warehouses, predictable batch windows, and ETL teams with deep platform-specific expertise. They start to fail the modern data team in three concrete ways.

Scalability ceiling. SSIS runs on a single Windows server. When a transformation job needs more memory or CPU, you either provision a bigger machine or split the job manually. AWS Glue runs on Apache Spark under the hood, which means horizontal scaling is automatic. A job that chokes on 500 GB of data on an SSIS server can run on Glue across a distributed cluster sized precisely for that load.

Licence cost and vendor dependency. Informatica PowerCenter licences commonly run $100,000–$400,000 CAD per year for mid-size deployments, depending on connector count and compute units. SQL Server Enterprise Edition β€” frequently required to run SSIS at scale β€” adds another $15,000–$50,000 per core. AWS Glue charges $0.44 per DPU-hour with no upfront licences, no renewal negotiations, and no per-connector fees for the core service.

Operational overhead. Legacy ETL tools need patching, Windows Server maintenance, ODBC driver management, and specialist staff who know the tooling. Every hour spent keeping the platform alive is an hour not spent building pipelines.

Understanding What AWS Glue Actually Is

Before migrating, teams need a clear-eyed view of what Glue is and is not. Glue is a managed Apache Spark and Python Shell environment. When you write a Glue job, you are writing PySpark or Scala with some AWS-specific extensions (DynamicFrame, GlueContext). This is a significant skill shift for teams whose engineers have spent years dragging-and-dropping SSIS components.

AWS Glue also includes:

  • Glue Data Catalog β€” a managed Hive metastore compatible with Athena, EMR, and Redshift Spectrum
  • Glue Crawlers β€” automated schema inference from S3, JDBC sources, and DynamoDB
  • Glue Studio β€” a visual ETL designer that generates PySpark code, easing the transition for teams coming from visual tooling
  • Glue DataBrew β€” a no-code data preparation tool suited for analysts

The Catalog is arguably as valuable as the job execution engine. If you are building a data lake on S3, the Glue Catalog becomes the central schema registry that ties together Athena ad-hoc queries, EMR Spark jobs, and Redshift Spectrum external tables.

The Migration Assessment: What to Inventory First

A successful migration starts with a brutal inventory of your existing ETL estate. Many organisations have hundreds of SSIS packages or Informatica workflows that have accumulated over a decade, and a significant fraction are either broken, redundant, or executing on a schedule nobody remembers approving.

Run this assessment before writing a single line of Glue code:

1. Classify by execution frequency. Which jobs run daily? Weekly? Quarterly? Jobs that run quarterly are lower priority for migration but are also often the most complex β€” they were last touched years ago and lack documentation.

2. Map data sources and targets. Build a complete list of source databases, flat file locations, and target systems. On-premises SQL Server, Oracle, SAP BFTP drops, mainframe flat files, and FTP-delivered CSVs all require different connectivity strategies in AWS.

3. Identify transformation logic complexity. Some SSIS packages are essentially SQL SELECT ... INSERT statements wrapped in a package. Others contain hundreds of derived column transformations, conditional splits, and fuzzy lookup components. The former translate easily; the latter require careful engineering.

4. Flag hard dependencies. Look for packages that call COM objects, run Windows batch scripts, or depend on local file system paths. These are migration blockers that need resolving before you can lift anything to AWS.

A practical checklist for this phase is available in our AWS Data Migration Checklist.

A Concrete Migration Example: SSIS to Glue

Here is a simplified but representative example. An SSIS package reads transactional data from SQL Server, applies currency conversion using a lookup table, and writes the result to a SQL Server data warehouse. In AWS, the equivalent pattern uses RDS (or a retained on-premises SQL Server over Direct Connect), Glue for the transformation, and Redshift as the warehouse target.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import col, when

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read transactions from source JDBC (SQL Server or RDS)
transactions = glueContext.create_dynamic_frame.from_catalog(
    database="raw_db",
    table_name="transactions"
)

# Read FX lookup from S3 (was a Lookup transform in SSIS)
fx_rates = spark.read.parquet("s3://my-bucket/reference/fx_rates/")

# Convert to Spark DataFrame for join logic
tx_df = transactions.toDF()
joined = tx_df.join(fx_rates, on="currency_code", how="left")

# Apply derived column logic (equivalent to SSIS Derived Column)
result = joined.withColumn(
    "amount_cad",
    col("amount_local") * col("rate_to_cad")
)

# Write to Redshift via Glue connection
output = DynamicFrame.fromDF(result, glueContext, "output")
glueContext.write_dynamic_frame.from_jdbc_conf(
    frame=output,
    catalog_connection="redshift-connection",
    connection_options={"dbtable": "fact.transactions_cad", "database": "analytics"},
    redshift_tmp_dir="s3://my-bucket/tmp/glue/"
)

job.commit()

This replaces roughly 40 SSIS components with approximately 50 lines of readable Python. The Glue job is version-controlled in Git, deployable via Terraform or CDK, and scales horizontally without touching a server.

Handling the Hard Cases: Informatica Mappings

Informatica PowerCenter mappings tend to be more complex than their SSIS equivalents. They frequently use expression transformations with embedded business logic, router transformations for conditional branching, and normaliser/denormaliser components for hierarchical data. None of these have direct Glue equivalents β€” they translate to PySpark operations.

The most effective migration strategy we have used is a strangler fig approach: rather than migrating the entire Informatica estate in one programme, identify new pipelines or major pipeline modifications as the trigger to write in Glue. Over 12–18 months, the Informatica dependency shrinks organically while the team builds Glue proficiency on lower-risk work.

For jobs that must be migrated immediately β€” typically those tied to decommissioned servers β€” Informatica’s own export-to-code features, combined with AWS’s schema conversion tool concepts, can accelerate the initial translation. Expect to spend 2–4 weeks of engineering time per complex mapping in a direct migration effort.

Cost Modelling: What to Expect

Glue pricing is consumption-based. A standard Glue job uses G.1X workers at $0.44/DPU-hour. A job that processes 50 GB of data and runs for 15 minutes using 10 workers costs approximately $1.10 per run. Running that job twice daily costs roughly $66/month.

The equivalent SSIS workload, running on a dedicated Windows Server 2019 instance (say, a db.r5.2xlarge at ~$0.48/hour), costs $345/month before storage, SQL Server licensing, and maintenance overhead. The Glue job is not always cheaper for every pattern β€” long-running jobs with large clusters can become expensive β€” but for bursty, variable-volume batch workloads, the consumption model wins consistently.

Factor in the elimination of Informatica licence costs and the server reduction, and a well-planned migration to Glue commonly delivers 40–70% total cost reduction on ETL infrastructure within 18 months. See our deeper analysis in AWS Cost Optimisation for Data Teams for benchmarking methodology.

Conclusion

Modernising legacy ETL is not a lift-and-shift exercise. It is a deliberate re-engineering effort that requires inventorying what you have, making honest assessments of complexity, and building team capability in PySpark alongside the migration work. The reward is a cloud-native data pipeline estate that scales on demand, eliminates licence lock-in, and integrates natively with the rest of the AWS data ecosystem.

Infra IT Consulting has delivered ETL modernisation programmes for clients across Canada, the UK, and Africa β€” from single-team migrations of 30 SSIS packages to multi-year Informatica decommission programmes. If your team is ready to start planning, contact us to discuss a migration assessment tailored to your environment.

Related posts