Data Analytics & BI data-warehousemigrationredshift

Modernising Legacy Data Warehouses on AWS

By Infra IT Consulting · February 23, 2024 · 10 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

Most organisations running legacy data warehouses did not choose obsolescence — they chose the best available technology at the time. Teradata, Oracle, SQL Server Analysis Services, and even early-generation Redshift clusters were legitimate choices in their era. The problem is that business data volumes have grown, query patterns have diversified, and the economics of cloud-native analytics have improved dramatically. What once cost seven figures in hardware and licensing can now run on Amazon Redshift for a fraction of the price, with better performance and far less operational burden.

Modernising a legacy data warehouse is not a simple lift-and-shift. It requires careful assessment, schema translation, ETL pipeline migration, and a cutover strategy that keeps the business running throughout. This guide walks through the key phases and decisions.

Phase 1: Assessment and Discovery

Before touching a single table, you need a clear picture of what you have. A thorough assessment covers:

Inventory of objects — How many schemas, tables, views, stored procedures, and user-defined functions exist? Tools like AWS Schema Conversion Tool (SCT) can connect to Oracle, SQL Server, Teradata, and Netezza and produce an automated compatibility report showing which objects can be converted automatically and which require manual intervention.

Query workload analysis — Which queries run most frequently? Which consume the most resources? Identifying the top 20 queries by compute cost often reveals the schemas and transformations that matter most to the business. On SQL Server, sys.dm_exec_query_stats provides this; on Teradata, DBQL logs serve the same purpose.

Data volumes and growth rates — Table sizes determine storage costs and influence the choice between Redshift RA3 nodes (which separate compute and storage) and a pure serverless approach with Redshift Serverless.

Upstream and downstream dependencies — Which ETL processes feed the warehouse? Which BI tools, reports, and applications query it? A migration that breaks the finance team’s month-end report on day one will permanently damage trust in the project.

AWS SCT produces a migration assessment report that assigns a complexity rating to each object. A typical legacy Teradata environment might see 60-70% of objects auto-converted, 20-30% requiring minor manual adjustments, and 5-10% requiring substantial rewriting — usually stored procedures with vendor-specific syntax or features.

Phase 2: Schema Translation and Data Modelling Decisions

Schema translation is rarely just syntax substitution. It is also an opportunity — and sometimes a necessity — to revisit data modelling decisions made years ago under different constraints.

From star schema to Redshift-optimised tables — Redshift is a columnar, massively parallel database. Distribution keys control how rows are spread across compute nodes; sort keys control the order in which data is stored on disk, enabling zone map pruning. Choosing these correctly for your query patterns is the single biggest performance lever available.

A typical Redshift table definition for a fact table might look like:

CREATE TABLE fact_sales (
    sale_id         BIGINT      NOT NULL,
    sale_date       DATE        NOT NULL,
    customer_id     INT         NOT NULL,
    product_id      INT         NOT NULL,
    store_id        INT         NOT NULL,
    quantity        INT         NOT NULL,
    unit_price_cad  DECIMAL(10,2),
    discount_pct    DECIMAL(5,2),
    net_revenue_cad DECIMAL(10,2)
)
DISTSTYLE KEY
DISTKEY (customer_id)
SORTKEY (sale_date, customer_id)
ENCODE AUTO;

ENCODE AUTO allows Redshift to choose column compression automatically — generally the right choice for new tables since it adapts as data is loaded.

Stored procedure translation — This is where most migration effort concentrates. Teradata BTEQ scripts, Oracle PL/SQL packages, and SQL Server T-SQL stored procedures each have idioms with no direct Redshift equivalent. AWS SCT handles straightforward cases, but complex cursor-based logic, dynamic SQL, and vendor-specific functions often require rewriting. In practice, we often take this as an opportunity to move transformation logic out of stored procedures entirely and into dbt models — making the code version-controlled, tested, and easier to maintain. See our post on dbt on AWS for how this works in practice.

Phase 3: ETL Pipeline Migration

Migrating the warehouse schema is only half the job. You also need to migrate the pipelines that populate it.

Homogeneous source systems — If your source systems are relational databases (Oracle, SQL Server, PostgreSQL), AWS Database Migration Service (DMS) can replicate them continuously to S3 or directly to Redshift using Change Data Capture. DMS handles the initial full load and then streams ongoing changes, enabling near-real-time replication with minimal latency.

Heterogeneous and legacy ETL — Many legacy warehouses are fed by Informatica, DataStage, SSIS, or Ab Initio pipelines. AWS offers the Glue Studio visual interface as a drop-in replacement for simple Informatica mappings. More complex pipelines often benefit from a full rewrite in AWS Glue (PySpark) or dbt, using the migration as an opportunity to document what the pipeline actually does.

Incremental loading patterns — Legacy ETL jobs often use fragile watermark-based incremental loads that break when source data is updated out of order. Redshift MERGE (available since 2022) enables robust upsert patterns:

MERGE INTO fact_sales
USING stage_sales AS src
ON fact_sales.sale_id = src.sale_id
WHEN MATCHED THEN UPDATE SET
    net_revenue_cad = src.net_revenue_cad,
    quantity = src.quantity
WHEN NOT MATCHED THEN INSERT VALUES (
    src.sale_id, src.sale_date, src.customer_id,
    src.product_id, src.store_id, src.quantity,
    src.unit_price_cad, src.discount_pct, src.net_revenue_cad
);

Phase 4: Validation and Parallel Running

Never cut over without a validation phase. Row counts and aggregate checksums alone are insufficient — you need query-level output comparison:

Identify the 50-100 most important reports and queries.
Run them against both the legacy warehouse and Redshift simultaneously.
Compare outputs programmatically — exact match for deterministic queries, tolerance-based for floating-point aggregations.
Investigate every discrepancy before proceeding.

Most discrepancies fall into a few categories: timezone handling differences, rounding behaviour, NULL semantics, or incremental load gaps. Catching these in validation is far less costly than fielding stakeholder complaints post-cutover.

During parallel running, both systems receive data from the ETL pipelines simultaneously. This increases load on the source systems temporarily but gives you confidence that the new warehouse is current and accurate before the switch.

Phase 5: Cutover and Decommission

A well-planned cutover is low-drama. Key steps:

Communication — Notify all consumers (BI tools, downstream pipelines, report schedulers) of the new connection strings in advance. For tools like Tableau or Power BI connecting directly to the warehouse, this may require updating dozens of data source definitions.
DNS / CNAME cutover — Where possible, abstract the connection endpoint behind a DNS alias so the cutover is invisible to consumers.
Rollback plan — Keep the legacy warehouse in read-only mode for two to four weeks post-cutover. This provides a safety net without the cost of continuing to write to two systems.
Decommission — Once the business has validated the new system through at least one full reporting cycle (typically month-end), decommission the legacy hardware or service.

Cost Considerations for Canadian Organisations

Canadian organisations choosing between Redshift provisioned clusters and Redshift Serverless should note that the ca-central-1 (Montreal) region offers both options. Serverless is ideal for intermittent query workloads; provisioned clusters with Reserved Instance pricing are more cost-effective for consistent, high-throughput workloads. For organisations subject to provincial privacy regulations, keeping data in ca-central-1 or ca-west-1 (Calgary) simplifies compliance posture significantly.

Conclusion

Modernising a legacy data warehouse is a significant undertaking, but the operational and economic benefits — reduced licensing costs, elastic scaling, managed infrastructure, and access to the broader AWS ecosystem — make it worthwhile for nearly every organisation still running on-premises or legacy cloud warehouses.

The projects that succeed share a common trait: they invest in the assessment and planning phases rather than rushing to migrate. Understanding what you have, what your query patterns demand, and what your stakeholders depend on is the foundation of a successful cutover.

Infra IT Consulting has guided data warehouse migrations across Teradata, Oracle, and SQL Server source systems to Amazon Redshift for clients in Canada, the UK, and Africa. Contact us to discuss your migration — we offer a no-obligation assessment to help you understand the scope and timeline before you commit.

Related reading:

Data Analytics & BI

Talk to our team →

Modernising Legacy Data Warehouses on AWS

Phase 1: Assessment and Discovery

Phase 2: Schema Translation and Data Modelling Decisions

Phase 3: ETL Pipeline Migration

Phase 4: Validation and Parallel Running

Phase 5: Cutover and Decommission

Cost Considerations for Canadian Organisations

Conclusion

Related posts

Designing KPI Dashboards That Data Engineers Will Actually Maintain

Building Real-Time Dashboards with Kinesis and QuickSight

QuickSight vs. Tableau vs. Power BI: An Honest Comparison for AWS Shops

Phase 1: Assessment and Discovery

Phase 2: Schema Translation and Data Modelling Decisions

Phase 3: ETL Pipeline Migration

Phase 4: Validation and Parallel Running

Phase 5: Cutover and Decommission

Cost Considerations for Canadian Organisations

Conclusion

Related posts

Designing KPI Dashboards That Data Engineers Will Actually Maintain

Building Real-Time Dashboards with Kinesis and QuickSight

QuickSight vs. Tableau vs. Power BI: An Honest Comparison for AWS Shops

We value your privacy