Data Architecture & Strategy lakehousedelta-lakeiceberg

Lakehouse Architecture on AWS: Combining the Best of Lakes and Warehouses

By Infra IT Consulting · January 16, 2024 · 9 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

For years, data architecture teams faced a binary choice: build a data lake (cheap, flexible, hard to manage) or build a data warehouse (fast, reliable, expensive and rigid). Data lakes promised unlimited scale and schema flexibility but delivered governance nightmares — data swamps with no discoverability, no ACID guarantees, and unreliable query performance. Data warehouses delivered reliability and performance but at high cost, with tightly coupled storage and compute, and limited support for the semi-structured and unstructured data that modern organisations generate at scale.

The lakehouse architecture resolves this false dichotomy. By adding a transactional table format layer on top of object storage, a lakehouse provides ACID transactions, schema enforcement, time travel, and efficient query performance — directly on data stored in S3 — without requiring a proprietary warehouse engine.

What Makes a Lakehouse Different from a Data Lake

A traditional data lake is, at its core, a collection of files in S3. You write Parquet files, define an Athena table over them, and query. This works, but it has fundamental limitations:

No ACID transactions. If two writers write to the same partition simultaneously, you get corrupted data with no automatic rollback.
No efficient updates or deletes. Updating a record means rewriting the entire partition file. GDPR right-to-erasure requests are expensive operations.
No partition evolution. Changing partition strategy requires a full table rewrite.
Unreliable reads during writes. A consumer query that runs while a writer is mid-update may read a partially consistent state.

Open table formats — Apache Iceberg and Delta Lake — solve all of these problems by adding a metadata layer on top of Parquet files in S3. This metadata layer tracks which files constitute the current state of a table, which files have been deleted or superseded, and the complete schema and partition history. The result is a table abstraction on S3 that behaves like a database table: consistent, transactional, and efficiently queryable.

Apache Iceberg vs. Delta Lake on AWS

Both Apache Iceberg and Delta Lake are production-ready open table formats supported on AWS. The choice between them is increasingly less important — both are mature and both integrate with the major AWS query engines — but there are differences worth understanding.

Apache Iceberg is the AWS-preferred format. AWS has invested heavily in Iceberg support across its services: Amazon Athena queries and writes Iceberg tables natively, Amazon EMR supports Iceberg via the Iceberg library, AWS Glue 4.0 has native Iceberg support, and Amazon Redshift Spectrum can query Iceberg tables registered in the Glue Data Catalog. For new AWS-native lakehouse implementations, Iceberg is the default recommendation.

Delta Lake is the Databricks-originated format and is excellent if your organisation uses Databricks on AWS. It has slightly more mature support for streaming use cases (Delta Live Tables, streaming merge operations), and its Change Data Feed feature makes CDC pipelines straightforward. If your team has existing Databricks expertise or workloads, Delta Lake is a natural fit.

A practical rule of thumb: if your primary query engines are Athena, Glue, and Redshift Spectrum, use Iceberg. If you run Databricks on AWS or have significant PySpark workloads, Delta Lake or Iceberg both work well.

Building an Iceberg Lakehouse on AWS

Here is a concrete architecture for a production Iceberg lakehouse on AWS:

Storage: Amazon S3 with separate buckets (or prefixes) for raw landing, curated Iceberg tables, and a warehouse bucket for Athena query results.

Catalogue: AWS Glue Data Catalog as the Iceberg catalogue. All Iceberg tables are registered here, making them queryable by Athena, EMR, and Redshift Spectrum without additional configuration.

Ingestion and transformation: AWS Glue 4.0 jobs for batch ELT. Glue 4.0’s native Iceberg support means you can write Glue jobs that perform upserts, deletes, and schema evolution on Iceberg tables without custom library management.

Query: Amazon Athena for ad-hoc queries and transformation SQL; Redshift Spectrum for joining Iceberg tables with Redshift-managed data; optionally Amazon EMR for large-scale Spark processing.

Creating an Iceberg table and performing an upsert in AWS Glue (PySpark):

from awsglue.context import GlueContext
from pyspark.context import SparkContext
from pyspark.sql import functions as F

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Configure Iceberg catalogue to use Glue Data Catalog
spark.conf.set(
    "spark.sql.catalog.glue_catalog",
    "org.apache.iceberg.spark.SparkCatalog"
)
spark.conf.set(
    "spark.sql.catalog.glue_catalog.warehouse",
    "s3://your-data-lake/warehouse/"
)
spark.conf.set(
    "spark.sql.catalog.glue_catalog.catalog-impl",
    "org.apache.iceberg.aws.glue.GlueCatalog"
)

# Read incremental source data
incremental_df = spark.read.parquet(
    "s3://your-data-lake/raw/customers/date=2024-01-16/"
)

# Register as temp view for SQL merge
incremental_df.createOrReplaceTempView("source_customers")

# MERGE INTO Iceberg table — handles upserts atomically
spark.sql("""
    MERGE INTO glue_catalog.curated.dim_customers AS target
    USING source_customers AS source
    ON target.customer_id = source.customer_id
    WHEN MATCHED THEN
        UPDATE SET
            email = source.email,
            name = source.name,
            plan_type = source.plan_type,
            updated_at = source.updated_at
    WHEN NOT MATCHED THEN
        INSERT (customer_id, email, name, plan_type, created_at, updated_at)
        VALUES (source.customer_id, source.email, source.name,
                source.plan_type, source.created_at, source.updated_at)
""")

This MERGE operation is ACID-compliant: if the Glue job fails halfway through, the Iceberg metadata layer ensures the table remains in its pre-merge state. No partial updates, no corrupted partitions.

Time Travel and Data Auditing

One of the most operationally valuable features of Iceberg is time travel: the ability to query the state of a table at any past point in time. This is built into the metadata layer with no additional engineering effort.

In Amazon Athena:

-- Query customer table as it appeared on January 1st 2024
SELECT *
FROM curated.dim_customers
FOR TIMESTAMP AS OF TIMESTAMP '2024-01-01 00:00:00'
WHERE customer_id = 'CUST-12345';

-- Query using a specific snapshot ID (visible in table history)
SELECT *
FROM curated.dim_customers
FOR VERSION AS OF 8463729301928374;

-- View full table history
SELECT *
FROM "curated"."dim_customers$history";

Time travel is invaluable for data auditing, debugging pipeline issues, and implementing point-in-time recovery without maintaining separate snapshot infrastructure. For financial data with audit requirements, this capability partially substitutes for the separate history tables you would otherwise need to maintain.

Schema Evolution Without Downtime

Iceberg supports safe schema evolution operations that are not possible with plain Parquet on S3:

Add column — safe, new column appears as null in historical files
Rename column — safe, tracked in metadata without file rewrite
Drop column — safe, column disappears from subsequent reads without affecting historical data
Widen type — safe for compatible widening (INT to BIGINT, FLOAT to DOUBLE)

This is a significant operational advantage. In a traditional data lake, adding a column requires updating the Athena table definition and potentially rewriting Parquet files for schema consistency. In Iceberg, ALTER TABLE ADD COLUMN is instantaneous and backward-compatible.

Connecting the Lakehouse to the Warehouse Layer

The lakehouse architecture does not replace Amazon Redshift; it complements it. For a hybrid lakehouse-warehouse setup:

Iceberg on S3 stores the full raw and curated datasets — potentially petabytes — at low cost
Redshift Spectrum queries Iceberg tables registered in the Glue catalog, joining them with Redshift-managed tables as needed
Redshift materialized views over Spectrum queries cache frequently-accessed aggregations for fast dashboard performance

This pattern is discussed in the modern data stack guide and aligns with the architecture detailed in building a data lake on S3. The lakehouse layer sits between these two: more structured than a raw data lake, more open and scalable than a pure data warehouse.

Compaction and Maintenance

Iceberg tables require periodic maintenance to remain performant. High-frequency writes produce many small Parquet files, which are expensive to read in aggregate. Iceberg provides compaction operations that merge small files into larger ones without changing the table’s logical content:

# Run in a periodic Glue maintenance job
spark.sql("""
    CALL glue_catalog.system.rewrite_data_files(
        table => 'curated.dim_customers',
        options => map('target-file-size-bytes', '134217728')  -- 128 MB target
    )
""")

# Expire old snapshots to control S3 storage costs
spark.sql("""
    CALL glue_catalog.system.expire_snapshots(
        table => 'curated.dim_customers',
        older_than => TIMESTAMP '2024-01-09 00:00:00',
        retain_last => 10
    )
""")

Schedule these maintenance jobs weekly or daily depending on write frequency, using AWS Glue Triggers or MWAA Airflow DAGs.

Conclusion

The lakehouse architecture on AWS — anchored by Apache Iceberg on S3 with Glue as the catalogue and Athena and Redshift Spectrum as query engines — delivers the cost efficiency of a data lake with the reliability and governance of a data warehouse. ACID transactions, time travel, schema evolution, and efficient upserts are now first-class features of S3-based storage.

For organisations building or modernising their data architecture, the lakehouse pattern is increasingly the right default. It eliminates the technical debt of managing two separate systems (lake and warehouse) while providing capabilities that neither delivers alone.

If you are designing a lakehouse architecture on AWS and want expert guidance on Iceberg configuration, query engine selection, and migration from an existing architecture, contact Infra IT Consulting for a technical consultation.

Data Architecture & Strategy

Talk to our team →

Lakehouse Architecture on AWS: Combining the Best of Lakes and Warehouses

What Makes a Lakehouse Different from a Data Lake

Apache Iceberg vs. Delta Lake on AWS

Building an Iceberg Lakehouse on AWS

Time Travel and Data Auditing

Schema Evolution Without Downtime

Connecting the Lakehouse to the Warehouse Layer

Compaction and Maintenance

Conclusion

Related posts

The Data Platform Maturity Model: Where Does Your Organisation Stand?

API-First Data Architecture: Exposing Data as Services

Vector Databases on AWS: Enabling AI-Powered Search and RAG

What Makes a Lakehouse Different from a Data Lake

Apache Iceberg vs. Delta Lake on AWS

Building an Iceberg Lakehouse on AWS

Time Travel and Data Auditing

Schema Evolution Without Downtime

Connecting the Lakehouse to the Warehouse Layer

Compaction and Maintenance

Conclusion

Related posts

The Data Platform Maturity Model: Where Does Your Organisation Stand?

API-First Data Architecture: Exposing Data as Services

Vector Databases on AWS: Enabling AI-Powered Search and RAG

We value your privacy