AWS Data Engineering iceberggluetable-format

Apache Iceberg with AWS Glue: The Modern Table Format Explained

By Infra IT Consulting · March 25, 2024 · 9 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

The data lakehouse pattern promises the best of both worlds: the scalability and cost-efficiency of an object store with the reliability and query performance of a data warehouse. Achieving that promise requires a table format that sits between your storage layer and your compute engines, providing transactional semantics, efficient metadata management, and consistent reads. Apache Iceberg has emerged as the format of choice for AWS-native data teams, and its integration with AWS Glue, Athena, and S3 has matured to the point where it is a viable production choice for organisations of all sizes.

This post explains what Iceberg actually does, how it integrates with AWS Glue and the broader AWS analytics stack, and the architectural decisions you need to make when adopting it.

What Makes Iceberg Different from Parquet and Hive Tables

Traditional Hive-style tables on S3 store metadata in a Hive Metastore (or the AWS Glue Data Catalog) as a mapping from partition values to S3 directory paths. Query engines list those directories, filter by partition, and then scan the files within. This approach has fundamental limitations: partition scheme changes require full rewrites, concurrent writers can corrupt the table state, and listing millions of small files is slow enough to dominate query latency.

Apache Iceberg replaces the directory-listing model with a tree of metadata files. At the root is a metadata file that points to a snapshot. Each snapshot references manifest lists, which reference manifests, which reference individual data files along with their column-level statistics. This structure allows query engines to prune not just by partition but by column-level min/max statistics, dramatically reducing the data scanned per query.

The key architectural differences from Hive tables are:

Snapshot isolation: Every commit creates a new snapshot. Readers hold a reference to a specific snapshot and see a consistent view regardless of concurrent writes.
Hidden partitioning: Partitioning is a physical implementation detail defined in the table schema, not something query authors must know about. You query WHERE event_date = '2024-03-25' without writing WHERE year=2024 AND month=3 AND day=25.
Partition evolution: You can change the partitioning scheme of an existing table without rewriting data. Old data retains its original partitioning; new data uses the updated scheme. The query engine handles both transparently.
Schema evolution: Adding, renaming, reordering, or widening columns is a metadata-only operation that requires no data file rewrites.

Setting Up Iceberg Tables in AWS Glue

AWS Glue natively supports Iceberg as of Glue version 3.0, with full support from Glue 4.0. You can create and manage Iceberg tables directly using PySpark in Glue jobs, with the Glue Data Catalog acting as the Iceberg catalog.

Here is a complete example of creating an Iceberg table and performing an upsert in a Glue ETL job:

import sys
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.sql import SparkSession

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Configure Iceberg with Glue catalog
spark.conf.set("spark.sql.catalog.glue_catalog", 
               "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.glue_catalog.warehouse", 
               "s3://my-data-lake/iceberg/")
spark.conf.set("spark.sql.catalog.glue_catalog.catalog-impl", 
               "org.apache.iceberg.aws.glue.GlueCatalog")
spark.conf.set("spark.sql.catalog.glue_catalog.io-impl", 
               "org.apache.iceberg.aws.s3.S3FileIO")

# Create an Iceberg table with hidden partitioning
spark.sql("""
    CREATE TABLE IF NOT EXISTS glue_catalog.analytics.customer_events (
        event_id STRING,
        customer_id STRING,
        event_type STRING,
        event_timestamp TIMESTAMP,
        payload STRING
    )
    USING iceberg
    PARTITIONED BY (days(event_timestamp))
    LOCATION 's3://my-data-lake/iceberg/analytics/customer_events'
    TBLPROPERTIES (
        'write.target-file-size-bytes' = '134217728',
        'write.parquet.compression-codec' = 'snappy'
    )
""")

# MERGE (upsert) new events into the table
spark.sql("""
    MERGE INTO glue_catalog.analytics.customer_events AS target
    USING new_events AS source
    ON target.event_id = source.event_id
    WHEN MATCHED THEN UPDATE SET *
    WHEN NOT MATCHED THEN INSERT *
""")

job.commit()

The PARTITIONED BY (days(event_timestamp)) clause uses Iceberg’s transform functions to derive the partition value from the timestamp column automatically. Query authors never reference partition columns directly — Athena and Spark both push predicates on event_timestamp through the transform to identify the relevant partitions.

Querying Iceberg Tables with Amazon Athena

Athena supports Iceberg tables natively from Athena engine version 3. No special configuration is required beyond ensuring the table is registered in the Glue Data Catalog. Queries look identical to standard SQL:

-- Current snapshot query
SELECT customer_id, COUNT(*) AS event_count
FROM analytics.customer_events
WHERE event_timestamp >= TIMESTAMP '2024-03-01 00:00:00'
GROUP BY customer_id
ORDER BY event_count DESC;

-- Time travel: query the table as of a specific snapshot
SELECT * FROM analytics.customer_events
FOR VERSION AS OF 5483947382910
LIMIT 100;

-- Time travel: query as of a specific timestamp
SELECT * FROM analytics.customer_events
FOR TIMESTAMP AS OF TIMESTAMP '2024-03-10 12:00:00'
WHERE customer_id = 'cust_8827';

Athena’s integration with Iceberg’s column statistics means that well-maintained Iceberg tables (regular OPTIMIZE compaction runs) often significantly outperform equivalent Parquet/Hive tables for filtered analytical queries, even without changes to query syntax. The statistics allow Athena to skip entire data files based on column min/max values, reducing data scanned and therefore both query latency and cost.

Partition Evolution: A Practical Example

One of Iceberg’s most operationally valuable features is partition evolution. Suppose you originally partitioned your events table by month (months(event_timestamp)) but as the table grows you need daily partitions for better query performance. With a Hive table, this change requires a full data rewrite. With Iceberg, it is a metadata operation:

ALTER TABLE glue_catalog.analytics.customer_events
DROP PARTITION FIELD months(event_timestamp);

ALTER TABLE glue_catalog.analytics.customer_events
ADD PARTITION FIELD days(event_timestamp);

After this change, new writes use daily partitioning while old data retains monthly partitions. Query engines handle both partitioning schemes transparently when planning queries — the change is entirely backward compatible with existing data and existing queries.

Compaction and Maintenance

Iceberg tables accumulate small files over time, especially with frequent incremental writes or streaming ingestion patterns. AWS Glue provides a managed compaction service for Iceberg tables that handles file compaction automatically:

# Compact small files using Iceberg's rewrite_data_files procedure
spark.sql("""
    CALL glue_catalog.system.rewrite_data_files(
        table => 'analytics.customer_events',
        strategy => 'sort',
        sort_order => 'zorder(customer_id, event_type)',
        options => map(
            'target-file-size-bytes', '134217728',
            'min-file-size-bytes', '33554432'
        )
    )
""")

-- Also expire old snapshots to control metadata growth
CALL glue_catalog.system.expire_snapshots(
    'analytics.customer_events',
    TIMESTAMP '2024-03-18 00:00:00',
    100
);

Z-order sorting during compaction is particularly effective for tables with multiple high-cardinality filter columns, as it co-locates related data physically even without a strict single-column sort order. This is a significant improvement over the single-column sort keys available in traditional columnar formats.

Iceberg vs. Delta Lake for AWS Teams

The choice between Iceberg and Delta Lake is contextual rather than absolute. Delta Lake on AWS has a more mature Spark-native ecosystem with better MERGE performance benchmarks at extremely large scales. Iceberg has broader native support across AWS services — Athena, EMR, and Glue all support Iceberg natively without external connectors, and AWS has publicly committed to Iceberg as a strategic format.

For teams building AWS-native data platforms where Athena is a primary query interface, Iceberg’s native Athena support is a compelling advantage. For teams running heavy Spark workloads with complex MERGE patterns on tables exceeding hundreds of terabytes, Delta Lake’s mature optimistic concurrency implementation may edge out Iceberg in raw performance.

Both formats store data in open Parquet files and both support time travel, schema evolution, and ACID transactions. The lakehouse architecture on AWS you build will be well-served by either, and the operational patterns around maintenance, monitoring, and incremental processing transfer between the two.

Real-World Architectural Considerations

When adopting Iceberg in production, several operational decisions matter beyond the initial table creation:

Catalog location: Using the Glue Data Catalog as the Iceberg catalog enables Athena access and keeps table discovery centralised. The alternative — using an Iceberg REST catalog or Hadoop catalog — provides more portability but requires additional infrastructure.

File size targets: The default target file size of 128 MB is appropriate for most workloads. For tables with very high concurrency writes, smaller targets reduce write latency at the cost of more compaction work. For tables primarily read by Athena, larger files (256 MB) reduce the per-file overhead of S3 requests.

Snapshot retention: Retaining too few snapshots limits your time travel window; retaining too many increases metadata overhead. A 7-day retention with 500 minimum snapshots is a reasonable starting point for most operational tables.

IAM permissions: The Glue job execution role needs read/write access to both the S3 data location and the Glue Data Catalog. Ensure your Lake Formation permissions (if enabled) grant column-level access appropriately for Athena query users.

Conclusion

Apache Iceberg with AWS Glue represents a significant architectural improvement over traditional Parquet/Hive table formats for data teams building on S3. The combination of ACID transactions, hidden partitioning, schema and partition evolution, and native Athena support makes it a practical choice for production data lakehouses — not a research project.

The integration with AWS-managed services like Glue and Athena means your team can adopt Iceberg incrementally, migrating tables from Parquet as business requirements justify the operational change rather than committing to a full platform rewrite upfront.

If your team is evaluating open table formats for a new data lakehouse or planning a migration from a traditional data warehouse, contact Infra IT Consulting to discuss the right approach for your specific data volumes, query patterns, and organisational constraints.

AWS Data Engineering

Talk to our team →

Apache Iceberg with AWS Glue: The Modern Table Format Explained

What Makes Iceberg Different from Parquet and Hive Tables

Setting Up Iceberg Tables in AWS Glue

Querying Iceberg Tables with Amazon Athena

Partition Evolution: A Practical Example

Compaction and Maintenance

Iceberg vs. Delta Lake for AWS Teams

Real-World Architectural Considerations

Conclusion

Related posts

Orchestrating Data Pipelines with AWS Step Functions

Amazon Redshift vs. Athena: Choosing the Right Query Engine

Monitoring and Alerting for AWS Glue Jobs in Production

What Makes Iceberg Different from Parquet and Hive Tables

Setting Up Iceberg Tables in AWS Glue

Querying Iceberg Tables with Amazon Athena

Partition Evolution: A Practical Example

Compaction and Maintenance

Iceberg vs. Delta Lake for AWS Teams

Real-World Architectural Considerations

Conclusion

Related posts

Orchestrating Data Pipelines with AWS Step Functions

Amazon Redshift vs. Athena: Choosing the Right Query Engine

Monitoring and Alerting for AWS Glue Jobs in Production

We value your privacy