AWS Data Engineering glue-streamingkafkakinesis

AWS Glue Streaming ETL: Processing Kafka and Kinesis Data

By Infra IT Consulting · May 6, 2024 · 9 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

Batch ETL pipelines have a fundamental latency floor: data is not available for analysis until the next scheduled run, which might be hours away. For many analytical use cases — operational dashboards, near-real-time fraud detection, SLA monitoring, and customer-facing data products — hourly or daily batch latency is acceptable. For a growing category of use cases, it is not. AWS Glue Streaming ETL bridges the gap between batch and real-time processing by running Spark Structured Streaming jobs inside the managed Glue environment, consuming from Kinesis Data Streams or Apache Kafka (including Amazon MSK) and writing to S3, Redshift, or other destinations continuously.

Understanding Glue Streaming ETL requires understanding what it is and what it is not. It is not a true real-time streaming system like Apache Flink — it operates in micro-batches with a configurable window size (typically 100 seconds for Kinesis). It is a managed, serverless Structured Streaming job with built-in checkpointing, auto-scaling, and integration with the Glue Data Catalog. For teams that need sub-second latency, Kinesis Data Analytics (Managed Apache Flink) is the right tool. For teams that need minutes-level latency with the operational simplicity of Glue, Glue Streaming ETL is a strong choice.

Architecture Overview

A Glue Streaming ETL job runs as a long-lived Spark Structured Streaming application. Unlike batch Glue jobs that start, process, and stop, a streaming job runs continuously and must be designed for durability and fault tolerance.

The key components:

Source: Kinesis Data Stream or Kafka topic (MSK or self-managed Kafka reachable from your VPC)
Glue Streaming Job: A long-running Structured Streaming application with job bookmarks replaced by Spark checkpointing
Checkpoint store: An S3 location where Spark writes checkpoints to enable restart from the last committed offset
Sink: S3 (most common), Redshift, Kinesis, or DynamoDB

The micro-batch model means Glue collects records for a window duration (configurable via windowSize), processes the micro-batch as a Spark DataFrame, and commits the results to the sink. Checkpoints track the stream position so that if the job fails and restarts, it resumes from exactly where it left off without reprocessing or losing data.

Consuming from Kinesis Data Streams

Here is a complete Glue Streaming ETL script consuming from Kinesis, applying transformations, and writing partitioned Parquet to S3:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrame
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, LongType

args = getResolvedOptions(
    sys.argv,
    ['JOB_NAME', 'stream_name', 'checkpoint_location', 'output_path']
)

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Define the schema of incoming JSON events
event_schema = StructType([
    StructField("event_id", StringType(), False),
    StructField("user_id", StringType(), True),
    StructField("event_type", StringType(), True),
    StructField("amount", DoubleType(), True),
    StructField("currency", StringType(), True),
    StructField("event_time", LongType(), True),  # Unix epoch milliseconds
    StructField("metadata", StringType(), True)
])

# Create the Kinesis source using Glue's DataSource
kinesis_options = {
    "streamARN": f"arn:aws:kinesis:ca-central-1:123456789:stream/{args['stream_name']}",
    "startingPosition": "TRIM_HORIZON",
    "inferSchema": "false",
    "classification": "json"
}

kinesis_stream = glueContext.create_data_frame.from_options(
    connection_type="kinesis",
    connection_options=kinesis_options,
    transformation_ctx="kinesis_stream"
)

def process_batch(data_frame, batch_id):
    """Process each micro-batch of records from Kinesis."""
    if data_frame.count() == 0:
        return

    # Parse the JSON payload from the Kinesis record
    parsed_df = data_frame.select(
        F.from_json(F.col("$json$data_infer_schema$_1"), event_schema).alias("data")
    ).select("data.*")

    # Transform: convert epoch to timestamp, derive partition columns
    transformed_df = parsed_df \
        .withColumn("event_timestamp",
                    F.to_timestamp(F.col("event_time") / 1000)) \
        .withColumn("event_date", F.to_date("event_timestamp")) \
        .withColumn("event_hour", F.hour("event_timestamp")) \
        .filter(F.col("event_id").isNotNull()) \
        .filter(F.col("amount") > 0) \
        .drop("event_time")

    # Write to partitioned S3 as Parquet
    transformed_df.write \
        .format("parquet") \
        .mode("append") \
        .partitionBy("event_date", "event_hour") \
        .save(args['output_path'])

    print(f"Batch {batch_id}: Wrote {data_frame.count()} records to S3")

# Run the streaming job with checkpointing
glueContext.forEachBatch(
    frame=kinesis_stream,
    batch_function=process_batch,
    options={
        "windowSize": "100 seconds",
        "checkpointLocation": args['checkpoint_location'],
        "batchMaxRetries": 3
    }
)

job.commit()

The forEachBatch function is Glue’s wrapper around Spark Structured Streaming’s foreachBatch sink. It handles the checkpoint management and retry logic automatically. The windowSize parameter controls how long Glue accumulates records before triggering a micro-batch — larger windows reduce the number of small files written to S3 but increase end-to-end latency.

Consuming from Apache Kafka (MSK)

For Kafka sources, the connection configuration changes but the processing model is identical:

kafka_options = {
    "connectionName": "msk-connection",  # Glue connection to your MSK cluster
    "topicName": "transactions",
    "startingOffsets": "earliest",
    "inferSchema": "false"
}

kafka_stream = glueContext.create_data_frame.from_options(
    connection_type="kafka",
    connection_options=kafka_options,
    transformation_ctx="kafka_stream"
)

The connectionName references a Glue Network Connection that provides the MSK broker addresses and VPC configuration. For MSK clusters with IAM authentication, the Glue job execution role needs kafka-cluster:* permissions on the cluster ARN. For SASL/SCRAM authentication, store credentials in AWS Secrets Manager and reference them in the Glue connection configuration.

One important consideration with Kafka sources: Glue Streaming jobs read from all partitions of the specified topic. Ensure your Glue DPU allocation scales with the Kafka topic partition count — each partition requires independent consumer capacity, and undersized jobs will show increasing consumer lag as records accumulate faster than they are processed.

Handling Schema Evolution

Streaming ETL introduces a challenging schema evolution problem: your Kinesis or Kafka producers may begin sending new fields at any time, and your streaming job must handle both old and new record formats simultaneously. Defining an explicit schema in your Glue job (as shown above) rejects records that don’t match — useful for data quality but fragile when schemas change.

A more resilient pattern uses schema-on-read with explicit handling of missing fields:

def process_batch(data_frame, batch_id):
    # Parse with permissive mode — malformed records go to a dead letter path
    parsed_df = data_frame.select(
        F.from_json(
            F.col("$json$data_infer_schema$_1"),
            event_schema,
            {"mode": "PERMISSIVE", "columnNameOfCorruptRecord": "_corrupt_record"}
        ).alias("data")
    ).select("data.*")

    # Separate valid records from malformed ones
    valid_df = parsed_df.filter(F.col("_corrupt_record").isNull()).drop("_corrupt_record")
    dead_letter_df = parsed_df.filter(F.col("_corrupt_record").isNotNull())

    # Write dead letter records for manual review
    if dead_letter_df.count() > 0:
        dead_letter_df.write \
            .format("json") \
            .mode("append") \
            .save("s3://my-bucket/dead-letter/events/")

    # Handle optional new fields with safe column access
    if "new_field" in valid_df.columns:
        valid_df = valid_df.withColumn("new_field_processed",
                                       F.col("new_field").cast("string"))
    else:
        valid_df = valid_df.withColumn("new_field_processed", F.lit(None).cast("string"))

    valid_df.write.format("parquet").mode("append") \
        .partitionBy("event_date").save(args['output_path'])

The dead letter pattern ensures that malformed records are captured for investigation rather than silently dropped, which is essential for maintaining trust in streaming pipelines.

Managing Small Files in Streaming S3 Sinks

Every micro-batch writes a new set of Parquet files to S3. With a 100-second window, a Glue streaming job writes roughly 864 micro-batch files per day per partition. For a table partitioned by hour, that is up to 20,000+ small files per day — small files that degrade Athena query performance and increase S3 API costs.

The solution is periodic compaction. A separate batch Glue job or AWS Lambda function runs on a schedule (e.g., hourly) to compact the small files written by the streaming job:

# Compaction job that runs hourly on the previous hour's data
compaction_hour = (datetime.utcnow() - timedelta(hours=1)).strftime("%Y-%m-%d/%H")
input_path = f"s3://my-bucket/events/event_date={compaction_hour[:10]}/event_hour={compaction_hour[11:]}/"

df = spark.read.parquet(input_path)
df.coalesce(4).write.format("parquet").mode("overwrite").save(input_path)

This compacts each hour’s worth of streaming output into 4 files (approximately 128 MB each for typical event volumes), dramatically improving downstream query performance without stopping the streaming job.

For teams using Delta Lake on AWS or Apache Iceberg as the streaming sink format, compaction is managed by the table format’s built-in OPTIMIZE command rather than requiring a separate compaction job.

Monitoring Glue Streaming Jobs

Glue Streaming jobs emit CloudWatch metrics under the Glue namespace:

glue.driver.streaming.numRecords: Records processed per micro-batch — alert if this drops to zero for an extended period
glue.driver.streaming.numBatches: Micro-batches processed — should be consistent over time
glue.ALL.s3.filesystem.write_bytes: Data written to S3 per interval — useful for detecting processing slowdowns

The most important operational metric is consumer lag — the difference between the latest offset in your Kinesis stream or Kafka topic and the offset your Glue job has consumed. For Kinesis, monitor GetRecords.IteratorAgeMilliseconds in the Kinesis CloudWatch namespace. For Kafka/MSK, consumer group lag is available through the MSK Broker metrics or Kafka client tooling. Growing consumer lag indicates that your Glue job cannot keep up with the incoming data rate and needs more DPUs.

Real-time data streaming with Kinesis covers the Kinesis stream configuration side of this architecture — including shard sizing and producer patterns that affect what your Glue streaming job sees.

Conclusion

AWS Glue Streaming ETL occupies a productive niche in the AWS data engineering toolkit: managed, serverless near-real-time processing with the familiar Spark programming model and native integration with Kinesis, MSK, Glue Data Catalog, and S3. For teams that need minutes-level data freshness without the operational complexity of a dedicated streaming platform, it is a compelling option.

The key success factors are appropriate window sizing for your latency requirements, robust dead-letter handling for schema evolution, a compaction strategy for the small files that streaming inevitably produces, and consumer lag monitoring to detect processing capacity issues before they become visible to downstream consumers.

If your data platform needs to move from batch to near-real-time processing and you’re evaluating the right architecture for your latency, cost, and operational requirements, contact Infra IT Consulting for a streaming architecture assessment.

AWS Data Engineering

Talk to our team →

AWS Glue Streaming ETL: Processing Kafka and Kinesis Data

Architecture Overview

Consuming from Kinesis Data Streams

Consuming from Apache Kafka (MSK)

Handling Schema Evolution

Managing Small Files in Streaming S3 Sinks

Monitoring Glue Streaming Jobs

Conclusion

Related posts

Using AWS DMS for Zero-Downtime Database Migrations

7 Proven Ways to Cut AWS Data Pipeline Costs Without Losing Performance

Amazon Redshift vs. Athena: Choosing the Right Query Engine

Architecture Overview

Consuming from Kinesis Data Streams

Consuming from Apache Kafka (MSK)

Handling Schema Evolution

Managing Small Files in Streaming S3 Sinks

Monitoring Glue Streaming Jobs

Conclusion

Related posts

Using AWS DMS for Zero-Downtime Database Migrations

7 Proven Ways to Cut AWS Data Pipeline Costs Without Losing Performance

Amazon Redshift vs. Athena: Choosing the Right Query Engine

We value your privacy