Infra IT Consulting logo Infra ITC
AWS Data Engineering lambdaetlserverless

Using AWS Lambda for Lightweight ETL Transformations

By Infra IT Consulting · · 8 min read

AWS Glue, EMR, and Step Functions get most of the attention in data engineering discussions. AWS Lambda, by contrast, is often dismissed as “just for web applications.” This is a mistake. Lambda is one of the most cost-efficient and operationally simple tools available for a well-defined category of ETL work, and understanding where it fits — and where it absolutely does not — makes you a more effective data platform designer.

Where Lambda Fits in the ETL Landscape

Lambda is a poor fit for large-scale data processing. The hard limits are not negotiable: 15-minute maximum execution duration, 10 GB maximum memory, 512 MB of ephemeral /tmp storage (configurable up to 10 GB), and no built-in distributed compute. Trying to process a 200 GB Parquet file in a single Lambda invocation is an architectural error, not a configuration problem.

Lambda is an excellent fit for:

  • Event-driven micro-transformations: A new file lands in S3, Lambda validates its schema, applies a lightweight normalization, and writes it to the processed prefix — all within seconds and at near-zero cost for infrequent workloads.
  • Fan-out coordination: Lambda receives an SQS message listing partitions to process and invokes downstream Glue jobs or Step Functions state machines.
  • API-sourced data ingestion: Polling external REST APIs, paginating through results, and writing raw responses to S3 for downstream processing.
  • Data routing and enrichment: Adding metadata, looking up reference data from DynamoDB or ElastiCache, and routing records to different S3 prefixes based on content.
  • Lightweight file format conversion: Converting small JSON or CSV files (< 500 MB) to Parquet without spinning up a Glue cluster.

The key principle: Lambda excels when the work per invocation is bounded, fast, and does not require distributed compute. When data volumes grow beyond what fits in Lambda’s constraints, hand off to Glue, EMR, or Athena.

S3-Triggered ETL: The Core Pattern

The most common Lambda ETL pattern is an S3 event trigger. A file arrives in the raw S3 prefix, Lambda fires, validates and transforms the file, and writes the result to the processed prefix.

import boto3
import json
import pandas as pd
import io
from datetime import datetime
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

s3 = boto3.client('s3')

REQUIRED_COLUMNS = {'customer_id', 'event_type', 'timestamp', 'value'}
OUTPUT_BUCKET = 'my-org-data-lake-processed'

def handler(event, context):
    for record in event['Records']:
        bucket = record['s3']['bucket']['name']
        key = record['s3']['object']['key']
        size_bytes = record['s3']['object']['size']
        
        logger.info(json.dumps({
            "event": "file_received",
            "bucket": bucket,
            "key": key,
            "size_bytes": size_bytes
        }))
        
        # Guard against files too large for Lambda
        if size_bytes > 400 * 1024 * 1024:  # 400 MB threshold
            trigger_glue_job(bucket, key)
            return
        
        process_file(bucket, key)

def process_file(bucket: str, key: str):
    response = s3.get_object(Bucket=bucket, Key=key)
    raw_bytes = response['Body'].read()
    
    df = pd.read_json(io.BytesIO(raw_bytes), lines=True)
    
    # Validate schema
    missing_cols = REQUIRED_COLUMNS - set(df.columns)
    if missing_cols:
        raise ValueError(f"Missing required columns: {missing_cols}")
    
    # Normalize: parse timestamps, drop nulls on key fields, type cast
    df['timestamp'] = pd.to_datetime(df['timestamp'], utc=True)
    df['value'] = pd.to_numeric(df['value'], errors='coerce')
    df = df.dropna(subset=['customer_id', 'event_type', 'timestamp'])
    
    # Derive partition columns
    df['year'] = df['timestamp'].dt.year
    df['month'] = df['timestamp'].dt.month
    df['day'] = df['timestamp'].dt.day
    
    # Write as Parquet, partitioned by date
    for (year, month, day), partition_df in df.groupby(['year', 'month', 'day']):
        buffer = io.BytesIO()
        partition_df.drop(columns=['year', 'month', 'day']).to_parquet(
            buffer, engine='pyarrow', compression='snappy', index=False
        )
        buffer.seek(0)
        
        output_key = f"events/year={year}/month={month:02d}/day={day:02d}/{key.split('/')[-1]}.parquet"
        s3.put_object(Bucket=OUTPUT_BUCKET, Key=output_key, Body=buffer.getvalue())
        
        logger.info(json.dumps({
            "event": "partition_written",
            "output_key": output_key,
            "records": len(partition_df)
        }))

def trigger_glue_job(bucket: str, key: str):
    """Hand off large files to Glue rather than processing in Lambda."""
    glue = boto3.client('glue')
    glue.start_job_run(
        JobName='my-large-file-processor',
        Arguments={
            '--source_bucket': bucket,
            '--source_key': key,
        }
    )
    logger.info(json.dumps({"event": "handed_off_to_glue", "key": key}))

This function illustrates several production best practices: structured logging, a size guard that hands off large files to Glue, partition-aware output writing, and graceful handling of bad data.

Packaging Python Dependencies in Lambda

The biggest operational challenge with Python Lambda functions for data work is dependency management. pandas, pyarrow, and numpy together exceed Lambda’s 50 MB zipped deployment package limit and require a Lambda Layer or a container image.

Option 1: Lambda Layers — Package dependencies separately and attach the layer to your function. AWS publishes an AWSSDKPandas (formerly AWS Data Wrangler) layer that includes pandas, pyarrow, and boto3 pre-built for Lambda’s Python 3.12 runtime. This is the fastest path for most teams:

arn:aws:lambda:ca-central-1:336392948345:layer:AWSSDKPandas-Python312:13

Always pin the layer version ARN in your Terraform or SAM configuration. The :latest alias does not exist for Lambda layers — you must specify a version number. Check the AWS SDK for Pandas GitHub releases for the latest version in your region.

Option 2: Container Images — Package your function as a Docker container (up to 10 GB). This gives you full control over the runtime environment and avoids the 250 MB unzipped layer limit. Container images are the right choice when you need unusual libraries (geospatial, scientific computing) or specific dependency versions not available in the managed layers.

FROM public.ecr.aws/lambda/python:3.12

COPY requirements.txt .
RUN pip install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"

COPY src/ ${LAMBDA_TASK_ROOT}/
CMD ["handler.handler"]

SQS-Based Fan-Out for High-Throughput Ingestion

For event-driven pipelines where many files arrive simultaneously, direct S3 → Lambda triggers can cause throttling when Lambda’s concurrency limit (1,000 per region by default, adjustable via quota increase request) is reached. The solution is an SQS buffer:

S3 → SQS → Lambda with a batch window gives Lambda control over concurrency through the queue’s visibility timeout and the Lambda event source mapping’s MaximumConcurrency setting (available since 2022, up to 1,000):

resource "aws_lambda_event_source_mapping" "sqs_trigger" {
  event_source_arn                   = aws_sqs_queue.file_processing.arn
  function_name                      = aws_lambda_function.etl_processor.arn
  batch_size                         = 10
  maximum_batching_window_in_seconds = 30
  scaling_config {
    maximum_concurrency = 50  # Limit concurrent invocations
  }
}

Setting maximum_concurrency on the SQS event source mapping prevents Lambda from scaling to its account-level limit and overwhelming downstream services like DynamoDB or external APIs.

Cost Model: When Lambda Beats Glue

Lambda pricing: $0.0000166667 per GB-second + $0.20 per million requests.

A 1 GB Lambda function running for 10 seconds costs $0.000167 per invocation, or about $167 per million invocations.

AWS Glue pricing: $0.44 per DPU-hour (G.1X worker). A single-worker Glue job with a 10-minute minimum billing increment costs at minimum $0.073 per run.

For workloads with infrequent, small files: Lambda is far cheaper. For workloads with large files or complex transformations requiring multiple workers: Glue is the right tool and Lambda is the wrong one.

The break-even analysis: if your per-file processing takes less than 15 minutes and fits within 10 GB of memory, Lambda will be cheaper for volumes under roughly 50,000 files per day processed by a single-worker Glue job. Above that volume or complexity, the Glue investment pays off. Our comparison of AWS Glue vs. Apache Spark covers the Glue economics in more detail.

Error Handling and Dead Letter Queues

Lambda’s retry behaviour for S3 triggers is important to understand: asynchronous invocations retry twice on failure with an exponential backoff before the event is discarded. Configure a Dead Letter Queue (DLQ) to capture failed events:

resource "aws_lambda_function" "etl_processor" {
  function_name = "etl-file-processor"
  # ...
  
  dead_letter_config {
    target_arn = aws_sqs_queue.etl_dlq.arn
  }
  
  reserved_concurrent_executions = 100  # Prevent runaway scaling
}

resource "aws_cloudwatch_metric_alarm" "dlq_alarm" {
  alarm_name          = "etl-dlq-messages-visible"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = 300
  statistic           = "Sum"
  threshold           = 0
  dimensions = {
    QueueName = aws_sqs_queue.etl_dlq.name
  }
  alarm_actions = [aws_sns_topic.data_engineering_alerts.arn]
}

Any message in the DLQ is a processing failure that needs investigation. Alarm on it immediately.

Connecting Lambda ETL to Your Broader Data Stack

Lambda works best as the entry point or coordination layer of a larger pipeline, not as a standalone processing engine. Common integration patterns:

  • Lambda → S3 → Glue Crawler → Athena: Lambda writes normalized Parquet; Glue Crawler updates the Data Catalog; Athena queries the result.
  • Lambda → DynamoDB Streams → Lambda → S3: Capture change data, buffer in DynamoDB, write micro-batches to S3 for downstream analytics.
  • Lambda → Step Functions: Lambda validates an incoming file and starts a Step Functions workflow that orchestrates Glue, Athena CTAS, and SNS notifications.

For teams building out a data lake on S3, Lambda-based ingestion functions are the natural entry point for streaming and event-driven source data before it enters the structured lake layers.

Conclusion

AWS Lambda is a legitimate and cost-effective ETL tool for the right class of problems: event-driven, bounded in size, and requiring fast response times rather than high throughput on large datasets. The teams that get the most out of Lambda in data pipelines are those who understand its constraints precisely and design their architectures to work within them — including handing off to Glue or EMR when the workload grows beyond Lambda’s boundaries.

Infra IT Consulting designs serverless-first data pipelines that scale intelligently, using Lambda, Glue, and EMR in their appropriate contexts. Contact us to discuss how serverless ETL can reduce your infrastructure costs and operational overhead.

Related posts