Infra IT Consulting logo Infra ITC
AWS Data Engineering monitoringgluecloudwatch

Monitoring and Alerting for AWS Glue Jobs in Production

By Infra IT Consulting ยท ยท 9 min read

Running AWS Glue jobs in production without robust monitoring is like flying without instruments. The job might be succeeding โ€” or it might be silently processing zero records, running for eight hours instead of twenty minutes, or writing corrupted output to S3. Without the right telemetry in place, your data consumers discover the problem before your engineering team does. This guide walks through a production-grade monitoring and alerting stack for AWS Glue, covering CloudWatch metrics, structured logging, EventBridge rules, and data quality guardrails.

Understanding Glueโ€™s Native Observability Surface

AWS Glue emits metrics and logs to Amazon CloudWatch automatically when you enable the right job parameters. Without explicit configuration, you get almost nothing useful. With it, you get a rich stream of operational telemetry.

Enable these default arguments on every Glue job:

default_arguments = {
    "--enable-metrics": "true",
    "--enable-continuous-cloudwatch-log": "true",
    "--enable-continuous-log-filter": "true",       # Filters out Spark INFO noise
    "--continuous-log-logGroup": "/aws-glue/jobs",  # Centralised log group
    "--enable-spark-ui": "true",
    "--spark-event-logs-path": f"s3://{monitoring_bucket}/spark-ui-logs/",
}

The --enable-continuous-cloudwatch-log argument is critical. Without it, Glue only pushes logs after the job completes (or fails). With continuous logging, you get log output in near-real-time, which makes debugging a stuck job possible without waiting for it to time out.

The --enable-continuous-log-filter argument is equally important. Spark generates enormous volumes of INFO-level log output. Without filtering, a single Glue job can produce millions of log events per hour, costing hundreds of dollars per month in CloudWatch Logs ingestion fees and making the logs useless for debugging due to signal-to-noise ratio.

Key CloudWatch Metrics to Monitor

Glue publishes metrics under the Glue namespace with a JobName dimension. The most operationally significant metrics are:

MetricWhat It Tells YouAlert Threshold
glue.driver.ExecutorAllocationManager.executors.numberAllMaxNeededExecutorsAuto-scaling demandSustained max for > 30 min indicates undersizing
glue.driver.aggregate.numFailedTasksTask-level failures> 0 for critical jobs
glue.driver.aggregate.recordsReadRecords ingested< expected baseline (data quality signal)
glue.driver.aggregate.bytesReadBytes processedCompare against historical baseline
glue.ALL.jvm.heap.usageMemory pressure> 0.85 (85%) sustained
glue.driver.ExecutorAllocationManager.executors.numberAllExecutorsActive executor countUnexpected drops mid-job

The recordsRead metric is particularly valuable. A job that completes successfully but processed 0 records when it normally processes 5 million is not a successful job โ€” it is a silent data failure. Without this check, downstream dashboards and reports show stale data, and you find out from a business stakeholder rather than your monitoring stack.

Building a CloudWatch Dashboard

A production Glue monitoring dashboard should show job execution history, duration trends, record throughput, and memory utilisation. Use Terraform or CloudFormation to manage the dashboard as code:

resource "aws_cloudwatch_dashboard" "glue_operations" {
  dashboard_name = "GlueProductionJobs"

  dashboard_body = jsonencode({
    widgets = [
      {
        type = "metric"
        properties = {
          title  = "Job Duration (minutes)"
          period = 300
          metrics = [
            for job_name in var.critical_glue_jobs : [
              "Glue", "glue.driver.jvm.heap.used",
              "JobName", job_name,
              "Type", "gauge"
            ]
          ]
        }
      },
      {
        type = "metric"
        properties = {
          title = "Records Read - Last 24h"
          metrics = [
            for job_name in var.critical_glue_jobs : [
              "Glue", "glue.driver.aggregate.recordsRead",
              "JobName", job_name
            ]
          ]
        }
      }
    ]
  })
}

EventBridge Rules for Job State Changes

CloudWatch Metrics tell you about resource utilisation during a job run. EventBridge tells you about job lifecycle events: when a job starts, succeeds, fails, or times out. These are two different monitoring planes and you need both.

Glue publishes state change events to EventBridge automatically. Create a rule to capture failures and timeouts:

resource "aws_cloudwatch_event_rule" "glue_job_failure" {
  name        = "glue-job-failure-alert"
  description = "Capture Glue job failures and timeouts"

  event_pattern = jsonencode({
    source      = ["aws.glue"]
    detail-type = ["Glue Job State Change"]
    detail = {
      state = ["FAILED", "TIMEOUT", "STOPPED", "ERROR"]
    }
  })
}

resource "aws_cloudwatch_event_target" "notify_sns" {
  rule      = aws_cloudwatch_event_rule.glue_job_failure.name
  target_id = "GlueFailureSNS"
  arn       = aws_sns_topic.data_engineering_alerts.arn

  input_transformer {
    input_paths = {
      jobName   = "$.detail.jobName"
      state     = "$.detail.state"
      runId     = "$.detail.jobRunId"
      errorMsg  = "$.detail.message"
    }
    input_template = "\"Glue job <jobName> entered state <state>. Run ID: <runId>. Error: <errorMsg>\""
  }
}

The input_transformer extracts the job name, run ID, and error message from the raw EventBridge event and formats them into a human-readable notification. Without this, your on-call engineer receives a raw JSON blob at 2am.

For critical jobs, also create a rule that fires if a job has not started by a certain time. This catches upstream failures in your orchestration layer โ€” an MWAA DAG that failed to trigger the Glue job at all:

# Lambda function triggered by a scheduled EventBridge rule
import boto3
from datetime import datetime, timedelta, timezone

def handler(event, context):
    glue = boto3.client('glue')
    cutoff = datetime.now(timezone.utc) - timedelta(hours=2)
    
    job_name = 'my-critical-daily-job'
    runs = glue.get_job_runs(JobName=job_name, MaxResults=1)['JobRuns']
    
    if not runs or runs[0]['StartedOn'] < cutoff:
        # Job has not run in the expected window โ€” trigger alert
        sns = boto3.client('sns')
        sns.publish(
            TopicArn=os.environ['ALERT_TOPIC_ARN'],
            Subject=f'ALERT: Glue job {job_name} has not run today',
            Message=f'Expected start time exceeded. Last run: {runs[0]["StartedOn"] if runs else "Never"}'
        )

Structured Logging in Glue Job Scripts

Beyond Glueโ€™s native telemetry, your job scripts should emit structured log events to CloudWatch Logs. Structured (JSON) logs are machine-parseable, which lets you create CloudWatch Logs Insights queries and metric filters on your application-level events.

import json
import logging
from datetime import datetime

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def log_event(event_type: str, **kwargs):
    """Emit a structured log event to CloudWatch Logs."""
    event = {
        "timestamp": datetime.utcnow().isoformat(),
        "event_type": event_type,
        "job_name": args['JOB_NAME'],
        "run_id": args['JOB_RUN_ID'],
        **kwargs
    }
    logger.info(json.dumps(event))

# In your job logic:
log_event("records_loaded",
    source_table="raw.customers",
    records_read=source_df.count(),
    partition_date="2024-06-01"
)

log_event("transform_complete",
    records_written=output_df.count(),
    destination="s3://my-bucket/processed/customers/",
    duration_seconds=42.3
)

With structured logs in place, you can create a CloudWatch Logs Insights query to detect jobs with unexpectedly low record counts:

fields @timestamp, job_name, records_read
| filter event_type = "records_loaded"
| filter records_read < 1000
| sort @timestamp desc
| limit 50

Save this query and create a CloudWatch Insights scheduled query alarm to alert when it returns any results.

Data Quality Monitoring with AWS Glue Data Quality

AWS Glue Data Quality (GQDQ), launched in 2023, lets you define declarative data quality rules that run as part of your Glue job. This moves quality checks from an afterthought to a first-class monitoring concern:

from awsglue.context import GlueContext
from awsgluedq.transforms import EvaluateDataQuality

# Define rules in DQDL (Data Quality Definition Language)
ruleset = """
Rules = [
    RowCount > 10000,
    IsComplete "customer_id",
    IsUnique "customer_id",
    ColumnValues "country_code" in ["CA", "US", "GB", "NG", "ZA"],
    ColumnValues "order_amount" between 0 and 1000000,
    Completeness "email" > 0.95
]
"""

result = EvaluateDataQuality.apply(
    frame=source_dyf,
    ruleset=ruleset,
    publishing_options={
        "dataQualityEvaluationContext": "customer_data_quality",
        "enableDataQualityResultsPublishing": True,
        "resultsS3Prefix": "s3://my-monitoring-bucket/dq-results/"
    }
)

# Fail the job if critical rules are violated
if result.select_fields(["Outcome"]).toDF().filter("Outcome = 'Failed'").count() > 0:
    raise Exception("Data quality check failed โ€” aborting job")

GQDQ publishes results to CloudWatch, allowing you to build dashboards and alarms on data quality scores over time. This is especially important for pipelines that feed financial reports or customer-facing applications.

Alerting Runbooks and On-Call Hygiene

Technical monitoring is only valuable if alerts are actionable. Every alert that fires should have a corresponding runbook. For Glue jobs, a minimal runbook entry should answer:

  1. What does this job do and what depends on it?
  2. Where are the CloudWatch Logs and Spark UI for the failed run?
  3. What are the top three causes of failure for this job, and how do you diagnose each?
  4. What is the recovery procedure (re-trigger from specific partition, full backfill, notify downstream)?
  5. Who is the secondary contact if the primary on-call engineer cannot resolve it?

Store runbooks in your internal wiki and link to them from your CloudWatch alarm descriptions. This sounds obvious but is absent from the majority of data platform monitoring setups.

For broader context on how monitoring fits into a complete data platform design, see our post on Implementing a Data Mesh on AWS, which covers observability requirements for federated data products.

You may also want to review our comparison of AWS Glue vs. Apache Spark if you are evaluating whether Glue is the right execution engine for workloads where deep observability is a requirement.

Conclusion

A production Glue monitoring stack has three layers: resource utilisation metrics in CloudWatch, job lifecycle events via EventBridge, and application-level structured logs from your job scripts. Adding Glue Data Quality checks as a fourth layer turns your monitoring from reactive (detecting failures) to proactive (detecting bad data before it reaches consumers).

The patterns in this guide apply regardless of whether you run ten Glue jobs or two hundred. The investment in monitoring pays for itself the first time it catches a silent failure before your data consumers do. Infra IT Consulting helps data teams build production-grade monitoring for their AWS data platforms. Reach out to discuss your observability requirements.

Related posts