Data Architecture & Strategy slafreshnessreliability

Data Freshness and SLAs: Engineering Pipelines That Hit Their Targets

By Infra IT Consulting · April 30, 2024 · 8 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

A data pipeline that loads successfully but delivers stale data is a pipeline that is failing its users. Yet most data engineering teams spend far more effort ensuring pipelines complete without errors than ensuring they complete within the time window that business decisions actually depend on. Data freshness is an operational contract — and when that contract is implicit rather than documented, it is almost always violated before anyone realises it exists.

This post covers how to define data freshness SLAs, instrument pipelines to measure and enforce them, and build the alerting and escalation mechanisms that make SLA commitments real rather than aspirational.

Defining What Freshness Actually Means

Before you can measure data freshness, you need a precise definition. In practice, “freshness” conflates several distinct concepts that should be tracked separately:

Source latency: the delay between an event occurring in the operational system and that event appearing in the source extract. For near-real-time sources, this might be seconds (Kinesis Firehose delivery windows). For batch extracts, this might be 24 hours.

Pipeline latency: the time between data arriving in the raw landing zone (typically Amazon S3) and that data being available in the serving layer (Amazon Redshift, Athena, or an API). This is the portion your data engineering team directly controls.

Propagation latency: the time between a record being updated in the serving layer and that update being reflected in cached BI dashboards or downstream derived tables. This is often the most invisible source of staleness from a user perspective.

A well-formed SLA statement addresses all three:

“The daily sales summary in the analytics mart will reflect all transactions processed by 11:59 PM the prior day, and will be available in Redshift and QuickSight by 7:00 AM local time with ≥ 99.5% daily success rate.”

This definition is testable. It specifies the data completeness expectation (transactions up to midnight), the availability deadline (7 AM), and the reliability target (99.5%). Each element can be monitored independently.

Instrumenting Pipelines for Freshness Measurement

The foundation of freshness monitoring is metadata capture. Every pipeline run should emit a structured record containing at minimum: the pipeline name, the data interval it processed, the time processing started, the time it completed, and the row count of the output. On AWS, AWS Step Functions makes this straightforward — Step Functions execution history is queryable via API, and you can push custom metrics to Amazon CloudWatch at each execution stage.

import boto3
import json
from datetime import datetime, timezone

cloudwatch = boto3.client('cloudwatch', region_name='ca-central-1')

def emit_freshness_metric(pipeline_name: str, data_interval_end: datetime, rows_written: int):
    now = datetime.now(timezone.utc)
    lag_minutes = (now - data_interval_end).total_seconds() / 60

    cloudwatch.put_metric_data(
        Namespace='DataPlatform/Freshness',
        MetricData=[
            {
                'MetricName': 'DataLagMinutes',
                'Dimensions': [{'Name': 'Pipeline', 'Value': pipeline_name}],
                'Value': lag_minutes,
                'Unit': 'None',
                'Timestamp': now
            },
            {
                'MetricName': 'RowsWritten',
                'Dimensions': [{'Name': 'Pipeline', 'Value': pipeline_name}],
                'Value': rows_written,
                'Unit': 'Count',
                'Timestamp': now
            }
        ]
    )

With DataLagMinutes emitted as a CloudWatch metric, you can create a CloudWatch Alarm that triggers when a pipeline’s output exceeds its SLA lag threshold. For a pipeline with a 7 AM delivery commitment, configure the alarm to fire if DataLagMinutes exceeds 480 (8 hours) after midnight — giving you an early warning before the SLA window closes.

Using dbt Tests as SLA Gates

If your transformation layer uses dbt — which is the standard for teams following modern data stack patterns — dbt’s built-in dbt_utils.recency test provides a declarative way to encode freshness SLAs directly in your project.

# models/marts/schema.yml
models:
  - name: fact_sales_daily
    description: "Daily sales summary. SLA: available by 07:00 CAT/EST with data through prior midnight."
    tests:
      - dbt_utils.recency:
          datepart: hour
          field: loaded_at
          interval: 25  # alert if newest record is older than 25 hours
    columns:
      - name: sale_date
        tests:
          - not_null
          - dbt_utils.not_constant

Running dbt test --select fact_sales_daily in your post-load step turns this SLA into an automated gate. If the test fails, the Step Functions workflow can branch to an alerting state rather than proceeding to downstream transformations that would distribute stale data further into your platform.

This pattern is especially valuable for DataOps practices — it encodes quality expectations as code, version-controls them alongside the transformation logic, and makes SLA compliance visible in the dbt test results that every engineer on the team can inspect.

Designing for SLA Recovery: Backfill and Catch-Up Patterns

SLA violations are inevitable. The difference between a mature pipeline and an immature one is not that the mature pipeline never misses an SLA — it is that when a miss occurs, recovery is fast and predictable.

Backfill capability is the most important recovery mechanism. Every pipeline should be designed with the assumption that it will need to re-process one or more historical data intervals. On AWS, this means:

Idempotent writes: Using INSERT OVERWRITE (Athena/Glue) or DELETE/INSERT patterns (Redshift) so that re-running a pipeline for a historical interval overwrites rather than duplicates
Partitioned S3 layouts: Organising raw and processed data by date partition so a backfill can target specific date directories without scanning the full dataset
Step Functions Map states: Enabling parallel backfill execution across multiple date intervals when you need to recover a week’s worth of data quickly

# Airflow-style catchup pattern for SLA recovery
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-engineering',
    'depends_on_past': False,
    'retries': 3,
    'retry_delay': timedelta(minutes=10),
    'retry_exponential_backoff': True,
    'email_on_retry': False,
    'email_on_failure': True,
    'sla': timedelta(hours=7)  # 7 AM delivery deadline
}

dag = DAG(
    'sales_daily_etl',
    default_args=default_args,
    schedule_interval='0 1 * * *',  # 1 AM UTC start
    catchup=True,  # enable SLA recovery backfills
    max_active_runs=3
)

Setting catchup=True and defining sla at the DAG level gives Apache Airflow (running on Amazon MWAA) the information it needs to both recover missed runs and report SLA misses to your monitoring infrastructure.

Alerting Hierarchies That Actually Get Action

Alerting on data freshness failures is only useful if the right people receive the right information at the right time. A common anti-pattern is routing all pipeline alerts to a shared Slack channel where alert fatigue gradually trains the team to ignore notifications. A more effective hierarchy:

P1 — SLA breach confirmed (data not available by committed time): Page the on-call data engineer via Amazon SNS to PagerDuty or Opsgenie integration. Include the pipeline name, expected delivery time, current status, and a link to the CloudWatch Logs Insights query for the failing execution.

P2 — SLA at risk (pipeline running 30+ minutes behind schedule with less than 90 minutes until the SLA deadline): Send to the team Slack channel with context. No page. Engineers can assess and intervene if needed.

P3 — Anomalous row counts (output rows are ±20% of the 30-day average for this pipeline and interval): Post to an automated data quality channel. Low urgency, but logged for review.

Amazon EventBridge makes it straightforward to route Step Functions execution failures and CloudWatch Alarms to different SNS topics based on severity tags, giving you this tiered routing without custom infrastructure.

Building a Freshness Dashboard

Operational visibility into freshness across all pipelines is as important as the alerting. A CloudWatch dashboard with the following panels gives data engineering leads and data consumers a shared view of platform health:

Last successful load time per pipeline (CloudWatch metric math using the most recent RowsWritten > 0 timestamp)
Current data lag per pipeline (the DataLagMinutes metric trending over 24 hours)
SLA compliance rate over the rolling 30-day window (percentage of days where lag was within threshold)
Row count anomaly detector (CloudWatch Anomaly Detection band on the RowsWritten metric)

Exposing this dashboard to data consumers — analysts, product managers, downstream engineering teams — changes the nature of SLA conversations. Instead of data teams defending themselves in post-incident reviews, consumers can see real-time platform status and make informed decisions about when to trust the data they are looking at.

From Pipelines to Platform Reliability

Data freshness SLAs are ultimately about organisational trust. When business users know that a specific dataset will be available by a specific time with high reliability, they can build workflows and decisions around that commitment. When freshness is unpredictable, the response is usually to either distrust the data or to build redundant manual processes that the data platform was supposed to replace.

Engineering teams that treat freshness as a first-class operational metric — not an afterthought — tend to discover that most SLA violations trace back to a small number of chronic root causes: under-resourced pipeline execution windows, missing retry logic on transient failures, or upstream source systems that silently delay extracts. Addressing those root causes systematically delivers more reliability than any amount of alerting sophistication.

For organisations building or maturing their AWS data platforms, Infra IT Consulting helps define, instrument, and operationalise data freshness SLAs across the full pipeline stack. Contact us to discuss what a reliability programme looks like for your specific environment.

Data Architecture & Strategy

Talk to our team →

Data Freshness and SLAs: Engineering Pipelines That Hit Their Targets

Defining What Freshness Actually Means

Instrumenting Pipelines for Freshness Measurement

Using dbt Tests as SLA Gates

Designing for SLA Recovery: Backfill and Catch-Up Patterns

Alerting Hierarchies That Actually Get Action

Building a Freshness Dashboard

From Pipelines to Platform Reliability

Related posts

API-First Data Architecture: Exposing Data as Services

Star Schema vs. Data Vault: Picking the Right Modelling Approach

Where MLOps Meets Data Engineering: Building ML-Ready Pipelines

Defining What Freshness Actually Means

Instrumenting Pipelines for Freshness Measurement

Using dbt Tests as SLA Gates

Designing for SLA Recovery: Backfill and Catch-Up Patterns

Alerting Hierarchies That Actually Get Action

Building a Freshness Dashboard

From Pipelines to Platform Reliability

Related posts

API-First Data Architecture: Exposing Data as Services

Star Schema vs. Data Vault: Picking the Right Modelling Approach

Where MLOps Meets Data Engineering: Building ML-Ready Pipelines

We value your privacy