Data Architecture & Strategy data-contractsdata-qualityengineering

Data Contracts: The Key to Reliable Data Pipelines

By Infra IT Consulting · February 20, 2024 · 9 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

Silent data failures are the most expensive kind. A producer team renames a column in their Postgres table, the downstream Glue job silently reads nulls, and a week later a finance director questions why revenue figures dropped 40% last Tuesday. The root cause takes three days to trace. The fix takes an hour. The organisational trust damage takes months to repair.

Data contracts are the engineering practice that prevents this. They are explicit, versioned agreements between data producers and data consumers that define the schema, semantics, quality constraints, and SLA of a dataset — and they are enforced programmatically, not by convention.

What a Data Contract Actually Contains

A data contract is more than a schema definition. A complete contract covers five dimensions:

Schema: Field names, types, nullability, and nesting structure. This is the minimum that most teams capture, but it is insufficient on its own.

Semantics: What each field means in business terms. order_status with values confirmed, shipped, delivered is a semantic definition. Without it, a consumer team might filter for status = 'complete' and get zero rows because the producer uses delivered.

Quality constraints: Acceptable ranges, uniqueness requirements, referential integrity rules. order_total must be greater than zero. customer_id must exist in the customers table. created_at must not be more than 24 hours in the future.

SLA: When the data will be available. If your ETL pipeline refreshes at 06:00 UTC, downstream jobs should not be scheduled to run at 05:45 UTC — but without an explicit SLA, they often are.

Compatibility policy: How the producer will signal changes. Will they provide 30 days notice for breaking changes? Will they maintain backward-compatible versions for 90 days? Without this, downstream teams have no planning horizon.

A minimal contract in YAML form looks like this:

contract:
  id: orders-v2
  owner: platform-team@company.com
  consumers:
    - analytics-team
    - finance-reporting
  schema:
    - name: order_id
      type: string
      nullable: false
      description: "UUID v4, globally unique"
    - name: order_total
      type: decimal(12,2)
      nullable: false
      constraints:
        - min: 0.01
        - max: 999999.99
    - name: order_status
      type: string
      nullable: false
      enum: ["pending", "confirmed", "shipped", "delivered", "cancelled"]
    - name: created_at
      type: timestamp
      nullable: false
      timezone: UTC
  sla:
    availability: "06:00 UTC daily"
    latency_p99: "30 minutes"
  compatibility: BACKWARD_COMPATIBLE
  changelog:
    - version: "2.0"
      date: "2024-01-15"
      changes: "Added order_status enum constraint, deprecated legacy_flag field"

Implementing Contracts on AWS with Schema Registry

AWS Glue Schema Registry is the natural enforcement layer for Kafka and Kinesis-based pipelines. It stores Avro, JSON Schema, or Protobuf schemas, enforces compatibility modes (BACKWARD, FORWARD, FULL), and integrates directly with Amazon MSK and Amazon Kinesis Data Streams producers and consumers.

When a producer attempts to publish a message that violates the registered schema, the Schema Registry client raises an exception before the message reaches the stream. No silent failures, no downstream surprises.

from aws_glue_schema_registry.serde.avro import AvroSerializer
from aws_glue_schema_registry.utils.aws_schema_registry_utils import AWSSchemaRegistryUtils

# Producer-side serialisation with schema enforcement
serializer = AvroSerializer(
    region_name="ca-central-1",
    registry_name="production-registry",
    schema_name="orders-v2",
    compatibility_mode="BACKWARD_COMPATIBLE",
    auto_registration=False  # Never auto-register in production
)

def publish_order(order: dict, producer):
    try:
        serialised = serializer.serialize(order, schema_definition=ORDER_SCHEMA)
        producer.put_record(
            StreamName="orders-stream",
            Data=serialised,
            PartitionKey=order["order_id"]
        )
    except SchemaRegistryException as e:
        # Contract violation — alert, do not silently drop
        raise ContractViolationError(f"Order {order['order_id']} failed schema validation: {e}")

For batch pipelines using AWS Glue, schema enforcement happens at the Data Catalog level. Glue’s schema change detection can alert on column additions, deletions, or type changes before your downstream jobs run. Configure Glue crawlers with SchemaChangePolicy set to LOG for alerting or UPDATE_IN_DATABASE with downstream notification via Amazon EventBridge.

Enforcing Contracts in Batch Pipelines with Great Expectations

For data teams that need richer quality constraint enforcement beyond schema validation, Great Expectations (GX) integrates well into AWS Glue jobs and AWS Lambda functions. An Expectation Suite is effectively the quality section of your data contract expressed in executable form.

import great_expectations as gx

context = gx.get_context()

# Define the contract's quality expectations
suite = context.add_expectation_suite("orders_contract_v2")
suite.add_expectation(
    gx.core.ExpectationConfiguration(
        expectation_type="expect_column_values_to_not_be_null",
        kwargs={"column": "order_id"}
    )
)
suite.add_expectation(
    gx.core.ExpectationConfiguration(
        expectation_type="expect_column_values_to_be_in_set",
        kwargs={
            "column": "order_status",
            "value_set": ["pending", "confirmed", "shipped", "delivered", "cancelled"]
        }
    )
)
suite.add_expectation(
    gx.core.ExpectationConfiguration(
        expectation_type="expect_column_values_to_be_between",
        kwargs={"column": "order_total", "min_value": 0.01, "max_value": 999999.99}
    )
)

# Run validation and halt pipeline on contract breach
checkpoint = context.add_checkpoint(name="orders_daily_check", ...)
results = checkpoint.run()
if not results.success:
    raise ContractBreachError("Orders dataset failed contract validation — pipeline halted")

Results can be published to Amazon S3 and surfaced via a data observability dashboard, giving both producers and consumers visibility into contract health over time. This ties directly into the broader Data Governance Framework your organisation needs to treat data as a reliable product.

Versioning and Breaking Change Management

The hardest part of data contracts is not writing them — it is managing what happens when they need to change. Three principles make this tractable:

Additive changes are non-breaking. Adding a new nullable column, adding a new enum value (with care), or adding an optional field does not break existing consumers. These changes can be deployed without consumer coordination.

Rename and type changes are always breaking. Even renaming customer_id to client_id breaks every downstream query. Treat renames as deprecations: add the new field, maintain the old field for a defined period (30–90 days is standard), then remove it after consumers have migrated.

Maintain a version registry. Store contract versions in a dedicated S3 path or a DynamoDB table. When a consumer subscribes to a dataset, they pin to a contract version. When a breaking change is required, the producer publishes version 3 while maintaining version 2 until all consumers have migrated.

s3://data-platform-contracts/
  orders/
    v1/contract.yaml  (deprecated 2024-01-15, sunset 2024-03-15)
    v2/contract.yaml  (active)
    v3/contract.yaml  (in development — consumers notified 2024-02-01)

A migration notification sent to consuming teams 30 days before a version sunset, with a link to the diff between versions, is the minimum operational process. Automating this via EventBridge and SES — triggered when a contract’s sunset_date is 30 days away — turns a manual process into a reliable workflow.

Connecting Contracts to Data Lineage and Observability

Data contracts are most powerful when they feed into your lineage and observability layer. When a contract validation fails, you want to know immediately which downstream pipelines are affected — not after a business user notices a number that looks wrong.

AWS Glue’s data lineage integration with Amazon DataZone (generally available since late 2023) can trace a contract breach upstream to its source and downstream to its consumers. Combined with Amazon CloudWatch alarms on contract validation Lambda functions, you can build an alert chain that fires within minutes of a breach and surfaces the affected dashboards automatically.

This is a core component of mature DataOps Practices — treating data pipelines with the same observability investment that software engineering teams apply to application services.

Organisational Adoption: Starting Small

The most common failure mode for data contract initiatives is scope creep at launch. Teams try to contract every dataset simultaneously, the effort becomes overwhelming, and the initiative stalls.

Start with two or three datasets that cause the most downstream pain — typically high-traffic, frequently-changed tables that feed customer-facing dashboards. Write contracts for those, enforce them in staging first, and demonstrate the reduction in incidents. Let the success case pull adoption from other teams rather than mandating it top-down.

A good sequence:

Identify the five most-broken pipelines from the last quarter’s incident log
Write contracts for those source datasets (schema + quality constraints minimum)
Enforce validation in a staging pipeline using GX or Glue quality checks
Present the incident reduction to stakeholders after 60 days
Expand to ten more datasets, now with tooling and patterns established

Conclusion

Data contracts shift the conversation from “why did the pipeline break?” to “who is responsible for this dataset and what did they commit to?” That shift in accountability is as valuable as the technical enforcement mechanism. When producers own their contract and consumers subscribe to it explicitly, silent failures become contract breaches — visible, attributable, and fixable.

Building reliable data pipelines requires reliable data at every upstream boundary. Contracts provide that reliability programmatically rather than by hope.

If you are designing a contract framework for your AWS data platform or need help integrating schema enforcement into existing Glue and Kinesis pipelines, contact the Infra IT Consulting team. We work with data engineering organisations in Canada, the UK, and Africa to build the kind of pipeline reliability that lets teams ship with confidence.

Data Architecture & Strategy

Talk to our team →

Data Contracts: The Key to Reliable Data Pipelines

What a Data Contract Actually Contains

Implementing Contracts on AWS with Schema Registry

Enforcing Contracts in Batch Pipelines with Great Expectations

Versioning and Breaking Change Management

Connecting Contracts to Data Lineage and Observability

Organisational Adoption: Starting Small

Conclusion

Related posts

Master Data Management on AWS: Strategies and Tools

Data Lineage on AWS: Tracking Data from Source to Dashboard

Data Strategy for Startups: Building for Scale from Day One

What a Data Contract Actually Contains

Implementing Contracts on AWS with Schema Registry

Enforcing Contracts in Batch Pipelines with Great Expectations

Versioning and Breaking Change Management

Connecting Contracts to Data Lineage and Observability

Organisational Adoption: Starting Small

Conclusion

Related posts

Master Data Management on AWS: Strategies and Tools

Data Lineage on AWS: Tracking Data from Source to Dashboard

Data Strategy for Startups: Building for Scale from Day One

We value your privacy