Data Architecture & Strategy dataopsdevopsci-cd

DataOps: Applying DevOps Principles to Data Engineering

By Infra IT Consulting · March 26, 2024 · 9 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

Software engineering teams have spent two decades learning how to ship code reliably: automated testing, continuous integration, deployment pipelines, observability, and on-call runbooks. Data engineering teams, for much of that same period, deployed pipelines manually, tested them informally (or not at all), and learned about failures when business users noticed wrong numbers.

DataOps is the discipline of applying software engineering rigour to data pipelines. It is not a product you buy — it is a set of practices that, when adopted consistently, dramatically reduce pipeline failures, accelerate delivery, and make data platforms genuinely trustworthy. On AWS, the combination of AWS CodePipeline, AWS Glue, Amazon Redshift, dbt, and CloudWatch gives you a full DataOps stack without proprietary tooling.

The Core DataOps Principles Applied to Data Engineering

The DevOps movement was driven by four key metrics popularised by the DORA research program: deployment frequency, lead time for changes, mean time to restore, and change failure rate. DataOps applies these same metrics to data pipeline delivery.

Deployment frequency: How often does your team deploy changes to production pipelines? Teams with manual deployment processes typically deploy weekly or less — meaning bugs live in production for days before they are noticed and fixed. DataOps teams with automated pipelines deploy multiple times per day, with each change small enough that the blast radius of any failure is minimal.

Lead time for changes: How long does it take from committing a dbt model change to that change being live in production? Without CI/CD, the answer is often days — the change sits in a PR, gets manually reviewed, is manually deployed by someone with production access, and is manually validated. With DataOps, the answer is hours.

Mean time to restore: When a pipeline fails, how long until data is flowing correctly again? Without observability, teams often do not know a pipeline failed until a business user reports it. With proper monitoring and alerting, failures are detected within minutes of occurrence.

Change failure rate: What percentage of pipeline deployments cause a downstream quality issue? Without automated testing, this can be surprisingly high — even small dbt model changes can silently break downstream aggregations if no tests catch the regression.

Building the CI/CD Pipeline for Data

A production DataOps CI/CD pipeline for an AWS data platform has the following stages:

Developer commits → Git (CodeCommit or GitHub)
    ↓
CI Stage (AWS CodeBuild):
    1. Lint SQL (SQLFluff)
    2. Run dbt compile (validate syntax)
    3. Run unit tests (dbt test on test data)
    4. Run Great Expectations checks on sample data
    5. Check Terraform plan for infrastructure changes
    ↓ (on PR merge)
Staging Deployment (AWS CodePipeline):
    1. Deploy Glue job changes to staging environment
    2. Run dbt against staging Redshift cluster
    3. Run integration tests (end-to-end pipeline validation)
    4. Run data quality gate (row counts, null checks, freshness)
    ↓ (on approval or automated gate pass)
Production Deployment:
    1. Deploy Glue job changes via Terraform
    2. Run dbt in production (with --defer for efficiency)
    3. Trigger downstream job validations
    4. Update CloudWatch dashboard
    5. Notify stakeholders via SNS

The key discipline is that no human manually deploys to production. Every production change goes through the pipeline. This sounds obvious, but most data teams have at least one person who “just updates it quickly in the console” when there is a bug — which bypasses testing, bypasses audit trails, and creates the conditions for a much larger incident later.

dbt Testing as the Quality Gate

dbt’s built-in testing framework is the most accessible entry point for data quality automation. Every dbt model should have at minimum four standard tests:

# schema.yml — data quality contract expressed as dbt tests
version: 2

models:
  - name: orders_daily
    description: "Daily order aggregations — primary analytics table"
    columns:
      - name: order_id
        description: "Unique identifier for each order"
        tests:
          - unique
          - not_null
      - name: customer_id
        description: "Foreign key to customers table"
        tests:
          - not_null
          - relationships:
              to: ref('customers')
              field: customer_id
      - name: order_total
        description: "Order value in CAD, must be positive"
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0.01
              max_value: 999999.99
      - name: order_status
        tests:
          - accepted_values:
              values: ['pending', 'confirmed', 'shipped', 'delivered', 'cancelled']

  - name: orders_daily
    tests:
      # Table-level tests
      - dbt_utils.recency:
          datepart: hour
          field: created_at
          interval: 26  # Alert if no data in last 26 hours
      - dbt_utils.expression_is_true:
          expression: "row_count > 0"  # Table must not be empty

These tests run in the CI pipeline before every merge and in the production pipeline after every dbt run. A failing test blocks the deployment and triggers an alert — the pipeline does not silently produce empty or incorrect tables.

For more sophisticated quality checks — statistical anomaly detection, referential integrity across datasets not managed by dbt — Great Expectations integrates with the CI pipeline via its checkpoint mechanism. Results are written to S3 and surfaced in a data quality dashboard via Amazon QuickSight.

Infrastructure as Code for Data Resources

Every data infrastructure resource should be defined in code — Terraform, AWS CDK, or CloudFormation. This applies to Glue jobs, Glue crawlers, Redshift clusters, Lake Formation permissions, S3 bucket policies, and EventBridge rules. Resources created manually in the AWS console have no version history, no peer review process, and no automated test coverage.

A Glue job defined in Terraform:

resource "aws_glue_job" "orders_transform" {
  name         = "orders-transform-${var.environment}"
  role_arn     = aws_iam_role.glue_execution.arn
  glue_version = "4.0"
  worker_type  = "G.1X"
  number_of_workers = var.environment == "production" ? 10 : 2

  command {
    name            = "glueetl"
    script_location = "s3://${aws_s3_bucket.scripts.bucket}/glue/orders_transform.py"
    python_version  = "3"
  }

  default_arguments = {
    "--job-bookmark-option"                = "job-bookmark-enable"
    "--enable-metrics"                     = "true"
    "--enable-continuous-cloudwatch-log"   = "true"
    "--enable-glue-datacatalog"            = "true"
    "--TempDir"                            = "s3://${aws_s3_bucket.temp.bucket}/glue-temp/"
    "--additional-python-modules"          = "great_expectations==0.18.0"
  }

  execution_property {
    max_concurrent_runs = 1
  }

  tags = {
    Environment = var.environment
    Team        = "data-platform"
    CostCenter  = "engineering"
  }
}

When Glue job parameters change, the change is reviewed in a PR, validated by a terraform plan in CI, and applied via terraform apply in the deployment pipeline. No console access required, no undocumented changes, full audit trail in version control.

Monitoring and Observability for Data Pipelines

The most common DataOps gap in organisations that have adopted CI/CD is observability. Teams deploy reliably but do not monitor production pipelines with the same rigour applied to application services.

A production data pipeline monitoring stack on AWS should include:

AWS Glue job metrics via CloudWatch: glue.driver.aggregate.recordsRead, glue.driver.aggregate.recordsWritten, and glue.ALL.jvm.heap.used expose job-level health. Set alarms on glue.ALL.s3.filesystem.write.bytes dropping below baseline to catch empty output silently produced by jobs that complete without error but process no records.

dbt run metadata via the dbt metadata API or run_results.json: Track model execution time trends. A model that normally runs in 90 seconds and suddenly takes 12 minutes is signalling a query performance regression — often caused by a missing index, a partition pruning failure, or a data skew issue introduced by an upstream change.

Data freshness alarms via custom Lambda + CloudWatch: A Lambda function that queries the most recent created_at timestamp in critical tables and publishes a DataFreshness metric. An alarm that fires when a table has not been updated within its expected SLA window catches pipeline failures before business users notice stale data.

Pipeline execution dashboard via CloudWatch or Amazon QuickSight: A unified view of all pipeline stages, their last run status, run duration trend, and data volume processed. This dashboard is the first screen an on-call data engineer opens when investigating an incident.

This monitoring posture connects directly to the lineage and observability patterns described in Data Lineage on AWS — when an alert fires, lineage tells you which upstream change caused it.

Environment Management: The Often-Skipped Foundation

DataOps requires at minimum two environments: staging and production. Ideally three: development, staging, and production. This is where many data teams resist, because maintaining separate Redshift clusters or Glue environments has real cost.

The cost-effective approach on AWS uses Redshift Serverless for development and staging environments, paying only for queries executed rather than maintaining running clusters. Production uses provisioned Redshift (or Redshift Serverless if query patterns are sufficiently spiky). Glue uses the same script code across environments, with environment-specific parameters passed via the --ENV argument.

dbt’s environment management uses separate profiles.yml targets that point to different Redshift databases:

# profiles.yml
data_platform:
  target: dev
  outputs:
    dev:
      type: redshift
      host: "dev.cluster.ca-central-1.redshift.amazonaws.com"
      database: dev_analytics
      schema: "dbt_{{ env_var('DBT_USER', 'developer') }}"  # Isolated per developer
    staging:
      type: redshift
      host: "staging.cluster.ca-central-1.redshift.amazonaws.com"
      database: staging_analytics
      schema: analytics
    prod:
      type: redshift
      host: "prod.cluster.ca-central-1.redshift.amazonaws.com"
      database: prod_analytics
      schema: analytics

Each developer gets an isolated schema in the development environment — no shared mutable state between team members. CI runs against staging. Production deployments use the prod target.

Starting Your DataOps Journey

The most common mistake when adopting DataOps is trying to implement everything simultaneously. The pragmatic sequence:

Week 1–2: Add dbt unique and not_null tests to your five most critical models. Run them manually after each dbt run.
Week 3–4: Set up a basic CodePipeline or GitHub Actions workflow that runs dbt test on every PR. Block merges on test failures.
Month 2: Add CloudWatch alarms for Glue job failures and data freshness on top-five tables.
Month 3: Move all Glue job definitions to Terraform. No more console modifications.
Month 4–6: Add staging environment. Require all changes to pass staging validation before production deployment.

This sequence delivers value at each step and builds the team’s confidence in automated processes before adding more complexity. The Data Platform Maturity Model provides a broader framework for understanding where DataOps practices fit in your organisation’s overall data capability evolution.

Conclusion

DataOps is not a technology investment — it is a practice investment. The tools (CodePipeline, dbt, CloudWatch, Terraform) are available, well-documented, and do not require specialised licences. What requires investment is the discipline to treat data pipelines with the same engineering rigour as application software: tested, version-controlled, automatically deployed, and monitored in production.

Organisations that make this investment consistently report fewer production incidents, shorter recovery times when incidents do occur, and higher stakeholder confidence in data-driven decisions.

If your data engineering team is building pipeline reliability practices or modernising an existing DataOps setup on AWS, reach out to the Infra IT Consulting team. We help data teams in Canada, the UK, and Africa build the engineering practices that make data platforms production-grade.

Data Architecture & Strategy

Talk to our team →

DataOps: Applying DevOps Principles to Data Engineering

The Core DataOps Principles Applied to Data Engineering

Building the CI/CD Pipeline for Data

dbt Testing as the Quality Gate

Infrastructure as Code for Data Resources

Monitoring and Observability for Data Pipelines

Environment Management: The Often-Skipped Foundation

Starting Your DataOps Journey

Conclusion

Related posts

Data Contracts: The Key to Reliable Data Pipelines

API-First Data Architecture: Exposing Data as Services

Data Freshness and SLAs: Engineering Pipelines That Hit Their Targets

The Core DataOps Principles Applied to Data Engineering

Building the CI/CD Pipeline for Data

dbt Testing as the Quality Gate

Infrastructure as Code for Data Resources

Monitoring and Observability for Data Pipelines

Environment Management: The Often-Skipped Foundation

Starting Your DataOps Journey

Conclusion

Related posts

Data Contracts: The Key to Reliable Data Pipelines

API-First Data Architecture: Exposing Data as Services

Data Freshness and SLAs: Engineering Pipelines That Hit Their Targets

We value your privacy