DataOps: Applying DevOps Principles to Data Engineering
Software engineering teams have spent two decades learning how to ship code reliably: automated testing, continuous integration, deployment pipelines, observability, and on-call runbooks. Data engineering teams, for much of that same period, deployed pipelines manually, tested them informally (or not at all), and learned about failures when business users noticed wrong numbers.
DataOps is the discipline of applying software engineering rigour to data pipelines. It is not a product you buy — it is a set of practices that, when adopted consistently, dramatically reduce pipeline failures, accelerate delivery, and make data platforms genuinely trustworthy. On AWS, the combination of AWS CodePipeline, AWS Glue, Amazon Redshift, dbt, and CloudWatch gives you a full DataOps stack without proprietary tooling.
The Core DataOps Principles Applied to Data Engineering
The DevOps movement was driven by four key metrics popularised by the DORA research program: deployment frequency, lead time for changes, mean time to restore, and change failure rate. DataOps applies these same metrics to data pipeline delivery.
Deployment frequency: How often does your team deploy changes to production pipelines? Teams with manual deployment processes typically deploy weekly or less — meaning bugs live in production for days before they are noticed and fixed. DataOps teams with automated pipelines deploy multiple times per day, with each change small enough that the blast radius of any failure is minimal.
Lead time for changes: How long does it take from committing a dbt model change to that change being live in production? Without CI/CD, the answer is often days — the change sits in a PR, gets manually reviewed, is manually deployed by someone with production access, and is manually validated. With DataOps, the answer is hours.
Mean time to restore: When a pipeline fails, how long until data is flowing correctly again? Without observability, teams often do not know a pipeline failed until a business user reports it. With proper monitoring and alerting, failures are detected within minutes of occurrence.
Change failure rate: What percentage of pipeline deployments cause a downstream quality issue? Without automated testing, this can be surprisingly high — even small dbt model changes can silently break downstream aggregations if no tests catch the regression.
Building the CI/CD Pipeline for Data
A production DataOps CI/CD pipeline for an AWS data platform has the following stages:
Developer commits → Git (CodeCommit or GitHub)
↓
CI Stage (AWS CodeBuild):
1. Lint SQL (SQLFluff)
2. Run dbt compile (validate syntax)
3. Run unit tests (dbt test on test data)
4. Run Great Expectations checks on sample data
5. Check Terraform plan for infrastructure changes
↓ (on PR merge)
Staging Deployment (AWS CodePipeline):
1. Deploy Glue job changes to staging environment
2. Run dbt against staging Redshift cluster
3. Run integration tests (end-to-end pipeline validation)
4. Run data quality gate (row counts, null checks, freshness)
↓ (on approval or automated gate pass)
Production Deployment:
1. Deploy Glue job changes via Terraform
2. Run dbt in production (with --defer for efficiency)
3. Trigger downstream job validations
4. Update CloudWatch dashboard
5. Notify stakeholders via SNS
The key discipline is that no human manually deploys to production. Every production change goes through the pipeline. This sounds obvious, but most data teams have at least one person who “just updates it quickly in the console” when there is a bug — which bypasses testing, bypasses audit trails, and creates the conditions for a much larger incident later.
dbt Testing as the Quality Gate
dbt’s built-in testing framework is the most accessible entry point for data quality automation. Every dbt model should have at minimum four standard tests:
# schema.yml — data quality contract expressed as dbt tests
version: 2
models:
- name: orders_daily
description: "Daily order aggregations — primary analytics table"
columns:
- name: order_id
description: "Unique identifier for each order"
tests:
- unique
- not_null
- name: customer_id
description: "Foreign key to customers table"
tests:
- not_null
- relationships:
to: ref('customers')
field: customer_id
- name: order_total
description: "Order value in CAD, must be positive"
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0.01
max_value: 999999.99
- name: order_status
tests:
- accepted_values:
values: ['pending', 'confirmed', 'shipped', 'delivered', 'cancelled']
- name: orders_daily
tests:
# Table-level tests
- dbt_utils.recency:
datepart: hour
field: created_at
interval: 26 # Alert if no data in last 26 hours
- dbt_utils.expression_is_true:
expression: "row_count > 0" # Table must not be empty
These tests run in the CI pipeline before every merge and in the production pipeline after every dbt run. A failing test blocks the deployment and triggers an alert — the pipeline does not silently produce empty or incorrect tables.
For more sophisticated quality checks — statistical anomaly detection, referential integrity across datasets not managed by dbt — Great Expectations integrates with the CI pipeline via its checkpoint mechanism. Results are written to S3 and surfaced in a data quality dashboard via Amazon QuickSight.
Infrastructure as Code for Data Resources
Every data infrastructure resource should be defined in code — Terraform, AWS CDK, or CloudFormation. This applies to Glue jobs, Glue crawlers, Redshift clusters, Lake Formation permissions, S3 bucket policies, and EventBridge rules. Resources created manually in the AWS console have no version history, no peer review process, and no automated test coverage.
A Glue job defined in Terraform:
resource "aws_glue_job" "orders_transform" {
name = "orders-transform-${var.environment}"
role_arn = aws_iam_role.glue_execution.arn
glue_version = "4.0"
worker_type = "G.1X"
number_of_workers = var.environment == "production" ? 10 : 2
command {
name = "glueetl"
script_location = "s3://${aws_s3_bucket.scripts.bucket}/glue/orders_transform.py"
python_version = "3"
}
default_arguments = {
"--job-bookmark-option" = "job-bookmark-enable"
"--enable-metrics" = "true"
"--enable-continuous-cloudwatch-log" = "true"
"--enable-glue-datacatalog" = "true"
"--TempDir" = "s3://${aws_s3_bucket.temp.bucket}/glue-temp/"
"--additional-python-modules" = "great_expectations==0.18.0"
}
execution_property {
max_concurrent_runs = 1
}
tags = {
Environment = var.environment
Team = "data-platform"
CostCenter = "engineering"
}
}
When Glue job parameters change, the change is reviewed in a PR, validated by a terraform plan in CI, and applied via terraform apply in the deployment pipeline. No console access required, no undocumented changes, full audit trail in version control.
Monitoring and Observability for Data Pipelines
The most common DataOps gap in organisations that have adopted CI/CD is observability. Teams deploy reliably but do not monitor production pipelines with the same rigour applied to application services.
A production data pipeline monitoring stack on AWS should include:
AWS Glue job metrics via CloudWatch: glue.driver.aggregate.recordsRead, glue.driver.aggregate.recordsWritten, and glue.ALL.jvm.heap.used expose job-level health. Set alarms on glue.ALL.s3.filesystem.write.bytes dropping below baseline to catch empty output silently produced by jobs that complete without error but process no records.
dbt run metadata via the dbt metadata API or run_results.json: Track model execution time trends. A model that normally runs in 90 seconds and suddenly takes 12 minutes is signalling a query performance regression — often caused by a missing index, a partition pruning failure, or a data skew issue introduced by an upstream change.
Data freshness alarms via custom Lambda + CloudWatch: A Lambda function that queries the most recent created_at timestamp in critical tables and publishes a DataFreshness metric. An alarm that fires when a table has not been updated within its expected SLA window catches pipeline failures before business users notice stale data.
Pipeline execution dashboard via CloudWatch or Amazon QuickSight: A unified view of all pipeline stages, their last run status, run duration trend, and data volume processed. This dashboard is the first screen an on-call data engineer opens when investigating an incident.
This monitoring posture connects directly to the lineage and observability patterns described in Data Lineage on AWS — when an alert fires, lineage tells you which upstream change caused it.
Environment Management: The Often-Skipped Foundation
DataOps requires at minimum two environments: staging and production. Ideally three: development, staging, and production. This is where many data teams resist, because maintaining separate Redshift clusters or Glue environments has real cost.
The cost-effective approach on AWS uses Redshift Serverless for development and staging environments, paying only for queries executed rather than maintaining running clusters. Production uses provisioned Redshift (or Redshift Serverless if query patterns are sufficiently spiky). Glue uses the same script code across environments, with environment-specific parameters passed via the --ENV argument.
dbt’s environment management uses separate profiles.yml targets that point to different Redshift databases:
# profiles.yml
data_platform:
target: dev
outputs:
dev:
type: redshift
host: "dev.cluster.ca-central-1.redshift.amazonaws.com"
database: dev_analytics
schema: "dbt_{{ env_var('DBT_USER', 'developer') }}" # Isolated per developer
staging:
type: redshift
host: "staging.cluster.ca-central-1.redshift.amazonaws.com"
database: staging_analytics
schema: analytics
prod:
type: redshift
host: "prod.cluster.ca-central-1.redshift.amazonaws.com"
database: prod_analytics
schema: analytics
Each developer gets an isolated schema in the development environment — no shared mutable state between team members. CI runs against staging. Production deployments use the prod target.
Starting Your DataOps Journey
The most common mistake when adopting DataOps is trying to implement everything simultaneously. The pragmatic sequence:
- Week 1–2: Add dbt
uniqueandnot_nulltests to your five most critical models. Run them manually after each dbt run. - Week 3–4: Set up a basic CodePipeline or GitHub Actions workflow that runs
dbt teston every PR. Block merges on test failures. - Month 2: Add CloudWatch alarms for Glue job failures and data freshness on top-five tables.
- Month 3: Move all Glue job definitions to Terraform. No more console modifications.
- Month 4–6: Add staging environment. Require all changes to pass staging validation before production deployment.
This sequence delivers value at each step and builds the team’s confidence in automated processes before adding more complexity. The Data Platform Maturity Model provides a broader framework for understanding where DataOps practices fit in your organisation’s overall data capability evolution.
Conclusion
DataOps is not a technology investment — it is a practice investment. The tools (CodePipeline, dbt, CloudWatch, Terraform) are available, well-documented, and do not require specialised licences. What requires investment is the discipline to treat data pipelines with the same engineering rigour as application software: tested, version-controlled, automatically deployed, and monitored in production.
Organisations that make this investment consistently report fewer production incidents, shorter recovery times when incidents do occur, and higher stakeholder confidence in data-driven decisions.
If your data engineering team is building pipeline reliability practices or modernising an existing DataOps setup on AWS, reach out to the Infra IT Consulting team. We help data teams in Canada, the UK, and Africa build the engineering practices that make data platforms production-grade.
Related posts
Vector Databases on AWS: Enabling AI-Powered Search and RAG
Read more Data Architecture & StrategyData Lineage on AWS: Tracking Data from Source to Dashboard
Read more Data Architecture & StrategyLakehouse Architecture on AWS: Combining the Best of Lakes and Warehouses
Read moreBook a free 30-minute consultation to discuss your data engineering and analytics needs.
Talk to our team →