Data Lineage on AWS: Tracking Data from Source to Dashboard
When a revenue dashboard shows a number that does not match the finance team’s spreadsheet, the first question is almost always the hardest: where did this number come from, and what touched it on the way? Without data lineage, answering that question means manually tracing through Glue job logs, Redshift query history, and Athena execution records — a process that can take hours or days. With lineage, the answer is a few clicks.
Data lineage is the capability to trace a data asset’s full journey: from the source system where it was created, through every transformation pipeline that processed it, to the final table or dashboard where it is consumed. On AWS, building this capability requires combining several services and potentially an open standard — OpenLineage — that makes lineage metadata portable across tools.
Why Lineage Is a Data Engineering Priority, Not a Governance Luxury
Teams often treat lineage as a compliance or governance concern — something the data steward worries about, not the engineer building the pipeline. This framing underestimates its operational value.
Faster incident resolution. When a pipeline produces incorrect output, lineage tells you immediately which upstream source changed and which downstream consumers are affected. Mean time to resolution (MTTR) for data incidents at organisations with mature lineage is typically 60–80% lower than at organisations without it.
Safe schema changes. Before renaming a column or changing a table’s schema, lineage shows you every downstream job, query, and dashboard that reads from that table. You can notify affected teams and coordinate migration rather than discovering breakages after deployment.
Impact analysis for infrastructure changes. When evaluating whether to migrate a source system or change a Glue job’s output format, lineage provides the full blast radius of that change — across every consumer, including ones added by other teams months ago.
Regulatory compliance. Under PIPEDA, Quebec’s Law 25, and GDPR, organisations must be able to demonstrate where personal data flows and how it is processed. Automated lineage is more reliable and auditable than manually maintained data flow documentation.
AWS Native Lineage Capabilities
AWS provides lineage tracking through two main channels: AWS Glue Data Catalog column-level lineage for Glue-managed assets, and Amazon DataZone for broader organisational lineage across accounts and domains.
AWS Glue Data Catalog lineage (available in Glue version 4.0+) automatically records column-level provenance for Glue ETL jobs when you enable it via the --enable-glue-datacatalog job parameter and set --datalake-formats appropriately. When a Glue job reads from Table A and writes to Table B, the Catalog records which columns in Table B derive from which columns in Table A.
# Glue job with lineage tracking enabled via job parameters
# Set these in the Glue job configuration:
# --enable-glue-datacatalog: true
# --enable-continuous-cloudwatch-log: true
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Glue automatically records lineage for DynamicFrame operations
source = glueContext.create_dynamic_frame.from_catalog(
database="raw_db",
table_name="orders_raw"
)
# Transformation — lineage tracks column derivations
transformed = source.select_fields(
["order_id", "customer_id", "order_total", "created_at"]
).rename_field("order_total", "revenue")
# Lineage recorded: raw_db.orders_raw → analytics_db.orders_daily
glueContext.write_dynamic_frame.from_catalog(
frame=transformed,
database="analytics_db",
table_name="orders_daily"
)
job.commit()
The resulting lineage is queryable via the AWS Glue API and visible in the Glue Console under the Data Catalog table’s “Lineage” tab. Column-level lineage shows that analytics_db.orders_daily.revenue derives from raw_db.orders_raw.order_total.
Amazon DataZone extends lineage to the organisational level — across AWS accounts, data domains, and consumer personas. DataZone’s lineage graph shows not just technical pipeline lineage but business lineage: which data products (published in a DataZone domain) consume from which other products, and who has subscribed to those products. This is the layer where compliance officers and data stewards operate.
OpenLineage: The Standard That Makes Lineage Portable
AWS-native lineage works well for Glue-to-Glue and Glue-to-Redshift flows, but most production data platforms also include dbt transformations, Airflow orchestration, Redshift stored procedures, and custom Python jobs. Capturing lineage across all of these requires a standard that all tools can emit — and that standard is OpenLineage.
OpenLineage is an open specification (CNCF project) that defines a common event schema for lineage metadata. Tools that support it — dbt, Apache Airflow, Spark, Flink, and others — emit RunEvent messages to an OpenLineage backend whenever a job starts, completes, or fails.
Marquez is the reference OpenLineage backend, deployable on Amazon ECS or EKS. It collects OpenLineage events from all your tools and presents a unified lineage graph via its UI and REST API.
// Example OpenLineage RunEvent emitted by dbt when a model completes
{
"eventType": "COMPLETE",
"eventTime": "2024-03-12T06:15:00Z",
"run": {
"runId": "550e8400-e29b-41d4-a716-446655440000"
},
"job": {
"namespace": "dbt-redshift",
"name": "analytics.orders_daily"
},
"inputs": [
{
"namespace": "redshift://cluster.ca-central-1",
"name": "raw.orders_raw"
},
{
"namespace": "redshift://cluster.ca-central-1",
"name": "raw.customers"
}
],
"outputs": [
{
"namespace": "redshift://cluster.ca-central-1",
"name": "analytics.orders_daily",
"facets": {
"schema": {
"fields": [
{"name": "order_id", "type": "VARCHAR"},
{"name": "revenue", "type": "DECIMAL(12,2)"}
]
}
}
}
]
}
Configuring dbt to emit OpenLineage events requires setting the openlineage adapter in profiles.yml and setting the OPENLINEAGE_URL environment variable to your Marquez instance. From that point forward, every dbt run automatically records its inputs, outputs, and schema facets without additional engineering work.
Building a Lineage Architecture on AWS
A production lineage architecture for an AWS data platform looks like this:
Data Sources (RDS, Kafka, SaaS APIs)
↓
AWS Glue ETL Jobs (emit lineage to Glue Data Catalog)
↓
Amazon S3 (raw zone → curated zone)
↓
dbt on Amazon Redshift (emit OpenLineage events to Marquez on ECS)
↓
Amazon Redshift (analytics layer)
↓
Amazon QuickSight (dashboards)
Lineage metadata flows:
Glue Data Catalog → lineage API → Marquez (via custom Lambda bridge)
dbt → OpenLineage → Marquez directly
Airflow DAGs → OpenLineage → Marquez via openlineage-airflow package
Marquez (on Amazon ECS, backed by Amazon RDS PostgreSQL)
↑
Unified lineage graph, queryable via REST API and UI
↑
Amazon DataZone (subscribes to Marquez for cross-domain lineage)
The Lambda bridge between Glue Data Catalog and Marquez is a small function that polls the Glue lineage API on job completion (triggered via EventBridge) and translates Glue’s lineage format into OpenLineage events. This keeps your lineage graph unified in Marquez even though Glue uses its own format internally.
Column-Level Lineage: The Detail That Matters for Compliance
Table-level lineage tells you that Table B was built from Tables A and C. Column-level lineage tells you that Table B.revenue derives from Table A.order_total multiplied by Table A.exchange_rate. For compliance purposes — particularly when demonstrating to auditors how a specific personal data field is processed — column-level lineage is the required granularity.
dbt supports column-level lineage in its documentation layer through column definitions in schema YAML files. Combined with the OpenLineage schema facet (shown above), these column definitions propagate through to Marquez as column-level lineage when you run dbt docs generate.
Lake Formation’s column-level access controls work alongside lineage to create a complete governance picture: lineage shows where data came from, Lake Formation controls who can see it. For teams building a comprehensive governance posture, this combination is described in detail in AWS Lake Formation Best Practices.
Operationalising Lineage: Making It Useful Day to Day
Lineage metadata is only valuable if engineers and data stewards actually use it. Three operational practices turn lineage from a compliance artefact into a daily tool:
Lineage-aware change review. Before merging a PR that modifies a source table schema or a transformation job, run a lineage impact query that lists all downstream consumers. Make this a required step in your PR checklist. A simple script hitting the Marquez API can produce this output automatically as a PR comment via GitHub Actions.
Incident runbooks that start with lineage. When a data quality alert fires, the first step in the runbook should be: “Open the lineage graph for the affected table and identify which upstream job produced the current partition.” This becomes a 30-second step with Marquez rather than a manual search through CloudWatch logs.
Data contract validation linked to lineage nodes. When a lineage node (a table or dataset) has an associated data contract, validation results should be visible in the lineage graph. Marquez supports custom run facets that can carry contract validation status — failed validations appear as annotations on the lineage node, giving consumers immediate visibility into upstream quality issues.
Conclusion
Data lineage on AWS is not a single-service capability — it requires combining AWS Glue Data Catalog lineage for Glue-native jobs, OpenLineage emission from dbt and Airflow, and a unified backend (Marquez or Amazon DataZone) that brings the full graph together. The engineering investment to set this up is measured in weeks, not months, and the operational return — faster incident resolution, safe schema changes, and defensible compliance documentation — pays back quickly.
Organisations that invest in lineage early in their data platform build avoid the painful process of reverse-engineering data flows from logs and tribal knowledge after an incident. Organisations that add lineage to mature platforms consistently report that the visibility it provides changes how their engineering teams approach changes to shared datasets.
If you are designing or retrofitting a lineage capability for your AWS data platform, get in touch with the Infra IT Consulting team. We work with data engineering teams in Canada, the UK, and Africa to build governance and observability infrastructure that makes data platforms trustworthy at scale.
Related posts
API-First Data Architecture: Exposing Data as Services
Read more Data Architecture & StrategyCloud-Native Analytics Strategy: A Roadmap for 2024 and Beyond
Read more Data Architecture & StrategyThe Data Platform Maturity Model: Where Does Your Organisation Stand?
Read moreBook a free 30-minute consultation to discuss your data engineering and analytics needs.
Talk to our team →