Infra IT Consulting logo Infra ITC
Data Architecture & Strategy governancedata-qualitypolicy

Building a Data Governance Framework That Actually Works

By Infra IT Consulting Β· Β· 9 min read

Data governance is one of those initiatives that almost every organisation knows it needs and very few implement successfully. The failure mode is predictable: a governance committee is formed, a policy document is written, roles and responsibilities are defined on paper β€” and then nothing changes in practice. Data engineers keep building pipelines the way they always have. Analysts keep creating shadow copies of data in their personal S3 buckets. The governance framework exists as a document rather than a functioning system.

The difference between governance that works and governance that gathers dust is enforcement. Policies that rely on people voluntarily following documented guidelines will fail. Policies that are encoded into infrastructure β€” access controls, automated quality checks, pipeline validation rules β€” are followed because the system enforces them. This guide focuses on building that kind of governance.

What Data Governance Actually Needs to Do

Effective data governance achieves four things:

  1. Access control β€” the right people can access the data they need; the wrong people cannot access data they should not
  2. Data quality β€” data assets meet defined standards for completeness, accuracy, and freshness; degradations are detected and resolved quickly
  3. Lineage and auditability β€” you can trace where any piece of data came from, how it was transformed, and who has accessed it
  4. Data discoverability β€” people who need data can find it without asking the data team for help

Governance frameworks that address all four are genuinely useful. Frameworks that address only policies and access control but ignore quality, lineage, and discoverability solve less than half the problem.

Data Classification as the Foundation

Before you can govern data, you need to classify it. Classification answers the question: what kind of data is this, and what handling requirements does that impose? A practical classification scheme for most organisations has four tiers:

  • Public β€” data that could be shared externally without risk (aggregated, anonymised analytics; publicly available reference data)
  • Internal β€” data appropriate for any authenticated employee but not for external parties (company-wide metrics, internal documentation)
  • Confidential β€” data requiring role-based access within the organisation (customer PII, financial detail, HR data)
  • Restricted β€” data with regulatory handling requirements (payment card data, health information, data subject to GDPR/PIPEDA/FCA requirements)

Classification must be applied at the dataset level and enforced technically. In AWS, this means tagging S3 objects and Glue Data Catalog tables with a classification label, and using AWS Lake Formation to enforce access based on that classification.

Amazon Macie automates PII discovery in S3 β€” it scans your buckets and identifies objects containing personally identifiable information such as names, email addresses, SINs, credit card numbers, and passport numbers. Running Macie on your data lake surfaces unclassified PII that should be elevated to Confidential or Restricted status. This is particularly important for raw ingestion buckets, where developers sometimes land full API responses or database exports without reviewing the contents.

AWS Lake Formation: Technical Policy Enforcement

AWS Lake Formation is the governance control plane for AWS data lakes. It sits above S3 and the Glue Data Catalog and provides column-level, row-level, and table-level access control for Athena, Redshift Spectrum, EMR, and Glue queries. Without Lake Formation, S3 access is controlled by IAM bucket policies and Glue catalog permissions β€” which are powerful but coarse-grained and difficult to audit centrally.

Lake Formation’s key capabilities for governance:

Table and column permissions. Grant or revoke access to specific Glue catalog tables (and specific columns within those tables) for individual IAM users, roles, or groups. A column exclusion means that an analyst’s Athena query against a table simply does not return the excluded column β€” even if the underlying Parquet file contains it.

# Granting column-level access via Lake Formation (boto3)
import boto3

lf = boto3.client('lakeformation', region_name='ca-central-1')

# Grant an analyst role access to customer table but exclude PII columns
lf.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::123456789:role/AnalystRole'},
    Resource={
        'TableWithColumns': {
            'DatabaseName': 'curated',
            'Name': 'dim_customers',
            'ColumnWildcard': {
                'ExcludedColumnNames': ['email', 'phone', 'sin_hash', 'date_of_birth']
            }
        }
    },
    Permissions=['SELECT']
)

Row-level security. Lake Formation data filters restrict which rows a principal can see. A regional manager can query the sales fact table but see only rows for their region. This is configured as a data filter and attached to the table grant.

Access audit via CloudTrail. Every Lake Formation data access is logged in AWS CloudTrail β€” who queried which table, when, from which IP, and via which query engine. This audit log is the foundation for regulatory compliance reporting and insider threat detection.

Data Quality Policy and Automated Enforcement

A governance framework without data quality enforcement is incomplete. Data quality policy should specify, for each classified data tier, the minimum acceptable thresholds for:

  • Completeness β€” maximum allowable null rate for required fields
  • Uniqueness β€” uniqueness constraints on primary keys and business identifiers
  • Freshness β€” maximum age of data at query time
  • Consistency β€” referential integrity between related datasets
  • Validity β€” domain constraints (dates in valid ranges, categorical fields within allowed values)

These thresholds are implemented as automated checks in the pipeline, not manual reviews. AWS Glue Data Quality provides a rule-based framework for defining and running quality checks within Glue jobs:

# AWS Glue Data Quality ruleset definition
RULESET = """
Rules = [
    IsComplete "customer_id",
    IsUnique "customer_id",
    IsComplete "email",
    ColumnValues "plan_type" in ["free", "starter", "pro", "enterprise"],
    ColumnValues "created_at" > "2015-01-01",
    RowCount > 10000
]
"""

When these rules run as part of a Glue pipeline and fail, the job can be configured to halt and alert rather than write bad data to the curated layer. This β€œfail fast” approach β€” stopping bad data at the pipeline boundary rather than letting it propagate β€” is the most effective quality enforcement mechanism.

For teams using dbt, dbt tests provide equivalent quality enforcement at the transformation layer, with test results written to a central results table that feeds a data reliability dashboard. The data as a product framework details how to structure SLAs and quality monitoring as part of a broader data product ownership model.

Data Lineage: Understanding Data Provenance

Data lineage answers: where did this data come from, and how was it transformed before it reached me? Without lineage, debugging a data quality issue requires manually tracing back through pipeline code to find the source. With lineage, you can navigate the dependency graph from any dataset to its origins in a few clicks.

AWS Glue provides automatic lineage tracking for Glue jobs β€” it records which source tables were read and which target tables were written in each job run, building a lineage graph in the Glue console. For more detailed lineage that includes column-level transformations, tools like OpenLineage (integrated with dbt, Airflow, and Spark) provide a standardised lineage event model that can be stored in S3 and visualised in tools like Marquez or DataHub.

Practical lineage requirements for a governance framework:

  • Every pipeline job records its source and target datasets and its job run ID
  • Every dataset includes a created_by_job_id column linking each record to the job run that produced it
  • A central lineage store (S3 + Athena or a dedicated lineage tool) is queryable by data stewards

Governance Roles and Responsibilities

Technical enforcement handles the β€œwhat”; organisational roles handle the β€œwho”. A minimal governance operating model includes:

Data owners β€” typically senior business leaders who are accountable for the accuracy and appropriate use of data in their domain. The VP of Finance is the data owner for financial data. They approve who gets access to restricted financial datasets.

Data stewards β€” practitioners (data engineers, senior analysts) who manage the day-to-day quality and compliance of specific datasets. They triage quality alerts, manage access requests, and maintain data documentation.

Data consumers β€” analysts, engineers, and business users who use data in their work. They are responsible for using data within its classification constraints and reporting quality issues.

Platform team β€” the data engineering team that maintains the infrastructure: Glue pipelines, Lake Formation policies, quality check frameworks, and the data catalogue.

This model connects to the data catalog best practices post, which covers how to make datasets discoverable and documented as part of a governance-first approach.

Measuring Governance Maturity

Governance programs need metrics to demonstrate progress and identify gaps. Track:

  • Coverage rate β€” what percentage of datasets in the catalogue have an assigned owner, classification, and quality SLA?
  • Quality pass rate β€” what percentage of pipeline runs pass all defined quality checks?
  • Mean time to detect β€” how quickly are quality issues detected after they occur?
  • Mean time to resolve β€” how quickly are detected quality issues resolved?
  • Access request SLA β€” how long does it take to provision legitimate data access requests?
  • Audit readiness β€” can you produce a complete access log for any dataset for any time period within 24 hours?

Conclusion

A data governance framework that works is primarily a technical systems problem, not a policy-writing exercise. Access controls must be machine-enforced via Lake Formation. Quality standards must be validated automatically in pipelines. Lineage must be captured at the infrastructure level. Policies that depend on human compliance will fail; policies encoded into infrastructure will hold.

Building governance that meets regulatory requirements (PIPEDA in Canada, UK GDPR, various African national data protection laws) while enabling analysts to work productively is a balance that requires careful architecture. If your organisation needs to establish or improve its data governance program on AWS, contact Infra IT Consulting for an assessment and implementation roadmap.

Related posts