Data Architecture & Strategy data-catalogmetadatagovernance

Data Catalog Best Practices: Making Data Discoverable at Scale

By Infra IT Consulting · January 30, 2024 · 8 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

Every data team eventually reaches the same inflection point: the data lake has grown large enough that new joiners cannot find what they need, analysts duplicate work because they cannot tell whether a dataset already exists, and engineers spend hours answering questions like “what does this column mean?” and “where does this number come from?” A data catalogue is the solution — but only if it is built and maintained in a way that people actually use.

Most data catalogue initiatives fail for the same reason: they rely on manual metadata entry that quickly falls out of date. An empty catalogue with excellent search is useless; a catalogue full of stale, inaccurate descriptions is worse than useless, because it actively misleads people. The best data catalogues are largely automated — metadata is captured from the pipeline itself — and supplemented by a lightweight human curation process.

What a Data Catalog Actually Needs to Provide

Before evaluating tools, be clear about what problems you are solving. A data catalogue needs to provide:

Discovery — I need data about X; where does that data live in our data lake?
Understanding — I found a dataset; what does it contain, and what do the columns mean in business terms?
Trust — is this dataset reliable, who owns it, when was it last updated, and does it meet quality standards?
Access — how do I get access to this dataset if I don’t have it?
Lineage — where does this data come from, and what downstream assets depend on it?

A catalogue that provides discovery and understanding but not trust and lineage is half a catalogue. Analysts will find data but not know whether to rely on it, which is often worse than not finding it at all.

AWS Glue Data Catalog: The Foundation Layer

AWS Glue Data Catalog is the metadata repository for your AWS data lake. It stores table schemas, partition information, data types, and location metadata for all data stored in S3 that is exposed through Athena, Redshift Spectrum, or Glue jobs. It is Hive-compatible, meaning any tool that can connect to a Hive metastore can work with it.

Glue Crawlers automate schema discovery: they scan S3 buckets or database connections, infer schemas from Parquet, JSON, or CSV files, and register tables in the Data Catalog without manual intervention. This is the foundation of automated metadata capture — as new data arrives in S3 and pipelines create new datasets, Crawlers keep the catalogue up to date.

However, Glue Data Catalog has significant limitations as a standalone catalogue:

No business-friendly search. The Glue console is designed for engineers, not analysts. There is no full-text search across table descriptions, no tagging interface, and no concept of data products or logical groupings.
No rich column descriptions. Glue supports table-level and column-level descriptions, but they are short strings with no formatting, examples, or lineage context.
No quality or SLA metadata. There is nowhere to express that a table is refreshed daily by 07:00 UTC, has a 99.9% completeness SLA, or failed its quality check last Tuesday.
No access request workflow. A user who discovers a restricted dataset in Glue has no way to request access from within the catalogue.

For small data teams with a limited number of datasets, Glue Data Catalog supplemented by dbt docs is often sufficient. For larger organisations, a purpose-built catalogue tool running on top of Glue is needed.

dbt Docs as a Lightweight Catalogue

If your transformation layer uses dbt, dbt’s built-in documentation system is a powerful, low-overhead addition to Glue Data Catalog. dbt docs generates a static website from your dbt project that includes:

Full schema documentation for every model (table), including column descriptions from schema.yml
A lineage DAG showing how each model is derived from its upstream sources
dbt test results indicating which quality checks pass or fail
Model metadata including owner, tags, and custom meta fields

The dbt docs site is automatically kept in sync with the data models — if a developer adds a column to a model and documents it in schema.yml, it appears in the docs at the next dbt docs generate run. This is far more reliable than manually maintained documentation.

A well-documented dbt schema.yml file is the key investment:

models:
  - name: dim_customers
    description: >
      Canonical customer dimension. One row per customer.
      Source of truth for customer attributes across all analytical models.
      Do not join directly to the production CRM database; use this table instead.
    meta:
      owner: "customer-data-team"
      slack_channel: "#data-customer"
      sla: "Updated daily by 07:00 UTC"
      classification: "Confidential — contains PII"
      data_product: true
    columns:
      - name: customer_id
        description: "Surrogate integer key. Never null. Stable across updates."
        tests:
          - not_null
          - unique
      - name: email
        description: >
          Customer email address as provided at registration.
          May be null for legacy accounts created before email was required.
          PII — do not export without data owner approval.
        tests:
          - not_null
      - name: plan_type
        description: "Current subscription plan. One of: free, starter, pro, enterprise."
        tests:
          - accepted_values:
              values: ['free', 'starter', 'pro', 'enterprise']
      - name: ltv_cad
        description: >
          Cumulative revenue attributed to this customer in CAD, calculated from
          fact_payments. Updated on each dbt run. Used for customer segmentation
          and churn analysis. See /blog/cohort-analysis-sql-aws for usage patterns.

The description fields are where you add genuine business context — not just “customer ID” but what it is, why it exists, and how to use it correctly. This is the metadata that saves analysts hours of investigation.

Choosing a Catalogue Tool for Larger Organisations

When dbt docs is insufficient — typically when you have more than 20–30 data producers, mixed source systems beyond dbt, or a need for access request workflows — a dedicated catalogue tool is warranted. The leading options for AWS-native deployments:

AWS Glue Data Catalog + third-party UI. Tools like Atlan, Alation, or DataHub can be deployed on AWS and configured to use Glue Data Catalog as their backend. They add the search, lineage visualisation, data quality integration, and business glossary capabilities that Glue’s native UI lacks. DataHub is open source and can be self-hosted on ECS or Kubernetes.

AWS DataZone is AWS’s purpose-built data governance and catalogue service, announced in 2023. It provides a business-friendly portal for data discovery, subscription-based access requests, and integration with Glue, Redshift, S3, and Athena. For organisations looking for a fully managed, AWS-native catalogue without open-source self-hosting overhead, DataZone is the emerging default recommendation.

Regardless of tool choice, the architecture principle is the same: automated metadata capture at the infrastructure layer, supplemented by human-authored business context.

Metadata That Actually Gets Used

The difference between a catalogue that analysts consult daily and one that they ignore is the quality of the metadata. Metadata that gets used has these characteristics:

Written for consumers, not producers. A table description written by the engineer who built the pipeline reads like: “Fact table joining orders_raw to customers_dim via customer_uuid”. A description written for a data consumer reads like: “One row per paid order. Use this table to calculate revenue by product, geography, or customer segment. For returns and refunds, see fact_refunds which joins on order_id.”

Includes examples. Column descriptions that include example values (“e.g., ‘CA-ON’, ‘UK-ENG’, ‘NG-LA’”) are dramatically more useful than abstract descriptions. Sample queries in the table description show consumers how to start without writing queries from scratch.

Acknowledges limitations. A catalogue entry that says “Historical data before 2021-03-15 may have gaps due to a migration from the legacy system. Verify totals against finance system of record for pre-2021 periods” is more trustworthy than one that says nothing about data quality history.

Is searchable by business term. A business glossary that maps business terms (“recurring revenue”, “active customer”, “churn rate”) to the specific tables and columns that implement them — including the business rule — is one of the highest-value catalogue additions you can make.

Keeping the Catalogue Current

A data catalogue that is accurate today but stale in six months is a broken catalogue. Maintaining currency requires:

Automated schema sync. Glue Crawlers run on schedule; dbt docs regenerates on every pipeline run. Schema changes are captured automatically.
Pipeline metadata emission. Every Glue or Airflow job emits a completion event to a central metadata log: job name, run ID, tables read and written, row counts, quality check results. This keeps the “last updated” and “quality status” metadata current without manual intervention.
Stewardship reviews. Quarterly review by data stewards of the 20 most-queried datasets: are descriptions still accurate? Are example queries still valid? Have SLAs changed?
Ownership accountability. When a dataset’s owner leaves the organisation, their datasets are flagged for reassignment. An unowned dataset is a governance risk.

Connecting Catalogue to Governance

The data catalogue and the data governance framework are complementary. The catalogue makes data discoverable; governance controls access and ensures quality. The post on building a data governance framework covers the access control and quality enforcement side; the catalogue is where policies are made visible to consumers — data classification labels, access request processes, and quality SLAs should all be surfaced in the catalogue entry for each dataset.

A well-implemented catalogue entry for a restricted dataset should tell a potential consumer: what the data contains, who owns it, how to request access, what the expected processing time is, and what the approved use cases are. This replaces informal Slack messages with a self-service process that is auditable and consistent.

Conclusion

A data catalogue that works is not a project you complete once; it is an operational practice you maintain continuously. The technical foundation — Glue Data Catalog, dbt docs or a purpose-built tool, automated metadata capture — is the easy part. The cultural commitment to writing quality descriptions, reviewing metadata regularly, and routing access requests through the catalogue is harder, but it is what separates catalogues that improve data productivity from ones that gather dust.

If your organisation needs to implement a data catalogue that teams will actually use — from technical architecture to metadata standards and stewardship processes — contact Infra IT Consulting. We have built catalogues for data teams across Canada, the UK, and Africa and can help you avoid the common pitfalls.

Data Architecture & Strategy

Talk to our team →

Data Catalog Best Practices: Making Data Discoverable at Scale

What a Data Catalog Actually Needs to Provide

AWS Glue Data Catalog: The Foundation Layer

dbt Docs as a Lightweight Catalogue

Choosing a Catalogue Tool for Larger Organisations

Metadata That Actually Gets Used

Keeping the Catalogue Current

Connecting Catalogue to Governance

Conclusion

Related posts

The Modern Data Stack Explained: What It Is and When to Use It

Lambda vs. Kappa Architecture: Which Fits Your Streaming Use Case?

Data Strategy for Startups: Building for Scale from Day One

What a Data Catalog Actually Needs to Provide

AWS Glue Data Catalog: The Foundation Layer

dbt Docs as a Lightweight Catalogue

Choosing a Catalogue Tool for Larger Organisations

Metadata That Actually Gets Used

Keeping the Catalogue Current

Connecting Catalogue to Governance

Conclusion

Related posts

The Modern Data Stack Explained: What It Is and When to Use It

Lambda vs. Kappa Architecture: Which Fits Your Streaming Use Case?

Data Strategy for Startups: Building for Scale from Day One

We value your privacy