Infra IT Consulting logo Infra ITC
Data Analytics & BI data-productstrategygovernance

Data as a Product: Building Internal Data Products That Teams Actually Use

By Infra IT Consulting · · 9 min read

Most data teams have experienced the same frustrating pattern. An analyst builds a dashboard in response to a stakeholder request. The stakeholder uses it for a few weeks, then discovers that the numbers do not match another report they received from a different team. They lose trust in the data. The dashboard is abandoned. The analyst builds another one for the next request, and the cycle repeats. Six months later, the organisation has a dozen dashboards that contradict each other, nobody knows which to trust, and data is blamed rather than the process that created it.

The root cause is almost never a technical problem. It is an organisational problem: data was treated as a by-product of engineering work rather than a product in its own right, with an owner, defined consumers, quality standards, and a support model. The Data as a Product framework addresses exactly this.

What “Data as a Product” Actually Means

The phrase comes from the data mesh architecture pattern introduced by Zhamak Dehghani, but the principle is applicable independently of whether your organisation adopts the full data mesh model. Treating data as a product means applying the same discipline to internal data assets that a software product team applies to a customer-facing product:

  • Defined ownership — a specific team is responsible for a data asset, not just whoever built it originally
  • Known consumers — the team understands who uses the data, for what purpose, and what their requirements are
  • Quality standards and SLAs — freshness, completeness, and accuracy are measurable and monitored, and the owning team is accountable when they degrade
  • Discoverability — consumers can find the data asset, understand what it contains, and access it without asking the owning team for help
  • Versioning and change management — schema changes are communicated in advance, backward compatibility is maintained or migration paths are provided

This is a significant shift for most data teams, who are accustomed to building pipelines and dashboards on request and moving to the next project. It requires investment in tooling, process, and organisational culture.

The Internal Data Product Taxonomy

Not all data assets are products. A raw S3 dump of a third-party API response is not a data product — it is raw material. A staging table that only exists to feed a downstream model is an implementation detail. A data product is an asset that is consumed directly by humans or systems outside the team that owns it, and that carries a commitment to quality and availability.

Internal data products typically fall into three categories:

Source-aligned products — canonical representations of a specific business domain’s data, owned by the team closest to that domain. The Customer team owns the dim_customers table and the fact_customer_events table. The Finance team owns fact_journal_entries. These are the authoritative sources that other teams join against.

Aggregate products — derived datasets that combine multiple source-aligned products to answer cross-domain questions. A fact_order_revenue_attribution table that joins orders, customers, marketing touchpoints, and financial data is owned by a central analytics team because no single domain team has full context.

Analytical products — dashboards, reports, and notebooks that present data to decision-makers. These are products too, with defined consumers, refresh schedules, and quality standards.

Defining and Enforcing Data Product SLAs

A data product without a defined SLA is a data product that will eventually fail its consumers without accountability. SLAs for internal data products should specify:

  • Freshness — how old can the data be when a consumer queries it? (e.g., “updated by 07:00 UTC daily”)
  • Completeness — what percentage of expected records must be present? (e.g., “no more than 0.1% null values in required fields”)
  • Accuracy — how is correctness defined and measured? (e.g., “revenue totals reconcile to within 0.5% of finance system of record”)
  • Availability — what is the expected uptime of the serving layer?
  • Incident response — how quickly will the owning team acknowledge and resolve a data quality issue?

On AWS, SLA monitoring is implemented through a combination of AWS Glue Data Quality checks, custom Lambda functions that run validation queries against Amazon Athena or Amazon Redshift after each pipeline run, and Amazon CloudWatch alarms that trigger when quality metrics fall below thresholds. The results are written to a central data quality table that feeds a reliability dashboard visible to data consumers and data product owners alike.

A simple quality check pattern using dbt tests (running on Athena or Redshift):

# dbt schema.yml for dim_customers data product
models:
  - name: dim_customers
    description: "Canonical customer dimension. Owner: Customer Data Team. SLA: updated by 07:00 UTC daily, 99.9% completeness on customer_id and email."
    meta:
      owner: "customer-data-team@company.com"
      sla_freshness_hours: 24
      data_product: true
    columns:
      - name: customer_id
        description: "Surrogate key. Never null."
        tests:
          - not_null
          - unique
      - name: email
        tests:
          - not_null
      - name: first_order_date
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: "date '2015-01-01'"
              max_value: "current_date"
    tests:
      - dbt_expectations.expect_table_row_count_to_be_between:
          min_value: 10000
          max_value: 10000000

When these tests run as part of the pipeline, failures are written to a central quality log and trigger alerts to the owning team — before consumers discover the problem themselves.

Making Data Products Discoverable

A data product that nobody can find is worthless. Discoverability requires a data catalogue that is accurate, searchable, and actually used. AWS Glue Data Catalog provides the metadata foundation — table schemas, partition information, and crawler-maintained freshness — but it lacks the human-readable context that makes a catalogue genuinely useful.

Practical discoverability requires layering additional context on top of the technical metadata:

  • Business descriptions — what does this table represent in business terms, not technical terms?
  • Usage examples — sample queries that show common consumption patterns
  • Lineage — what sources feed this product, and what products consume it?
  • Data classification — is this PII? Confidential? Freely shareable?
  • Contact information — who owns this, and how do I reach them if something is wrong?

Tools like AWS Glue Data Catalog combined with dbt docs, or a third-party catalogue like Atlan or DataHub deployed on AWS, provide this richer metadata layer. The post on data catalog best practices covers the specific AWS configuration and tooling choices in detail.

Ownership Without Bureaucracy

The biggest risk in implementing a Data as a Product model is creating bureaucracy that slows down the data team without improving quality. Ownership should be lightweight in process, heavy in accountability. A few principles that work in practice:

Ownership follows domain knowledge, not data location. The team that understands the business logic owns the data product, even if the data physically lives in a shared data lake. The logistics team owns delivery event data because they understand what each event type means; the central data team should not be the owner just because they built the pipeline.

Owners are empowered, not just accountable. An owning team must have the technical ability to fix quality issues when they occur. If they have to raise a ticket with a central platform team to update a Glue job, ownership is nominal. Give domain teams the tools and permissions to manage their own pipelines within guardrails set by the platform team.

Automate the contract. Data contracts — machine-readable specifications of a data product’s schema, quality thresholds, and SLA commitments — can be validated automatically at pipeline runtime. When a source system changes in a way that breaks a downstream data product’s contract, the contract validation fails loudly rather than silently corrupting downstream consumers. See the post on data contracts for implementation patterns on AWS.

Measuring Data Product Health

How do you know if your Data as a Product initiative is working? Track these metrics:

  • Time to trust — how long does it take for a new consumer to go from discovering a data product to using it confidently in a production context?
  • Incident rate — how many data quality incidents per month, broken down by owning team?
  • Mean time to resolution — how long from incident detection to resolution?
  • Consumer satisfaction — periodic surveys of data consumers on quality, freshness, and discoverability
  • Self-service ratio — what percentage of data questions are answered without direct involvement from the data team?

These metrics require investment in tooling and process, but they are the only way to objectively assess whether treating data as a product is improving outcomes for the organisation.

Conclusion

Treating data as a product is not a technology initiative; it is an organisational design decision backed by technology. The technical components — SLA monitoring, data contracts, catalogue tooling, quality checks — are the easier part. The harder part is convincing domain teams to accept ownership of data quality, and data teams to shift from order-takers to platform providers.

The payoff is significant: organisations that successfully implement Data as a Product models see higher data trust, faster time-to-insight, and far fewer political battles over whose numbers are correct. It is the foundation of genuine data democratisation.

If your organisation is struggling with data quality, trust, or discoverability, contact Infra IT Consulting for a data product maturity assessment and a practical roadmap to improvement.

Related posts