Infra IT Consulting logo Infra ITC
Data Architecture & Strategy star-schemadata-vaultmodelling

Star Schema vs. Data Vault: Picking the Right Modelling Approach

By Infra IT Consulting Β· Β· 9 min read

The choice of data modelling methodology shapes every downstream decision in your analytics platform β€” query performance, change management costs, auditability, and how quickly your team can respond to new business requirements. Two approaches dominate enterprise data warehousing: the dimensional model (most commonly implemented as star schema) and Data Vault 2.0. Both are mature, both are production-proven at scale, and both are capable of running effectively on Amazon Redshift and the broader AWS analytics stack.

What they are not is interchangeable. Picking the wrong one for your context creates technical debt that compounds over years. This post lays out the real trade-offs so engineering leaders and data architects can make a deliberate decision rather than defaulting to whichever pattern someone on the team learned at their last job.

What Star Schema Actually Optimises For

Ralph Kimball’s dimensional modelling approach, formalised in the 1990s, was designed around a specific problem: making analytical queries intuitive and fast for business users. The star schema achieves this through denormalisation. A central fact table β€” storing measurable events like sales transactions, web sessions, or insurance claims β€” is surrounded by dimension tables that describe the who, what, where, and when of those events.

On Amazon Redshift, a well-constructed star schema with appropriate distribution keys, sort keys, and column encoding routinely delivers sub-second query times on billions of rows. The join structure is predictable: one fact table, a small number of dimension lookups, and the query engine can plan efficiently. BI tools like Amazon QuickSight, Tableau, and Looker are purpose-built to query star schemas. Most generate SQL that assumes this structure.

-- Typical star schema query on Redshift
SELECT
    d_product.category,
    d_date.fiscal_quarter,
    d_geography.province,
    SUM(f_sales.gross_revenue) AS total_revenue,
    COUNT(DISTINCT f_sales.customer_key) AS unique_customers
FROM analytics.fact_sales f_sales
JOIN analytics.dim_product d_product ON f_sales.product_key = d_product.product_key
JOIN analytics.dim_date d_date ON f_sales.date_key = d_date.date_key
JOIN analytics.dim_geography d_geography ON f_sales.geo_key = d_geography.geo_key
WHERE d_date.fiscal_year = 2024
  AND d_geography.country = 'Canada'
GROUP BY 1, 2, 3
ORDER BY total_revenue DESC;

This query is readable, performant, and maps directly to how business analysts think about reporting. That is the core strength of dimensional modelling.

The limitation is adaptability. When business definitions change β€” and they always do β€” star schemas require structural changes that ripple through ETL pipelines, BI layer semantic models, and downstream reports. Adding a new attribute to a dimension or splitting a fact table is a coordinated effort across multiple layers. Historically tracking changes (slowly changing dimensions, or SCDs) adds complexity that many teams underestimate at design time.

What Data Vault Optimises For

Data Vault 2.0, developed by Dan Linstedt, makes a different trade-off. It prioritises auditability, flexibility in the face of changing source systems, and the ability to integrate data from many heterogeneous sources without upfront agreement on business rules. The modelling pattern uses three core entity types:

  • Hubs: store unique business keys (e.g., customer ID, product SKU) with a load timestamp and record source. No descriptive attributes.
  • Links: store relationships between hubs (e.g., a customer placing an order), also with load timestamps and record sources.
  • Satellites: store the descriptive context for hubs and links (e.g., customer name, address, order status), partitioned by source system and change rate.

Every record in a Data Vault carries a load timestamp and a record source. This makes the raw vault a complete, auditable historical archive β€” you can reconstruct the state of any entity as seen by any source system at any point in time. For organisations subject to FINTRAC reporting, PIPEDA compliance, or GDPR-equivalent obligations, this audit trail is a significant regulatory advantage.

The flexibility benefit is real. Adding a new source system that provides additional attributes about an existing customer hub means adding a new satellite. Existing hubs, links, and satellites are untouched. There is no ETL refactoring, no coordination with downstream teams, no regression risk to existing reports. This architectural agility is why Data Vault adoption tends to be highest in financial services, insurance, and telecommunications β€” industries where source system churn is constant and regulatory scrutiny is high.

The cost is query complexity. Business intelligence on top of a raw vault requires an information mart layer β€” typically implemented as star schema views or materialised tables derived from the vault. Without this mart layer, analysts face joins across hubs, links, and multiple satellites just to answer basic questions. The double-architecture (raw vault plus information mart) means more engineering surface area to maintain.

When to Choose Star Schema

Star schema is the right choice when:

  • Your data sources are relatively stable and well-understood
  • Your primary consumers are BI analysts and business users, not data scientists or operational systems
  • You need to deliver reporting value quickly β€” dimensional modelling has a shorter time-to-value than Data Vault for most reporting use cases
  • Your team has SQL expertise but limited experience with Data Vault modelling patterns
  • Your organisation runs a modern data stack with dbt, where Kimball-style dimensional modelling integrates naturally

A well-run star schema implementation on Amazon Redshift with dbt for transformation management is a proven, high-performance architecture. Using Redshift’s materialised views to pre-aggregate common report dimensions can further compress query times for large fact tables.

When to Choose Data Vault

Data Vault earns its complexity premium when:

  • You are integrating five or more source systems with different business key conventions
  • Source systems change frequently β€” acquisitions, legacy replacements, vendor swaps
  • Regulatory requirements demand a full audit trail of when data was loaded and from where
  • You want to separate the integration layer (raw vault) from the business interpretation layer (information marts) so that changing business rules does not require reloading historical data
  • Your data platform maturity is high enough to support the tooling and process overhead that Data Vault requires

On AWS, Data Vault implementations commonly use AWS Glue for the Hub/Link/Satellite load patterns, Amazon S3 as the underlying storage layer, and Amazon Redshift as the query engine. Tools like dbt-vault (now AutomateDV) provide pre-built macros for the standard load patterns, significantly reducing the boilerplate involved in implementing the methodology.

The Hybrid Reality: Raw Vault Plus Information Marts

In practice, most mature data platforms that choose Data Vault do not abandon dimensional modelling entirely. They use Data Vault for the integration and historical archiving layer, then derive star-schema-style information marts from it for BI consumption. This architecture separates concerns cleanly:

  • The raw vault absorbs the chaos of integration β€” schema changes, source system oddities, late-arriving data β€” without propagating that chaos to analysts
  • The information mart layer enforces business rules, conforms dimensions across sources, and delivers the query performance that BI tools need
  • The transformation logic that converts raw vault to information mart is where business definitions live β€” and when those definitions change, only the mart derivation logic changes, not the underlying vault

This hybrid approach represents the highest-maturity pattern but also the highest engineering investment. It is appropriate for organisations with dedicated data engineering teams and significant data integration complexity. For a 10-person startup or a single-domain analytics use case, a well-implemented star schema in Redshift with dbt will outperform a poorly maintained Data Vault every time.

Practical Recommendations for AWS-Based Teams

For teams building on Amazon Redshift, the distribution strategy amplifies the impact of your modelling choice. Star schema fact tables should use DISTKEY on the most common join key and SORTKEY on the date column used in range filters. Dimension tables that are small enough should use DISTSTYLE ALL to eliminate network shuffles on joins.

For Data Vault on Redshift, Hub and Link tables benefit from DISTSTYLE KEY on the business key hash, ensuring that records for the same entity land on the same compute node. Satellite tables should be co-located with their parent Hub or Link using matching DISTKEY values.

Whichever approach you choose, invest in documenting the business rules that drove your design decisions. The most common failure mode in data modelling is not picking the wrong pattern β€” it is failing to record why certain choices were made, leaving the next team to guess at intent when requirements change.

Choosing Your Path Forward

Star schema and Data Vault are both legitimate answers to data warehousing β€” they answer different questions. If your primary question is β€œhow do we deliver reliable, fast reporting to business users,” dimensional modelling is the more direct path. If your primary question is β€œhow do we integrate many changing sources with full auditability,” Data Vault provides structural advantages that are difficult to retrofit later.

Most organisations benefit from a conversation with an experienced data architect before committing to either approach at scale. The modelling decision is far less reversible than it appears at the start of a project. Infra IT Consulting works with Canadian and international organisations to design and implement data warehousing strategies that fit both current requirements and realistic growth trajectories. Reach out to discuss your architecture before locking in a direction.

Related posts