Data Architecture & Strategy mdmdata-qualitygovernance

Master Data Management on AWS: Strategies and Tools

By Infra IT Consulting · March 19, 2024 · 10 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

Most data quality problems trace back to the same root cause: the same real-world entity — a customer, a product, a supplier — exists under multiple representations across multiple systems. Your CRM records the customer as “Acme Corp.” Your ERP records it as “ACME Corporation Ltd.” Your e-commerce platform records it as “acme-corp.” When your analytics team tries to calculate customer lifetime value, they are working with three separate entities that are actually one. The result is triple-counted revenue, broken attribution, and reports that no one trusts.

Master Data Management (MDM) is the practice of creating a single authoritative record — a “golden record” — for each entity, then making that record available to all consuming systems. On AWS, a combination of AWS Entity Resolution, AWS Lake Formation, Amazon Redshift, and well-designed data pipeline patterns gives you the building blocks for a production-grade MDM capability without requiring a dedicated MDM appliance from a legacy vendor.

The Four MDM Patterns and When to Use Each

MDM implementations fall into four architectural patterns, and choosing the wrong one creates more problems than it solves:

Registry MDM (virtual): A central registry holds the authoritative identifier and the mapping to source system IDs, but all attribute data remains in source systems. Queries join through the registry at read time. This pattern has the lowest implementation cost and the lowest data duplication, but it creates query latency and dependency on source system availability for every analytical query.

Consolidation MDM: Source system data is periodically ingested into a central MDM repository. The MDM system performs matching and creates golden records. Golden records are not pushed back to source systems — they exist only in the MDM layer for analytical use. This is the most common pattern for analytics-focused MDM on AWS and is what most data engineering teams should start with.

Coexistence MDM: Golden records are created centrally and distributed back to source systems, which can maintain their own local copies. Requires bidirectional integration with every source system and sophisticated conflict resolution when source systems modify their copies.

Centralised MDM: A single MDM system becomes the authoritative record for all domains. All source systems read from and write to the MDM system. Operationally the cleanest pattern but requires significant organisational change management and system integration work. Rarely appropriate outside of greenfield implementations.

For most AWS data platform teams, Consolidation MDM is the right starting point. You build it incrementally, you do not need to modify source systems, and the golden records produced are immediately usable by your analytics layer.

AWS Entity Resolution: The Core Matching Engine

AWS Entity Resolution (generally available since 2023) is the service that makes Consolidation MDM on AWS tractable without custom ML engineering. It provides probabilistic and rule-based entity matching across datasets stored in S3 or the Glue Data Catalog, producing match groups that represent the same real-world entity across sources.

Entity Resolution supports two matching workflows:

Rule-based matching: Define deterministic rules using field comparisons. Exact email match OR (exact company name match AND same postal code) → same entity. Predictable, auditable, but misses variants like “Acme Corp” vs “ACME Corporation Ltd.”

ML-based matching: AWS Entity Resolution uses a trained model to score record pairs and group those above a configurable threshold. Catches name variants, abbreviations, typos, and format differences that rule-based matching misses. Requires labelled training examples for best results, though the service provides pre-trained models for common entity types.

import boto3

er_client = boto3.client('entityresolution', region_name='ca-central-1')

# Create a matching workflow for customer entity resolution
response = er_client.create_matching_workflow(
    workflowName='customer-golden-record-matching',
    description='Match customer records across CRM, ERP, and e-commerce',
    inputSourceConfig=[
        {
            'inputSourceARN': 'arn:aws:glue:ca-central-1:123456789:table/raw_db/crm_customers',
            'schemaName': 'customer-schema',
            'applyNormalization': True
        },
        {
            'inputSourceARN': 'arn:aws:glue:ca-central-1:123456789:table/raw_db/erp_accounts',
            'schemaName': 'customer-schema',
            'applyNormalization': True
        },
        {
            'inputSourceARN': 'arn:aws:glue:ca-central-1:123456789:table/raw_db/ecommerce_users',
            'schemaName': 'customer-schema',
            'applyNormalization': True
        }
    ],
    outputSourceConfig=[
        {
            's3Path': 's3://data-lake-prod/mdm/customer-matches/',
            'KMSArn': 'arn:aws:kms:ca-central-1:123456789:key/...',
            'applyNormalization': False
        }
    ],
    resolutionTechniques={
        'resolutionType': 'ML_MATCHING',
        'mlMatchingAttributes': {
            'mlModelEndpoint': ''  # Uses AWS-managed model
        }
    },
    roleArn='arn:aws:iam::123456789:role/EntityResolutionRole'
)

# Start the matching job
er_client.start_matching_job(workflowName='customer-golden-record-matching')

The output is a match group file in S3: each row maps a source record ID (from CRM, ERP, or e-commerce) to a matchId that represents the golden entity. Three rows from three different systems with the same matchId are the same customer.

Building the Golden Record Layer in Amazon Redshift

The match group output from Entity Resolution is the foundation. The golden record itself — the single authoritative representation of the entity — requires a consolidation step where you decide which source system provides each attribute value when multiple sources disagree.

This is the survivorship logic problem, and it needs explicit business rules:

Email address: prefer CRM over ERP over e-commerce (CRM is more carefully maintained)
Company name: prefer ERP (finance system, most legally accurate)
Phone number: prefer most recently updated record across all sources
Postal code: prefer ERP over CRM (billing address is more accurate than contact address)

-- Golden record consolidation in Amazon Redshift
-- Uses match groups from Entity Resolution output

CREATE TABLE mdm.customer_golden_record AS
WITH match_groups AS (
    -- Load match output from S3 via Redshift COPY or Redshift Spectrum
    SELECT match_id, source_system, source_record_id
    FROM mdm_staging.entity_resolution_output
),
crm_customers AS (
    SELECT c.customer_id, c.email, c.company_name, c.phone, c.postal_code,
           c.updated_at, 'crm' AS source_system
    FROM raw.crm_customers c
),
erp_accounts AS (
    SELECT a.account_id, a.email, a.company_name, a.phone, a.postal_code,
           a.updated_at, 'erp' AS source_system
    FROM raw.erp_accounts a
),
all_sources AS (
    SELECT mg.match_id, crm.email, crm.company_name, crm.phone, crm.postal_code,
           crm.updated_at, 'crm' AS source_priority, 1 AS email_priority, 2 AS name_priority
    FROM match_groups mg
    JOIN crm_customers crm ON mg.source_record_id = crm.customer_id AND mg.source_system = 'crm'
    UNION ALL
    SELECT mg.match_id, erp.email, erp.company_name, erp.phone, erp.postal_code,
           erp.updated_at, 'erp' AS source_priority, 2 AS email_priority, 1 AS name_priority
    FROM match_groups mg
    JOIN erp_accounts erp ON mg.source_record_id = erp.account_id AND mg.source_system = 'erp'
),
-- Survivorship: pick best attribute per golden entity
ranked_email AS (
    SELECT match_id, email,
           ROW_NUMBER() OVER (PARTITION BY match_id ORDER BY email_priority, updated_at DESC) AS rn
    FROM all_sources WHERE email IS NOT NULL
),
ranked_name AS (
    SELECT match_id, company_name,
           ROW_NUMBER() OVER (PARTITION BY match_id ORDER BY name_priority, updated_at DESC) AS rn
    FROM all_sources WHERE company_name IS NOT NULL
)
SELECT
    re.match_id AS golden_customer_id,
    re.email,
    rn.company_name,
    CURRENT_TIMESTAMP AS golden_record_created_at
FROM ranked_email re
JOIN ranked_name rn ON re.match_id = rn.match_id
WHERE re.rn = 1 AND rn.rn = 1;

The resulting mdm.customer_golden_record table is the MDM layer’s output. All analytical queries that need a customer-level view join through this table using the golden_customer_id. Source-system-specific queries still use their native IDs; the MDM layer translates between them.

Domain-Driven MDM: Scoping What You Manage

One of the most common MDM project failures is trying to master every entity domain simultaneously. Customer, product, supplier, location, and employee MDM have different matching complexity, different stakeholder ownership, and different data quality starting points.

Start with the domain that causes the most analytical pain. For most B2B organisations, that is customer MDM — because customer lifetime value, retention analysis, and account health reporting all break when customer records are fragmented. For B2C companies with product catalogues, product MDM (matching SKUs across different vendor representations) is often the more urgent priority.

A phased domain approach:

Phase 1 (months 1–3): Customer MDM using AWS Entity Resolution. Prove the pattern, establish survivorship logic, demonstrate improved CLV calculation accuracy.
Phase 2 (months 4–6): Product MDM. More complex — requires hierarchical category matching and supplier ID mapping — but the Phase 1 infrastructure (Entity Resolution workflow, Redshift golden record pattern) reuses directly.
Phase 3 (months 7–12): Location and supplier MDM. Typically lower urgency but enables more sophisticated supply chain and geographic analytics.

Governance Integration with Lake Formation

Golden records are high-value, sensitive data. A customer golden record contains the merged identity of a real person or organisation, which in Canadian and UK regulatory contexts is personal data under PIPEDA, Law 25, and UK GDPR. Lake Formation’s column-level access control should protect the golden record table with the same rigour applied to source system data.

Lake Formation data filters allow you to restrict which columns of the golden record table are accessible to different consumer roles:

Analytics engineers: full access including all identifier fields
BI tool service account: access to non-PII aggregation columns only
Data science team: access to behavioural attributes, not to email/phone
External auditors: read access with all PII columns masked via Lake Formation tag-based access control

This governance posture connects directly to the broader Data Governance Framework that makes MDM trustworthy rather than just technically correct. A golden record that is not governed is a governance liability — a single point where identity data is concentrated without proportionate access controls.

Monitoring MDM Health

MDM is not a one-time project — it is an ongoing operational capability. Three metrics indicate whether your MDM layer is healthy:

Match rate: What percentage of source records are successfully matched to a golden entity? A match rate below 80% for customer records suggests either data quality problems in source systems or matching rules that are too conservative.

Singleton rate: What percentage of golden entities have only one source record? A high singleton rate (above 30%) may indicate that matching rules are too strict and are creating false separations.

Golden record staleness: How often is the golden record layer refreshed relative to source system change frequency? If your CRM updates customer email addresses daily but your MDM refreshes weekly, the golden record layer is stale for analytical purposes.

These metrics belong in a CloudWatch dashboard, with alarms configured for threshold breaches. An MDM pipeline that silently degrades match quality is worse than no MDM — it produces confident but wrong answers. The observability practices described in DataOps Practices apply directly to MDM pipelines: automated quality gates, freshness monitoring, and pipeline health dashboards make degradation visible before it affects consumers.

Conclusion

Master Data Management on AWS is now more accessible than it has ever been, thanks to AWS Entity Resolution handling the ML-powered matching problem that previously required specialised vendor software. The remaining work — survivorship logic, golden record modelling, governance integration, and operational monitoring — is standard data engineering, deployable with Redshift, Glue, and Lake Formation.

The organisations that benefit most from MDM are those that have already invested in data quality at the pipeline level — where source data is clean enough for matching to work reliably — and have clear ownership of entity domains at the business level. MDM without business ownership of the golden record is a technical solution to a business problem, and it rarely sticks.

If you are designing an MDM capability for your AWS data platform or need help evaluating whether AWS Entity Resolution fits your matching requirements, contact the Infra IT Consulting team. We work with data-intensive organisations in Canada, the UK, and Africa to build identity resolution and master data infrastructure that makes analytics trustworthy.

Data Architecture & Strategy

Talk to our team →

Master Data Management on AWS: Strategies and Tools

The Four MDM Patterns and When to Use Each

AWS Entity Resolution: The Core Matching Engine

Building the Golden Record Layer in Amazon Redshift

Domain-Driven MDM: Scoping What You Manage

Governance Integration with Lake Formation

Monitoring MDM Health

Conclusion

Related posts

The Modern Data Stack Explained: What It Is and When to Use It

Data Contracts: The Key to Reliable Data Pipelines

Multi-Cloud Data Strategy: When It Makes Sense and When It Doesn't

The Four MDM Patterns and When to Use Each

AWS Entity Resolution: The Core Matching Engine

Building the Golden Record Layer in Amazon Redshift

Domain-Driven MDM: Scoping What You Manage

Governance Integration with Lake Formation

Monitoring MDM Health

Conclusion

Related posts

The Modern Data Stack Explained: What It Is and When to Use It

Data Contracts: The Key to Reliable Data Pipelines

Multi-Cloud Data Strategy: When It Makes Sense and When It Doesn't

We value your privacy