Data Analytics & BI dbtanalytics-engineeringredshift

dbt on AWS: Transforming Raw Data into Analytics-Ready Models

By Infra IT Consulting · February 16, 2024 · 9 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

Raw data is rarely analytics-ready. Whether it comes from a SaaS application, a transactional database, or a streaming pipeline, it almost always requires cleaning, joining, and restructuring before it can answer real business questions. For years, that transformation work lived in sprawling, poorly documented SQL scripts or overloaded ETL pipelines. dbt (data build tool) changes that equation entirely — and when paired with AWS services like Amazon Redshift and Amazon Athena, it becomes the backbone of a genuinely modern analytics stack.

This post explores how dbt fits into an AWS data architecture, how to set it up with both Redshift and Athena, and what analytics engineering looks like in practice.

What Is Analytics Engineering and Why Does It Matter?

Analytics engineering sits at the intersection of data engineering and data analysis. Analytics engineers own the transformation layer: they take raw, ingested data and produce clean, well-documented, tested models that analysts and business stakeholders can trust and use without help.

Before dbt, transformations often happened in one of two places: inside ETL tools (AWS Glue, Informatica, Talend) or in ad-hoc SQL files with no version control, no testing, and no documentation. Both approaches create fragility. When a source schema changes or a business definition shifts, downstream breakage is discovered late — usually by an angry stakeholder looking at a wrong dashboard number.

dbt solves this by treating SQL transformations as software. Every model is a .sql file checked into Git. Dependencies between models are declared explicitly, enabling dbt to build a directed acyclic graph (DAG) and execute models in the correct order. Tests are defined in YAML and run automatically. Documentation is generated from the same codebase. The result is a transformation layer that is reproducible, testable, and maintainable.

Setting Up dbt with Amazon Redshift

Amazon Redshift is dbt’s most mature AWS integration. The dbt-redshift adapter wraps psycopg2 and supports all of Redshift’s performance features, including distribution keys, sort keys, and late-binding views.

A typical profiles.yml for Redshift looks like this:

my_project:
  target: dev
  outputs:
    dev:
      type: redshift
      host: my-cluster.abc123.ca-central-1.redshift.amazonaws.com
      port: 5439
      user: analytics_engineer
      password: "{{ env_var('DBT_PASSWORD') }}"
      dbname: analytics
      schema: dbt_dev
      threads: 4
      ra3_node: true

Setting ra3_node: true enables cross-database queries if you use Redshift’s data sharing feature. In a production setup, you would separate dev and prod schemas so that engineers can iterate without affecting dashboards.

For performance-critical models, dbt lets you configure Redshift-specific settings per model:

{{ config(
    materialized='table',
    dist='customer_id',
    sort=['event_date', 'customer_id'],
    sort_style='compound'
) }}

SELECT
    o.customer_id,
    o.order_date::date AS event_date,
    SUM(o.amount_usd) AS total_spend_usd,
    COUNT(o.order_id) AS order_count
FROM {{ ref('stg_orders') }} o
GROUP BY 1, 2

Specifying a distribution key on customer_id tells Redshift to co-locate rows with the same customer across slices, which dramatically reduces data shuffling when you join this table to other customer-grain models.

Setting Up dbt with Amazon Athena

For teams running a data lake architecture on Amazon S3, dbt-athena (the community adapter dbt-athena-community) allows you to run dbt models directly against S3 data using Athena’s serverless SQL engine. This is particularly compelling when you want transformation without managing a Redshift cluster.

my_lake_project:
  target: dev
  outputs:
    dev:
      type: athena
      region_name: ca-central-1
      s3_staging_dir: s3://my-analytics-bucket/athena-results/
      database: analytics_dev
      schema: transformed
      threads: 4
      work_group: primary

Athena models materialise as tables or views in the AWS Glue Data Catalog, stored as Parquet or ORC files on S3. For large models, Iceberg table format is increasingly popular because it supports MERGE operations — enabling dbt incremental models to work correctly without full table rebuilds.

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='order_id',
    table_type='iceberg',
    format='parquet'
) }}

SELECT
    order_id,
    customer_id,
    status,
    amount_usd,
    updated_at
FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

This incremental pattern is central to production dbt workflows — processing only new or changed records rather than reprocessing the entire dataset on every run.

Project Structure and the Medallion Architecture

A well-structured dbt project maps naturally onto the medallion architecture common in AWS data lakes:

models/
├── staging/          # Bronze → Silver: light cleaning, type casting, renaming
│   ├── stg_orders.sql
│   ├── stg_customers.sql
│   └── schema.yml
├── intermediate/     # Silver: joins, business logic
│   ├── int_customer_orders.sql
│   └── schema.yml
└── marts/            # Gold: business-facing aggregations
    ├── finance/
    │   └── fct_revenue_daily.sql
    └── marketing/
        └── fct_campaign_performance.sql

Staging models do minimal work: rename columns to a consistent convention, cast types, and filter out test or deleted records. Intermediate models join entities together and apply business rules. Mart models produce the aggregations and metrics that power dashboards and reports.

This structure also clarifies ownership: staging models are owned by the data engineering team and track source systems closely; mart models are owned collaboratively with business stakeholders who define what metrics mean.

Testing and Documentation

dbt ships with four built-in generic tests: unique, not_null, accepted_values, and relationships. You declare them in YAML alongside your model:

models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'shipped', 'delivered', 'cancelled']
      - name: customer_id
        tests:
          - relationships:
              to: ref('stg_customers')
              field: customer_id

Running dbt test after every model build catches data quality issues at the transformation layer, before bad data reaches dashboards. In a CI/CD pipeline on AWS CodePipeline or GitHub Actions, these tests gate deployment to production.

Documentation is generated with dbt docs generate and served via dbt docs serve. On AWS, many teams host the generated static site in S3 with CloudFront, giving the whole organisation a searchable data catalogue at zero marginal cost.

Orchestration on AWS

dbt runs are typically scheduled via an orchestrator. The most common patterns on AWS are:

Amazon MWAA (Managed Airflow) — mature, flexible, integrates with DbtTaskGroup from the Astronomer Cosmos library
AWS Step Functions — lower overhead for simpler pipelines
dbt Cloud — fully managed option that handles scheduling, CI, and IDE in one product

For most mid-sized organisations, MWAA with Cosmos offers the best balance of flexibility and operational simplicity. You get Airflow’s dependency management and retry logic while dbt handles the SQL graph.

Conclusion

dbt brings software engineering discipline to the analytics transformation layer — version control, testing, documentation, and modularity that were previously absent from most SQL workflows. When deployed on AWS using Redshift or Athena, it completes the modern data stack: S3 as storage, Glue or Kinesis for ingestion, dbt for transformation, and QuickSight or another BI tool for visualisation.

If you are building or modernising a data platform and want to implement dbt correctly from the start, or if you have inherited a tangled transformation layer and need to bring it under control, Infra IT Consulting can help. We have delivered dbt implementations across Redshift, Athena, and Snowflake for clients in Canada, the UK, and Africa. Get in touch to discuss your project.

Related reading:

Data Analytics & BI

Talk to our team →

dbt on AWS: Transforming Raw Data into Analytics-Ready Models

What Is Analytics Engineering and Why Does It Matter?

Setting Up dbt with Amazon Redshift

Setting Up dbt with Amazon Athena

Project Structure and the Medallion Architecture

Testing and Documentation

Orchestration on AWS

Conclusion

Related posts

Data as a Product: Building Internal Data Products That Teams Actually Use

Building Real-Time Dashboards with Kinesis and QuickSight

Geospatial Analytics on AWS: Tools and Patterns

What Is Analytics Engineering and Why Does It Matter?

Setting Up dbt with Amazon Redshift

Setting Up dbt with Amazon Athena

Project Structure and the Medallion Architecture

Testing and Documentation

Orchestration on AWS

Conclusion

Related posts

Data as a Product: Building Internal Data Products That Teams Actually Use

Building Real-Time Dashboards with Kinesis and QuickSight

Geospatial Analytics on AWS: Tools and Patterns

We value your privacy