Infra IT Consulting logo Infra ITC
Tech Tutorials & How-Tos dbttutorialredshift

dbt 101 for AWS Data Engineers: Your First Transformation Project

By Infra IT Consulting ยท ยท 10 min read

dbt (data build tool) has reshaped how data teams think about transformations. Instead of writing stored procedures, maintaining fragile ETL scripts, or building transformation layers manually in Glue, dbt lets you express transformations as SQL SELECT statements, manage them in Git, test them automatically, and generate documentation that stays in sync with the code. This tutorial walks through a complete first project against Amazon Redshift: installation, connection setup, your first model, source definitions, schema tests, and running the full pipeline.

Prerequisites

You need:

  • An Amazon Redshift cluster or Redshift Serverless workgroup (the free trial is sufficient for this tutorial)
  • A database user with CREATE SCHEMA and SELECT permissions
  • Python 3.9+ installed locally
  • Familiarity with basic SQL

For Redshift Serverless, note the endpoint format: <workgroup-name>.<account-id>.<region>.redshift-serverless.amazonaws.com.

Installing dbt-redshift

dbt uses adapter packages that pair the dbt core framework with a specific database connector. For Redshift:

# Create a virtual environment (strongly recommended)
python -m venv dbt-env
source dbt-env/bin/activate   # On Windows: dbt-env\Scripts\activate

# Install the Redshift adapter (includes dbt-core)
pip install dbt-redshift==1.7.4

# Verify installation
dbt --version

You should see output like:

Core:
  - installed: 1.7.4
  - latest:    1.7.4 - Up to date!

Plugins:
  - redshift: 1.7.4 - Up to date!

Initialising Your Project

# Create a new dbt project
dbt init my_analytics_project

# Answer the prompts:
# - Which database adapter? redshift
# - The rest you can fill in interactively, or edit profiles.yml afterwards

This creates the project structure:

my_analytics_project/
โ”œโ”€โ”€ analyses/
โ”œโ”€โ”€ dbt_project.yml
โ”œโ”€โ”€ macros/
โ”œโ”€โ”€ models/
โ”‚   โ””โ”€โ”€ example/
โ”œโ”€โ”€ seeds/
โ”œโ”€โ”€ snapshots/
โ””โ”€โ”€ tests/

Configuring profiles.yml

The profiles.yml file lives in ~/.dbt/profiles.yml (outside the project directory, so it is not committed to Git). It stores database credentials per environment.

# ~/.dbt/profiles.yml
my_analytics_project:
  target: dev
  outputs:
    dev:
      type: redshift
      host: my-cluster.abc123.ca-central-1.redshift.amazonaws.com
      port: 5439
      user: dbt_dev_user
      password: "{{ env_var('REDSHIFT_PASSWORD') }}"
      dbname: analytics
      schema: dbt_dev_yourname   # Each developer gets their own schema
      threads: 4
      keepalives_idle: 240
      connect_timeout: 30
      ra3_node: true             # Set true for RA3 nodes or Serverless

    prod:
      type: redshift
      host: my-cluster.abc123.ca-central-1.redshift.amazonaws.com
      port: 5439
      user: dbt_prod_user
      password: "{{ env_var('REDSHIFT_PASSWORD_PROD') }}"
      dbname: analytics
      schema: analytics          # Production models land in the analytics schema
      threads: 8
      ra3_node: true

The {{ env_var('REDSHIFT_PASSWORD') }} syntax pulls the password from an environment variable rather than hardcoding it. Export it before running dbt:

export REDSHIFT_PASSWORD="your-password-here"

Test the connection:

cd my_analytics_project
dbt debug

All checks should pass. If you see a connection error, verify your security group allows inbound traffic on port 5439 from your IP.

Defining Sources

Before writing transformation models, declare the raw tables you depend on in a sources.yml file. This gives dbt the metadata it needs to generate lineage, run source freshness checks, and allow you to use the source() function in models.

# models/staging/sources.yml
version: 2

sources:
  - name: raw_ecommerce
    description: "Raw transactional data loaded from the application database via AWS DMS"
    database: analytics
    schema: raw
    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    loaded_at_field: _loaded_at

    tables:
      - name: orders
        description: "One row per customer order"
        columns:
          - name: order_id
            description: "Primary key โ€” unique identifier for each order"
            tests:
              - unique
              - not_null
          - name: customer_id
            tests:
              - not_null
          - name: order_date
            tests:
              - not_null
          - name: status
            tests:
              - accepted_values:
                  values: ["pending", "confirmed", "shipped", "delivered", "cancelled"]

      - name: customers
        description: "One row per registered customer"
        columns:
          - name: customer_id
            tests:
              - unique
              - not_null

      - name: order_items
        description: "One row per line item within an order"
        columns:
          - name: order_item_id
            tests:
              - unique
              - not_null
          - name: order_id
            tests:
              - not_null
              - relationships:
                  to: source('raw_ecommerce', 'orders')
                  field: order_id

Writing Your First Model

dbt models are SQL files where the filename becomes the table or view name. Start with a staging model that cleans the raw orders data:

-- models/staging/stg_orders.sql
-- Config block: this model materialises as a view in dev, table in prod
{{
  config(
    materialized = 'view',
    schema = 'staging'
  )
}}

WITH source AS (
    -- Use the source() function to reference declared sources
    SELECT * FROM {{ source('raw_ecommerce', 'orders') }}
),

cleaned AS (
    SELECT
        order_id::VARCHAR(50)                           AS order_id,
        customer_id::VARCHAR(50)                        AS customer_id,
        order_date::DATE                                AS order_date,
        LOWER(TRIM(status))                             AS status,
        COALESCE(total_amount, 0.00)::DECIMAL(12, 2)   AS total_amount,
        COALESCE(discount_amount, 0.00)::DECIMAL(12, 2) AS discount_amount,
        total_amount - COALESCE(discount_amount, 0)     AS net_revenue,
        created_at::TIMESTAMP                           AS created_at,
        _loaded_at                                      AS _loaded_at
    FROM source
    WHERE order_id IS NOT NULL
      AND customer_id IS NOT NULL
      AND order_date >= '2020-01-01'::DATE
)

SELECT * FROM cleaned

Now write a marts-level model that uses ref() to depend on the staging model:

-- models/marts/fct_orders.sql
{{
  config(
    materialized = 'table',
    schema = 'marts',
    sort = 'order_date',
    dist = 'customer_id'
  )
}}

WITH orders AS (
    -- ref() creates a dependency and resolves to the correct schema at runtime
    SELECT * FROM {{ ref('stg_orders') }}
),

customers AS (
    SELECT * FROM {{ ref('stg_customers') }}
),

order_items AS (
    SELECT * FROM {{ ref('stg_order_items') }}
),

order_metrics AS (
    SELECT
        order_id,
        COUNT(*)            AS line_item_count,
        SUM(quantity)       AS total_units,
        SUM(unit_price * quantity) AS gross_merchandise_value
    FROM order_items
    GROUP BY order_id
)

SELECT
    o.order_id,
    o.customer_id,
    c.customer_name,
    c.customer_segment,
    o.order_date,
    o.status,
    o.net_revenue,
    o.discount_amount,
    m.line_item_count,
    m.total_units,
    m.gross_merchandise_value,
    o.created_at
FROM orders o
LEFT JOIN customers c   USING (customer_id)
LEFT JOIN order_metrics m USING (order_id)

Adding Model-Level Tests in schema.yml

Tests validate your models after they run. Define them in a schema.yml file alongside the models:

# models/marts/schema.yml
version: 2

models:
  - name: fct_orders
    description: "Fact table containing one row per customer order with enriched metrics"
    columns:
      - name: order_id
        description: "Primary key"
        tests:
          - unique
          - not_null

      - name: customer_id
        tests:
          - not_null

      - name: order_date
        tests:
          - not_null

      - name: status
        tests:
          - not_null
          - accepted_values:
              values: ["pending", "confirmed", "shipped", "delivered", "cancelled"]

      - name: net_revenue
        tests:
          - not_null

  - name: stg_orders
    description: "Cleaned and typed staging layer for raw orders"
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: net_revenue
        tests:
          - not_null

Configuring dbt_project.yml

The dbt_project.yml file controls project-wide settings, including default materialisation strategies per folder:

# dbt_project.yml
name: my_analytics_project
version: "1.0.0"
config-version: 2

profile: my_analytics_project

model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target"
clean-targets:
  - "target"
  - "dbt_packages"

models:
  my_analytics_project:
    staging:
      +materialized: view
      +schema: staging
    marts:
      +materialized: table
      +schema: marts
      +sort_type: compound

Running dbt: The Core Commands

# Install dbt packages defined in packages.yml
dbt deps

# Run all models
dbt run

# Run only staging models
dbt run --select staging

# Run a specific model and all its downstream dependents
dbt run --select fct_orders+

# Run tests for all models
dbt test

# Run a specific test
dbt test --select fct_orders

# Run models AND tests in one command (recommended for CI)
dbt build

# Check source freshness
dbt source freshness

# Generate and serve documentation
dbt docs generate
dbt docs serve  # Opens browser at localhost:8080

# Compile models without running (useful for debugging SQL)
dbt compile --select fct_orders

The dbt build command is the most useful for pipelines โ€” it runs models, seeds, snapshots, and tests in DAG order, stopping if any test fails before downstream models execute.

Connecting dbt to the Broader AWS Stack

For production dbt deployments on AWS, there are two common patterns:

AWS MWAA (Airflow): Use the DbtCloudOperator or the BashOperator to run dbt run and dbt test as Airflow tasks. This gives you dependency management, retry logic, and alerting within your existing orchestration layer.

GitHub Actions: Run dbt build on every pull request to catch regressions before they reach production. See CI/CD for Data Pipelines with GitHub Actions for a complete workflow example.

For a deeper dive into the dbt + Redshift combination โ€” including incremental models, snapshots, and the modern data stack architecture โ€” see our post on dbt on AWS.

Conclusion

dbt transforms SQL from a one-off scripting language into a properly engineered application: version-controlled, tested, documented, and deployable. This tutorial covered the core workflow โ€” installation, source definitions, staging and marts models, schema tests, and the essential dbt commands. From here, explore incremental models (for large tables), snapshots (for slowly changing dimensions), and macros (for reusable SQL logic).

If your team is adopting dbt and needs architectural guidance on the Redshift data model, test coverage strategy, or production deployment setup, contact Infra IT Consulting. We help Canadian data teams implement the modern analytics engineering stack on AWS.

Related posts