dbt 101 for AWS Data Engineers: Your First Transformation Project
dbt (data build tool) has reshaped how data teams think about transformations. Instead of writing stored procedures, maintaining fragile ETL scripts, or building transformation layers manually in Glue, dbt lets you express transformations as SQL SELECT statements, manage them in Git, test them automatically, and generate documentation that stays in sync with the code. This tutorial walks through a complete first project against Amazon Redshift: installation, connection setup, your first model, source definitions, schema tests, and running the full pipeline.
Prerequisites
You need:
- An Amazon Redshift cluster or Redshift Serverless workgroup (the free trial is sufficient for this tutorial)
- A database user with CREATE SCHEMA and SELECT permissions
- Python 3.9+ installed locally
- Familiarity with basic SQL
For Redshift Serverless, note the endpoint format: <workgroup-name>.<account-id>.<region>.redshift-serverless.amazonaws.com.
Installing dbt-redshift
dbt uses adapter packages that pair the dbt core framework with a specific database connector. For Redshift:
# Create a virtual environment (strongly recommended)
python -m venv dbt-env
source dbt-env/bin/activate # On Windows: dbt-env\Scripts\activate
# Install the Redshift adapter (includes dbt-core)
pip install dbt-redshift==1.7.4
# Verify installation
dbt --version
You should see output like:
Core:
- installed: 1.7.4
- latest: 1.7.4 - Up to date!
Plugins:
- redshift: 1.7.4 - Up to date!
Initialising Your Project
# Create a new dbt project
dbt init my_analytics_project
# Answer the prompts:
# - Which database adapter? redshift
# - The rest you can fill in interactively, or edit profiles.yml afterwards
This creates the project structure:
my_analytics_project/
โโโ analyses/
โโโ dbt_project.yml
โโโ macros/
โโโ models/
โ โโโ example/
โโโ seeds/
โโโ snapshots/
โโโ tests/
Configuring profiles.yml
The profiles.yml file lives in ~/.dbt/profiles.yml (outside the project directory, so it is not committed to Git). It stores database credentials per environment.
# ~/.dbt/profiles.yml
my_analytics_project:
target: dev
outputs:
dev:
type: redshift
host: my-cluster.abc123.ca-central-1.redshift.amazonaws.com
port: 5439
user: dbt_dev_user
password: "{{ env_var('REDSHIFT_PASSWORD') }}"
dbname: analytics
schema: dbt_dev_yourname # Each developer gets their own schema
threads: 4
keepalives_idle: 240
connect_timeout: 30
ra3_node: true # Set true for RA3 nodes or Serverless
prod:
type: redshift
host: my-cluster.abc123.ca-central-1.redshift.amazonaws.com
port: 5439
user: dbt_prod_user
password: "{{ env_var('REDSHIFT_PASSWORD_PROD') }}"
dbname: analytics
schema: analytics # Production models land in the analytics schema
threads: 8
ra3_node: true
The {{ env_var('REDSHIFT_PASSWORD') }} syntax pulls the password from an environment variable rather than hardcoding it. Export it before running dbt:
export REDSHIFT_PASSWORD="your-password-here"
Test the connection:
cd my_analytics_project
dbt debug
All checks should pass. If you see a connection error, verify your security group allows inbound traffic on port 5439 from your IP.
Defining Sources
Before writing transformation models, declare the raw tables you depend on in a sources.yml file. This gives dbt the metadata it needs to generate lineage, run source freshness checks, and allow you to use the source() function in models.
# models/staging/sources.yml
version: 2
sources:
- name: raw_ecommerce
description: "Raw transactional data loaded from the application database via AWS DMS"
database: analytics
schema: raw
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
loaded_at_field: _loaded_at
tables:
- name: orders
description: "One row per customer order"
columns:
- name: order_id
description: "Primary key โ unique identifier for each order"
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- name: order_date
tests:
- not_null
- name: status
tests:
- accepted_values:
values: ["pending", "confirmed", "shipped", "delivered", "cancelled"]
- name: customers
description: "One row per registered customer"
columns:
- name: customer_id
tests:
- unique
- not_null
- name: order_items
description: "One row per line item within an order"
columns:
- name: order_item_id
tests:
- unique
- not_null
- name: order_id
tests:
- not_null
- relationships:
to: source('raw_ecommerce', 'orders')
field: order_id
Writing Your First Model
dbt models are SQL files where the filename becomes the table or view name. Start with a staging model that cleans the raw orders data:
-- models/staging/stg_orders.sql
-- Config block: this model materialises as a view in dev, table in prod
{{
config(
materialized = 'view',
schema = 'staging'
)
}}
WITH source AS (
-- Use the source() function to reference declared sources
SELECT * FROM {{ source('raw_ecommerce', 'orders') }}
),
cleaned AS (
SELECT
order_id::VARCHAR(50) AS order_id,
customer_id::VARCHAR(50) AS customer_id,
order_date::DATE AS order_date,
LOWER(TRIM(status)) AS status,
COALESCE(total_amount, 0.00)::DECIMAL(12, 2) AS total_amount,
COALESCE(discount_amount, 0.00)::DECIMAL(12, 2) AS discount_amount,
total_amount - COALESCE(discount_amount, 0) AS net_revenue,
created_at::TIMESTAMP AS created_at,
_loaded_at AS _loaded_at
FROM source
WHERE order_id IS NOT NULL
AND customer_id IS NOT NULL
AND order_date >= '2020-01-01'::DATE
)
SELECT * FROM cleaned
Now write a marts-level model that uses ref() to depend on the staging model:
-- models/marts/fct_orders.sql
{{
config(
materialized = 'table',
schema = 'marts',
sort = 'order_date',
dist = 'customer_id'
)
}}
WITH orders AS (
-- ref() creates a dependency and resolves to the correct schema at runtime
SELECT * FROM {{ ref('stg_orders') }}
),
customers AS (
SELECT * FROM {{ ref('stg_customers') }}
),
order_items AS (
SELECT * FROM {{ ref('stg_order_items') }}
),
order_metrics AS (
SELECT
order_id,
COUNT(*) AS line_item_count,
SUM(quantity) AS total_units,
SUM(unit_price * quantity) AS gross_merchandise_value
FROM order_items
GROUP BY order_id
)
SELECT
o.order_id,
o.customer_id,
c.customer_name,
c.customer_segment,
o.order_date,
o.status,
o.net_revenue,
o.discount_amount,
m.line_item_count,
m.total_units,
m.gross_merchandise_value,
o.created_at
FROM orders o
LEFT JOIN customers c USING (customer_id)
LEFT JOIN order_metrics m USING (order_id)
Adding Model-Level Tests in schema.yml
Tests validate your models after they run. Define them in a schema.yml file alongside the models:
# models/marts/schema.yml
version: 2
models:
- name: fct_orders
description: "Fact table containing one row per customer order with enriched metrics"
columns:
- name: order_id
description: "Primary key"
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- name: order_date
tests:
- not_null
- name: status
tests:
- not_null
- accepted_values:
values: ["pending", "confirmed", "shipped", "delivered", "cancelled"]
- name: net_revenue
tests:
- not_null
- name: stg_orders
description: "Cleaned and typed staging layer for raw orders"
columns:
- name: order_id
tests:
- unique
- not_null
- name: net_revenue
tests:
- not_null
Configuring dbt_project.yml
The dbt_project.yml file controls project-wide settings, including default materialisation strategies per folder:
# dbt_project.yml
name: my_analytics_project
version: "1.0.0"
config-version: 2
profile: my_analytics_project
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target"
clean-targets:
- "target"
- "dbt_packages"
models:
my_analytics_project:
staging:
+materialized: view
+schema: staging
marts:
+materialized: table
+schema: marts
+sort_type: compound
Running dbt: The Core Commands
# Install dbt packages defined in packages.yml
dbt deps
# Run all models
dbt run
# Run only staging models
dbt run --select staging
# Run a specific model and all its downstream dependents
dbt run --select fct_orders+
# Run tests for all models
dbt test
# Run a specific test
dbt test --select fct_orders
# Run models AND tests in one command (recommended for CI)
dbt build
# Check source freshness
dbt source freshness
# Generate and serve documentation
dbt docs generate
dbt docs serve # Opens browser at localhost:8080
# Compile models without running (useful for debugging SQL)
dbt compile --select fct_orders
The dbt build command is the most useful for pipelines โ it runs models, seeds, snapshots, and tests in DAG order, stopping if any test fails before downstream models execute.
Connecting dbt to the Broader AWS Stack
For production dbt deployments on AWS, there are two common patterns:
AWS MWAA (Airflow): Use the DbtCloudOperator or the BashOperator to run dbt run and dbt test as Airflow tasks. This gives you dependency management, retry logic, and alerting within your existing orchestration layer.
GitHub Actions: Run dbt build on every pull request to catch regressions before they reach production. See CI/CD for Data Pipelines with GitHub Actions for a complete workflow example.
For a deeper dive into the dbt + Redshift combination โ including incremental models, snapshots, and the modern data stack architecture โ see our post on dbt on AWS.
Conclusion
dbt transforms SQL from a one-off scripting language into a properly engineered application: version-controlled, tested, documented, and deployable. This tutorial covered the core workflow โ installation, source definitions, staging and marts models, schema tests, and the essential dbt commands. From here, explore incremental models (for large tables), snapshots (for slowly changing dimensions), and macros (for reusable SQL logic).
If your team is adopting dbt and needs architectural guidance on the Redshift data model, test coverage strategy, or production deployment setup, contact Infra IT Consulting. We help Canadian data teams implement the modern analytics engineering stack on AWS.
Related posts
Book a free 30-minute consultation to discuss your data engineering and analytics needs.
Talk to our team โ