CI/CD for Data Pipelines with GitHub Actions
Continuous integration and deployment is table stakes in software engineering, but adoption in data teams has historically been uneven. Data pipelines get deployed manually, SQL transformations are edited directly in production, and Glue job scripts are uploaded by hand. The result is brittle infrastructure, environment drift, and incidents that are hard to reproduce.
GitHub Actions makes CI/CD for data pipelines genuinely accessible. It integrates directly with your code repository, supports AWS CLI and SDK operations natively, and costs nothing for small-to-medium teams using public or private repositories within the free tier limits. This tutorial shows four concrete workflows: running dbt tests, deploying Glue jobs, triggering Step Functions, and linting SQL with sqlfluff.
Repository Structure
Organise your data repository so that the CI/CD workflows can target logical components:
data-platform/
โโโ .github/
โ โโโ workflows/
โ โโโ dbt-ci.yml
โ โโโ glue-deploy.yml
โ โโโ step-functions-trigger.yml
โ โโโ sql-lint.yml
โโโ dbt/
โ โโโ models/
โ โโโ tests/
โ โโโ dbt_project.yml
โ โโโ profiles.yml
โโโ glue/
โ โโโ jobs/
โ โโโ ingest_raw.py
โ โโโ transform_orders.py
โโโ step_functions/
โ โโโ pipeline_definition.json
โโโ .sqlfluff
Secrets are stored in GitHub repository secrets (Settings โ Secrets and variables โ Actions) and never in code. At minimum, you need AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION.
Workflow 1: Running dbt Tests on Pull Requests
This workflow runs whenever a pull request modifies dbt models or tests. It sets up a slim dbt environment, connects to Redshift using a CI-specific schema, runs the changed models, and executes all tests.
# .github/workflows/dbt-ci.yml
name: dbt CI
on:
pull_request:
paths:
- "dbt/**"
branches:
- main
- develop
env:
DBT_PROFILES_DIR: ./dbt
jobs:
dbt-test:
name: Run dbt Tests
runs-on: ubuntu-latest
environment: ci
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: "pip"
- name: Install dbt-redshift
run: pip install dbt-redshift==1.7.*
- name: Create dbt profiles for CI
env:
REDSHIFT_HOST: ${{ secrets.REDSHIFT_HOST }}
REDSHIFT_USER: ${{ secrets.REDSHIFT_USER }}
REDSHIFT_PASSWORD: ${{ secrets.REDSHIFT_PASSWORD }}
REDSHIFT_DB: ${{ secrets.REDSHIFT_DB }}
REDSHIFT_PORT: 5439
run: |
mkdir -p ~/.dbt
cat > ~/.dbt/profiles.yml << EOF
data_platform:
target: ci
outputs:
ci:
type: redshift
host: "${REDSHIFT_HOST}"
user: "${REDSHIFT_USER}"
password: "${REDSHIFT_PASSWORD}"
port: ${REDSHIFT_PORT}
dbname: "${REDSHIFT_DB}"
schema: "ci_${{ github.event.pull_request.number }}"
threads: 4
EOF
- name: dbt debug (verify connection)
working-directory: ./dbt
run: dbt debug
- name: dbt deps
working-directory: ./dbt
run: dbt deps
- name: dbt build (run + test)
working-directory: ./dbt
run: dbt build --target ci
- name: Clean up CI schema
if: always()
working-directory: ./dbt
env:
REDSHIFT_HOST: ${{ secrets.REDSHIFT_HOST }}
REDSHIFT_USER: ${{ secrets.REDSHIFT_USER }}
REDSHIFT_PASSWORD: ${{ secrets.REDSHIFT_PASSWORD }}
run: |
python - <<'PYEOF'
import psycopg2, os
conn = psycopg2.connect(
host=os.environ["REDSHIFT_HOST"],
user=os.environ["REDSHIFT_USER"],
password=os.environ["REDSHIFT_PASSWORD"],
dbname="analytics",
port=5439
)
conn.autocommit = True
with conn.cursor() as cur:
schema = f"ci_{os.environ.get('PR_NUMBER', 'unknown')}"
cur.execute(f"DROP SCHEMA IF EXISTS {schema} CASCADE")
PYEOF
env:
PR_NUMBER: ${{ github.event.pull_request.number }}
The CI schema โ named after the PR number โ gives each pull request an isolated namespace. The cleanup step drops it on workflow completion regardless of whether the tests passed or failed.
Workflow 2: Deploying AWS Glue Jobs
This workflow deploys Glue job scripts to S3 and updates the Glue job definition whenever changes are pushed to the main branch. It supports multiple jobs via a matrix strategy.
# .github/workflows/glue-deploy.yml
name: Deploy Glue Jobs
on:
push:
branches:
- main
paths:
- "glue/jobs/**"
jobs:
deploy-glue:
name: Deploy ${{ matrix.job_name }}
runs-on: ubuntu-latest
environment: production
strategy:
matrix:
include:
- job_name: ingest_raw
script_path: glue/jobs/ingest_raw.py
glue_job_name: prod-ingest-raw
- job_name: transform_orders
script_path: glue/jobs/transform_orders.py
glue_job_name: prod-transform-orders
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ca-central-1
- name: Upload script to S3
run: |
aws s3 cp ${{ matrix.script_path }} \
s3://${{ secrets.GLUE_SCRIPTS_BUCKET }}/jobs/${{ matrix.job_name }}/script.py \
--sse AES256
- name: Update Glue job definition
run: |
aws glue update-job \
--job-name "${{ matrix.glue_job_name }}" \
--job-update '{
"Command": {
"Name": "glueetl",
"ScriptLocation": "s3://${{ secrets.GLUE_SCRIPTS_BUCKET }}/jobs/${{ matrix.job_name }}/script.py",
"PythonVersion": "3"
},
"GlueVersion": "4.0",
"NumberOfWorkers": 10,
"WorkerType": "G.1X",
"Timeout": 120,
"DefaultArguments": {
"--job-bookmark-option": "job-bookmark-enable",
"--enable-metrics": "",
"--enable-continuous-cloudwatch-log": "true",
"--TempDir": "s3://${{ secrets.GLUE_TEMP_BUCKET }}/temp/"
}
}'
- name: Verify job update
run: |
aws glue get-job --job-name "${{ matrix.glue_job_name }}" \
--query 'Job.{Name: Name, GlueVersion: GlueVersion, ModifiedOn: LastModifiedOn}' \
--output table
This pattern keeps Glue job scripts in version control โ the S3 upload and update-job call happen automatically on every merge to main. Teams no longer upload scripts manually or lose track of which version is running in production.
Workflow 3: Triggering AWS Step Functions
For pipelines orchestrated by Step Functions, you may want to trigger a test execution on deployment or on a schedule:
# .github/workflows/step-functions-trigger.yml
name: Trigger Pipeline Execution
on:
workflow_dispatch:
inputs:
environment:
description: "Target environment (staging/production)"
required: true
default: "staging"
type: choice
options:
- staging
- production
execution_date:
description: "Execution date (YYYY-MM-DD)"
required: false
default: ""
schedule:
- cron: "0 6 * * *" # Run daily at 06:00 UTC
jobs:
trigger-pipeline:
name: Start Step Functions Execution
runs-on: ubuntu-latest
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ca-central-1
- name: Set execution date
id: set-date
run: |
if [ -n "${{ github.event.inputs.execution_date }}" ]; then
echo "exec_date=${{ github.event.inputs.execution_date }}" >> "$GITHUB_OUTPUT"
else
echo "exec_date=$(date -u +%Y-%m-%d)" >> "$GITHUB_OUTPUT"
fi
- name: Start Step Functions execution
id: start-execution
run: |
ENV="${{ github.event.inputs.environment || 'production' }}"
STATE_MACHINE_ARN="${{ secrets.STATE_MACHINE_ARN }}"
EXECUTION_ARN=$(aws stepfunctions start-execution \
--state-machine-arn "$STATE_MACHINE_ARN" \
--name "github-actions-${{ github.run_id }}" \
--input "{\"execution_date\": \"${{ steps.set-date.outputs.exec_date }}\", \"environment\": \"$ENV\"}" \
--query 'executionArn' \
--output text)
echo "execution_arn=$EXECUTION_ARN" >> "$GITHUB_OUTPUT"
echo "Execution started: $EXECUTION_ARN"
- name: Wait for execution to complete
run: |
EXECUTION_ARN="${{ steps.start-execution.outputs.execution_arn }}"
MAX_WAIT=1800 # 30 minutes
ELAPSED=0
INTERVAL=30
while [ $ELAPSED -lt $MAX_WAIT ]; do
STATUS=$(aws stepfunctions describe-execution \
--execution-arn "$EXECUTION_ARN" \
--query 'status' \
--output text)
echo "Status: $STATUS (elapsed: ${ELAPSED}s)"
if [ "$STATUS" = "SUCCEEDED" ]; then
echo "Pipeline execution succeeded."
exit 0
elif [ "$STATUS" = "FAILED" ] || [ "$STATUS" = "TIMED_OUT" ] || [ "$STATUS" = "ABORTED" ]; then
echo "Pipeline execution failed with status: $STATUS"
exit 1
fi
sleep $INTERVAL
ELAPSED=$((ELAPSED + INTERVAL))
done
echo "Timed out waiting for execution to complete."
exit 1
The workflow_dispatch trigger allows manual runs from the GitHub UI with parameters. The schedule trigger runs it automatically each morning. Combining both gives you automated daily runs with the ability to trigger on-demand reruns when needed.
Workflow 4: SQL Linting with sqlfluff
Consistent SQL style and syntax checking catches bugs before they reach production. sqlfluff is a configurable SQL linter that understands dialect-specific syntax โ including Athena/Trino and Redshift.
# .github/workflows/sql-lint.yml
name: SQL Lint
on:
pull_request:
paths:
- "dbt/models/**/*.sql"
- "dbt/tests/**/*.sql"
jobs:
sqlfluff:
name: Lint SQL files
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: "pip"
- name: Install sqlfluff
run: pip install sqlfluff sqlfluff-templater-dbt dbt-redshift==1.7.*
- name: Run sqlfluff lint
run: |
sqlfluff lint dbt/models/ \
--dialect redshift \
--templater dbt \
--format github-annotation \
--annotation-level warning \
--processes 4
- name: Run sqlfluff fix (check mode)
run: |
sqlfluff fix dbt/models/ \
--dialect redshift \
--templater dbt \
--check \
--no-safety
Your .sqlfluff configuration file controls which rules apply:
# .sqlfluff
[sqlfluff]
templater = dbt
dialect = redshift
max_line_length = 120
exclude_rules = L031, L034
[sqlfluff:templater:dbt]
project_dir = ./dbt
[sqlfluff:rules:L010]
# Keywords should be uppercase
capitalisation_policy = upper
[sqlfluff:rules:L014]
# Unquoted identifiers should be lowercase
unquoted_identifiers_policy = column_aliases
Branch Protection and Environment Secrets
To enforce that all these checks must pass before merging:
- Go to Settings โ Branches โ Add rule for your
mainbranch. - Enable Require status checks to pass before merging.
- Add
dbt CI / Run dbt TestsandSQL Lint / Lint SQL filesas required status checks. - Enable Require branches to be up to date before merging.
For environment-specific secrets (separating staging and production credentials), use GitHub Environments (Settings โ Environments). Each environment can have its own secrets and required reviewers โ production deployments can be gated on manual approval before the workflow runs.
# In your deploy workflow, reference the environment:
jobs:
deploy:
environment:
name: production
url: https://your-dashboard.example.com
This setup โ combined with DataOps practices like data quality checks and automated testing โ creates a robust delivery pipeline for data products. For monitoring what happens after deployment, see Monitoring Data Pipelines with CloudWatch.
Conclusion
GitHub Actions is a practical, low-overhead path to CI/CD for data teams. The four workflows here cover the most common needs: testing transformations in isolation, deploying job scripts automatically, orchestrating pipeline execution, and enforcing SQL quality standards. Together they eliminate manual deployments, catch regressions before they reach production, and give your team confidence that every merge to main results in a tested, deployed, and running pipeline.
If your data team is ready to move from manual deployments to a proper CI/CD process, contact Infra IT Consulting. We help Canadian organisations build DataOps workflows that are reliable, auditable, and built for scale.
Related posts
Book a free 30-minute consultation to discuss your data engineering and analytics needs.
Talk to our team โ