Infra IT Consulting logo Infra ITC
Tech Tutorials & How-Tos github-actionsci-cddataops

CI/CD for Data Pipelines with GitHub Actions

By Infra IT Consulting ยท ยท 10 min read

Continuous integration and deployment is table stakes in software engineering, but adoption in data teams has historically been uneven. Data pipelines get deployed manually, SQL transformations are edited directly in production, and Glue job scripts are uploaded by hand. The result is brittle infrastructure, environment drift, and incidents that are hard to reproduce.

GitHub Actions makes CI/CD for data pipelines genuinely accessible. It integrates directly with your code repository, supports AWS CLI and SDK operations natively, and costs nothing for small-to-medium teams using public or private repositories within the free tier limits. This tutorial shows four concrete workflows: running dbt tests, deploying Glue jobs, triggering Step Functions, and linting SQL with sqlfluff.

Repository Structure

Organise your data repository so that the CI/CD workflows can target logical components:

data-platform/
โ”œโ”€โ”€ .github/
โ”‚   โ””โ”€โ”€ workflows/
โ”‚       โ”œโ”€โ”€ dbt-ci.yml
โ”‚       โ”œโ”€โ”€ glue-deploy.yml
โ”‚       โ”œโ”€โ”€ step-functions-trigger.yml
โ”‚       โ””โ”€โ”€ sql-lint.yml
โ”œโ”€โ”€ dbt/
โ”‚   โ”œโ”€โ”€ models/
โ”‚   โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ dbt_project.yml
โ”‚   โ””โ”€โ”€ profiles.yml
โ”œโ”€โ”€ glue/
โ”‚   โ””โ”€โ”€ jobs/
โ”‚       โ”œโ”€โ”€ ingest_raw.py
โ”‚       โ””โ”€โ”€ transform_orders.py
โ”œโ”€โ”€ step_functions/
โ”‚   โ””โ”€โ”€ pipeline_definition.json
โ””โ”€โ”€ .sqlfluff

Secrets are stored in GitHub repository secrets (Settings โ†’ Secrets and variables โ†’ Actions) and never in code. At minimum, you need AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION.

Workflow 1: Running dbt Tests on Pull Requests

This workflow runs whenever a pull request modifies dbt models or tests. It sets up a slim dbt environment, connects to Redshift using a CI-specific schema, runs the changed models, and executes all tests.

# .github/workflows/dbt-ci.yml
name: dbt CI

on:
  pull_request:
    paths:
      - "dbt/**"
    branches:
      - main
      - develop

env:
  DBT_PROFILES_DIR: ./dbt

jobs:
  dbt-test:
    name: Run dbt Tests
    runs-on: ubuntu-latest
    environment: ci

    steps:
      - name: Checkout repository
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: "pip"

      - name: Install dbt-redshift
        run: pip install dbt-redshift==1.7.*

      - name: Create dbt profiles for CI
        env:
          REDSHIFT_HOST: ${{ secrets.REDSHIFT_HOST }}
          REDSHIFT_USER: ${{ secrets.REDSHIFT_USER }}
          REDSHIFT_PASSWORD: ${{ secrets.REDSHIFT_PASSWORD }}
          REDSHIFT_DB: ${{ secrets.REDSHIFT_DB }}
          REDSHIFT_PORT: 5439
        run: |
          mkdir -p ~/.dbt
          cat > ~/.dbt/profiles.yml << EOF
          data_platform:
            target: ci
            outputs:
              ci:
                type: redshift
                host: "${REDSHIFT_HOST}"
                user: "${REDSHIFT_USER}"
                password: "${REDSHIFT_PASSWORD}"
                port: ${REDSHIFT_PORT}
                dbname: "${REDSHIFT_DB}"
                schema: "ci_${{ github.event.pull_request.number }}"
                threads: 4
          EOF

      - name: dbt debug (verify connection)
        working-directory: ./dbt
        run: dbt debug

      - name: dbt deps
        working-directory: ./dbt
        run: dbt deps

      - name: dbt build (run + test)
        working-directory: ./dbt
        run: dbt build --target ci

      - name: Clean up CI schema
        if: always()
        working-directory: ./dbt
        env:
          REDSHIFT_HOST: ${{ secrets.REDSHIFT_HOST }}
          REDSHIFT_USER: ${{ secrets.REDSHIFT_USER }}
          REDSHIFT_PASSWORD: ${{ secrets.REDSHIFT_PASSWORD }}
        run: |
          python - <<'PYEOF'
          import psycopg2, os
          conn = psycopg2.connect(
              host=os.environ["REDSHIFT_HOST"],
              user=os.environ["REDSHIFT_USER"],
              password=os.environ["REDSHIFT_PASSWORD"],
              dbname="analytics",
              port=5439
          )
          conn.autocommit = True
          with conn.cursor() as cur:
              schema = f"ci_{os.environ.get('PR_NUMBER', 'unknown')}"
              cur.execute(f"DROP SCHEMA IF EXISTS {schema} CASCADE")
          PYEOF
        env:
          PR_NUMBER: ${{ github.event.pull_request.number }}

The CI schema โ€” named after the PR number โ€” gives each pull request an isolated namespace. The cleanup step drops it on workflow completion regardless of whether the tests passed or failed.

Workflow 2: Deploying AWS Glue Jobs

This workflow deploys Glue job scripts to S3 and updates the Glue job definition whenever changes are pushed to the main branch. It supports multiple jobs via a matrix strategy.

# .github/workflows/glue-deploy.yml
name: Deploy Glue Jobs

on:
  push:
    branches:
      - main
    paths:
      - "glue/jobs/**"

jobs:
  deploy-glue:
    name: Deploy ${{ matrix.job_name }}
    runs-on: ubuntu-latest
    environment: production

    strategy:
      matrix:
        include:
          - job_name: ingest_raw
            script_path: glue/jobs/ingest_raw.py
            glue_job_name: prod-ingest-raw
          - job_name: transform_orders
            script_path: glue/jobs/transform_orders.py
            glue_job_name: prod-transform-orders

    steps:
      - name: Checkout repository
        uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ca-central-1

      - name: Upload script to S3
        run: |
          aws s3 cp ${{ matrix.script_path }} \
            s3://${{ secrets.GLUE_SCRIPTS_BUCKET }}/jobs/${{ matrix.job_name }}/script.py \
            --sse AES256

      - name: Update Glue job definition
        run: |
          aws glue update-job \
            --job-name "${{ matrix.glue_job_name }}" \
            --job-update '{
              "Command": {
                "Name": "glueetl",
                "ScriptLocation": "s3://${{ secrets.GLUE_SCRIPTS_BUCKET }}/jobs/${{ matrix.job_name }}/script.py",
                "PythonVersion": "3"
              },
              "GlueVersion": "4.0",
              "NumberOfWorkers": 10,
              "WorkerType": "G.1X",
              "Timeout": 120,
              "DefaultArguments": {
                "--job-bookmark-option": "job-bookmark-enable",
                "--enable-metrics": "",
                "--enable-continuous-cloudwatch-log": "true",
                "--TempDir": "s3://${{ secrets.GLUE_TEMP_BUCKET }}/temp/"
              }
            }'

      - name: Verify job update
        run: |
          aws glue get-job --job-name "${{ matrix.glue_job_name }}" \
            --query 'Job.{Name: Name, GlueVersion: GlueVersion, ModifiedOn: LastModifiedOn}' \
            --output table

This pattern keeps Glue job scripts in version control โ€” the S3 upload and update-job call happen automatically on every merge to main. Teams no longer upload scripts manually or lose track of which version is running in production.

Workflow 3: Triggering AWS Step Functions

For pipelines orchestrated by Step Functions, you may want to trigger a test execution on deployment or on a schedule:

# .github/workflows/step-functions-trigger.yml
name: Trigger Pipeline Execution

on:
  workflow_dispatch:
    inputs:
      environment:
        description: "Target environment (staging/production)"
        required: true
        default: "staging"
        type: choice
        options:
          - staging
          - production
      execution_date:
        description: "Execution date (YYYY-MM-DD)"
        required: false
        default: ""

  schedule:
    - cron: "0 6 * * *"  # Run daily at 06:00 UTC

jobs:
  trigger-pipeline:
    name: Start Step Functions Execution
    runs-on: ubuntu-latest

    steps:
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ca-central-1

      - name: Set execution date
        id: set-date
        run: |
          if [ -n "${{ github.event.inputs.execution_date }}" ]; then
            echo "exec_date=${{ github.event.inputs.execution_date }}" >> "$GITHUB_OUTPUT"
          else
            echo "exec_date=$(date -u +%Y-%m-%d)" >> "$GITHUB_OUTPUT"
          fi

      - name: Start Step Functions execution
        id: start-execution
        run: |
          ENV="${{ github.event.inputs.environment || 'production' }}"
          STATE_MACHINE_ARN="${{ secrets.STATE_MACHINE_ARN }}"

          EXECUTION_ARN=$(aws stepfunctions start-execution \
            --state-machine-arn "$STATE_MACHINE_ARN" \
            --name "github-actions-${{ github.run_id }}" \
            --input "{\"execution_date\": \"${{ steps.set-date.outputs.exec_date }}\", \"environment\": \"$ENV\"}" \
            --query 'executionArn' \
            --output text)

          echo "execution_arn=$EXECUTION_ARN" >> "$GITHUB_OUTPUT"
          echo "Execution started: $EXECUTION_ARN"

      - name: Wait for execution to complete
        run: |
          EXECUTION_ARN="${{ steps.start-execution.outputs.execution_arn }}"
          MAX_WAIT=1800  # 30 minutes
          ELAPSED=0
          INTERVAL=30

          while [ $ELAPSED -lt $MAX_WAIT ]; do
            STATUS=$(aws stepfunctions describe-execution \
              --execution-arn "$EXECUTION_ARN" \
              --query 'status' \
              --output text)

            echo "Status: $STATUS (elapsed: ${ELAPSED}s)"

            if [ "$STATUS" = "SUCCEEDED" ]; then
              echo "Pipeline execution succeeded."
              exit 0
            elif [ "$STATUS" = "FAILED" ] || [ "$STATUS" = "TIMED_OUT" ] || [ "$STATUS" = "ABORTED" ]; then
              echo "Pipeline execution failed with status: $STATUS"
              exit 1
            fi

            sleep $INTERVAL
            ELAPSED=$((ELAPSED + INTERVAL))
          done

          echo "Timed out waiting for execution to complete."
          exit 1

The workflow_dispatch trigger allows manual runs from the GitHub UI with parameters. The schedule trigger runs it automatically each morning. Combining both gives you automated daily runs with the ability to trigger on-demand reruns when needed.

Workflow 4: SQL Linting with sqlfluff

Consistent SQL style and syntax checking catches bugs before they reach production. sqlfluff is a configurable SQL linter that understands dialect-specific syntax โ€” including Athena/Trino and Redshift.

# .github/workflows/sql-lint.yml
name: SQL Lint

on:
  pull_request:
    paths:
      - "dbt/models/**/*.sql"
      - "dbt/tests/**/*.sql"

jobs:
  sqlfluff:
    name: Lint SQL files
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: "pip"

      - name: Install sqlfluff
        run: pip install sqlfluff sqlfluff-templater-dbt dbt-redshift==1.7.*

      - name: Run sqlfluff lint
        run: |
          sqlfluff lint dbt/models/ \
            --dialect redshift \
            --templater dbt \
            --format github-annotation \
            --annotation-level warning \
            --processes 4

      - name: Run sqlfluff fix (check mode)
        run: |
          sqlfluff fix dbt/models/ \
            --dialect redshift \
            --templater dbt \
            --check \
            --no-safety

Your .sqlfluff configuration file controls which rules apply:

# .sqlfluff
[sqlfluff]
templater = dbt
dialect = redshift
max_line_length = 120
exclude_rules = L031, L034

[sqlfluff:templater:dbt]
project_dir = ./dbt

[sqlfluff:rules:L010]
# Keywords should be uppercase
capitalisation_policy = upper

[sqlfluff:rules:L014]
# Unquoted identifiers should be lowercase
unquoted_identifiers_policy = column_aliases

Branch Protection and Environment Secrets

To enforce that all these checks must pass before merging:

  1. Go to Settings โ†’ Branches โ†’ Add rule for your main branch.
  2. Enable Require status checks to pass before merging.
  3. Add dbt CI / Run dbt Tests and SQL Lint / Lint SQL files as required status checks.
  4. Enable Require branches to be up to date before merging.

For environment-specific secrets (separating staging and production credentials), use GitHub Environments (Settings โ†’ Environments). Each environment can have its own secrets and required reviewers โ€” production deployments can be gated on manual approval before the workflow runs.

# In your deploy workflow, reference the environment:
jobs:
  deploy:
    environment:
      name: production
      url: https://your-dashboard.example.com

This setup โ€” combined with DataOps practices like data quality checks and automated testing โ€” creates a robust delivery pipeline for data products. For monitoring what happens after deployment, see Monitoring Data Pipelines with CloudWatch.

Conclusion

GitHub Actions is a practical, low-overhead path to CI/CD for data teams. The four workflows here cover the most common needs: testing transformations in isolation, deploying job scripts automatically, orchestrating pipeline execution, and enforcing SQL quality standards. Together they eliminate manual deployments, catch regressions before they reach production, and give your team confidence that every merge to main results in a tested, deployed, and running pipeline.

If your data team is ready to move from manual deployments to a proper CI/CD process, contact Infra IT Consulting. We help Canadian organisations build DataOps workflows that are reliable, auditable, and built for scale.

Related posts