Tech Tutorials & How-Tos cdkiactypescript

AWS CDK for Data Infrastructure: Type-Safe IaC for Data Teams

By Infra IT Consulting · May 17, 2024 · 11 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

Infrastructure as code for data teams has historically meant Terraform: declarative HCL, a mature ecosystem, and broad community support. AWS CDK (Cloud Development Kit) offers a different model — you write infrastructure using a real programming language (TypeScript, Python, Java, or Go), and CDK synthesizes it to CloudFormation. For data teams that already work in TypeScript or Python, CDK offers genuine advantages: type safety catches misconfiguration at compile time, loops and conditionals handle repetitive resources cleanly, and the Constructs Library provides high-level abstractions for common patterns.

This tutorial builds a complete data infrastructure stack in TypeScript CDK: an S3 data lake with lifecycle rules, a Glue Data Catalog database and crawler, a Redshift Serverless namespace, and a Step Functions state machine that orchestrates the pipeline.

Setting Up a CDK Project

# Install CDK CLI globally
npm install -g aws-cdk

# Create a new CDK TypeScript project
mkdir data-platform-cdk && cd data-platform-cdk
cdk init app --language typescript

# Install AWS construct libraries for the services we need
npm install \
  @aws-cdk/aws-s3 \
  @aws-cdk/aws-glue-alpha \
  @aws-cdk/aws-redshiftserverless \
  aws-cdk-lib \
  constructs

The project structure after init:

data-platform-cdk/
├── bin/
│   └── data-platform-cdk.ts    # App entry point
├── lib/
│   └── data-platform-cdk-stack.ts  # Main stack definition
├── test/
│   └── data-platform-cdk.test.ts
├── cdk.json
├── package.json
└── tsconfig.json

Defining S3 Buckets with Lifecycle Rules

A well-structured S3 data lake uses separate buckets (or prefixes) for raw, processed, and curated data, with lifecycle policies that automatically transition objects to cheaper storage tiers.

// lib/constructs/data-lake-buckets.ts
import * as cdk from 'aws-cdk-lib';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as iam from 'aws-cdk-lib/aws-iam';
import { Construct } from 'constructs';

export interface DataLakeBucketsProps {
  environment: string;
  retentionDays: number;
}

export class DataLakeBuckets extends Construct {
  public readonly rawBucket: s3.Bucket;
  public readonly processedBucket: s3.Bucket;
  public readonly curatedBucket: s3.Bucket;

  constructor(scope: Construct, id: string, props: DataLakeBucketsProps) {
    super(scope, id);

    const { environment, retentionDays } = props;

    // Raw zone: stores unmodified ingested data
    this.rawBucket = new s3.Bucket(this, 'RawBucket', {
      bucketName: `data-lake-raw-${environment}-${cdk.Aws.ACCOUNT_ID}`,
      versioned: true,
      encryption: s3.BucketEncryption.S3_MANAGED,
      blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
      enforceSSL: true,
      removalPolicy: cdk.RemovalPolicy.RETAIN,
      lifecycleRules: [
        {
          id: 'TransitionToIA',
          transitions: [
            {
              storageClass: s3.StorageClass.INFREQUENT_ACCESS,
              transitionAfter: cdk.Duration.days(30),
            },
            {
              storageClass: s3.StorageClass.GLACIER,
              transitionAfter: cdk.Duration.days(90),
            },
          ],
        },
        {
          id: 'ExpireOldVersions',
          noncurrentVersionExpiration: cdk.Duration.days(30),
          abortIncompleteMultipartUploadAfter: cdk.Duration.days(7),
        },
      ],
    });

    // Processed zone: cleaned and typed Parquet data
    this.processedBucket = new s3.Bucket(this, 'ProcessedBucket', {
      bucketName: `data-lake-processed-${environment}-${cdk.Aws.ACCOUNT_ID}`,
      versioned: false,
      encryption: s3.BucketEncryption.S3_MANAGED,
      blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
      enforceSSL: true,
      removalPolicy: cdk.RemovalPolicy.RETAIN,
      lifecycleRules: [
        {
          id: 'TransitionToIA',
          transitions: [
            {
              storageClass: s3.StorageClass.INFREQUENT_ACCESS,
              transitionAfter: cdk.Duration.days(60),
            },
          ],
        },
        {
          id: 'AbortIncompleteUploads',
          abortIncompleteMultipartUploadAfter: cdk.Duration.days(7),
        },
      ],
    });

    // Curated zone: aggregated, business-ready datasets
    this.curatedBucket = new s3.Bucket(this, 'CuratedBucket', {
      bucketName: `data-lake-curated-${environment}-${cdk.Aws.ACCOUNT_ID}`,
      versioned: false,
      encryption: s3.BucketEncryption.S3_MANAGED,
      blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
      enforceSSL: true,
      removalPolicy: cdk.RemovalPolicy.RETAIN,
    });

    // Output bucket ARNs for cross-stack references
    new cdk.CfnOutput(this, 'RawBucketArn',       { value: this.rawBucket.bucketArn });
    new cdk.CfnOutput(this, 'ProcessedBucketArn', { value: this.processedBucket.bucketArn });
    new cdk.CfnOutput(this, 'CuratedBucketArn',   { value: this.curatedBucket.bucketArn });
  }
}

Defining a Glue Database and Crawler

The Glue Data Catalog database and crawler define the metadata layer that makes S3 data queryable via Athena.

// lib/constructs/glue-catalog.ts
import * as cdk from 'aws-cdk-lib';
import * as glue from 'aws-cdk-lib/aws-glue';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as s3 from 'aws-cdk-lib/aws-s3';
import { Construct } from 'constructs';

export interface GlueCatalogProps {
  processedBucket: s3.IBucket;
  environment: string;
}

export class GlueCatalog extends Construct {
  public readonly database: glue.CfnDatabase;
  public readonly crawlerRole: iam.Role;

  constructor(scope: Construct, id: string, props: GlueCatalogProps) {
    super(scope, id);

    const { processedBucket, environment } = props;

    // Glue Data Catalog database
    this.database = new glue.CfnDatabase(this, 'Database', {
      catalogId: cdk.Aws.ACCOUNT_ID,
      databaseInput: {
        name: `data_lake_${environment}`,
        description: `Data lake catalog for ${environment} environment`,
        locationUri: `s3://${processedBucket.bucketName}/`,
      },
    });

    // IAM role for the Glue crawler
    this.crawlerRole = new iam.Role(this, 'CrawlerRole', {
      roleName: `glue-crawler-role-${environment}`,
      assumedBy: new iam.ServicePrincipal('glue.amazonaws.com'),
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AWSGlueServiceRole'),
      ],
    });

    // Grant the crawler read access to the processed bucket
    processedBucket.grantRead(this.crawlerRole);

    // Glue crawler — discovers schema from Parquet files in S3
    new glue.CfnCrawler(this, 'ProcessedDataCrawler', {
      name: `processed-data-crawler-${environment}`,
      role: this.crawlerRole.roleArn,
      databaseName: `data_lake_${environment}`,
      description: 'Crawls processed Parquet data and updates the Glue Data Catalog',
      targets: {
        s3Targets: [
          { path: `s3://${processedBucket.bucketName}/orders/` },
          { path: `s3://${processedBucket.bucketName}/customers/` },
          { path: `s3://${processedBucket.bucketName}/products/` },
        ],
      },
      schedule: {
        scheduleExpression: 'cron(0 6 * * ? *)',  // Daily at 6 AM UTC
      },
      schemaChangePolicy: {
        updateBehavior: 'UPDATE_IN_DATABASE',
        deleteBehavior: 'LOG',
      },
      recrawlPolicy: {
        recrawlBehavior: 'CRAWL_NEW_FOLDERS_ONLY',
      },
      configuration: JSON.stringify({
        Version: 1.0,
        Grouping: {
          TableGroupingPolicy: 'CombineCompatibleSchemas',
        },
        CrawlerOutput: {
          Partitions: { AddOrUpdateBehavior: 'InheritFromTable' },
          Tables: { AddOrUpdateBehavior: 'MergeNewColumns' },
        },
      }),
    });

    new cdk.CfnOutput(this, 'DatabaseName', { value: `data_lake_${environment}` });
  }
}

Defining a Redshift Serverless Namespace

Redshift Serverless eliminates cluster management — you pay per RPU (Redshift Processing Unit) per second, with automatic scaling.

// lib/constructs/redshift-serverless.ts
import * as cdk from 'aws-cdk-lib';
import * as redshiftserverless from 'aws-cdk-lib/aws-redshiftserverless';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as secretsmanager from 'aws-cdk-lib/aws-secretsmanager';
import { Construct } from 'constructs';

export interface RedshiftServerlessProps {
  vpc: ec2.IVpc;
  environment: string;
}

export class RedshiftServerlessCluster extends Construct {
  public readonly namespaceArn: string;
  public readonly workgroupName: string;

  constructor(scope: Construct, id: string, props: RedshiftServerlessProps) {
    super(scope, id);

    const { vpc, environment } = props;

    // Admin credentials stored in Secrets Manager
    const adminSecret = new secretsmanager.Secret(this, 'AdminSecret', {
      secretName: `redshift-serverless-admin-${environment}`,
      generateSecretString: {
        secretStringTemplate: JSON.stringify({ username: 'admin' }),
        generateStringKey: 'password',
        excludeCharacters: '"@/\\\'',
        passwordLength: 32,
      },
    });

    // IAM role for Redshift to access S3
    const redshiftRole = new iam.Role(this, 'RedshiftS3Role', {
      roleName: `redshift-s3-role-${environment}`,
      assumedBy: new iam.ServicePrincipal('redshift.amazonaws.com'),
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonS3ReadOnlyAccess'),
      ],
    });

    // Security group for Redshift Serverless
    const redshiftSg = new ec2.SecurityGroup(this, 'RedshiftSG', {
      vpc,
      securityGroupName: `redshift-serverless-sg-${environment}`,
      description: 'Security group for Redshift Serverless workgroup',
      allowAllOutbound: false,
    });

    // Namespace (holds the database and credentials)
    const namespace = new redshiftserverless.CfnNamespace(this, 'Namespace', {
      namespaceName: `data-platform-${environment}`,
      dbName: 'analytics',
      adminUsername: 'admin',
      adminUserPassword: adminSecret.secretValueFromJson('password').unsafeUnwrap(),
      iamRoles: [redshiftRole.roleArn],
      logExports: ['useractivitylog', 'userlog', 'connectionlog'],
    });

    // Workgroup (the compute layer — scales RPUs automatically)
    const workgroup = new redshiftserverless.CfnWorkgroup(this, 'Workgroup', {
      workgroupName: `data-platform-${environment}`,
      namespaceName: namespace.namespaceName!,
      baseCapacity: 8,           // Minimum 8 RPUs
      maxCapacity: 128,          // Maximum for auto-scaling
      enhancedVpcRouting: true,
      publiclyAccessible: false,
      subnetIds: vpc.selectSubnets({ subnetType: ec2.SubnetType.PRIVATE_ISOLATED }).subnetIds,
      securityGroupIds: [redshiftSg.securityGroupId],
      configParameters: [
        { parameterKey: 'enable_user_activity_logging', parameterValue: 'true' },
        { parameterKey: 'search_path',                  parameterValue: '$user, public, analytics' },
      ],
    });

    workgroup.addDependency(namespace);

    this.namespaceArn = namespace.attrNamespaceNamespaceArn;
    this.workgroupName = workgroup.workgroupName!;

    new cdk.CfnOutput(this, 'WorkgroupName',   { value: workgroup.workgroupName! });
    new cdk.CfnOutput(this, 'AdminSecretArn',  { value: adminSecret.secretArn });
  }
}

Defining a Step Functions State Machine

This state machine orchestrates the daily ETL pipeline: trigger the Glue crawler, wait for completion, run the Glue ETL job, then start a dbt transformation.

// lib/constructs/pipeline-state-machine.ts
import * as cdk from 'aws-cdk-lib';
import * as sfn from 'aws-cdk-lib/aws-stepfunctions';
import * as tasks from 'aws-cdk-lib/aws-stepfunctions-tasks';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as logs from 'aws-cdk-lib/aws-logs';
import { Construct } from 'constructs';

export interface PipelineStateMachineProps {
  glueCrawlerName: string;
  glueJobName: string;
  environment: string;
}

export class PipelineStateMachine extends Construct {
  public readonly stateMachine: sfn.StateMachine;

  constructor(scope: Construct, id: string, props: PipelineStateMachineProps) {
    super(scope, id);

    const { glueCrawlerName, glueJobName, environment } = props;

    // Step 1: Start the Glue crawler
    const startCrawler = new tasks.GlueStartCrawlerRun(this, 'StartCrawler', {
      crawlerName: glueCrawlerName,
      resultPath: '$.crawlerResult',
    });

    // Step 2: Run the Glue ETL job
    const runGlueJob = new tasks.GlueStartJobRun(this, 'RunGlueJob', {
      glueJobName: glueJobName,
      arguments: sfn.TaskInput.fromObject({
        '--execution_date': sfn.JsonPath.stringAt('$.execution_date'),
        '--environment': environment,
      }),
      resultPath: '$.glueJobResult',
      integrationPattern: sfn.IntegrationPattern.RUN_JOB,  // Wait for job completion
    });

    // Step 3: Handle Glue job failure
    const jobFailed = new sfn.Fail(this, 'GlueJobFailed', {
      error: 'GlueJobFailed',
      cause: 'The Glue ETL job returned a non-success status',
    });

    // Step 4: Success state
    const pipelineSuccess = new sfn.Succeed(this, 'PipelineSuccess', {
      comment: 'Daily ETL pipeline completed successfully',
    });

    // Catch Glue job failures
    runGlueJob.addCatch(jobFailed, {
      errors: ['Glue.GlueException', 'States.TaskFailed'],
      resultPath: '$.error',
    });

    // Build the state machine definition
    const definition = startCrawler
      .next(runGlueJob)
      .next(pipelineSuccess);

    // CloudWatch log group for execution history
    const logGroup = new logs.LogGroup(this, 'StateMachineLogGroup', {
      logGroupName: `/aws/states/data-pipeline-${environment}`,
      retention: logs.RetentionDays.ONE_MONTH,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
    });

    // State machine
    this.stateMachine = new sfn.StateMachine(this, 'StateMachine', {
      stateMachineName: `data-pipeline-${environment}`,
      definitionBody: sfn.DefinitionBody.fromChainable(definition),
      timeout: cdk.Duration.hours(2),
      stateMachineType: sfn.StateMachineType.STANDARD,
      logs: {
        destination: logGroup,
        level: sfn.LogLevel.ALL,
        includeExecutionData: true,
      },
      tracingEnabled: true,
    });

    new cdk.CfnOutput(this, 'StateMachineArn', { value: this.stateMachine.stateMachineArn });
  }
}

Composing the Full Stack

// lib/data-platform-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import { Construct } from 'constructs';
import { DataLakeBuckets } from './constructs/data-lake-buckets';
import { GlueCatalog } from './constructs/glue-catalog';
import { RedshiftServerlessCluster } from './constructs/redshift-serverless';
import { PipelineStateMachine } from './constructs/pipeline-state-machine';

export interface DataPlatformStackProps extends cdk.StackProps {
  environment: string;
}

export class DataPlatformStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props: DataPlatformStackProps) {
    super(scope, id, props);

    const { environment } = props;

    // VPC for Redshift and other private resources
    const vpc = new ec2.Vpc(this, 'DataPlatformVpc', {
      vpcName: `data-platform-vpc-${environment}`,
      maxAzs: 2,
      natGateways: 1,
      subnetConfiguration: [
        { name: 'Public',          subnetType: ec2.SubnetType.PUBLIC,           cidrMask: 24 },
        { name: 'Private',         subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS, cidrMask: 24 },
        { name: 'PrivateIsolated', subnetType: ec2.SubnetType.PRIVATE_ISOLATED, cidrMask: 24 },
      ],
    });

    // S3 data lake buckets
    const dataLake = new DataLakeBuckets(this, 'DataLake', {
      environment,
      retentionDays: 365,
    });

    // Glue Data Catalog
    const glueCatalog = new GlueCatalog(this, 'GlueCatalog', {
      processedBucket: dataLake.processedBucket,
      environment,
    });

    // Redshift Serverless
    const redshift = new RedshiftServerlessCluster(this, 'Redshift', {
      vpc,
      environment,
    });

    // Step Functions pipeline orchestrator
    const pipeline = new PipelineStateMachine(this, 'Pipeline', {
      glueCrawlerName: `processed-data-crawler-${environment}`,
      glueJobName: `prod-ingest-raw`,
      environment,
    });

    // Stack-level tags
    cdk.Tags.of(this).add('Environment', environment);
    cdk.Tags.of(this).add('Project', 'DataPlatform');
    cdk.Tags.of(this).add('ManagedBy', 'CDK');
  }
}

The cdk synth and cdk deploy Workflow

# Bootstrap CDK in your account/region (once per account/region)
cdk bootstrap aws://123456789012/ca-central-1

# Synthesize CloudFormation templates (validates TypeScript and generates CFN)
cdk synth --context environment=dev

# Preview what will change (like terraform plan)
cdk diff --context environment=dev

# Deploy the stack
cdk deploy DataPlatformStack --context environment=dev

# Deploy to production
cdk deploy DataPlatformStack --context environment=prod \
  --require-approval broadening  # Require approval for security group changes

# Destroy the dev stack when done
cdk destroy DataPlatformStack --context environment=dev

cdk synth generates CloudFormation JSON/YAML in the cdk.out/ directory. You can inspect this output to see exactly what CloudFormation resources CDK will create — useful for auditing and compliance.

CDK vs. Terraform for Data Teams

Dimension	AWS CDK (TypeScript)	Terraform (HCL)
Language	TypeScript, Python, Java, Go	HCL (domain-specific)
Type safety	Full IDE autocompletion and compile-time errors	HCL validation at plan-time only
Loops / conditionals	Native language constructs	`for_each`, `count`, `dynamic` blocks
State management	CloudFormation manages state	Terraform state file (S3 + DynamoDB recommended)
Drift detection	CloudFormation stack drift detection	`terraform plan` shows drift
Multi-cloud	AWS only (with CDKtf for Terraform providers)	All major clouds
Ecosystem	CDK Construct Library (L2/L3 abstractions)	Terraform Registry (community modules)
Learning curve	Steeper if unfamiliar with CDK patterns	Gentler HCL syntax, but limits expressiveness
Best for	AWS-native teams comfortable with TypeScript/Python	Multi-cloud teams or existing Terraform users

For AWS-native data teams who are comfortable in TypeScript or Python, CDK offers meaningful developer experience advantages — particularly the type-safe L2 constructs that encode AWS best practices by default. For teams with existing Terraform expertise or multi-cloud requirements, Terraform remains the pragmatic choice. See our Terraform for AWS Data Stacks guide for the equivalent infrastructure in HCL.

For integrating CDK deployments into your CI/CD process, combine it with the GitHub Actions CI/CD workflows covered in our DataOps tutorial — cdk diff in PR checks and cdk deploy on merge to main.

Conclusion

AWS CDK brings software engineering practices to infrastructure definition. Type safety catches misconfigurations before deployment. The Constructs Library encodes AWS best practices into reusable, composable abstractions. And the familiar language tooling — linters, formatters, test frameworks — applies directly to your infrastructure code.

The data infrastructure defined in this tutorial — S3 data lake, Glue Data Catalog, Redshift Serverless, and Step Functions orchestration — forms the backbone of a modern AWS data platform. CDK keeps that infrastructure reproducible, auditable, and maintainable as it evolves.

If your team is building or modernising a data platform on AWS and wants expert guidance on CDK architecture, infrastructure design, or cloud cost optimization, contact Infra IT Consulting. We help Canadian organisations design and deliver production-grade AWS data infrastructure.

Tech Tutorials & How-Tos

Talk to our team →

AWS CDK for Data Infrastructure: Type-Safe IaC for Data Teams

Setting Up a CDK Project

Defining S3 Buckets with Lifecycle Rules

Defining a Glue Database and Crawler

Defining a Redshift Serverless Namespace

Defining a Step Functions State Machine

Composing the Full Stack

The cdk synth and cdk deploy Workflow

CDK vs. Terraform for Data Teams

Conclusion

Related posts

Python and Boto3: Automating S3 Data Operations

Monitoring Data Pipelines with Amazon CloudWatch: A How-To Guide

dbt 101 for AWS Data Engineers: Your First Transformation Project

Setting Up a CDK Project

Defining S3 Buckets with Lifecycle Rules

Defining a Glue Database and Crawler

Defining a Redshift Serverless Namespace

Defining a Step Functions State Machine

Composing the Full Stack

The cdk synth and cdk deploy Workflow

CDK vs. Terraform for Data Teams

Conclusion

Related posts

Python and Boto3: Automating S3 Data Operations

Monitoring Data Pipelines with Amazon CloudWatch: A How-To Guide

dbt 101 for AWS Data Engineers: Your First Transformation Project

We value your privacy