AWS CDK for Data Infrastructure: Type-Safe IaC for Data Teams
Infrastructure as code for data teams has historically meant Terraform: declarative HCL, a mature ecosystem, and broad community support. AWS CDK (Cloud Development Kit) offers a different model โ you write infrastructure using a real programming language (TypeScript, Python, Java, or Go), and CDK synthesizes it to CloudFormation. For data teams that already work in TypeScript or Python, CDK offers genuine advantages: type safety catches misconfiguration at compile time, loops and conditionals handle repetitive resources cleanly, and the Constructs Library provides high-level abstractions for common patterns.
This tutorial builds a complete data infrastructure stack in TypeScript CDK: an S3 data lake with lifecycle rules, a Glue Data Catalog database and crawler, a Redshift Serverless namespace, and a Step Functions state machine that orchestrates the pipeline.
Setting Up a CDK Project
# Install CDK CLI globally
npm install -g aws-cdk
# Create a new CDK TypeScript project
mkdir data-platform-cdk && cd data-platform-cdk
cdk init app --language typescript
# Install AWS construct libraries for the services we need
npm install \
@aws-cdk/aws-s3 \
@aws-cdk/aws-glue-alpha \
@aws-cdk/aws-redshiftserverless \
aws-cdk-lib \
constructs
The project structure after init:
data-platform-cdk/
โโโ bin/
โ โโโ data-platform-cdk.ts # App entry point
โโโ lib/
โ โโโ data-platform-cdk-stack.ts # Main stack definition
โโโ test/
โ โโโ data-platform-cdk.test.ts
โโโ cdk.json
โโโ package.json
โโโ tsconfig.json
Defining S3 Buckets with Lifecycle Rules
A well-structured S3 data lake uses separate buckets (or prefixes) for raw, processed, and curated data, with lifecycle policies that automatically transition objects to cheaper storage tiers.
// lib/constructs/data-lake-buckets.ts
import * as cdk from 'aws-cdk-lib';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as iam from 'aws-cdk-lib/aws-iam';
import { Construct } from 'constructs';
export interface DataLakeBucketsProps {
environment: string;
retentionDays: number;
}
export class DataLakeBuckets extends Construct {
public readonly rawBucket: s3.Bucket;
public readonly processedBucket: s3.Bucket;
public readonly curatedBucket: s3.Bucket;
constructor(scope: Construct, id: string, props: DataLakeBucketsProps) {
super(scope, id);
const { environment, retentionDays } = props;
// Raw zone: stores unmodified ingested data
this.rawBucket = new s3.Bucket(this, 'RawBucket', {
bucketName: `data-lake-raw-${environment}-${cdk.Aws.ACCOUNT_ID}`,
versioned: true,
encryption: s3.BucketEncryption.S3_MANAGED,
blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
enforceSSL: true,
removalPolicy: cdk.RemovalPolicy.RETAIN,
lifecycleRules: [
{
id: 'TransitionToIA',
transitions: [
{
storageClass: s3.StorageClass.INFREQUENT_ACCESS,
transitionAfter: cdk.Duration.days(30),
},
{
storageClass: s3.StorageClass.GLACIER,
transitionAfter: cdk.Duration.days(90),
},
],
},
{
id: 'ExpireOldVersions',
noncurrentVersionExpiration: cdk.Duration.days(30),
abortIncompleteMultipartUploadAfter: cdk.Duration.days(7),
},
],
});
// Processed zone: cleaned and typed Parquet data
this.processedBucket = new s3.Bucket(this, 'ProcessedBucket', {
bucketName: `data-lake-processed-${environment}-${cdk.Aws.ACCOUNT_ID}`,
versioned: false,
encryption: s3.BucketEncryption.S3_MANAGED,
blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
enforceSSL: true,
removalPolicy: cdk.RemovalPolicy.RETAIN,
lifecycleRules: [
{
id: 'TransitionToIA',
transitions: [
{
storageClass: s3.StorageClass.INFREQUENT_ACCESS,
transitionAfter: cdk.Duration.days(60),
},
],
},
{
id: 'AbortIncompleteUploads',
abortIncompleteMultipartUploadAfter: cdk.Duration.days(7),
},
],
});
// Curated zone: aggregated, business-ready datasets
this.curatedBucket = new s3.Bucket(this, 'CuratedBucket', {
bucketName: `data-lake-curated-${environment}-${cdk.Aws.ACCOUNT_ID}`,
versioned: false,
encryption: s3.BucketEncryption.S3_MANAGED,
blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
enforceSSL: true,
removalPolicy: cdk.RemovalPolicy.RETAIN,
});
// Output bucket ARNs for cross-stack references
new cdk.CfnOutput(this, 'RawBucketArn', { value: this.rawBucket.bucketArn });
new cdk.CfnOutput(this, 'ProcessedBucketArn', { value: this.processedBucket.bucketArn });
new cdk.CfnOutput(this, 'CuratedBucketArn', { value: this.curatedBucket.bucketArn });
}
}
Defining a Glue Database and Crawler
The Glue Data Catalog database and crawler define the metadata layer that makes S3 data queryable via Athena.
// lib/constructs/glue-catalog.ts
import * as cdk from 'aws-cdk-lib';
import * as glue from 'aws-cdk-lib/aws-glue';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as s3 from 'aws-cdk-lib/aws-s3';
import { Construct } from 'constructs';
export interface GlueCatalogProps {
processedBucket: s3.IBucket;
environment: string;
}
export class GlueCatalog extends Construct {
public readonly database: glue.CfnDatabase;
public readonly crawlerRole: iam.Role;
constructor(scope: Construct, id: string, props: GlueCatalogProps) {
super(scope, id);
const { processedBucket, environment } = props;
// Glue Data Catalog database
this.database = new glue.CfnDatabase(this, 'Database', {
catalogId: cdk.Aws.ACCOUNT_ID,
databaseInput: {
name: `data_lake_${environment}`,
description: `Data lake catalog for ${environment} environment`,
locationUri: `s3://${processedBucket.bucketName}/`,
},
});
// IAM role for the Glue crawler
this.crawlerRole = new iam.Role(this, 'CrawlerRole', {
roleName: `glue-crawler-role-${environment}`,
assumedBy: new iam.ServicePrincipal('glue.amazonaws.com'),
managedPolicies: [
iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AWSGlueServiceRole'),
],
});
// Grant the crawler read access to the processed bucket
processedBucket.grantRead(this.crawlerRole);
// Glue crawler โ discovers schema from Parquet files in S3
new glue.CfnCrawler(this, 'ProcessedDataCrawler', {
name: `processed-data-crawler-${environment}`,
role: this.crawlerRole.roleArn,
databaseName: `data_lake_${environment}`,
description: 'Crawls processed Parquet data and updates the Glue Data Catalog',
targets: {
s3Targets: [
{ path: `s3://${processedBucket.bucketName}/orders/` },
{ path: `s3://${processedBucket.bucketName}/customers/` },
{ path: `s3://${processedBucket.bucketName}/products/` },
],
},
schedule: {
scheduleExpression: 'cron(0 6 * * ? *)', // Daily at 6 AM UTC
},
schemaChangePolicy: {
updateBehavior: 'UPDATE_IN_DATABASE',
deleteBehavior: 'LOG',
},
recrawlPolicy: {
recrawlBehavior: 'CRAWL_NEW_FOLDERS_ONLY',
},
configuration: JSON.stringify({
Version: 1.0,
Grouping: {
TableGroupingPolicy: 'CombineCompatibleSchemas',
},
CrawlerOutput: {
Partitions: { AddOrUpdateBehavior: 'InheritFromTable' },
Tables: { AddOrUpdateBehavior: 'MergeNewColumns' },
},
}),
});
new cdk.CfnOutput(this, 'DatabaseName', { value: `data_lake_${environment}` });
}
}
Defining a Redshift Serverless Namespace
Redshift Serverless eliminates cluster management โ you pay per RPU (Redshift Processing Unit) per second, with automatic scaling.
// lib/constructs/redshift-serverless.ts
import * as cdk from 'aws-cdk-lib';
import * as redshiftserverless from 'aws-cdk-lib/aws-redshiftserverless';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as secretsmanager from 'aws-cdk-lib/aws-secretsmanager';
import { Construct } from 'constructs';
export interface RedshiftServerlessProps {
vpc: ec2.IVpc;
environment: string;
}
export class RedshiftServerlessCluster extends Construct {
public readonly namespaceArn: string;
public readonly workgroupName: string;
constructor(scope: Construct, id: string, props: RedshiftServerlessProps) {
super(scope, id);
const { vpc, environment } = props;
// Admin credentials stored in Secrets Manager
const adminSecret = new secretsmanager.Secret(this, 'AdminSecret', {
secretName: `redshift-serverless-admin-${environment}`,
generateSecretString: {
secretStringTemplate: JSON.stringify({ username: 'admin' }),
generateStringKey: 'password',
excludeCharacters: '"@/\\\'',
passwordLength: 32,
},
});
// IAM role for Redshift to access S3
const redshiftRole = new iam.Role(this, 'RedshiftS3Role', {
roleName: `redshift-s3-role-${environment}`,
assumedBy: new iam.ServicePrincipal('redshift.amazonaws.com'),
managedPolicies: [
iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonS3ReadOnlyAccess'),
],
});
// Security group for Redshift Serverless
const redshiftSg = new ec2.SecurityGroup(this, 'RedshiftSG', {
vpc,
securityGroupName: `redshift-serverless-sg-${environment}`,
description: 'Security group for Redshift Serverless workgroup',
allowAllOutbound: false,
});
// Namespace (holds the database and credentials)
const namespace = new redshiftserverless.CfnNamespace(this, 'Namespace', {
namespaceName: `data-platform-${environment}`,
dbName: 'analytics',
adminUsername: 'admin',
adminUserPassword: adminSecret.secretValueFromJson('password').unsafeUnwrap(),
iamRoles: [redshiftRole.roleArn],
logExports: ['useractivitylog', 'userlog', 'connectionlog'],
});
// Workgroup (the compute layer โ scales RPUs automatically)
const workgroup = new redshiftserverless.CfnWorkgroup(this, 'Workgroup', {
workgroupName: `data-platform-${environment}`,
namespaceName: namespace.namespaceName!,
baseCapacity: 8, // Minimum 8 RPUs
maxCapacity: 128, // Maximum for auto-scaling
enhancedVpcRouting: true,
publiclyAccessible: false,
subnetIds: vpc.selectSubnets({ subnetType: ec2.SubnetType.PRIVATE_ISOLATED }).subnetIds,
securityGroupIds: [redshiftSg.securityGroupId],
configParameters: [
{ parameterKey: 'enable_user_activity_logging', parameterValue: 'true' },
{ parameterKey: 'search_path', parameterValue: '$user, public, analytics' },
],
});
workgroup.addDependency(namespace);
this.namespaceArn = namespace.attrNamespaceNamespaceArn;
this.workgroupName = workgroup.workgroupName!;
new cdk.CfnOutput(this, 'WorkgroupName', { value: workgroup.workgroupName! });
new cdk.CfnOutput(this, 'AdminSecretArn', { value: adminSecret.secretArn });
}
}
Defining a Step Functions State Machine
This state machine orchestrates the daily ETL pipeline: trigger the Glue crawler, wait for completion, run the Glue ETL job, then start a dbt transformation.
// lib/constructs/pipeline-state-machine.ts
import * as cdk from 'aws-cdk-lib';
import * as sfn from 'aws-cdk-lib/aws-stepfunctions';
import * as tasks from 'aws-cdk-lib/aws-stepfunctions-tasks';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as logs from 'aws-cdk-lib/aws-logs';
import { Construct } from 'constructs';
export interface PipelineStateMachineProps {
glueCrawlerName: string;
glueJobName: string;
environment: string;
}
export class PipelineStateMachine extends Construct {
public readonly stateMachine: sfn.StateMachine;
constructor(scope: Construct, id: string, props: PipelineStateMachineProps) {
super(scope, id);
const { glueCrawlerName, glueJobName, environment } = props;
// Step 1: Start the Glue crawler
const startCrawler = new tasks.GlueStartCrawlerRun(this, 'StartCrawler', {
crawlerName: glueCrawlerName,
resultPath: '$.crawlerResult',
});
// Step 2: Run the Glue ETL job
const runGlueJob = new tasks.GlueStartJobRun(this, 'RunGlueJob', {
glueJobName: glueJobName,
arguments: sfn.TaskInput.fromObject({
'--execution_date': sfn.JsonPath.stringAt('$.execution_date'),
'--environment': environment,
}),
resultPath: '$.glueJobResult',
integrationPattern: sfn.IntegrationPattern.RUN_JOB, // Wait for job completion
});
// Step 3: Handle Glue job failure
const jobFailed = new sfn.Fail(this, 'GlueJobFailed', {
error: 'GlueJobFailed',
cause: 'The Glue ETL job returned a non-success status',
});
// Step 4: Success state
const pipelineSuccess = new sfn.Succeed(this, 'PipelineSuccess', {
comment: 'Daily ETL pipeline completed successfully',
});
// Catch Glue job failures
runGlueJob.addCatch(jobFailed, {
errors: ['Glue.GlueException', 'States.TaskFailed'],
resultPath: '$.error',
});
// Build the state machine definition
const definition = startCrawler
.next(runGlueJob)
.next(pipelineSuccess);
// CloudWatch log group for execution history
const logGroup = new logs.LogGroup(this, 'StateMachineLogGroup', {
logGroupName: `/aws/states/data-pipeline-${environment}`,
retention: logs.RetentionDays.ONE_MONTH,
removalPolicy: cdk.RemovalPolicy.DESTROY,
});
// State machine
this.stateMachine = new sfn.StateMachine(this, 'StateMachine', {
stateMachineName: `data-pipeline-${environment}`,
definitionBody: sfn.DefinitionBody.fromChainable(definition),
timeout: cdk.Duration.hours(2),
stateMachineType: sfn.StateMachineType.STANDARD,
logs: {
destination: logGroup,
level: sfn.LogLevel.ALL,
includeExecutionData: true,
},
tracingEnabled: true,
});
new cdk.CfnOutput(this, 'StateMachineArn', { value: this.stateMachine.stateMachineArn });
}
}
Composing the Full Stack
// lib/data-platform-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import { Construct } from 'constructs';
import { DataLakeBuckets } from './constructs/data-lake-buckets';
import { GlueCatalog } from './constructs/glue-catalog';
import { RedshiftServerlessCluster } from './constructs/redshift-serverless';
import { PipelineStateMachine } from './constructs/pipeline-state-machine';
export interface DataPlatformStackProps extends cdk.StackProps {
environment: string;
}
export class DataPlatformStack extends cdk.Stack {
constructor(scope: Construct, id: string, props: DataPlatformStackProps) {
super(scope, id, props);
const { environment } = props;
// VPC for Redshift and other private resources
const vpc = new ec2.Vpc(this, 'DataPlatformVpc', {
vpcName: `data-platform-vpc-${environment}`,
maxAzs: 2,
natGateways: 1,
subnetConfiguration: [
{ name: 'Public', subnetType: ec2.SubnetType.PUBLIC, cidrMask: 24 },
{ name: 'Private', subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS, cidrMask: 24 },
{ name: 'PrivateIsolated', subnetType: ec2.SubnetType.PRIVATE_ISOLATED, cidrMask: 24 },
],
});
// S3 data lake buckets
const dataLake = new DataLakeBuckets(this, 'DataLake', {
environment,
retentionDays: 365,
});
// Glue Data Catalog
const glueCatalog = new GlueCatalog(this, 'GlueCatalog', {
processedBucket: dataLake.processedBucket,
environment,
});
// Redshift Serverless
const redshift = new RedshiftServerlessCluster(this, 'Redshift', {
vpc,
environment,
});
// Step Functions pipeline orchestrator
const pipeline = new PipelineStateMachine(this, 'Pipeline', {
glueCrawlerName: `processed-data-crawler-${environment}`,
glueJobName: `prod-ingest-raw`,
environment,
});
// Stack-level tags
cdk.Tags.of(this).add('Environment', environment);
cdk.Tags.of(this).add('Project', 'DataPlatform');
cdk.Tags.of(this).add('ManagedBy', 'CDK');
}
}
The cdk synth and cdk deploy Workflow
# Bootstrap CDK in your account/region (once per account/region)
cdk bootstrap aws://123456789012/ca-central-1
# Synthesize CloudFormation templates (validates TypeScript and generates CFN)
cdk synth --context environment=dev
# Preview what will change (like terraform plan)
cdk diff --context environment=dev
# Deploy the stack
cdk deploy DataPlatformStack --context environment=dev
# Deploy to production
cdk deploy DataPlatformStack --context environment=prod \
--require-approval broadening # Require approval for security group changes
# Destroy the dev stack when done
cdk destroy DataPlatformStack --context environment=dev
cdk synth generates CloudFormation JSON/YAML in the cdk.out/ directory. You can inspect this output to see exactly what CloudFormation resources CDK will create โ useful for auditing and compliance.
CDK vs. Terraform for Data Teams
| Dimension | AWS CDK (TypeScript) | Terraform (HCL) |
|---|---|---|
| Language | TypeScript, Python, Java, Go | HCL (domain-specific) |
| Type safety | Full IDE autocompletion and compile-time errors | HCL validation at plan-time only |
| Loops / conditionals | Native language constructs | for_each, count, dynamic blocks |
| State management | CloudFormation manages state | Terraform state file (S3 + DynamoDB recommended) |
| Drift detection | CloudFormation stack drift detection | terraform plan shows drift |
| Multi-cloud | AWS only (with CDKtf for Terraform providers) | All major clouds |
| Ecosystem | CDK Construct Library (L2/L3 abstractions) | Terraform Registry (community modules) |
| Learning curve | Steeper if unfamiliar with CDK patterns | Gentler HCL syntax, but limits expressiveness |
| Best for | AWS-native teams comfortable with TypeScript/Python | Multi-cloud teams or existing Terraform users |
For AWS-native data teams who are comfortable in TypeScript or Python, CDK offers meaningful developer experience advantages โ particularly the type-safe L2 constructs that encode AWS best practices by default. For teams with existing Terraform expertise or multi-cloud requirements, Terraform remains the pragmatic choice. See our Terraform for AWS Data Stacks guide for the equivalent infrastructure in HCL.
For integrating CDK deployments into your CI/CD process, combine it with the GitHub Actions CI/CD workflows covered in our DataOps tutorial โ cdk diff in PR checks and cdk deploy on merge to main.
Conclusion
AWS CDK brings software engineering practices to infrastructure definition. Type safety catches misconfigurations before deployment. The Constructs Library encodes AWS best practices into reusable, composable abstractions. And the familiar language tooling โ linters, formatters, test frameworks โ applies directly to your infrastructure code.
The data infrastructure defined in this tutorial โ S3 data lake, Glue Data Catalog, Redshift Serverless, and Step Functions orchestration โ forms the backbone of a modern AWS data platform. CDK keeps that infrastructure reproducible, auditable, and maintainable as it evolves.
If your team is building or modernising a data platform on AWS and wants expert guidance on CDK architecture, infrastructure design, or cloud cost optimization, contact Infra IT Consulting. We help Canadian organisations design and deliver production-grade AWS data infrastructure.
Related posts
Book a free 30-minute consultation to discuss your data engineering and analytics needs.
Talk to our team โ