Using AWS DMS for Zero-Downtime Database Migrations
Database migrations are among the highest-risk operations in data engineering. Moving a production Oracle database to Amazon Aurora PostgreSQL, replicating an on-premises SQL Server instance to Amazon RDS, or streaming MySQL changes into a data lake β each of these carries real consequences if downtime exceeds the maintenance window or data integrity is compromised.
AWS Database Migration Service (AWS DMS) was built to make these migrations safer. Used correctly, it enables continuous replication with near-zero downtime cutover windows. Used incorrectly, it creates subtle data integrity issues that are harder to detect than a clean failure. This guide covers how to do it correctly.
What AWS DMS Does (and Does Not Do)
AWS DMS handles two distinct phases of a database migration:
Full load: DMS reads all existing rows from the source database and writes them to the target. This is a bulk copy β DMS does not lock the source, does not interrupt ongoing writes, and does not guarantee point-in-time consistency across tables without additional configuration.
Change Data Capture (CDC): After the full load completes, DMS switches to reading the source databaseβs transaction log (binary log for MySQL, redo log for Oracle, WAL for PostgreSQL) and applying changes to the target in near real time. This is the mechanism that enables zero-downtime migrations β the source database continues serving production traffic while DMS keeps the target synchronised.
What DMS does not do: schema conversion. DMS replicates data but assumes the target schema already exists and is compatible with the source. For heterogeneous migrations (Oracle to PostgreSQL, SQL Server to MySQL), AWS Schema Conversion Tool (SCT) handles the schema translation β stored procedures, triggers, sequences, and data type mappings. DMS then handles the data replication. Always run SCT before DMS for heterogeneous migrations.
DMS also does not replicate DDL changes (ALTER TABLE, CREATE INDEX) during CDC by default for all database combinations. If your source schema changes during migration, you need to apply the DDL to the target manually and resume replication. Monitor your migration window accordingly.
Setting Up a DMS Replication Instance
The DMS replication instance is the EC2-based compute that runs your migration tasks. Size it appropriately β it is a common mistake to under-provision here:
- For migrations up to 100 GB with moderate change volume:
dms.t3.mediumordms.t3.large - For migrations 100 GB - 1 TB or high CDC throughput:
dms.r5.largeordms.r5.xlarge - For very large databases (> 1 TB) or parallel full load of many tables:
dms.r5.2xlargeor larger
The replication instance needs network connectivity to both the source (on-premises via AWS Direct Connect or VPN, or another AWS region via VPC peering) and the target. Place it in the same VPC and Availability Zone as your target database to minimise data transfer latency and avoid cross-AZ transfer costs.
Enable Multi-AZ for the replication instance if the migration spans more than a day or two β a replication instance failure without Multi-AZ means the task restarts from its last checkpoint, potentially hours of replay from the transaction log.
Configuring Change Data Capture
CDC configuration requirements differ by source database:
MySQL / Amazon Aurora MySQL: Enable binary logging with binlog_format = ROW and binlog_row_image = FULL. Set binlog_retention_hours (or the equivalent AWS RDS parameter) to at least 24 hours β longer if your migration might pause unexpectedly. DMS needs to read binary log entries back to when it established its starting position.
-- Verify binary log settings on MySQL source
SHOW VARIABLES LIKE 'binlog_format';
SHOW VARIABLES LIKE 'binlog_row_image';
SHOW BINARY LOGS;
PostgreSQL / Amazon Aurora PostgreSQL: Enable logical replication: wal_level = logical, max_replication_slots >= 1, max_wal_senders >= 1. DMS creates a replication slot on the source that holds WAL segments until DMS has consumed them. Monitor replication slot lag β if DMS falls behind, WAL segments accumulate and disk space on the source can fill up, causing the PostgreSQL instance to halt all writes.
-- Monitor replication slot lag on PostgreSQL source
SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag
FROM pg_replication_slots;
Oracle: DMS requires supplemental logging enabled at minimum for the tables being migrated. Full supplemental logging at the database level is simpler to configure but generates more redo log volume. Confirm that the DMS database user has SELECT ANY TABLE, EXECUTE on DBMS_LOGMNR, and LOGMINING (Oracle 12c+) privileges.
Microsoft SQL Server: Enable MS-CDC on the source database and each table being migrated. The DMS user needs sysadmin or equivalent permissions. SQL Server CDC uses the SQL Server Agent, which must be running and healthy throughout the migration.
Migration Task Configuration
DMS migration tasks have configuration options that significantly affect performance and reliability:
LOB (Large Object) handling: Tables with BLOB, CLOB, TEXT, or JSON columns require special handling. DMS supports three LOB modes:
Full LOB mode: DMS retrieves LOBs in separate passes after the main row data. Slowest but most accurate.Limited LOB mode: DMS truncates LOBs larger than the configured maximum size. Fast but risks data loss for large objects.Inline LOB mode: LOBs up to 100 KB are handled inline; larger ones fall back to full LOB mode. Best balance for mixed workloads.
Use Limited LOB mode with a large maximum size only if you have verified your source data has no LOBs that exceed the limit. Silent truncation of production data is worse than a slower migration.
Parallel load settings: For large tables, DMS can partition the full load across multiple threads. Configure ParallelLoadThreads and ParallelLoadBufferSize in the table mapping rules. For a table with a sequential integer primary key, DMS can split the load into N segments and load them in parallel, significantly reducing full load time.
Table mapping and transformation: DMS table mapping rules (JSON configuration) control which tables and columns are replicated, with optional transformations. You can rename tables, filter rows, convert column names from camelCase to snake_case, and include or exclude specific tables by schema and name pattern. This is where you implement the structural differences between source and target schemas that SCT did not handle.
{
"rules": [
{
"rule-type": "selection",
"rule-id": "1",
"rule-name": "include-app-schema",
"object-locator": { "schema-name": "app", "table-name": "%" },
"rule-action": "include"
},
{
"rule-type": "transformation",
"rule-id": "2",
"rule-name": "lowercase-column-names",
"rule-action": "convert-lowercase",
"rule-target": "column",
"object-locator": { "schema-name": "%", "table-name": "%" }
}
]
}
Data Validation
DMS includes a built-in data validation feature that compares row counts and data checksums between source and target. Enable it in the task configuration with EnableValidation = true. Validation runs during and after the full load, and continues during CDC to catch any replication discrepancies.
Validation results appear in the DMS console and CloudWatch Logs. A task showing TableError or RowError in the validation summary indicates specific rows or tables with data mismatches. Investigate immediately β these are often caused by LOB truncation, unsupported data type conversions, or concurrent DDL on the source during migration.
Do not rely solely on DMS validation. Run your own application-level validation queries:
-- Compare row counts across matched date ranges
-- Source (run against source DB)
SELECT COUNT(*) FROM orders WHERE created_at >= '2024-01-01';
-- Target (run against target DB)
SELECT COUNT(*) FROM orders WHERE created_at >= '2024-01-01';
-- Compare checksum of critical columns
SELECT MD5(CAST(SUM(order_total) AS TEXT)) FROM orders WHERE created_at >= '2024-01-01';
Run these checks across multiple tables and time windows before scheduling your cutover window.
Cutover Strategy: Minimising Downtime
With DMS CDC running and the source-target replication lag under 1-2 seconds, you are ready for cutover. The minimum-downtime cutover sequence:
- Stop application writes to the source database. This is the start of your actual downtime window. Redirect the application to a maintenance page or queue incoming requests.
- Wait for DMS replication lag to reach zero. Monitor
CDCLatencySourceandCDCLatencyTargetin CloudWatch. Cutover when both are near zero (< 5 seconds) for a sustained period. - Verify final row counts and checksums on critical tables.
- Update application connection strings to point to the target database.
- Start application writes on the target.
- Stop the DMS task after confirming the application is healthy on the target.
For most migrations, this downtime window is under 5 minutes β dominated by the application restart time rather than the data migration itself.
For data lake use cases β where DMS streams CDC events into Amazon S3 or Amazon Kinesis rather than another relational database β there is no cutover in the traditional sense. The source database continues operating, and DMS continuously streams changes to the data lake on Amazon S3. This pattern supports real-time analytics with Kinesis and is commonly used to replicate operational database changes into an analytics layer without impacting the source system.
Common Mistakes to Avoid
Not testing the migration end-to-end before production: Run the complete migration β full load, CDC validation, cutover β on a staging environment with a production-size data copy. DMS behaviour with your specific source schema, data types, and volume is not always predictable from documentation alone.
Forgetting foreign key constraints on the target: During full load, DMS does not guarantee table load order. If foreign key constraints are enabled on the target, parent table rows may not exist when child rows are loaded, causing constraint violation errors. Disable foreign key checks during full load and re-enable after DMS full load completes.
Insufficient replication slot retention on PostgreSQL: If the DMS task pauses for any reason (network interruption, replication instance restart), and the PostgreSQL replication slotβs WAL has been flushed from disk before DMS reconnects, the task cannot resume from where it left off β it must restart the full load. Set max_slot_wal_keep_size appropriately and monitor slot lag continuously during the migration.
Skipping the SCT report: AWS SCT generates a migration assessment report showing the percentage of objects that can be converted automatically and which require manual remediation. Ignoring this report and discovering complex stored procedure incompatibilities during the migration is a preventable delay.
Conclusion
AWS DMS enables genuinely zero-downtime database migrations when configured correctly. The keys are proper CDC setup on the source, right-sized replication instances, careful LOB handling, thorough data validation before cutover, and a practiced cutover sequence. The most successful migrations treat DMS as one component of a larger migration plan β not a magic button β and invest time in schema conversion, staging environment testing, and rollback planning. Ready to build or optimise your AWS data infrastructure? Contact the Infra IT Consulting team for a free consultation.
Related posts
Book a free 30-minute consultation to discuss your data engineering and analytics needs.
Talk to our team β