Infra IT Consulting logo Infra ITC
Data Architecture & Strategy modern-data-stackarchitecturetools

The Modern Data Stack Explained: What It Is and When to Use It

By Infra IT Consulting ยท ยท 9 min read

โ€œModern data stackโ€ has become one of those terms that everyone in the data industry uses but few agree on precisely. Vendors use it to market their products. Consultants use it to signal technical currency. Engineers use it to describe a specific set of tools and architectural patterns that emerged from the cloud-native data revolution of the early 2020s. For technical decision-makers trying to evaluate whether the modern data stack is right for their organisation, the noise-to-signal ratio is high.

This post cuts through the marketing. We explain what the modern data stack actually is, which AWS services and open-source tools constitute it in practice, what problems it solves well, and โ€” critically โ€” when it is the wrong choice.

What the Modern Data Stack Actually Is

The modern data stack is an architectural pattern, not a specific product or vendor. It emerged as a response to the limitations of traditional on-premises data warehouse architectures, where storage and compute were tightly coupled, ETL was handled by monolithic proprietary tools, and adding new data sources required months of project work.

The defining characteristics are:

  1. Cloud-native storage โ€” data lands in object storage (Amazon S3) that is cheap, scalable, and decoupled from compute
  2. ELT over ETL โ€” data is extracted and loaded into the analytical store first, then transformed using SQL, rather than transformed before loading
  3. SQL-first transformation โ€” data transformation is done in SQL using tools like dbt, making it accessible to analysts and engineers without requiring Spark or Python expertise
  4. Separation of concerns โ€” ingestion, storage, transformation, and serving are handled by separate tools with defined interfaces between them
  5. SaaS or managed services โ€” operational overhead is minimised by using fully managed services rather than self-hosted infrastructure

On AWS, a canonical modern data stack looks like this:

LayerCommon Choices
IngestionFivetran, Airbyte, AWS DMS, AWS Glue
StorageAmazon S3 (data lake), Amazon Redshift (data warehouse)
Transformationdbt (running against Redshift or Athena)
OrchestrationApache Airflow on AWS MWAA, Dagster, or dbt Cloud
Semantic layerdbt metrics, Cube.dev
BI and servingAmazon QuickSight, Looker, Metabase
Data catalogueAWS Glue Data Catalog, dbt docs

The ELT Shift and Why It Matters

The shift from ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform) is the most consequential architectural change in the modern data stack, and it is worth understanding why it happened.

Traditional ETL was born in an era when storage was expensive and compute was scarce outside the data warehouse. It made sense to transform data before loading it โ€” discard columns you did not need, aggregate records to reduce volume, enforce strict schemas โ€” because landing the full raw data in the warehouse was too expensive.

Cloud object storage changed this calculation fundamentally. Amazon S3 storage costs roughly $0.023 per GB per month. Landing the full raw data from every source system costs almost nothing, and it preserves optionality: if you decide six months later that you need a column you previously discarded, it is still there.

ELT leverages this by loading everything first and transforming later, inside the analytical engine where you have full SQL expressiveness and can iterate rapidly on the transformation logic without re-ingesting source data. dbt makes this practical by providing a framework for writing modular, tested, version-controlled SQL transformations that produce reliable analytical models.

For a detailed treatment of dbt on AWS, see the post on dbt and AWS analytics engineering.

AWS Services in the Modern Data Stack

AWS provides first-party alternatives or complements to almost every layer of the modern data stack.

Amazon S3 is the foundation. Raw data from every source lands here. Curated data is stored here as Parquet. The data lake S3 architecture described in building a data lake on S3 is the storage backbone of any modern data stack on AWS.

Amazon Redshift Serverless is the primary data warehouse. For the ELT pattern, Redshift is where dbt models run โ€” you load raw data into a staging schema and run dbt to produce curated models in a separate schema. Redshift Serverless auto-scales compute capacity and charges per second of compute used, making it cost-effective for organisations with variable query load.

AWS Glue serves dual purposes: as a managed ETL/ELT runner for Python and Spark workloads, and as the Data Catalog that makes S3 data queryable by Athena and Redshift Spectrum. For ingestion of complex source systems (SAP, legacy databases, APIs without managed connectors), Glue Python shell jobs or Glue ETL jobs provide a flexible, managed execution environment.

Amazon MWAA (Managed Workflows for Apache Airflow) provides orchestration. Airflow DAGs coordinate the sequence: run Glue ingestion job โ†’ trigger dbt transformation โ†’ refresh QuickSight SPICE dataset โ†’ send notification on completion or failure. MWAA removes the operational burden of running Airflow yourself.

Amazon QuickSight delivers BI. For organisations already on AWS, QuickSightโ€™s native integration with Redshift, Athena, and S3 makes it the lowest-friction choice. Its SPICE in-memory engine provides fast query response for dashboards without maintaining a separate caching layer.

When the Modern Data Stack Is the Right Choice

The modern data stack is well-suited to organisations that:

  • Have multiple data sources that need to be unified โ€” at least three or four source systems (CRM, ERP, marketing platforms, product database) where unification provides clear analytical value
  • Have analysts who know SQL and can participate in transformation work using dbt, reducing the engineering teamโ€™s burden
  • Need to iterate quickly on data models as business requirements evolve โ€” the ELT pattern and dbtโ€™s modular approach make schema changes significantly cheaper than traditional ETL
  • Want low operational overhead โ€” using managed AWS services and SaaS tools means the data team focuses on analytics rather than infrastructure management
  • Are growing โ€” the architecture scales well from 10 GB to 10 TB of data without fundamental redesign

A startup with a product database, Stripe for payments, and a CRM is an excellent candidate. A mid-market retailer with point-of-sale systems, an e-commerce platform, inventory management, and marketing platforms is an excellent candidate. For practical guidance on starting this journey, see the post on data strategy for startups.

When the Modern Data Stack Is Overkill or Wrong

The modern data stack is not always the right answer, and being clear about this is important.

You have one or two data sources. If your entire analytical need is querying your production database and one third-party data source, a simple Athena-over-S3 setup or even a well-configured business intelligence tool with a direct database connection may be sufficient. You do not need Fivetran, dbt, and Redshift to query two tables.

Your data is primarily real-time or streaming. The modern data stack is optimised for batch ELT. If your core use case is real-time fraud detection, live operations monitoring, or streaming IoT analytics, you need a streaming architecture with Kinesis, Lambda, and a time-series or OLAP store โ€” not a batch pipeline into Redshift.

You do not have SQL-literate analysts. The modern data stackโ€™s value proposition depends on analysts participating in the transformation layer via dbt. If your analytical team is not comfortable writing and reviewing SQL, dbt becomes a black box maintained by engineers, which loses most of its benefits.

Your regulatory requirements prohibit SaaS tools. Some financial services, healthcare, and government organisations cannot use SaaS connectors due to data residency or security requirements. In this case, a fully AWS-native stack (DMS for ingestion, Glue for transformation, Redshift for serving) with no third-party SaaS in the data path may be required.

A Realistic Implementation Sequence

For organisations adopting the modern data stack on AWS, the implementation sequence matters. A common mistake is trying to implement everything at once โ€” standing up a full Airflow orchestration environment before you even have data flowing is premature.

A pragmatic sequence:

  1. Week 1โ€“2: Stand up S3 raw data lake, configure first ingestion connector (Fivetran or DMS for your highest-priority source)
  2. Week 3โ€“4: Configure Redshift Serverless, load raw data, validate it matches source
  3. Week 5โ€“8: Set up dbt project with staging and mart models for the first analytical use case; deliver first dashboard in QuickSight
  4. Week 9โ€“12: Add second and third data sources; introduce dbt tests; configure basic CloudWatch monitoring
  5. Month 4+: Add Airflow orchestration once you have enough jobs to warrant it; introduce data catalogue

This approach delivers business value (working dashboards) within 6โ€“8 weeks while building the foundation incrementally.

Conclusion

The modern data stack is a genuinely powerful set of patterns and tools for organisations that need to unify multiple data sources, enable self-service analytics, and iterate quickly on data models. On AWS, it maps naturally to a set of managed services โ€” S3, Redshift, Glue, MWAA, QuickSight โ€” that together provide a scalable, low-operational-overhead analytical platform.

But it is not a universal answer. Understanding when it fits and when simpler alternatives suffice is as important as knowing how to build it.

If you are evaluating whether the modern data stack is right for your organisation, or need help designing and implementing an AWS-native analytical platform, contact Infra IT Consulting for an architecture consultation.

Related posts