Infra IT Consulting logo Infra ITC
AWS Data Engineering s3data-lakeaws

Building a Scalable Data Lake on Amazon S3: A Step-by-Step Guide

By Infra IT Consulting Β· Β· 9 min read

A data lake is only as good as the decisions made in its first few weeks. Teams that skip the foundational design work β€” bucket structure, zone naming, partitioning strategy, cataloguing, and access control β€” spend the next two years working around a fragile, ungoverned store that nobody trusts. Teams that get it right build infrastructure that scales from gigabytes to petabytes without architectural rework.

This guide walks through how to design and build a production-grade data lake on Amazon S3, the storage layer that underpins virtually every modern AWS data platform. We cover the zone model, folder structure, metadata management, access control, and cost management decisions that matter most in a real implementation.

The Zone Architecture: Raw, Cleansed, and Curated

The most durable S3 data lake pattern divides data into three logical zones, each with a distinct purpose, quality level, and access policy:

Raw (Bronze) Zone: Data lands here exactly as it arrived from source systems β€” unmodified, un-renamed, and un-typed. The raw zone is append-only. No job ever updates or deletes raw data; it is your audit trail and reprocessing source. Store raw data in its original format (JSON, CSV, Avro, database export) under a prefix structure that reflects the source:

s3://your-data-lake-raw/
  source=salesforce/
    entity=opportunities/
      year=2024/month=01/day=12/
        load_20240112T143000Z.json.gz

The load timestamp in the filename β€” not just the partition path β€” makes reprocessing deterministic and avoids silent overwrite bugs.

Cleansed (Silver) Zone: ETL jobs read from raw and write to cleansed. At this stage, data is validated against a schema, null values are handled, duplicate records are resolved, and formats are standardised to Parquet with Snappy compression. The cleansed zone is the primary source for most analytical queries via Amazon Athena.

Curated (Gold) Zone: Joins across cleansed tables, aggregations, and domain-specific transformations live here. Curated datasets are optimised for specific consumption patterns β€” BI dashboards, ML feature stores, executive reports β€” and are typically much smaller in volume than cleansed data.

This three-zone model maps cleanly onto S3’s flat namespace. Using separate S3 buckets per zone (rather than prefix separation within a single bucket) is strongly recommended because it enables distinct bucket policies, S3 Lifecycle rules, and replication configurations per zone. A typical production setup uses four buckets: one each for raw, cleansed, and curated, plus a logging bucket for S3 Server Access Logs.

Choosing Your S3 Bucket and Prefix Structure

The bucket and prefix structure you choose at the start is very difficult to change later β€” Athena tables, Glue crawlers, and data pipelines all hardcode these paths. Think carefully about:

Region. Canadian data subject to PIPEDA should stay in ca-central-1. UK GDPR-regulated data belongs in eu-west-2. Do not mix jurisdictions in the same bucket unless you have explicit legal clearance to do so.

Prefix partitioning. For time-series data, always partition by date at the directory level. The standard Hive-style partition format (year=YYYY/month=MM/day=DD) is automatically recognised by AWS Glue crawlers and Athena’s partition projection feature. For the cleansed and curated zones, partition on the columns most frequently used in WHERE clauses β€” typically date and a high-cardinality dimension like region or customer segment.

For a deeper treatment of this topic, see our guide on S3 Data Partitioning Strategies That Cut Athena Query Costs.

File sizing. Parquet files below 128 MB result in excessive file counts and slow Athena queries due to per-file overhead. Target 256 MB to 512 MB Parquet files in the cleansed and curated zones. AWS Glue’s groupFiles and groupSize parameters handle compaction automatically during ETL writes.

Registering Data with the AWS Glue Data Catalog

An S3 bucket full of Parquet files is not queryable until its schema is registered in a metadata catalog. The AWS Glue Data Catalog is the native choice on AWS β€” it integrates directly with Athena, Redshift Spectrum, EMR, and Lake Formation.

You have two ways to populate the catalog:

Glue Crawlers: Crawlers scan S3 prefixes, infer schemas from files, and create or update Glue table definitions. They work well for initial setup and for datasets where the schema evolves slowly. For production data lakes, schedule crawlers to run after each ETL job completes rather than on a fixed time interval to avoid stale metadata.

Direct API calls in your ETL job: More predictable than crawlers. At the end of your Glue or Spark ETL job, explicitly call the Glue Data Catalog API to add new partitions (glue.batch_create_partition() in PySpark) or update table schemas. This gives you full control over what the catalog reflects.

Access Control with AWS Lake Formation

S3 bucket policies and IAM policies alone are coarse instruments for a multi-team data lake. AWS Lake Formation provides table-level and column-level permissions on top of the Glue Data Catalog, letting you grant an analyst access to a specific table without exposing the underlying S3 bucket.

The key Lake Formation setup steps for a new data lake:

  1. Register your S3 buckets as data lake locations in Lake Formation.
  2. Grant Lake Formation DataLakeAdmin to a service role used by your ETL jobs.
  3. Create Lake Formation database-level and table-level grants for each team or IAM role that needs access.
  4. Revoke the default IAMAllowedPrincipals grant β€” this is the most frequently missed step and leaves data effectively unprotected by Lake Formation.

Column-level security is particularly valuable for PII. You can grant a data analyst access to a customer table while filtering out columns like email_address and sin_number at the Lake Formation layer, without modifying the underlying data or the Athena query.

Cost Management for S3 Data Lakes

S3 data lake costs break down into three categories: storage, requests, and data transfer. At scale, storage dominates.

Storage tiering: Use S3 Intelligent-Tiering for the raw zone β€” data older than 90 days moves to infrequent access automatically, cutting storage costs by 40-50%. For curated datasets with predictable access patterns, S3 Standard-IA or Glacier Instant Retrieval may be cheaper if you know the access frequency.

S3 Lifecycle rules: Set hard expiration rules on the raw zone for data older than your retention policy. For a typical analytics platform, raw data older than 7 years can be deleted or archived to Glacier Deep Archive. Configure these rules at bucket creation, not after the fact.

Request costs: Athena queries generate many S3 GET requests. Well-partitioned data and properly sized Parquet files reduce the number of files Athena must scan β€” and therefore the number of S3 GET requests β€” significantly. This is another reason the partitioning decisions in the cleansed zone have direct cost implications.

A Reference Architecture

A minimal production data lake on S3 involves the following AWS services working together:

  • Amazon S3 (three buckets: raw, cleansed, curated) β€” storage layer
  • AWS Glue (crawlers + ETL jobs) β€” transformation and cataloguing
  • AWS Glue Data Catalog β€” shared metadata store
  • AWS Lake Formation β€” fine-grained access control
  • Amazon Athena β€” ad-hoc SQL queries over cleansed and curated zones
  • AWS CloudTrail + S3 Server Access Logs β€” audit trail for compliance

This architecture supports both real-time streaming ingestion with Kinesis (writing to the raw zone via Kinesis Data Firehose) and batch ingestion from relational databases via AWS DMS or Glue connections.

Conclusion

A well-designed data lake on Amazon S3 is the foundation everything else in your data platform depends on β€” your query performance, your governance posture, your cost profile, and your team’s ability to trust the data they work with. The decisions that matter most are the ones made before the first byte lands: zone design, prefix structure, cataloguing strategy, and access control. Get those right, and scaling from gigabytes to petabytes becomes an operational task rather than an architectural crisis. Ready to build or optimise your AWS data infrastructure? Contact the Infra IT Consulting team for a free consultation.

Related posts