Infra IT Consulting logo Infra ITC
AWS Data Engineering s3athenapartitioning

S3 Data Partitioning Strategies That Cut Athena Query Costs

By Infra IT Consulting · · 8 min read

Amazon Athena charges $5 per terabyte of data scanned. On an unpartitioned table with 10 TB of Parquet data, a query filtering for last week’s records scans all 10 TB and costs $50. With good partitioning, the same query scans only the 7 days of data it needs — perhaps 200 GB — and costs $1.00. That is a 98% cost reduction for a single query, achieved purely through how data is organised in S3.

Partitioning is the highest-leverage optimisation available for Athena-based data platforms, and the decisions you make at pipeline design time determine whether your analytics costs scale linearly with data volume or stay flat as the lake grows. This guide covers the strategies that work in production — not theory, but the actual partitioning patterns and file sizing decisions that matter.

How Athena Uses Partitions

Before optimising, understand the mechanism. When Athena executes a query against a partitioned table, it reads the partition metadata from the AWS Glue Data Catalog to determine which S3 prefixes contain relevant data. If your WHERE clause includes a filter on a partition column, Athena excludes non-matching partitions entirely — those S3 files are never opened.

This is partition pruning. It only works when:

  1. The partition column appears in the Glue catalog table definition.
  2. The WHERE clause contains a filter on that column with a value Athena can evaluate at planning time (not a function call on the column).
  3. The data is actually organised in the corresponding S3 prefix structure.

If any of these conditions fail, Athena falls back to scanning all partitions. The most common mistake is filtering on a derived value: WHERE DATE_TRUNC('day', event_timestamp) = DATE '2024-02-12' will not prune partitions even if event_date is a partition column and the partition value 2024-02-12 would match — because Athena cannot evaluate the function at planning time. Always filter on the partition column directly: WHERE event_date = '2024-02-12'.

Standard Hive-Style Date Partitioning

For time-series event data — the most common pattern in data lakes — date-based Hive partitioning is the foundation:

s3://your-lake/cleansed/events/
  year=2024/month=01/day=08/
    events_0001.parquet
    events_0002.parquet
  year=2024/month=01/day=09/
    ...

The year=YYYY/month=MM/day=DD prefix format is automatically recognised by Glue crawlers and Athena partition projection. The Glue table DDL for this structure:

CREATE EXTERNAL TABLE events (
  event_id STRING,
  user_id STRING,
  event_type STRING,
  amount DECIMAL(10,2),
  event_timestamp TIMESTAMP
)
PARTITIONED BY (year STRING, month STRING, day STRING)
STORED AS PARQUET
LOCATION 's3://your-lake/cleansed/events/'
TBLPROPERTIES ('parquet.compression' = 'SNAPPY');

After adding new partitions via MSCK REPAIR TABLE events or explicit ALTER TABLE ADD PARTITION, queries can prune effectively:

-- Scans only 7 day-partitions, not the entire table
SELECT event_type, COUNT(*) 
FROM events
WHERE year = '2024' AND month = '02' AND day BETWEEN '05' AND '12'
GROUP BY event_type;

For tables that are queried almost exclusively by date range, three-level year/month/day partitioning is the right default. Avoid year/month/day/hour unless your queries regularly filter to hour-level ranges — too many partition directories degrades Glue crawler and Athena partition metadata performance.

Multi-Dimensional Partitioning

Some tables have a second high-cardinality dimension that appears frequently in WHERE clauses alongside the date filter. Common examples: region, product_category, event_type, customer_tier. Adding a second partition level captures additional pruning:

s3://your-lake/cleansed/transactions/
  region=CA/year=2024/month=02/day=12/
  region=UK/year=2024/month=02/day=12/
  region=NG/year=2024/month=02/day=12/

A query like WHERE region = 'CA' AND year = '2024' AND month = '02' now prunes by both region and date, scanning one-third of the data a date-only partition would scan.

The key constraint: the second partition dimension must be low-cardinality (fewer than a few hundred distinct values), stable (does not change over time), and genuinely used in query filters. Partitioning by user_id — a high-cardinality column — creates millions of tiny S3 prefixes, each with tiny files, destroying Athena performance and causing Glue crawler timeouts.

A practical rule: each additional partition level should reduce the data scanned by at least 30-50% for your most common queries. If a partition dimension only filters out 5% of data in typical queries, the overhead of managing it exceeds its benefit.

Partition Projection: Eliminating Catalog Overhead

For tables with thousands of daily partitions, querying the Glue catalog to list partitions becomes a measurable latency overhead. Athena partition projection eliminates this by computing expected S3 paths from a declared range, bypassing catalog lookups entirely.

Enable partition projection by setting table properties:

ALTER TABLE events SET TBLPROPERTIES (
  'projection.enabled' = 'true',
  'projection.year.type' = 'integer',
  'projection.year.range' = '2022,2030',
  'projection.month.type' = 'integer',
  'projection.month.range' = '1,12',
  'projection.month.digits' = '2',
  'projection.day.type' = 'integer',
  'projection.day.range' = '1,31',
  'projection.day.digits' = '2',
  'storage.location.template' = 
    's3://your-lake/cleansed/events/year=${year}/month=${month}/day=${day}/'
);

With projection enabled, Athena generates the S3 paths directly from the query’s date filters. Partition registration steps (MSCK REPAIR TABLE, crawler runs, batch_create_partition) are no longer needed. New data landing in the expected S3 path is immediately queryable.

The operational simplification is significant: your ETL job writes to the correct S3 path, and Athena automatically discovers it without any catalog management step. This is particularly valuable for high-frequency pipelines that produce new partitions throughout the day.

File Size: The Other Half of the Equation

Partitioning determines which files Athena opens. File size determines the overhead of opening each one. Both matter.

S3 has a per-request latency of 5-50 ms depending on file size and network conditions. A table with 10,000 small files (1 MB each) in a partition requires 10,000 individual S3 GET requests to read fully. A table with 40 files (256 MB each) requires 40 GET requests for the same data volume — 250 times fewer S3 calls, translating directly to faster query execution and lower request costs.

Target file sizes: 128 MB to 512 MB per Parquet file in the cleansed and curated zones. Files below 64 MB are generally inefficient. Files above 1 GB provide diminishing returns and can be counterproductive for Athena’s internal split-processing.

Achieving target file sizes in Glue ETL:

# In your Glue PySpark job, control output file count
df.repartition(target_partition_count).write \
    .mode('append') \
    .partitionBy('year', 'month', 'day') \
    .parquet('s3://your-lake/cleansed/events/')

Calculate target_partition_count as ceil(total_data_size_bytes / target_file_size_bytes). For a daily job processing 10 GB of data targeting 256 MB files: ceil(10 * 1024 / 256) = 40 output files. Call df.repartition(40) before writing.

Compaction jobs: Streaming pipelines (Kinesis Data Firehose, for example) produce many small files — Firehose’s minimum buffer is 60 seconds, creating at least one file per buffer window per shard. Run a periodic compaction Glue job that reads recent partitions and rewrites them with the target file count. Compaction once per day on the previous day’s streaming data is a common pattern.

Columnar Format and Compression

Partitioning and file sizing are structural decisions. The choice of format and compression affects how much data Athena reads within each file.

Parquet vs. ORC: Both are columnar formats that support predicate pushdown (skipping row groups that don’t match WHERE clause filters) and column projection (reading only the columns selected in the query). Parquet is the more widely supported format across AWS services — Glue, Athena, Redshift Spectrum, and EMR all handle it natively. ORC has marginally better compression ratios for some data types. Use Parquet as the default.

Snappy vs. GZIP: Snappy is a splittable, faster-decompressing codec with moderate compression ratios (~60% of raw). GZIP achieves higher compression (~70-75% of raw) but is not splittable — Athena cannot parallelise reads across a single GZIP-compressed Parquet file. Use Snappy for query performance. Use GZIP only when storage cost is the primary concern and query latency is secondary (typically for archive/raw zone data).

Statistics and Bloom filters: Parquet files embed column statistics (min/max values per row group) that Athena uses for predicate pushdown within files. Ensure your Glue ETL jobs write statistics by using a Parquet writer that supports statistics (Spark’s built-in Parquet writer does this by default). For high-cardinality columns used in equality filters (WHERE order_id = '12345'), Parquet Bloom filters provide additional row group skipping — supported in Parquet 2.0 and configurable in Spark 3.x.

A Worked Cost Example

A retail analytics team has a 2 TB daily event table (730 TB annual) stored as unpartitioned CSV. Athena costs:

  • Current: 730 TB × $5 = $3,650/year for full-table scans
  • After converting to Parquet (8:1 compression over CSV): 91 TB logical size, but Athena still scans everything
  • After date partitioning (queries filter to 1-7 days): 91 TB × (7/365) = ~1.75 TB scanned per query × $5 = $8.75/query
  • After adding regional sub-partitioning (3 regions, queries filter to 1): ~0.58 TB scanned per query = ~$2.90/query

A common query that previously cost $50+ now costs under $3.00 — and daily ingestion into properly partitioned Parquet costs significantly less than the $3,650 annual CSV baseline. The optimisation pays for itself within weeks.

For a broader look at building the underlying storage layer, see our guide on building a scalable data lake on Amazon S3 and the accompanying treatment of Redshift vs. Athena for analytics workloads.

Conclusion

S3 data partitioning is not an advanced optimisation — it is a foundational decision that determines whether your Athena costs grow linearly with data volume or stay manageable as your lake scales. Date-based Hive partitioning, partition projection, right-sized Parquet files with Snappy compression, and compaction for streaming data collectively reduce Athena query costs by 80-99% compared to unstructured CSV baselines. These decisions are far easier to get right at pipeline design time than to retrofit later. Ready to build or optimise your AWS data infrastructure? Contact the Infra IT Consulting team for a free consultation.

Related posts