Infra IT Consulting logo Infra ITC
Tech Tutorials & How-Tos sqlwindow-functionsathena

SQL Window Functions in Amazon Athena: A Practical Tutorial

By Infra IT Consulting · · 9 min read

Window functions are the single biggest productivity multiplier in analytical SQL. They let you compute rankings, running totals, period-over-period comparisons, and session analysis in a single query — without self-joins or subqueries that bloat execution time and make code unreadable. Amazon Athena supports the full Presto/Trino window function vocabulary, which means everything covered here runs as-is on your S3 data lake.

This tutorial uses a realistic e-commerce dataset with orders, sessions, and products tables. All queries are tested against Athena and use standard ANSI SQL window function syntax.

Understanding the Window Function Anatomy

Every window function has the same structure:

function_name(expression) OVER (
    [PARTITION BY column1, column2]
    [ORDER BY column3 ASC|DESC]
    [ROWS|RANGE BETWEEN frame_start AND frame_end]
)
  • PARTITION BY divides rows into groups (like GROUP BY, but without collapsing rows)
  • ORDER BY defines the ordering within each partition
  • The frame clause (ROWS BETWEEN ...) defines which rows relative to the current row are included in the calculation

The key insight: unlike aggregate functions with GROUP BY, window functions return one row per input row. The result set keeps full detail while adding computed columns.

ROW_NUMBER, RANK, and DENSE_RANK

These three functions assign integer labels to rows based on ordering within a partition. They differ in how they handle ties.

Use case: Find the top 3 orders by revenue for each customer.

WITH ranked_orders AS (
    SELECT
        customer_id,
        order_id,
        order_date,
        revenue,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY revenue DESC
        ) AS rn,
        RANK() OVER (
            PARTITION BY customer_id
            ORDER BY revenue DESC
        ) AS rnk,
        DENSE_RANK() OVER (
            PARTITION BY customer_id
            ORDER BY revenue DESC
        ) AS dense_rnk
    FROM orders
    WHERE order_date >= DATE '2024-01-01'
)
SELECT
    customer_id,
    order_id,
    order_date,
    revenue,
    rn,
    rnk,
    dense_rnk
FROM ranked_orders
WHERE rn <= 3
ORDER BY customer_id, rn;

The difference matters when two orders have the same revenue:

  • ROW_NUMBER: assigns unique numbers (1, 2, 3) — ties are broken arbitrarily
  • RANK: assigns the same number to ties, then skips (1, 1, 3) — there is no rank 2
  • DENSE_RANK: assigns the same number to ties, no skipping (1, 1, 2)

Use case: Identify each product category’s best-selling SKU.

SELECT
    category,
    product_id,
    product_name,
    total_units_sold
FROM (
    SELECT
        p.category,
        p.product_id,
        p.product_name,
        SUM(oi.quantity) AS total_units_sold,
        RANK() OVER (
            PARTITION BY p.category
            ORDER BY SUM(oi.quantity) DESC
        ) AS category_rank
    FROM order_items oi
    JOIN products p ON oi.product_id = p.product_id
    GROUP BY p.category, p.product_id, p.product_name
)
WHERE category_rank = 1
ORDER BY category;

LAG and LEAD: Period-Over-Period Comparisons

LAG accesses a value from a previous row; LEAD accesses a value from a future row. Both are essential for trend analysis.

Use case: Calculate week-over-week revenue change.

WITH weekly_revenue AS (
    SELECT
        DATE_TRUNC('week', order_date) AS week_start,
        SUM(revenue)                    AS total_revenue
    FROM orders
    WHERE order_date >= DATE '2023-01-01'
    GROUP BY DATE_TRUNC('week', order_date)
),
revenue_with_lag AS (
    SELECT
        week_start,
        total_revenue,
        LAG(total_revenue, 1) OVER (ORDER BY week_start)  AS prev_week_revenue,
        LAG(total_revenue, 4) OVER (ORDER BY week_start)  AS revenue_4_weeks_ago
    FROM weekly_revenue
)
SELECT
    week_start,
    total_revenue,
    prev_week_revenue,
    ROUND(
        100.0 * (total_revenue - prev_week_revenue) / NULLIF(prev_week_revenue, 0),
        2
    ) AS wow_pct_change,
    ROUND(
        100.0 * (total_revenue - revenue_4_weeks_ago) / NULLIF(revenue_4_weeks_ago, 0),
        2
    ) AS four_week_pct_change
FROM revenue_with_lag
ORDER BY week_start;

NULLIF(prev_week_revenue, 0) prevents division by zero on the first row where LAG returns NULL. This pattern is important — Athena will error on division by zero without it.

Use case: Flag orders where a customer’s purchase value dropped significantly from their last order.

SELECT
    customer_id,
    order_id,
    order_date,
    revenue,
    LAG(revenue) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_order_revenue,
    CASE
        WHEN revenue < 0.5 * LAG(revenue) OVER (PARTITION BY customer_id ORDER BY order_date)
        THEN true
        ELSE false
    END AS significant_drop
FROM orders
ORDER BY customer_id, order_date;

FIRST_VALUE and LAST_VALUE

These functions return the first or last value in the window frame. LAST_VALUE requires careful frame specification — the default frame ends at the current row, not the partition end.

Use case: For each order, show that customer’s first-ever order date (customer tenure).

SELECT
    customer_id,
    order_id,
    order_date,
    revenue,
    FIRST_VALUE(order_date) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS first_order_date,
    DATE_DIFF('day',
        FIRST_VALUE(order_date) OVER (
            PARTITION BY customer_id
            ORDER BY order_date
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ),
        order_date
    ) AS days_since_first_order
FROM orders
ORDER BY customer_id, order_date;

Use case: Show each customer’s most recent order value alongside their historical orders.

SELECT
    customer_id,
    order_id,
    order_date,
    revenue,
    LAST_VALUE(revenue) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING  -- full partition frame
    ) AS most_recent_revenue
FROM orders
ORDER BY customer_id, order_date;

Note the frame clause ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING — without it, LAST_VALUE returns the current row’s value (because the default frame ends at the current row).

Running Totals with SUM OVER

Running aggregates are among the most common window function patterns in business analytics.

Use case: Running revenue total and cumulative percentage of annual revenue.

WITH daily_revenue AS (
    SELECT
        order_date,
        SUM(revenue) AS daily_revenue
    FROM orders
    WHERE YEAR(order_date) = 2024
    GROUP BY order_date
)
SELECT
    order_date,
    daily_revenue,
    SUM(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total,
    ROUND(
        100.0 * SUM(daily_revenue) OVER (
            ORDER BY order_date
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) / SUM(daily_revenue) OVER (),
        2
    ) AS cumulative_pct_of_year
FROM daily_revenue
ORDER BY order_date;

The SUM(...) OVER () with an empty OVER clause computes the grand total across all rows — a useful denominator for percentage calculations.

Use case: 7-day moving average of daily orders (rolling window).

WITH daily_orders AS (
    SELECT
        order_date,
        COUNT(*) AS order_count
    FROM orders
    WHERE order_date >= DATE '2024-01-01'
    GROUP BY order_date
)
SELECT
    order_date,
    order_count,
    ROUND(
        AVG(CAST(order_count AS DOUBLE)) OVER (
            ORDER BY order_date
            ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        ),
        1
    ) AS seven_day_avg
FROM daily_orders
ORDER BY order_date;

ROWS BETWEEN 6 PRECEDING AND CURRENT ROW defines a 7-row rolling window (the current row plus the 6 rows before it). Use RANGE BETWEEN INTERVAL '6' DAY PRECEDING AND CURRENT ROW if you need date-based ranges that handle gaps in the data correctly.

Session Analysis with Window Functions

Web clickstream and event data often requires session analysis — grouping user events into sessions based on inactivity gaps.

Use case: Assign session IDs to page view events where a session ends after 30 minutes of inactivity.

WITH events_with_gap AS (
    SELECT
        user_id,
        event_time,
        page,
        DATE_DIFF('minute',
            LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time),
            event_time
        ) AS minutes_since_last_event
    FROM page_views
),
session_boundaries AS (
    SELECT
        user_id,
        event_time,
        page,
        minutes_since_last_event,
        CASE
            WHEN minutes_since_last_event IS NULL OR minutes_since_last_event >= 30
            THEN 1 ELSE 0
        END AS is_new_session
    FROM events_with_gap
)
SELECT
    user_id,
    event_time,
    page,
    minutes_since_last_event,
    SUM(is_new_session) OVER (
        PARTITION BY user_id
        ORDER BY event_time
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS session_number
FROM session_boundaries
ORDER BY user_id, event_time;

This three-step pattern — compute the gap with LAG, flag session boundaries, then accumulate with SUM OVER — is the standard approach for sessionization without a dedicated streaming engine.

Performance Tips for Window Functions in Athena

Athena charges by data scanned, and window functions can scan more data than expected if not structured carefully:

  1. Filter early. Apply WHERE conditions before the window function runs. Use CTEs or subqueries to pre-filter, then apply window functions on the reduced dataset.

  2. Use Parquet with partitioning. Partition your S3 data by date columns that appear in your WHERE clause. Athena uses partition pruning to skip irrelevant files. For background on partitioning strategy, see Athena SQL Best Practices.

  3. Avoid redundant window specifications. If you use the same OVER (PARTITION BY x ORDER BY y) clause multiple times, compute them together in a single CTE rather than repeating them in separate subqueries.

  4. Prefer ROWS over RANGE for row-based frames. RANGE requires sorting and comparison on data types, while ROWS is an integer offset — it is generally faster.

For structuring the data lake that these queries run on, see our guide to building a data lake on S3.

Conclusion

Window functions transform Athena from a simple query engine into a powerful analytical tool. The patterns covered here — ranking, period comparison with LAG/LEAD, FIRST_VALUE/LAST_VALUE for anchored lookups, running totals, and sessionization — appear constantly in production data work. Mastering them makes analytical SQL dramatically faster to write and easier to read.

If your team is building analytical capabilities on AWS and wants expert guidance on your Athena setup, data lake architecture, or SQL patterns, reach out to Infra IT Consulting. We work with Canadian organisations to build query infrastructure that performs and scales.

Related posts