Infra IT Consulting logo Infra ITC
Data Analytics & BI cohortsqlathena

Cohort Analysis in SQL with Amazon Athena

By Infra IT Consulting Β· Β· 9 min read

Cohort analysis is one of the most powerful and frequently misunderstood techniques in product and growth analytics. Done properly, it answers questions that aggregate metrics cannot: Are customers who signed up after your pricing change retaining better or worse than those who signed up before? Do users acquired through paid channels behave differently from organic users six months in? Is your product improving for new users, or are those improvements masked by a growing base of long-tenured users who already know the product?

Amazon Athena β€” querying data stored in Amazon S3 using standard SQL β€” is an excellent environment for cohort analysis. Athena’s support for window functions, date arithmetic, and arbitrary partitioning makes complex cohort queries both feasible and readable. This post walks through the key SQL patterns with complete, runnable examples.

What a Cohort Actually Is

A cohort is a group of users or customers who share a defining event within a specific time window. The most common cohort definition is the acquisition cohort: all users who signed up in a given month. But cohorts can be defined by any event:

  • Acquisition cohorts β€” users who first registered in a given calendar month
  • Feature cohorts β€” users who first used a specific feature in a given week
  • Channel cohorts β€” users acquired through a specific marketing channel in a given quarter
  • Behavioural cohorts β€” users who performed a specific in-product action (completed onboarding, made their first purchase, invited a team member)

The analysis then tracks how each cohort behaves over subsequent periods β€” typically measuring retention rate, revenue, or engagement β€” relative to how long it has been since their defining event.

Data Requirements and S3 Structure

For cohort analysis in Athena, you need at minimum:

  1. A users table with a user_id and the timestamp of the cohort-defining event (e.g., first_seen_at or registered_at)
  2. An events table or activity table with user_id, event_date, and optionally an event type and associated value

A typical S3/Athena table structure for a SaaS product:

-- Users table (partitioned by registration month for query efficiency)
CREATE EXTERNAL TABLE users (
  user_id       STRING,
  registered_at TIMESTAMP,
  plan_type     STRING,
  channel       STRING,
  country       STRING
)
PARTITIONED BY (registration_month STRING)
STORED AS PARQUET
LOCATION 's3://your-data-lake/curated/users/';

-- Activity table (partitioned by date)
CREATE EXTERNAL TABLE user_activity (
  user_id       STRING,
  activity_date DATE,
  event_type    STRING,
  revenue       DECIMAL(10,2)
)
PARTITIONED BY (activity_date STRING)
STORED AS PARQUET
LOCATION 's3://your-data-lake/curated/user_activity/';

Partitioning by date is critical for Athena cost and performance. A retention cohort query without proper partition pruning can scan years of activity data unnecessarily.

Basic Retention Cohort Query

The canonical retention cohort query follows a consistent structure: assign each user to a cohort period, then count how many return in each subsequent period.

WITH user_cohorts AS (
  -- Assign each user to their signup cohort (month granularity)
  SELECT
    user_id,
    DATE_TRUNC('month', registered_at)  AS cohort_month
  FROM users
  WHERE registered_at >= DATE '2023-01-01'
),
user_activity_months AS (
  -- For each user, find all months they were active
  SELECT DISTINCT
    user_id,
    DATE_TRUNC('month', CAST(activity_date AS TIMESTAMP)) AS active_month
  FROM user_activity
  WHERE activity_date >= DATE '2023-01-01'
),
cohort_activity AS (
  -- Join to find each user's activity relative to their cohort month
  SELECT
    c.cohort_month,
    DATE_DIFF('month', c.cohort_month, a.active_month) AS period_number,
    COUNT(DISTINCT a.user_id)                           AS active_users
  FROM user_cohorts c
  JOIN user_activity_months a USING (user_id)
  WHERE a.active_month >= c.cohort_month
  GROUP BY 1, 2
),
cohort_sizes AS (
  -- Count of users in each cohort at period 0 (acquisition)
  SELECT
    cohort_month,
    active_users AS cohort_size
  FROM cohort_activity
  WHERE period_number = 0
)
SELECT
  ca.cohort_month,
  cs.cohort_size,
  ca.period_number,
  ca.active_users,
  ROUND(
    100.0 * ca.active_users / cs.cohort_size, 1
  )                                         AS retention_rate_pct
FROM cohort_activity ca
JOIN cohort_sizes cs USING (cohort_month)
ORDER BY cohort_month, period_number;

This query returns one row per cohort-period combination: the January 2023 cohort at month 0 (100% by definition), month 1 (e.g., 42%), month 2 (e.g., 31%), and so on. Loading this output into Amazon QuickSight and pivoting on period_number produces the classic cohort retention grid.

Revenue Cohort Analysis

Retention rate tells you who comes back; revenue cohort analysis tells you how much value each cohort generates over time β€” the foundation of lifetime value (LTV) modelling.

WITH user_cohorts AS (
  SELECT
    user_id,
    DATE_TRUNC('month', registered_at) AS cohort_month
  FROM users
  WHERE registered_at >= DATE '2023-01-01'
),
monthly_revenue AS (
  SELECT
    user_id,
    DATE_TRUNC('month', CAST(activity_date AS TIMESTAMP)) AS revenue_month,
    SUM(revenue)                                           AS monthly_revenue
  FROM user_activity
  WHERE revenue > 0
    AND activity_date >= DATE '2023-01-01'
  GROUP BY 1, 2
),
cohort_revenue AS (
  SELECT
    c.cohort_month,
    DATE_DIFF('month', c.cohort_month, r.revenue_month) AS period_number,
    COUNT(DISTINCT r.user_id)                            AS paying_users,
    SUM(r.monthly_revenue)                               AS cohort_revenue,
    AVG(r.monthly_revenue)                               AS avg_revenue_per_payer
  FROM user_cohorts c
  JOIN monthly_revenue r USING (user_id)
  WHERE r.revenue_month >= c.cohort_month
  GROUP BY 1, 2
),
cohort_sizes AS (
  SELECT
    cohort_month,
    COUNT(*) AS cohort_size
  FROM user_cohorts
  GROUP BY 1
)
SELECT
  cr.cohort_month,
  cs.cohort_size,
  cr.period_number,
  cr.paying_users,
  cr.cohort_revenue,
  -- Cumulative revenue per acquired user β€” the LTV curve
  SUM(cr.cohort_revenue) OVER (
    PARTITION BY cr.cohort_month
    ORDER BY cr.period_number
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) / cs.cohort_size                                     AS cumulative_ltv
FROM cohort_revenue cr
JOIN cohort_sizes cs USING (cohort_month)
ORDER BY cohort_month, period_number;

The cumulative_ltv column gives you the LTV curve for each cohort β€” how much cumulative revenue, on average, each user acquired in that month has generated by period N. Comparing curves across cohorts reveals whether newer cohorts are more or less valuable, and at what period they reach payback on acquisition cost.

Behavioural Segmentation Within Cohorts

A single acquisition cohort often contains users with very different behaviours. Splitting cohorts by a behavioural attribute reveals which user types retain best β€” critical information for product and growth teams.

-- Retention broken out by whether users completed onboarding in their first week
WITH user_cohorts AS (
  SELECT
    u.user_id,
    DATE_TRUNC('month', u.registered_at) AS cohort_month,
    CASE
      WHEN EXISTS (
        SELECT 1 FROM user_activity a
        WHERE a.user_id = u.user_id
          AND a.event_type = 'onboarding_completed'
          AND a.activity_date <= DATE_ADD('day', 7, CAST(u.registered_at AS DATE))
      ) THEN 'completed_onboarding'
      ELSE 'did_not_complete_onboarding'
    END AS onboarding_segment
  FROM users u
  WHERE u.registered_at >= DATE '2023-01-01'
),
-- ... (rest of query follows same pattern as basic retention above,
--      with onboarding_segment added to GROUP BY)

This pattern extends naturally to any behavioural segmentation: plan type, acquisition channel, country, feature adoption. Athena handles these subqueries efficiently when the underlying S3 data is properly partitioned.

Avoiding Common Cohort Analysis Mistakes

Date truncation consistency. Mixing DATE_TRUNC and manual date arithmetic in the same query produces subtle off-by-one errors in period assignment. Pick one approach and use it everywhere in the query.

Cohort size denominator. The denominator for retention rate should always be the number of users in the cohort at period 0 β€” not the number active in any subsequent period. Using a variable denominator produces β€œretention” figures that can exceed 100%.

Partition pruning in Athena. Always filter on the partition column in your WHERE clause. For activity_date partitioned tables, WHERE activity_date >= '2023-01-01' pushes down to partition pruning; WHERE YEAR(activity_date) >= 2023 does not and will scan all partitions.

Choosing the right analysis tool. For more on when to use Athena versus other query engines for analytical workloads, see the Athena SQL best practices guide.

Presenting Cohort Results

The cohort grid β€” cohort months as rows, period numbers as columns, retention rate as cell values β€” is the standard presentation format. In Amazon QuickSight, you can produce this as a pivot table with conditional formatting (green for high retention, red for low) that makes degradation or improvement across cohorts immediately visible.

For teams that need cohort analysis as part of a broader product analytics capability, the data as a product post explores how to package cohort outputs as reusable internal data products that product managers and growth teams can access without running SQL themselves.

Conclusion

Cohort analysis in Amazon Athena is both powerful and cost-effective. With proper data modelling in S3, partition discipline, and well-structured window function queries, you can build retention grids, LTV curves, and behavioural segmentations that give your product and growth teams genuine insight into user behaviour over time.

If your team is building product analytics infrastructure on AWS and needs help designing cohort models, query frameworks, or BI delivery, contact Infra IT Consulting β€” we specialise in exactly this kind of high-value analytical engineering.

Related posts