Data Analytics & BI self-serviceanalyticsathena

Building Self-Service Analytics Platforms on AWS

By Infra IT Consulting · January 25, 2024 · 9 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

The promise of self-service analytics — where business users can answer their own data questions without filing tickets to the data team — is compelling. The execution is harder than most organisations anticipate. Too much self-service access leads to uncontrolled costs, inconsistent metrics, and data quality problems discovered in executive meetings. Too little, and the data team becomes a report factory, and the platform fails to deliver business value at the pace the organisation needs.

This guide covers how to design a self-service analytics platform on AWS that genuinely empowers business users while maintaining the cost controls and data governance that make the platform sustainable.

The Architecture: Governed Access at Multiple Layers

A production self-service analytics platform on AWS has four layers:

Data storage and cataloguing — Amazon S3 + AWS Glue Data Catalog + AWS Lake Formation
Query execution — Amazon Athena (with workgroup-based governance)
Semantic layer — Curated datasets in Amazon QuickSight or a dbt semantic layer on top of Athena
Presentation — Amazon QuickSight dashboards, Q&A interface, and parameterised reports

Each layer has a governance component. Self-service without governance is not a platform; it is a liability.

Structuring Data Zones for Self-Service

Not all data in your lake should be directly queryable by business users. Structure your S3 data lake into zones with different access levels:

Raw zone: Ingested data as-is from sources. Access restricted to data engineers only. Direct self-service access here leads to inconsistent metric definitions and schema confusion.

Processed zone: Cleaned, typed, and deduplicated data. Access for senior analysts and data scientists who understand the schemas. Athena workgroup limits apply.

Curated zone: Business-logic-applied, metric-consistent data products. The primary self-service access layer. Tables in this zone should have well-documented schemas in the Glue Data Catalog, consistent naming conventions, and stable schemas under a change management process.

Aggregated zone: Pre-aggregated summaries for dashboard consumption. SPICE import targets for QuickSight. The fastest, most controlled layer for business users.

Implement zone boundaries using Lake Formation:

import boto3

lf = boto3.client('lakeformation')

# Grant curated zone access to the Business Analysts group
lf.grant_permissions(
    Principal={
        'DataLakePrincipalIdentifier': 'arn:aws:iam::123456789012:role/BusinessAnalystRole'
    },
    Resource={
        'LFTagPolicy': {
            'ResourceType': 'DATABASE',
            'Expression': [
                {'TagKey': 'zone', 'TagValues': ['curated', 'aggregated']}
            ]
        }
    },
    Permissions=['DESCRIBE']
)

lf.grant_permissions(
    Principal={
        'DataLakePrincipalIdentifier': 'arn:aws:iam::123456789012:role/BusinessAnalystRole'
    },
    Resource={
        'LFTagPolicy': {
            'ResourceType': 'TABLE',
            'Expression': [
                {'TagKey': 'zone', 'TagValues': ['curated', 'aggregated']},
                {'TagKey': 'sensitivity', 'TagValues': ['public', 'internal']}
            ]
        }
    },
    Permissions=['SELECT', 'DESCRIBE']
)

This LF-TBAC policy automatically applies to new tables in the curated and aggregated zones without per-table permission grants. New data products added to the right zone with the right tags are immediately accessible to authorised business users.

Athena Workgroups for Cost and Access Control

Amazon Athena workgroups are the primary mechanism for governing self-service SQL access. Without workgroups, a single badly-written query from a business user can scan terabytes of data and generate unexpected costs.

Create dedicated workgroups for each user group with appropriate controls:

resource "aws_athena_workgroup" "business_analysts" {
  name        = "business-analysts"
  description = "Self-service workgroup for business analysts"

  configuration {
    enforce_workgroup_configuration    = true
    publish_cloudwatch_metrics_enabled = true
    result_configuration {
      output_location = "s3://${aws_s3_bucket.athena_results.bucket}/business-analysts/"
      encryption_configuration {
        encryption_option = "SSE_KMS"
        kms_key_arn       = var.kms_key_arn
      }
    }
    bytes_scanned_cutoff_per_query     = 10737418240  # 10 GB limit per query
    requester_pays_enabled             = false
    engine_version {
      selected_engine_version = "Athena engine version 3"
    }
  }
}

The bytes_scanned_cutoff_per_query setting is critical for self-service environments. Set it to 10 GB for business users — enough to run meaningful queries on curated data, but it will terminate a poorly constructed full-table scan before it generates a large cost. Data engineers can have a separate workgroup with a higher or unlimited cutoff.

Query result reuse: Enable Athena’s query result reuse feature for self-service workgroups. When multiple users run the same query within the TTL window (up to 7 days), Athena returns cached results without re-scanning data. This reduces costs dramatically for popular reports that many users run independently:

# Add to the workgroup configuration
result_configuration {
  # ...existing config...
}
# Enable in the workgroup via the console or API:
# QueryResultsS3.EnableResultReuse = true
# QueryResultsS3.ResultReuseMaxAgeInMinutes = 10080  # 7 days

Building a Semantic Layer for Business Users

Raw SQL access to the curated zone is appropriate for trained analysts. For less technical business users — regional managers, finance teams, marketing — a semantic layer hides SQL complexity and enforces metric consistency.

Option 1: QuickSight Datasets as Semantic Layer — Create well-structured QuickSight datasets from the curated Athena tables. Add calculated fields for business metrics (revenue growth rate, customer lifetime value, churn rate) at the dataset level. Business users interact with named metrics in the QuickSight UI rather than writing SQL.

The key discipline: define metrics once at the dataset level, not in individual visuals. When “Monthly Recurring Revenue” is defined in twelve different dashboards using twelve slightly different SQL expressions, you have a data credibility crisis waiting to happen.

Option 2: dbt Metrics Layer — Use dbt’s semantic layer (formerly dbt Metrics, now MetricFlow) to define business metrics as code, version-controlled and tested. dbt generates the SQL; Athena executes it; QuickSight or any downstream BI tool consumes the pre-defined metrics.

# dbt metrics definition (semantic_models/orders.yml)
semantic_models:
  - name: orders
    description: "Order transaction data from the commerce domain"
    model: ref('fct_orders')
    entities:
      - name: order
        type: primary
        expr: order_id
      - name: customer
        type: foreign
        expr: customer_id
    dimensions:
      - name: order_date
        type: time
        type_params:
          time_granularity: day
      - name: country_code
        type: categorical
    measures:
      - name: order_count
        agg: count
        expr: order_id
      - name: total_revenue
        agg: sum
        expr: order_amount

metrics:
  - name: monthly_revenue
    description: "Total order revenue by month"
    type: simple
    label: "Monthly Revenue"
    type_params:
      measure: total_revenue

Our deep-dive on dbt on AWS covers the full dbt analytics engineering workflow for teams adopting this approach.

Data Catalogue and Documentation for Self-Service

Technical platform access is necessary but not sufficient for self-service analytics. Business users also need to know what data exists, what it means, and how to interpret it. Without this, they revert to Excel and email.

Invest in the AWS Glue Data Catalog as a business data catalogue, not just a technical metadata store:

import boto3

glue = boto3.client('glue')

# Update the Glue catalog table with business-friendly descriptions
glue.update_table(
    DatabaseName='curated_commerce',
    TableInput={
        'Name': 'orders',
        'Description': 'Confirmed customer orders. Updated every 4 hours. Excludes cancelled and refunded orders. Source: Commerce platform order management system.',
        'Parameters': {
            'classification': 'parquet',
            'data_owner': 'commerce-team@myorg.com',
            'update_frequency': 'every_4_hours',
            'data_quality_score': '98.7',
            'last_validated': '2024-01-24',
        },
        'StorageDescriptor': {
            # ...existing storage config...
            'Columns': [
                {
                    'Name': 'order_id',
                    'Type': 'string',
                    'Comment': 'Unique order identifier. Format: ORD-XXXXXXXXXX'
                },
                {
                    'Name': 'order_amount',
                    'Type': 'double',
                    'Comment': 'Net order value in the order currency, excluding taxes and shipping'
                },
                # ...
            ]
        }
    }
)

For organisations that want a richer data catalogue experience, Amazon DataZone provides a business-friendly portal with data discovery, access request workflows, and lineage visualisation — directly integrated with the Glue Data Catalog and Lake Formation.

Cost Governance in Self-Service Environments

Self-service analytics can generate unpredictable costs if not governed. Implement three layers of cost control:

Per-query limits: Athena workgroup bytes_scanned_cutoff_per_query (covered above).

Workgroup-level budgets: Use AWS Budgets with an Athena-specific cost filter for each workgroup. Alert the data team when monthly Athena spend for the business-analysts workgroup exceeds a threshold:

resource "aws_budgets_budget" "athena_business_analysts" {
  name         = "athena-business-analysts-monthly"
  budget_type  = "COST"
  limit_amount = "500"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  cost_filters = {
    Service = "Amazon Athena"
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = ["data-team@myorg.com"]
  }
}

SPICE-first policy: Push business users toward QuickSight SPICE datasets (which have no per-query Athena cost) rather than direct Athena access. Reserve Athena direct access for analysts who have completed query training and understand the cost model.

Training and Enablement: The Human Layer

The most common failure mode in self-service analytics programmes is under-investing in training and enablement. Technology alone does not create self-service capability — humans need to know how to use it effectively.

A minimal enablement programme:

Data literacy curriculum: What is a dimension vs. a measure? How is the curated data structured? What does each metric definition mean?
Tool training: QuickSight basics for business users (2-hour workshop); Athena SQL for analysts (half-day workshop including cost awareness).
Data office hours: A weekly 30-minute slot where business users can bring questions to the data team — not for custom report requests, but for help navigating the platform.
A “where do I find X” documentation page: Updated whenever a new data product is added. Reduces repeat queries to the data team.

The governance framework that makes the platform sustainable is covered in our post on Building a Data Governance Framework, which addresses the policy, process, and tooling dimensions beyond the AWS technical layer.

Conclusion

A self-service analytics platform on AWS is achievable with the right combination of data lake structure (zoned access with Lake Formation), cost governance (Athena workgroups with scan limits), semantic layer (QuickSight datasets or dbt metrics), and business enablement. The platforms that succeed treat governance not as an obstacle to self-service but as the foundation that makes it sustainable.

Infra IT Consulting designs and implements self-service analytics platforms for organisations in Canada, the UK, and Africa — from initial architecture through to analyst training and cost optimisation. Get in touch to discuss your self-service analytics requirements.

Data Analytics & BI

Talk to our team →

Building Self-Service Analytics Platforms on AWS

The Architecture: Governed Access at Multiple Layers

Structuring Data Zones for Self-Service

Athena Workgroups for Cost and Access Control

Building a Semantic Layer for Business Users

Data Catalogue and Documentation for Self-Service

Cost Governance in Self-Service Environments

Training and Enablement: The Human Layer

Conclusion

Related posts

The Metrics Layer Explained: Headless BI and Why It Matters

Data Democratisation: Making Data Accessible Across Your Organisation

dbt on AWS: Transforming Raw Data into Analytics-Ready Models

The Architecture: Governed Access at Multiple Layers

Structuring Data Zones for Self-Service

Athena Workgroups for Cost and Access Control

Building a Semantic Layer for Business Users

Data Catalogue and Documentation for Self-Service

Cost Governance in Self-Service Environments

Training and Enablement: The Human Layer

Conclusion

Related posts

The Metrics Layer Explained: Headless BI and Why It Matters

Data Democratisation: Making Data Accessible Across Your Organisation

dbt on AWS: Transforming Raw Data into Analytics-Ready Models

We value your privacy