Infra IT Consulting logo Infra ITC
Data Architecture & Strategy apidata-productrest

API-First Data Architecture: Exposing Data as Services

By Infra IT Consulting Β· Β· 9 min read

Most organisations treat data exposure as an afterthought β€” a reporting database that internal analysts hit directly, a shared S3 bucket that downstream teams figured out how to read, a CSV export that someone emails on Fridays. This works until it doesn’t: until a schema change in the reporting database breaks twelve downstream consumers at once, until S3 permissions rot and nobody is sure what is accessible to whom, until the Friday email doesn’t arrive and nobody knows who to call.

An API-first data architecture treats data exposure as a product engineering problem. Data assets are exposed through versioned, documented, monitored APIs rather than direct database connections or file shares. Consumers depend on a stable contract, not on implementation details. This shift has significant operational consequences β€” for better and for worse β€” and it is worth understanding both before committing to the pattern.

The Data Product Framing

The API-first approach is closely related to the data mesh concept of treating data as a product. A data product has an owner, a defined schema, a service level agreement for availability and freshness, and documentation sufficient for a new consumer to self-serve onboarding. An API is the delivery mechanism that makes these properties enforceable.

When data is exposed through a REST or GraphQL API rather than a direct database query, several things become possible that are difficult or impossible otherwise:

  • Schema versioning: consumers pin to /v1/ or /v2/ endpoints and are not broken by schema evolution in the underlying storage layer
  • Access control at query level: you can restrict which fields, which row-level subsets, and which aggregation granularities are accessible to which consumer, independent of storage-layer permissions
  • Usage telemetry: every API call is logged, enabling you to see which consumers call which endpoints, how often, and with what latency β€” information that is invisible with direct database connections
  • Throttling and quota management: you can prevent a single consumer from overwhelming the data platform with expensive queries

On AWS, this architecture is typically built with Amazon API Gateway (for REST APIs), AWS AppSync (for GraphQL), AWS Lambda (for query execution and business logic), Amazon Redshift or Amazon Athena (for the underlying data), and AWS WAF and Cognito for security.

REST vs. GraphQL for Data APIs

Choosing between REST and GraphQL for a data API involves a genuine trade-off. REST is simpler to implement, easier to cache, and better supported by standard API tooling. GraphQL provides flexibility for consumers who need to request specific fields and traverse related entities without multiple round trips.

For internal data platform APIs where consumers are data engineers and analysts writing code, GraphQL’s flexibility is often worth the implementation overhead. A single GraphQL endpoint can satisfy queries from a mobile app needing lightweight summaries, a data science team pulling full historical records, and a downstream pipeline needing specific computed aggregates β€” without requiring separate endpoint design for each use case.

For external or partner-facing APIs where consumers may be less technically sophisticated, REST with well-documented fixed schemas is usually the better choice. The predictability of REST endpoints reduces integration support burden.

A practical middle ground for data platforms is to use REST for operational data APIs (high-frequency, low-latency reads of current state) and GraphQL for analytical data APIs (lower-frequency, flexible queries across multiple related entities).

Implementing a Data API on AWS

Here is a concrete architecture for a REST data API exposing a product catalogue with pricing and inventory, backed by Amazon Redshift:

API Gateway receives the request, validates the API key or JWT token, and routes to the appropriate Lambda function. Throttling is configured at 1000 requests per second per consumer with burst limits.

Lambda function (Python) receives the event, extracts query parameters, constructs a parameterised SQL query, executes it against Redshift Data API, and formats the response:

import boto3
import json
import os
from datetime import datetime

redshift_data = boto3.client('redshift-data', region_name='ca-central-1')

def lambda_handler(event, context):
    consumer_id = event['requestContext']['authorizer']['principalId']
    product_category = event.get('queryStringParameters', {}).get('category')
    
    # Row-level security: restrict to consumer's authorised categories
    allowed_categories = get_consumer_permissions(consumer_id)
    if product_category and product_category not in allowed_categories:
        return {
            'statusCode': 403,
            'body': json.dumps({'error': 'Access denied to requested category'})
        }
    
    sql = """
        SELECT
            p.product_id,
            p.product_name,
            p.category,
            p.unit_price_cad,
            i.quantity_on_hand,
            i.warehouse_location,
            i.last_updated_at
        FROM analytics.dim_product p
        JOIN analytics.fact_inventory_current i ON p.product_key = i.product_key
        WHERE p.category = :category
          AND p.is_active = TRUE
        ORDER BY p.product_name
        LIMIT 500
    """
    
    response = redshift_data.execute_statement(
        ClusterIdentifier=os.environ['REDSHIFT_CLUSTER_ID'],
        Database=os.environ['REDSHIFT_DATABASE'],
        SecretArn=os.environ['REDSHIFT_SECRET_ARN'],
        Sql=sql,
        Parameters=[{'name': 'category', 'value': product_category}]
    )
    
    # Poll for completion (or use async pattern for longer queries)
    result = wait_for_query(redshift_data, response['Id'])
    
    return {
        'statusCode': 200,
        'headers': {
            'Content-Type': 'application/json',
            'Cache-Control': 'max-age=300',  # 5-minute cache for stable data
            'X-Data-As-Of': result['metadata']['last_updated_at']
        },
        'body': json.dumps({
            'data': format_records(result),
            'count': result['TotalNumRows'],
            'api_version': 'v1'
        })
    }

Note the X-Data-As-Of response header β€” this communicates data freshness to consumers, which is a critical piece of contract fulfilment in a data API.

GraphQL with AWS AppSync for Analytical Queries

For more flexible analytical use cases, AWS AppSync with a Lambda data source provides a managed GraphQL layer. The AppSync schema defines the types and queries available:

type Product {
  productId: ID!
  productName: String!
  category: String!
  unitPriceCad: Float!
  inventory: InventoryStatus
}

type InventoryStatus {
  quantityOnHand: Int!
  warehouseLocation: String!
  lastUpdatedAt: AWSDateTime!
}

type SalesSummary {
  category: String!
  totalRevenueCad: Float!
  unitsSold: Int!
  uniqueCustomers: Int!
}

type Query {
  product(productId: ID!): Product
  productsByCategory(category: String!, limit: Int): [Product]
  salesSummary(startDate: AWSDate!, endDate: AWSDate!, groupBy: String!): [SalesSummary]
}

AppSync handles authentication (via Cognito User Pools or API keys), field-level resolvers, and caching. The resolver for salesSummary can execute a different Athena query than the resolver for product β€” consumers see a unified graph while the backend can mix Redshift, Athena, and DynamoDB as appropriate for each data domain.

Governance and Versioning Strategy

The most common failure mode in data API implementations is insufficient attention to versioning discipline. When you expose data through an API, you are making an implicit promise to consumers that the contract will be stable. Breaking that promise β€” by changing field names, dropping fields, or altering data types without version bumping β€” causes cascading failures in consumer applications.

Practical versioning guidance:

  • Major version in the URL path (/v1/, /v2/): breaking changes require a new major version. Maintain the previous major version for a documented deprecation period (typically 6-12 months).
  • Minor changes via field additions only: adding new optional fields to responses is non-breaking. Consumers that do not request the new fields are unaffected.
  • Explicit deprecation headers: use Deprecation and Sunset HTTP headers to signal when a version will be retired, giving consumers advance notice.

This versioning discipline integrates naturally with the data governance framework principles of data asset ownership and lifecycle management. Each API version is a distinct data asset with its own owner, SLA, and retirement plan.

Measuring API Data Product Health

An API-first architecture makes data product health measurable in ways that direct database access cannot. Standard API metrics to track in CloudWatch:

  • Request latency (p50, p95, p99): per endpoint and per consumer
  • Error rate by type (4xx authentication/authorisation errors, 5xx backend failures)
  • Cache hit ratio: for endpoints with CloudFront or API Gateway caching configured
  • Consumer call volume: to identify which consumers are your highest-value users and which might be hitting endpoints inefficiently

Setting CloudWatch Alarms on p99 latency and 5xx error rate gives you SLA-grade operational visibility into each data product. When combined with the data freshness monitoring practices for the underlying pipelines, you have end-to-end observability from source ingestion to consumer delivery.

When Not to Use an API-First Architecture

API-first data architecture adds engineering overhead. It is not the right choice when:

  • Consumers are only internal data engineers who need full dataset access for pipeline development β€” direct Redshift or Athena access is more efficient
  • Data volumes are large and query patterns are unpredictable β€” APIs that proxy through to analytical databases can be expensive and slow for ad-hoc heavy queries
  • Your team lacks API engineering expertise β€” a poorly designed data API is worse than direct database access because it adds a layer of complexity without the reliability benefits

For organisations with a mix of consumer types, a hybrid approach works well: direct warehouse access for data engineering and data science, API access for operational consumers, product teams, and external partners.

Building the Foundation for Data Products

An API-first architecture is ultimately an investment in data product discipline. It forces explicit decisions about ownership, versioning, access control, and SLAs that most data teams defer until consumers are already frustrated. Organisations that make this investment early tend to find that their data platform scales more gracefully β€” not because APIs are magic, but because the discipline they enforce prevents the accumulation of implicit dependencies that make platforms brittle.

Infra IT Consulting helps Canadian and international organisations design and implement data product architectures on AWS, including API layer design, Redshift and Athena query optimisation for API-latency requirements, and governance frameworks for data product lifecycle management. Contact us to explore what an API-first approach would look like for your data platform.

Related posts