Industry Use Cases ecommercereal-timekinesis

E-Commerce Data Pipelines: From Click to Insight in Near Real Time

By Infra IT Consulting · May 16, 2024 · 8 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

The economics of e-commerce favour data-driven teams. A merchandising manager who can see conversion rates by product category in near real time during a promotional event can pull underperforming banners and redirect email campaigns before the sale window closes. A supply chain analyst who sees cart abandonment spikes on specific SKUs can check inventory depth and flag reorder needs before stock-outs materialise. A marketing team with funnel visibility by acquisition channel can shift spend to the channels delivering the best cost-per-acquisition before a campaign burns through its budget.

This level of operational responsiveness requires a data pipeline that closes the gap between user behaviour and business insight. The traditional approach — nightly ETL jobs loading yesterday’s activity into a data warehouse — delivers insights that are a full business day old, which is too slow for decisions that need to happen during business hours. Near-real-time analytics, with latency measured in minutes rather than hours, is the competitive differentiator for e-commerce teams willing to invest in the infrastructure to support it.

This post describes the end-to-end AWS architecture Infra IT Consulting uses for e-commerce data pipelines: Kinesis Firehose for clickstream ingestion, S3 as the analytical foundation, Glue crawlers for automated schema discovery, dbt for funnel analytics models, and QuickSight for merchandising dashboards.

Clickstream Ingestion with Kinesis Data Firehose

Every interaction a visitor has with an e-commerce site — page views, product impressions, add-to-cart events, checkout steps, search queries, and purchases — generates a clickstream event. At the scale of a mid-market Canadian e-commerce retailer doing 50,000 daily sessions, this is 2–5 million events per day. At peak promotional periods (Black Friday, Boxing Day), it can be 10–20x that volume.

Amazon Kinesis Data Firehose is the right ingestion service for clickstream data. Unlike Kinesis Data Streams (which requires explicit consumer management), Firehose is fully managed: you configure the source, the destination, and the buffering parameters, and AWS handles the scaling, durability, and delivery. A JavaScript tag on the e-commerce site sends events to a Kinesis Firehose delivery stream via HTTPS. The tag fires on page load, product impression, and custom events defined in Google Tag Manager or a similar tag management system.

The Firehose delivery stream buffers incoming events and delivers them to S3 in batches, using dynamic partitioning to write to paths based on event attributes: s3://your-datalake/clickstream/event_type={event_type}/event_date={date}/event_hour={hour}/. This partition structure allows downstream queries to read only the events they need — a query for product page view events on a specific date reads only from the event_type=product_view partition, not the entire event dataset.

Data format conversion in Firehose transforms the raw JSON events to Parquet at delivery time, with schema validation using a Glue schema. Parquet format reduces storage costs by approximately 75% compared to raw JSON and dramatically improves query performance in Athena and Redshift Spectrum. A Firehose configuration for an e-commerce clickstream typically delivers 128MB Parquet files every 5 minutes during business hours.

For higher-latency use cases like cart abandonment email triggers (send an email 1 hour after cart abandonment if no purchase), a separate Kinesis Data Streams consumer with a Lambda function handles the stateful detection logic, tracking cart events per session and scheduling the trigger when the abandonment condition is met.

S3 Data Lake as the Analytical Foundation

The S3 data lake aggregates clickstream events alongside two other critical e-commerce data domains: the product catalogue (exported from the e-commerce platform as daily Parquet snapshots to S3) and order transaction data (replicated from the order management system via AWS DMS). Together, these three domains provide the raw material for all merchandising and marketing analytics.

The data lake uses a three-layer architecture. The raw layer holds data exactly as it arrived — immutable, partitioned, with no transformation applied. This is the audit trail and the source of record. The conformed layer holds validated, typed, deduplicated data ready for analytical use — clickstream events with session IDs standardised, product catalogue attributes denormalised, and order transactions with cancelled orders excluded. The curated layer holds business-level aggregations and dbt models that are consumed directly by QuickSight and reporting queries.

Amazon Glue crawlers automate the schema discovery process. When the Firehose delivery creates new Parquet files in the clickstream prefix, a Glue crawler runs on a 15-minute schedule, scans the new files, infers the schema, and updates the Glue Data Catalog. This means new event types introduced by the development team — say, a new “wishlist_add” event — automatically appear as queryable in Athena and Redshift Spectrum within 15 minutes, without any manual schema registration. The crawler also detects schema changes: if the development team adds a new field to the checkout_started event, the crawler updates the catalog table definition automatically.

Glue Data Quality rules on the crawler output validate that required fields (session_id, user_id or anonymous_id, event_timestamp, device_type) are present in all records. Records failing quality checks are quarantined in a separate S3 prefix for investigation rather than silently dropped or allowed to corrupt downstream models.

For a comprehensive treatment of S3 data lake architecture, see Building a Data Lake on S3.

dbt Models for Funnel Analytics

The funnel is the central analytical model of e-commerce: how many visitors progress from each stage to the next, and where do they drop off? Understanding conversion at each funnel stage — landing, browse, product view, add to cart, checkout initiation, payment, purchase — guides both site optimisation and inventory prioritisation decisions.

dbt models running on a Redshift compute environment (using dbt’s Redshift adapter) implement the funnel analytics logic in version-controlled, testable SQL. The int_sessions model sessionises clickstream events: events within a 30-minute window of inactivity are grouped into a session, each session is assigned a session_id, and session-level attributes (first touch channel, device type, entry page, country) are derived from the events within the session. The int_funnel_steps model identifies which funnel stages each session reached. The fct_funnel model aggregates to daily conversion rates by channel, device type, and product category.

The key dbt design principle for e-commerce analytics is incremental models with event deduplication. Clickstream data often contains duplicate events from client-side network retries — a purchase event might be sent twice if the client loses connectivity after the first send. The dbt incremental model deduplicates on (session_id, event_type, event_timestamp) before inserting into the conformed table, preventing double-counting of conversions.

dbt tests validate business logic constraints: session-level conversion rates must be between 0 and 1, purchase amounts must be positive, and a session cannot reach the “payment” stage without first reaching “checkout initiation”. Test failures send alerts to the analytics engineering Slack channel via dbt Cloud’s job notification integration, catching data quality regressions before they reach merchandising dashboards.

For the full dbt-on-AWS pattern, see dbt on AWS.

QuickSight Dashboards for Merchandising Teams

The end destination for all this pipeline complexity is a dashboard that a category manager can open during a promotional event and act on immediately. Amazon QuickSight’s SPICE (Super-fast, Parallel, In-memory Calculation Engine) ingests the dbt curated layer tables from Redshift, maintaining a compressed in-memory copy that delivers sub-second dashboard load times even for datasets with tens of millions of rows.

The merchandising performance dashboard shows real-time (15-minute-delayed) conversion rates by product category, top-selling products by revenue and units, and add-to-cart rates by SKU — allowing category managers to spot promotional performance outliers within minutes of a campaign going live. SPICE refresh is configured for every 15 minutes during business hours and every hour overnight.

The acquisition channel dashboard shows session volume, conversion rate, and revenue per session by marketing channel (paid search, organic search, email, social, direct) — giving the digital marketing team the data to make intra-day bid adjustments during heavy promotional periods. QuickSight’s anomaly detection feature highlights channels with statistically unusual conversion rate changes, surfacing issues like a broken landing page or a misdirected campaign before they cause material revenue loss.

Row-level security in QuickSight restricts category managers to their own category’s data. A dataset rule maps user email addresses to category codes in a reference table, applying the filter automatically without requiring separate dashboards per category.

For a detailed QuickSight configuration and deployment guide, see Amazon QuickSight Guide.

Conclusion

Near-real-time e-commerce analytics is not a luxury for large retailers — it is increasingly a baseline capability for any online business competing for customer attention during high-stakes promotional windows. The AWS pipeline described here — Kinesis Firehose for clickstream ingestion, S3 with Glue crawlers for the data lake foundation, dbt for funnel models, and QuickSight for merchandising dashboards — delivers minute-level analytical latency at a cost structure appropriate for mid-market e-commerce businesses.

Infra IT Consulting designs and builds e-commerce data pipelines for retailers across Canada and internationally. If your merchandising and marketing teams are working from yesterday’s data when they need today’s, contact us to discuss building the pipeline that closes that gap.

Industry Use Cases

Talk to our team →

E-Commerce Data Pipelines: From Click to Insight in Near Real Time

Clickstream Ingestion with Kinesis Data Firehose

S3 Data Lake as the Analytical Foundation

dbt Models for Funnel Analytics

QuickSight Dashboards for Merchandising Teams

Conclusion

Related posts

Manufacturing IoT Data Pipelines on AWS

Cloud Data Infrastructure for Canadian Public Sector

Data Engineering for African Telecom Operators: Scale, Cost, and Connectivity

Clickstream Ingestion with Kinesis Data Firehose

S3 Data Lake as the Analytical Foundation

dbt Models for Funnel Analytics

QuickSight Dashboards for Merchandising Teams

Conclusion

Related posts

Manufacturing IoT Data Pipelines on AWS

Cloud Data Infrastructure for Canadian Public Sector

Data Engineering for African Telecom Operators: Scale, Cost, and Connectivity

We value your privacy