Building an Insurance Data Platform on AWS
The Canadian insurance industry sits at the intersection of actuarial science, regulatory scrutiny, and increasingly sophisticated fraud. Property and casualty insurers are contending with climate-related claims volatility — catastrophic weather events in 2023 alone generated over $3.1 billion in insured losses, the worst year on record — while life and health insurers navigate IFRS 17 implementation requirements that have fundamentally changed how liability reserves are modelled and reported. Meanwhile, insurance fraud in Canada costs the industry over $3 billion annually, with a growing share enabled by organised fraud rings that exploit gaps in cross-company data sharing.
Against this backdrop, the data infrastructure that served Canadian insurers through the 2010s — policy administration systems with bolt-on reporting databases, Excel-based actuarial models, and rules-based fraud detection — is no longer adequate. This post describes how Infra IT Consulting architects data platforms for Canadian property and casualty and life and health insurers on AWS, covering actuarial pipeline design, historical claims analytics with Redshift Spectrum, fraud detection with SageMaker, and the OSFI and FSRA compliance considerations that shape every design decision.
Actuarial Modelling Data Pipelines
Actuarial teams are among the most demanding analytical users in any organisation. They work with large datasets (exposure bases spanning millions of policy-months, claims triangles going back 20+ years), sophisticated models (GLMs, GBMs, stochastic reserving models), and tight reporting deadlines (quarterly reserve filings, IFRS 17 contractual service margin calculations). Their tools of choice are R and Python, not SQL dashboards — which means the data platform must deliver data to their preferred environments efficiently, not force them into a BI tool.
The actuarial data pipeline on AWS starts with policy and claims data from the policy administration system (PAS) and claims management system (CMS). AWS Database Migration Service (DMS) replicates these operational databases to S3 in real time using CDC (change data capture). Glue ETL jobs transform raw replicated data into actuarial-ready datasets: exposure tables with earned premium by risk period, loss development triangles with paid and incurred amounts by accident year and development period, and experience study datasets linking claims to policyholder attributes.
These datasets land in S3 in Parquet format, registered in the Glue Data Catalog. Actuaries access them from Amazon SageMaker Studio notebooks using PyArrow or the R arrow package, reading directly from S3 without needing a database connection. For models that produce outputs to be consumed by finance or management reporting, the results are written back to S3 and picked up by Redshift for inclusion in financial reporting queries.
The IFRS 17 use case deserves specific attention. The standard requires insurers to measure insurance liabilities at a current estimate of future cash flows, which means actuarial models must run at a granularity (the group of insurance contracts, or GIC) and frequency (quarterly) that legacy systems were not designed for. The AWS architecture handles this by partitioning the actuarial dataset by GIC and running parallelised SageMaker Processing Jobs across the GIC population — reducing model runtime from multi-day batch jobs to hours.
For related data modelling patterns, see The Modern Data Stack Explained.
Redshift Spectrum for Historical Claims Analytics
Insurance actuarial work is uniquely dependent on long historical data series. Reserve adequacy analysis requires claims development triangles going back 10–15 accident years. Rate filings with provincial regulators (the Financial Services Regulatory Authority of Ontario, or FSRA, for Ontario P&C; the AMF for Quebec) must be supported by 5–10 years of loss experience. Policyholder behaviour analysis for life insurance products requires cohort tracking over 20+ year durations.
Storing 15 years of claims data in a Redshift cluster would be expensive and largely wasteful — the majority of historical data is queried infrequently. Amazon Redshift Spectrum addresses this by querying S3-resident data directly from Redshift without loading it into the cluster. Actuaries write standard SQL joins that span both Redshift cluster tables (holding the most recent 3 years of data for fast interactive queries) and Spectrum external tables (pointing to the historical S3 archive) — transparently, without needing to know which storage tier a given accident year sits in.
A query joining 15 years of claims development data to produce a paid loss triangle across all lines of business — which would require hours in a legacy on-premises system — completes in 3–5 minutes in Redshift Spectrum because the MPP architecture parallelises the S3 scan across hundreds of nodes. Actuaries who previously ran triangles only at quarter-end because of runtime constraints can now run them ad hoc during the rate filing development cycle.
Data partitioning in S3 is critical to Spectrum performance. The historical claims archive is partitioned by line_of_business, accident_year, and accident_quarter, allowing Spectrum’s partition pruning to skip irrelevant partitions and scan only the data relevant to a given query. A query for Ontario auto bodily injury claims from accident years 2015–2020 scans only 6 of the 60 accident-year partitions — a 10x reduction in data read.
SageMaker for Fraud Detection
Insurance fraud detection has a fundamental imbalance problem: fewer than 1% of claims are fraudulent, so a model that classifies everything as legitimate achieves 99% accuracy while being completely useless. The data engineering challenge is building the feature set and training infrastructure that allows a model to find the meaningful signal in a heavily imbalanced dataset.
The fraud detection pipeline on AWS integrates three data sources: structured claims data from the CMS (claim type, coverage, amount, claimant attributes, reporting channel), unstructured adjuster notes from S3 (processed by Amazon Comprehend to extract entities, sentiment, and key phrases), and network data derived from shared claimant attributes (attorneys, repair shops, medical providers appearing across multiple claims — a strong indicator of organised fraud rings).
Feature engineering runs as a SageMaker Processing Job that joins these sources and produces a feature vector for each claim: claim characteristics, claimant history metrics (prior claims count, prior fraud flags, address stability), network centrality scores, and NLP features from the adjuster notes. The resulting feature dataset in S3 is used to train an XGBoost model on SageMaker, with class imbalance addressed through SMOTE oversampling of the minority (fraudulent) class during training.
The deployed SageMaker endpoint scores each new claim as it enters the CMS. A score above 0.7 generates a “High Suspicion” flag in the Special Investigations Unit (SIU) queue; a score between 0.4 and 0.7 generates a “Review Recommended” flag for the adjuster. Model explainability using SHAP values — available natively in SageMaker Clarify — provides the top three features driving each score, giving SIU investigators a starting point for their investigation rather than a black-box recommendation.
Model performance monitoring using SageMaker Model Monitor detects data drift: as fraud ring tactics evolve, the distribution of fraud indicators changes, and the model’s discrimination power degrades. Automated alerts when data drift exceeds a configured threshold trigger a model retraining pipeline, ensuring the fraud detection capability keeps pace with evolving fraud patterns.
OSFI and FSRA Compliance Considerations
For federally regulated insurers (OSFI-supervised), the data platform must satisfy OSFI’s B-10 technology risk guideline — the same guideline that applies to banks. The key requirements are cloud provider risk assessment, data residency in Canada for sensitive policyholder data, access control logging, and recovery capability testing. All of these are addressed by deploying in ca-central-1 with CloudTrail logging, Lake Formation access controls, and documented DR procedures tested annually.
FSRA, which supervises Ontario insurance companies and pension plans, has published guidance on model risk management that directly affects AI-based fraud detection models. FSRA expects that models used in claims decisions are documented with their purpose, training data, validation results, and limitations; that the model outputs are used to inform rather than automatically determine claim outcomes; and that the model’s performance is monitored and that the insurer can demonstrate the model does not produce discriminatory outcomes by protected characteristic. The SageMaker Model Card feature provides a structured template for this documentation, and SageMaker Clarify’s bias detection capabilities support the non-discrimination requirement.
For the broader data governance framework that supports these compliance requirements, see Data Governance Framework.
Conclusion
Canadian insurers building data platforms on AWS have an opportunity to address three persistent pain points simultaneously: slow, manual actuarial workflows, expensive historical data access, and rules-based fraud detection that cannot keep pace with sophisticated fraud rings. Amazon Redshift with Spectrum handles the historical analytics requirement without the storage cost of a traditional data warehouse. SageMaker delivers fraud detection models that outperform rules-based systems while providing the explainability that FSRA’s model risk guidance requires. Glue and S3 provide the scalable, cost-efficient foundation that supports both.
Infra IT Consulting has worked with Canadian P&C and life insurers to design and implement data platforms that deliver actuarial productivity improvements, regulatory reporting automation, and measurable reductions in fraudulent claims payouts. Contact us to discuss your insurance data platform requirements.
Related posts
Book a free 30-minute consultation to discuss your data engineering and analytics needs.
Talk to our team →