Building a Healthcare Data Platform on AWS Under PIPEDA
Healthcare organisations in Canada sit at the intersection of two competing pressures: the urgent need to derive clinical and operational insights from data, and an equally urgent obligation to protect that data under PIPEDA, provincial health privacy legislation (PHIPA in Ontario, HIA in Alberta, PIPA in BC), and the professional obligations of regulated health practitioners. A data platform that prioritises analytics capability over privacy controls is not just a compliance risk — it is an ethical failure.
This post describes how Infra IT Consulting designs healthcare data platforms on AWS that take both pressures seriously. We cover the technical controls required for protected health information (PHI) handling, the AWS services that implement them, and the data ingestion patterns for HL7 v2 and FHIR R4 — the two message formats that dominate Canadian clinical data exchange.
Understanding PHI Obligations in the Canadian Context
PIPEDA defines personal health information broadly: it includes any information about an individual’s physical or mental health, their health services, or their health insurance. Provincial health privacy laws add jurisdiction-specific requirements — Ontario’s PHIPA, for example, applies specifically to health information custodians (HICs) and agents, with stricter breach notification timelines and more prescriptive security requirements than PIPEDA alone.
For AWS-based platforms, the most critical compliance implication is data residency. PHI collected under provincial health legislation must remain in Canada. This means all storage, processing, and analytics must occur in AWS’s ca-central-1 (Montreal) region, with no data egress to US-based services. This affects service selection: several AWS managed services that are convenient in other contexts — including some AI services — may not offer Canadian region endpoints and therefore cannot be used for PHI workloads.
Breach notification obligations also shape architecture. PIPEDA requires notification to the Office of the Privacy Commissioner (OPC) and to affected individuals where a breach creates a real risk of significant harm. This means the platform must be capable of answering “what data was accessed, by whom, and when” within hours of a suspected breach — which requires comprehensive logging and the ability to query it rapidly.
Encryption at Rest and In Transit with AWS KMS
Every PHI workload on AWS must use encryption in transit (TLS 1.2 minimum, TLS 1.3 preferred) and encryption at rest using AWS Key Management Service (KMS) with customer-managed keys (CMKs). The distinction between CMKs and AWS-managed keys matters for healthcare: CMKs give the organisation control over key lifecycle, rotation, and the ability to revoke access to data by deleting or disabling the key — a capability that matters for data subject deletion requests.
In practice, the encryption architecture works as follows. A CMK is created in AWS KMS with a key policy that grants encryption/decryption rights only to specific IAM roles (the ELT pipeline service role, the analytics workload role, and the security team). All S3 buckets containing PHI use SSE-KMS with this CMK. Amazon Redshift clusters use CMK-based cluster encryption. Glue job output is encrypted using the same CMK through S3 server-side encryption.
For data in transit, all service-to-service communication uses VPC endpoints to avoid data traversing the public internet. S3 bucket policies include a aws:SecureTransport condition that explicitly denies non-HTTPS access. RDS and Redshift connections enforce SSL, and the connection string parameter sslmode=require is mandated in all application and pipeline configurations.
Key rotation is configured for annual automated rotation on all CMKs. Rotation events are logged to CloudTrail, and a CloudWatch alarm notifies the security team if a CMK is scheduled for deletion without an approved change ticket.
VPC Isolation and Network Architecture
Healthcare data must not be reachable from the public internet. The AWS network architecture for a PHI data platform uses a multi-layer VPC design:
A private subnet hosts all compute resources — Glue workers, EMR clusters, Lambda functions processing PHI, and Redshift. No resource in this subnet has a public IP address. Outbound internet access for software updates routes through a NAT gateway in the public subnet.
VPC endpoints replace internet paths for all AWS service calls. S3 gateway endpoints, Glue interface endpoints, KMS endpoints, CloudWatch Logs endpoints, and Redshift endpoints are all provisioned within the VPC. This means data flowing between a Glue job and an S3 bucket never leaves the AWS network.
Security groups implement least-privilege network access at the resource level. The Glue job security group allows only outbound HTTPS to VPC endpoints. The Redshift security group allows inbound connections only from specific analytics tool IP ranges and from the Glue job security group. No security group has inbound rules from 0.0.0.0/0.
AWS PrivateLink is used for any third-party SaaS tools (EHR connectors, HL7 interface engines) that need to push data into the platform, avoiding public internet exposure for the ingestion path.
Lake Formation Row-Level Security for Clinical Data
Clinical data access is inherently role-based: a cardiologist should see cardiac records, an oncologist should see oncology records, and a billing analyst should see cost data without seeing diagnoses. AWS Lake Formation’s fine-grained access controls implement this at the data layer rather than the application layer — meaning the restriction applies regardless of which tool queries the data.
Lake Formation data filters implement row-level security by defining a rowFilter expression that is evaluated at query time. For a multi-specialty hospital system, filters are applied to the fct_encounters table: a data filter for the CARDIOLOGY department restricts rows where department_code = 'CARD', and this filter is bound to the IAM role used by cardiology analysts. When a cardiologist logs into Amazon QuickSight and their session assumes the cardiology analyst role, Lake Formation evaluates the filter and returns only cardiology encounter rows — transparently, without any query modification required.
Column-level security removes PHI fields from analyst-facing views. The mrn (medical record number), dob, postal_code, and provincial_health_number columns are excluded from all roles except the regulated research access role, which requires additional approval workflow and logs every query to a dedicated audit trail table.
For related data governance patterns, see our post on data governance frameworks.
HL7 and FHIR Data Ingestion Patterns
The two dominant standards for clinical data exchange in Canada are HL7 v2 (the older, pipe-delimited message format used by most Canadian hospital information systems) and HL7 FHIR R4 (the modern RESTful standard increasingly mandated by provincial digital health agencies and pan-Canadian interoperability frameworks).
HL7 v2 ingestion typically flows through an interface engine (Mirth Connect or Rhapsody) that receives messages via MLLP from hospital systems and transforms them to JSON before delivering to Amazon Kinesis Data Streams or Amazon SQS. A Lambda function consumes these events, parses the JSON, validates required fields, and writes structured records to S3 in Parquet format using a schema consistent with the FHIR resource model. This approach handles the volume of a large acute care hospital (50,000–200,000 ADT messages per day) without overloading downstream systems.
FHIR R4 ingestion uses AWS HealthLake, a HIPAA-eligible managed service that provides a FHIR-compliant API endpoint, stores FHIR resources in a managed data store, and supports bulk export to S3 in NDJSON format. For analytics purposes, Glue ETL jobs consume the bulk export, flatten nested FHIR resource structures (Observation resources are notably complex), and produce query-optimised Parquet files partitioned by resource type and date.
Both pipelines land in the same S3 data lake under a clinical/raw prefix, allowing unified analytics across message sources. A Glue Data Catalog table spans both, with a source_format partition column distinguishing HL7 v2-derived and FHIR-derived records for lineage tracing.
Conclusion
Building a compliant healthcare data platform on AWS is achievable, but it requires deliberate choices at every layer: CMK encryption rather than managed keys, VPC endpoints rather than internet access, Lake Formation row-level security rather than application-layer filtering, and HL7/FHIR-aware ingestion patterns rather than generic file drops. Getting any of these wrong creates both compliance exposure and analytical dead ends that are expensive to fix later.
Infra IT Consulting works with Canadian hospitals, health authorities, and digital health companies to design and implement data platforms that meet the letter and spirit of PIPEDA and provincial health privacy law, while delivering the analytical capabilities that clinical and operational teams need. Contact us to discuss your healthcare data platform requirements.
For related reading on the broader data lake patterns underpinning this architecture, see Building a Data Lake on S3.
Related posts
Book a free 30-minute consultation to discuss your data engineering and analytics needs.
Talk to our team →