AWS Data Engineering lake-formationgovernanceaws

AWS Lake Formation Best Practices for Data Governance

By Infra IT Consulting · January 17, 2024 · 9 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

Data governance is where many data lake projects stall. Teams build an elegant ingestion pipeline, land terabytes of data in Amazon S3, and then discover that granting access to the right people — and only the right people — requires a maze of IAM policies, bucket policies, and S3 prefix structures that no one fully understands. AWS Lake Formation was designed to solve this problem. But used incorrectly, it adds a new layer of complexity without delivering the governance clarity it promises.

This guide covers the Lake Formation concepts and configuration decisions that matter most in production, with a focus on the security and compliance requirements common to Canadian, UK, and African data teams operating under regional data protection frameworks.

Understanding the Lake Formation Permission Model

Lake Formation sits on top of the AWS Glue Data Catalog and Amazon S3. It does not replace IAM — it extends it. The permission model works as follows:

You register S3 locations with Lake Formation, making them “data lake locations.”
You define databases and tables in the Glue Data Catalog that map to those S3 locations.
You grant Lake Formation permissions (SELECT, INSERT, DELETE, DESCRIBE, ALTER) on those databases and tables to IAM principals (users, roles, and groups).
When an IAM principal queries a table via Amazon Athena or Amazon Redshift Spectrum, Lake Formation validates whether the principal has the appropriate table and column grants before allowing the query to proceed.

The critical insight is that Lake Formation permissions are additive to IAM permissions. A principal needs both a Lake Formation grant and an IAM policy that permits the relevant Glue and S3 actions. If either is missing, access is denied.

This is also why the default IAMAllowedPrincipals grant is dangerous. When you first enable Lake Formation on a new database, AWS creates this default grant, which effectively bypasses Lake Formation permission checks for any principal with IAM access to Glue. Always revoke IAMAllowedPrincipals grants on production databases the moment you have your explicit grants in place.

Setting Up Lake Formation on a New Data Lake

The cleanest path to a governed Lake Formation setup on a data lake built on Amazon S3 follows this sequence:

Step 1: Designate Lake Formation administrators. In the Lake Formation console, add the IAM roles used by your data platform team as data lake administrators. Do not add individual user accounts — use roles that your team members assume. This keeps the admin surface manageable as team membership changes.

Step 2: Register S3 data lake locations. For each S3 bucket or prefix that contains data lake content, register it as a Lake Formation data location. Lake Formation requires a service-linked role with read/write access to the S3 path. Use a dedicated LakeFormationServiceRole rather than an overly broad admin role.

Step 3: Create databases and tables in the Glue Data Catalog. If you are starting fresh, create these through Lake Formation rather than the Glue console directly — Lake Formation then sets itself as the authoritative permission manager for those resources. For existing Glue databases, migrate them into Lake Formation governance by revoking IAMAllowedPrincipals and adding explicit grants.

Step 4: Grant permissions using the Lake Formation console or AWS CLI. For each consumer role (analyst team, BI tool service account, ML team), grant the minimum necessary permissions. A read-only analyst role typically needs SELECT on specific tables and DESCRIBE on the database. An ETL service role needs SELECT, INSERT, and ALTER on the tables it writes.

Column-Level and Row-Level Security

Column-level security is one of Lake Formation’s most valuable features for teams handling regulated data. Rather than maintaining separate “sanitised” copies of tables, you grant column-level permissions so that different principals see different projections of the same underlying table.

For example, a customer transactions table might contain:

transaction_id, customer_id, amount, currency, email_address, sin_last4, card_bin

A fraud analytics team needs transaction_id, amount, currency, and card_bin. The data science team needs customer_id, amount, and currency. Neither team should see email_address or sin_last4. With Lake Formation column grants:

aws lakeformation grant-permissions \
  --principal DataLakePrincipalIdentifier=arn:aws:iam::123456789012:role/FraudAnalyticsRole \
  --permissions SELECT \
  --resource '{
    "TableWithColumns": {
      "DatabaseName": "cleansed",
      "Name": "customer_transactions",
      "ColumnNames": ["transaction_id", "amount", "currency", "card_bin"]
    }
  }'

When the fraud analytics role queries this table in Athena, the query will succeed only if it selects from the permitted columns. Attempting to select email_address returns an access denied error.

Row-level security via Lake Formation Data Filters provides similar granularity at the row level — allowing a regional team to query only rows where region = 'CA-ON', for instance. As of Lake Formation version 3, row-level filters and cell-level security (combining row and column filters) are supported for Athena and Redshift Spectrum queries.

In multi-account AWS environments — a common pattern for larger organisations separating production, development, and shared services accounts — Lake Formation enables secure cross-account data sharing without copying data.

The data producer account registers the S3 location and manages the Glue Data Catalog. The data consumer account queries the data via Athena using a Lake Formation RAM (Resource Access Manager) share. The setup involves:

In the producer account, create a Lake Formation resource share via AWS RAM that includes the target Glue database and tables.
In the consumer account, accept the RAM share.
In the producer account, grant the consumer account’s IAM principals the appropriate Lake Formation permissions on the shared resources.
In the consumer account, create a Glue database that links to the shared resource.

Data never leaves the producer account’s S3 bucket. The consumer’s Athena queries are validated against Lake Formation permissions in real time. This pattern is particularly valuable for regulated industries where the data-owning team needs to maintain strict audit control over who accessed what and when.

Audit Logging with AWS CloudTrail

Every Lake Formation permission grant, revocation, and data access event is logged in AWS CloudTrail. For compliance with PIPEDA, UK GDPR, or sector-specific frameworks like PHIPA (Ontario’s Personal Health Information Protection Act), this audit trail is non-negotiable.

Enable CloudTrail in your data lake account and configure it to log to a separate, locked-down S3 bucket. Key events to monitor:

lakeformation:GrantPermissions — who granted access to what
lakeformation:RevokePermissions — when access was removed
glue:GetTable and glue:GetPartitions — metadata access events
s3:GetObject via CloudTrail S3 data events — individual file access (high volume, enable selectively)

Pair CloudTrail with Amazon CloudWatch Logs and set up metric filters to alert on anomalous grant events — for example, any grant of database-level SELECT permissions to an IAM role not in your approved list. This gives your security team real-time visibility into permission changes.

For broader metadata management, Lake Formation integrates with the AWS Glue Data Catalog, which stores table-level lineage information that complements the access audit logs.

Tag-Based Access Control (TBAC)

As the number of databases and tables grows, managing per-resource Lake Formation grants becomes administratively heavy. Lake Formation Tag-Based Access Control (TBAC) addresses this by letting you define attribute tags — like data_classification=pii, domain=finance, environment=production — and grant permissions based on those tags rather than individual resource names.

A grant like “the Data Science role may SELECT on all tables tagged domain=marketing and data_classification=non_pii” automatically covers any new table added with those tags, without requiring a new explicit grant. This dramatically reduces the operational overhead of permission management as your catalog grows from tens to hundreds of tables.

TBAC is the recommended approach for data lakes with more than a few dozen tables. Start with a clear tagging taxonomy agreed on across your data platform team before enabling TBAC, as changing the taxonomy later requires revising all grants built on the old tags.

Common Mistakes to Avoid

Leaving IAMAllowedPrincipals in place: Mentioned earlier, but worth repeating. This is the single most common Lake Formation misconfiguration in production environments.

Over-privileged ETL roles: Glue ETL jobs that write to the data lake often get overly broad Lake Formation grants because it is easy. Scope ETL role grants to only the tables the job reads from and writes to, using explicit table-level grants rather than database-level grants.

Not testing permission denials: When setting up Lake Formation, teams usually test that authorised principals can access data. Fewer teams test that unauthorised principals cannot — and then discover gaps in production. Add permission boundary tests to your data platform CI/CD pipeline.

Skipping the governance planning phase: Lake Formation enforces whatever governance policy you define. If that policy is not thought through before implementation, enforcement will be either too loose (undermining governance) or too tight (blocking legitimate work). Document your data classification taxonomy and access policy before writing the first Lake Formation grant.

Conclusion

AWS Lake Formation provides genuine, enforceable data governance for S3-based data lakes — but it requires deliberate setup and a clear governance policy to deliver on that promise. Revoking default grants, implementing column and row-level security, enabling cross-account sharing correctly, and monitoring via CloudTrail are the practices that separate a governed data lake from one that merely looks governed. Ready to build or optimise your AWS data infrastructure? Contact the Infra IT Consulting team for a free consultation.

AWS Data Engineering

Talk to our team →

AWS Lake Formation Best Practices for Data Governance

Understanding the Lake Formation Permission Model

Setting Up Lake Formation on a New Data Lake

Column-Level and Row-Level Security

Audit Logging with AWS CloudTrail

Tag-Based Access Control (TBAC)

Common Mistakes to Avoid

Conclusion

Related posts

Running Apache Airflow on AWS with MWAA

Parquet vs. ORC on AWS: Choosing the Right Columnar Format

Using Amazon EventBridge in Data Engineering Workflows

Understanding the Lake Formation Permission Model

Setting Up Lake Formation on a New Data Lake

Column-Level and Row-Level Security

Cross-Account Data Sharing

Audit Logging with AWS CloudTrail

Tag-Based Access Control (TBAC)

Common Mistakes to Avoid

Conclusion

Related posts

Running Apache Airflow on AWS with MWAA

Parquet vs. ORC on AWS: Choosing the Right Columnar Format

Using Amazon EventBridge in Data Engineering Workflows

We value your privacy