Infra IT Consulting logo Infra ITC
AWS Data Engineering data-qualitygreat-expectationsaws

Automating Data Quality Checks with Great Expectations on AWS

By Infra IT Consulting · · 9 min read

Data quality failures are expensive. A pipeline that silently writes incorrect data to your analytics layer can corrupt dashboards that executives use for decisions, trigger erroneous alerts in operational systems, and invalidate machine learning models trained on contaminated features. The cost of discovering a data quality issue three months after it was introduced — when every downstream system has already consumed the bad data — is orders of magnitude higher than catching it at the pipeline boundary where it entered.

Most engineering teams acknowledge this problem and respond with ad hoc assertions scattered through ETL code: a check that row counts are non-zero here, a column null check there. These checks erode over time as pipelines evolve, rarely produce actionable failure messages, and have no centralised reporting. Great Expectations is the open-source framework that brings structure, reusability, and observability to data quality validation, and its integration with AWS services makes it a practical choice for production data platforms.

What Great Expectations Actually Does

Great Expectations (GE) introduces a vocabulary for expressing data quality assertions — called expectations — that are declarative, composable, and executable against any tabular dataset. An expectation like expect_column_values_to_not_be_null(column="customer_id") or expect_column_values_to_be_between(column="order_value", min_value=0, max_value=1000000) reads like a human-readable data contract that is simultaneously executable code.

Expectations are grouped into Expectation Suites — named collections that define the quality contract for a specific dataset or pipeline stage. When you run a validation, GE evaluates the suite against your data and produces a Validation Result containing a pass/fail status for each expectation, the actual observed values, and a summary report.

Data Docs is GE’s built-in reporting layer: it renders validation results as HTML documentation hosted on S3, giving data consumers and pipeline engineers a browsable audit trail of data quality history. This transforms data quality from a silent failure into a visible, documented system property.

Setting Up Great Expectations with S3 as the Backend

GE stores its configuration, expectation suites, and validation results in a Data Context. For AWS deployments, the recommended backend is S3 for all three stores:

# great_expectations/great_expectations.yml (abbreviated)
config_version: 3.0
datasources:
  s3_datasource:
    module_name: great_expectations.datasource
    class_name: Datasource
    execution_engine:
      module_name: great_expectations.execution_engine
      class_name: SparkDFExecutionEngine
    data_connectors:
      s3_connector:
        class_name: InferredAssetS3DataConnector
        bucket: my-data-lake
        prefix: processed/orders/

stores:
  expectations_store:
    class_name: ExpectationsStore
    store_backend:
      class_name: TupleS3StoreBackend
      bucket: my-ge-metadata
      prefix: expectations/

  validations_store:
    class_name: ValidationsStore
    store_backend:
      class_name: TupleS3StoreBackend
      bucket: my-ge-metadata
      prefix: validations/

  checkpoint_store:
    class_name: CheckpointStore
    store_backend:
      class_name: TupleS3StoreBackend
      bucket: my-ge-metadata
      prefix: checkpoints/

data_docs_sites:
  s3_site:
    class_name: SiteBuilder
    store_backend:
      class_name: TupleS3StoreBackend
      bucket: my-ge-docs
      prefix: data_docs/
    site_index_builder:
      class_name: DefaultSiteIndexBuilder

This configuration stores everything in S3, making GE stateless from the compute perspective. Any Lambda function, Glue job, or EC2 instance with access to the S3 buckets can run validations and contribute to the shared reporting layer.

Writing Expectation Suites in Python

Expectation suites are defined programmatically and stored as JSON in your expectations S3 bucket. A practical suite for an orders dataset:

import great_expectations as ge
from great_expectations.core.batch import RuntimeBatchRequest

context = ge.get_context()

# Create or update an expectation suite
suite = context.create_expectation_suite(
    expectation_suite_name="orders.raw.v1",
    overwrite_existing=True
)

# Get a validator against sample data
validator = context.get_validator(
    batch_request=RuntimeBatchRequest(
        datasource_name="s3_datasource",
        data_connector_name="s3_connector",
        data_asset_name="orders",
        runtime_parameters={"path": "s3://my-data-lake/processed/orders/2024/04/15/"},
        batch_identifiers={"run_date": "2024-04-15"}
    ),
    expectation_suite_name="orders.raw.v1"
)

# Define expectations
validator.expect_column_to_exist("order_id")
validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_be_unique("order_id")

validator.expect_column_to_exist("customer_id")
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_match_regex("customer_id", r"^cust_[0-9]{6,}$")

validator.expect_column_values_to_be_between(
    column="order_value_cad",
    min_value=0.01,
    max_value=500000,
    mostly=0.999  # Allow 0.1% exceptions for legitimate edge cases
)

validator.expect_column_values_to_be_in_set(
    column="currency",
    value_set=["CAD", "USD", "GBP", "EUR"]
)

# Row count should be within 30% of yesterday's count
validator.expect_table_row_count_to_be_between(
    min_value=int(yesterday_row_count * 0.7),
    max_value=int(yesterday_row_count * 1.3)
)

validator.save_expectation_suite(discard_failed_expectations=False)

The mostly parameter on column value expectations is particularly useful for production data: it allows you to express “99.9% of values must satisfy this condition” rather than requiring 100% compliance, which is more realistic for data ingested from external sources.

Integrating Great Expectations into AWS Glue Jobs

The most practical integration point is running GE validation as a step within your Glue ETL job, either before processing (to validate raw input) or after transformation (to validate output quality). Running before processing ensures you don’t waste compute transforming data that will ultimately be rejected.

import sys
import great_expectations as ge
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext

args = getResolvedOptions(sys.argv, ['JOB_NAME', 'source_path', 'run_date'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Read input data
raw_df = spark.read.parquet(args['source_path'])

# Run Great Expectations validation using Spark execution engine
context = ge.get_context()

checkpoint_result = context.run_checkpoint(
    checkpoint_name="orders_raw_checkpoint",
    validations=[
        {
            "batch_request": {
                "datasource_name": "s3_datasource",
                "data_connector_name": "runtime_connector",
                "data_asset_name": "orders_raw",
                "runtime_parameters": {"batch_data": raw_df},
                "batch_identifiers": {"run_date": args['run_date']}
            },
            "expectation_suite_name": "orders.raw.v1"
        }
    ]
)

# Fail the Glue job if validation fails
if not checkpoint_result.success:
    failed_expectations = [
        result.expectation_config.expectation_type
        for result in checkpoint_result.list_validation_results()[0].results
        if not result.success
    ]
    raise Exception(
        f"Data quality validation failed. Failed expectations: {failed_expectations}. "
        f"Check Data Docs at s3://my-ge-docs/data_docs/index.html"
    )

# Proceed with transformation only if validation passed
transformed_df = raw_df.transform(apply_business_rules)
transformed_df.write.format("parquet").mode("append").save("s3://processed/orders/")

When the validation fails, the Glue job raises an exception, causing the job to fail with a descriptive error message that includes which expectations were violated. If this Glue job is orchestrated by AWS Step Functions, the failure is caught by the state machine’s error handling logic, which can trigger notifications and halt downstream pipeline steps that depend on the validated data.

Checkpoint-Based Automation for Scheduled Pipelines

GE Checkpoints combine data sources, expectation suites, and action lists into a single executable unit. Actions are post-validation steps that execute conditionally on validation outcome — common actions include updating Data Docs, sending Slack notifications, and writing validation results to a database.

# great_expectations/checkpoints/orders_daily_checkpoint.yml
name: orders_daily_checkpoint
config_version: 1.0
class_name: Checkpoint
run_name_template: "%Y%m%d-%H%M%S-orders-daily"
expectation_suite_name: orders.raw.v1
action_list:
  - name: store_validation_result
    action:
      class_name: StoreValidationResultAction
  - name: update_data_docs
    action:
      class_name: UpdateDataDocsAction
      site_names:
        - s3_site
  - name: send_slack_notification
    action:
      class_name: SlackNotificationAction
      slack_webhook: "${GE_SLACK_WEBHOOK}"
      notify_on: failure
      renderer:
        module_name: great_expectations.render.renderer.slack_renderer
        class_name: SlackRenderer

Profiling and Automatic Expectation Generation

Rather than writing expectation suites from scratch, GE’s Profiler can analyse a representative sample of your data and generate a baseline expectation suite automatically. The profiler infers value ranges, cardinality, null rates, and distribution statistics from the sample data and creates expectations that capture those characteristics.

from great_expectations.profile.user_configurable_profiler import UserConfigurableProfiler

profiler = UserConfigurableProfiler(
    profile_dataset=validator,
    excluded_expectations=[
        "expect_column_values_to_be_in_set"  # Too granular for high-cardinality columns
    ],
    ignored_columns=["raw_json_payload"],
    value_set_threshold="few"
)
suite, validation_result = profiler.build_suite()

This is a useful starting point when onboarding a new dataset, but always review auto-generated expectations before using them in production — profilers trained on sample data can generate overly specific expectations that fail on legitimate data variations.

Single-point validation tells you whether today’s data passed. Trend analysis tells you whether data quality is improving or degrading over time. GE’s validation results are stored as JSON in S3, making them queryable with Athena for trend analysis.

A simple Athena query to track daily validation pass rates:

SELECT
    DATE(from_iso8601_timestamp(run_time)) AS validation_date,
    expectation_suite_name,
    SUM(CASE WHEN success = true THEN 1 ELSE 0 END) AS passed,
    COUNT(*) AS total,
    ROUND(100.0 * SUM(CASE WHEN success = true THEN 1 ELSE 0 END) / COUNT(*), 2) AS pass_rate_pct
FROM ge_validation_results
GROUP BY 1, 2
ORDER BY 1 DESC;

Tracking pass rates over time allows you to detect gradual data quality degradation before it becomes a pipeline-breaking event — a pattern consistent with the DataOps practices that separate mature data platforms from fragile ones.

Conclusion

Great Expectations brings software engineering discipline to data quality: declarative contracts, automated execution, version-controlled definitions, and observable reporting. Integrated into AWS Glue jobs with S3-backed storage and EventBridge-triggered validation workflows, it becomes a first-class component of a production data platform rather than an afterthought.

The investment in building expectation suites upfront pays dividends continuously — every pipeline run validates the data contract, every failure is caught at the boundary rather than propagated to consumers, and every validation result contributes to a growing body of data quality history.

If your AWS data platform lacks systematic data quality validation, or if you’re dealing with recurring data quality incidents that damage trust in your analytics, contact Infra IT Consulting to discuss how we design and implement data quality frameworks for production data engineering teams.

Related posts