Infra IT Consulting logo Infra ITC
AWS Data Engineering aws-wranglerpandaspython

AWS Data Wrangler: The Pandas-to-S3 Bridge You Need

By Infra IT Consulting · · 8 min read

Every Python data engineer has written the same boilerplate: create an S3 client with boto3, generate a presigned URL or use upload_fileobj, manage buffer handling for Parquet files, parse the Glue Data Catalog to discover schema, handle pagination in Athena query results, and stitch it all together with retry logic. The code works, but it is tedious to write and even more tedious to maintain across a team. AWS Data Wrangler — now officially renamed to the AWS SDK for Pandas (awswrangler) — is the library that eliminates this boilerplate and replaces it with a clean, Pandas-native API for the entire AWS data stack.

This post covers practical usage patterns, performance considerations, and the edge cases that trip up teams adopting awswrangler in production.

What awswrangler Actually Covers

awswrangler is not a Spark library. It runs on single-node Python environments — Lambda functions, Glue Python Shell jobs, EC2 instances, SageMaker notebooks, or local machines. Its value proposition is dramatically simplifying interaction with AWS data services from Python code that processes data at the Pandas scale (gigabytes, not terabytes).

The library provides high-level functions across:

  • Amazon S3: Read and write Parquet, CSV, JSON, ORC, Avro with automatic partitioning, multipart upload, and type inference
  • AWS Glue Data Catalog: Create, update, and query tables with automatic schema inference from DataFrames
  • Amazon Athena: Execute SQL queries and return results as DataFrames with pagination handling and CTAS optimisation
  • Amazon Redshift: Read from and write to Redshift tables with S3 staging handled automatically
  • AWS Glue ETL: Integration points for Glue Python Shell jobs
  • DynamoDB, Neptune, OpenSearch: Basic read/write support for those services

Reading and Writing Parquet to S3

The most fundamental operation is reading and writing Parquet files. awswrangler handles multipart downloads, parallel reads across multiple files in a prefix, and automatic schema merging:

import awswrangler as wr
import pandas as pd

# Write a DataFrame to S3 as partitioned Parquet
df = pd.DataFrame({
    'order_id': ['ord_001', 'ord_002', 'ord_003'],
    'customer_id': ['cust_100', 'cust_101', 'cust_100'],
    'order_date': pd.to_datetime(['2024-04-22', '2024-04-22', '2024-04-21']),
    'total_cad': [129.99, 459.00, 88.50],
    'region': ['ontario', 'alberta', 'ontario']
})

# Write with partitioning — awswrangler creates the partition structure automatically
wr.s3.to_parquet(
    df=df,
    path="s3://my-data-lake/orders/",
    partition_cols=["region"],
    dataset=True,           # Enable dataset mode for partition management
    mode="append",          # Options: overwrite, append, overwrite_partitions
    database="analytics",   # Update the Glue catalog automatically
    table="orders",
    dtype={"total_cad": "double", "order_date": "date"}
)

The dataset=True parameter is critical. Without it, awswrangler writes raw Parquet files. With it, you get automatic partition management, Glue catalog updates, and proper handling of the mode parameter. The overwrite_partitions mode is particularly useful for daily ETL: it overwrites only the partitions present in the new DataFrame rather than the entire table, enabling idempotent daily loads without touching historical data.

Reading is equally straightforward:

# Read an entire dataset
df_all = wr.s3.read_parquet(
    path="s3://my-data-lake/orders/",
    dataset=True
)

# Read with partition filtering — avoids scanning unnecessary partitions
df_ontario = wr.s3.read_parquet(
    path="s3://my-data-lake/orders/",
    dataset=True,
    partition_filter=lambda x: x["region"] == "ontario"
)

# Read only specific columns (projection pushdown)
df_summary = wr.s3.read_parquet(
    path="s3://my-data-lake/orders/",
    columns=["order_id", "total_cad", "region"],
    dataset=True
)

The partition_filter parameter applies client-side partition pruning — awswrangler lists the available partitions and downloads only those that match the filter. This is not the same as Athena’s server-side partition pruning but is effective for reducing S3 API calls and download volume.

Querying Athena and Getting Results as DataFrames

The Athena integration eliminates all the polling, result pagination, and S3 result file downloading that the raw boto3 Athena API requires:

# Simple query — returns a DataFrame directly
df_results = wr.athena.read_sql_query(
    sql="SELECT region, SUM(total_cad) AS revenue FROM orders GROUP BY region",
    database="analytics"
)

# Query with parameterisation (uses Athena prepared statements under the hood)
start_date = "2024-04-01"
end_date = "2024-04-22"
df_period = wr.athena.read_sql_query(
    sql=f"""
        SELECT order_id, customer_id, total_cad
        FROM orders
        WHERE order_date BETWEEN DATE '{start_date}' AND DATE '{end_date}'
        AND region = 'ontario'
    """,
    database="analytics",
    ctas_approach=True  # Use CREATE TABLE AS SELECT for large result sets
)

The ctas_approach=True parameter is important for queries returning more than 1 GB of results. Without it, awswrangler uses the standard Athena API, which writes results to S3 as a single CSV file and then downloads it — slow and memory-intensive for large results. With CTAS enabled, Athena writes results as Parquet files in parallel and awswrangler reads them with parallel S3 downloads, significantly faster for large result sets. The trade-off is that CTAS creates a temporary table in Glue that awswrangler cleans up automatically.

Managing the Glue Data Catalog Programmatically

awswrangler provides high-level functions for Glue catalog management that are far simpler than the raw boto3 Glue client:

# Create or update a table from a DataFrame
wr.catalog.create_parquet_table(
    database="analytics",
    table="customer_segments",
    path="s3://my-data-lake/customer_segments/",
    columns_types={
        "customer_id": "string",
        "segment": "string",
        "ltv_score": "double",
        "last_updated": "timestamp"
    },
    partitions_types={"year": "string", "month": "string"},
    description="Customer LTV segments updated daily by ML pipeline"
)

# Add new partitions to an existing table
wr.catalog.add_parquet_partitions(
    database="analytics",
    table="orders",
    partitions_values={
        "s3://my-data-lake/orders/region=ontario/": ["ontario"],
        "s3://my-data-lake/orders/region=alberta/": ["alberta"]
    }
)

# Check if a table exists
table_exists = wr.catalog.does_table_exist(
    database="analytics",
    table="orders"
)

# Get table schema as a dict
schema = wr.catalog.get_table_types(database="analytics", table="orders")

The AWS Glue Data Catalog guide covers catalog governance in depth, but awswrangler’s catalog functions handle the most common programmatic catalog management tasks without requiring Glue Crawlers for tables whose schema you control.

Redshift Integration with S3 Staging

Writing DataFrames to Redshift via awswrangler uses S3 as a staging area and Redshift’s COPY command under the hood, which is the highest-throughput path into Redshift:

# Write to Redshift using S3 staging
wr.redshift.to_sql(
    df=df,
    table="daily_orders",
    schema="analytics",
    con=wr.redshift.connect(secret_id="prod/redshift/credentials"),
    mode="upsert",
    primary_keys=["order_id"],
    temp_s3_path="s3://my-staging-bucket/redshift-temp/"
)

# Read from Redshift into a DataFrame
df_redshift = wr.redshift.read_sql_query(
    sql="SELECT * FROM analytics.customer_summary WHERE active = TRUE",
    con=wr.redshift.connect(secret_id="prod/redshift/credentials"),
    temp_s3_path="s3://my-staging-bucket/redshift-unload/"
)

The mode="upsert" option performs a staging-table merge pattern: write the new data to a temporary Redshift table, then merge it into the target table using the specified primary keys. This is the correct pattern for updating Redshift facts tables with incoming CDC data.

Performance Limits and When to Use Spark Instead

awswrangler is a single-node library. It is appropriate for processing datasets that fit in the memory of a Lambda function (up to 10 GB), a Glue Python Shell job (1 DPU = 16 GB), or an appropriately sized EC2 instance. For datasets exceeding a few hundred gigabytes, or for workloads that require distributed shuffle operations (joins, aggregations over billions of rows), you need Spark running on Glue ETL or EMR.

Common patterns for using both together:

  • Use awswrangler in Lambda for lightweight ETL and file metadata operations that don’t justify a Glue job startup time (typically 30-60 seconds)
  • Use awswrangler in Glue Python Shell for catalog management, Athena query orchestration, and small-to-medium DataFrame operations
  • Trigger Glue Spark ETL jobs from Lambda or Step Functions for heavy transformations
  • Use awswrangler to write Spark job outputs to Glue catalog after a Spark job writes Parquet to S3

The library also supports chunked reading, which allows you to process large S3 datasets in batches without loading everything into memory:

# Process a large dataset in chunks of 100,000 rows
for chunk_df in wr.s3.read_parquet(
    path="s3://my-data-lake/large-table/",
    chunked=100_000,
    dataset=True
):
    processed = apply_transformation(chunk_df)
    wr.s3.to_parquet(processed, path="s3://output/", dataset=True, mode="append")

This pattern enables processing of datasets significantly larger than available RAM, though without the parallelism of Spark. For the lightweight ETL patterns that fit in Lambda or a small Python Shell job, chunked reading is an effective tool.

Practical Configuration: Boto3 Session and Retry Settings

In production, always pass an explicit boto3 session to control credentials and region, and configure the retry behaviour for S3 operations:

import boto3
import awswrangler as wr

# Use a specific IAM role or credentials profile
session = boto3.Session(
    region_name="ca-central-1",
    # profile_name="data-pipeline-role"  # For local development
)

# Global configuration for the session
wr.config.boto3_session = session
wr.config.max_cache_seconds = 900      # Cache Athena query results for 15 minutes
wr.config.max_cache_query_inspections = 500

# Or pass session per-call
df = wr.s3.read_parquet(
    path="s3://my-data-lake/orders/",
    boto3_session=session,
    dataset=True
)

For Lambda functions, initialise the boto3 session and awswrangler configuration in the module-level code (outside the handler function) to benefit from container reuse between invocations. This avoids re-authenticating with AWS on every invocation, which can add 100–300 ms of latency.

Conclusion

awswrangler fills a genuine gap in the AWS Python ecosystem: a high-level, Pandas-native library that makes interacting with S3, Athena, Glue, and Redshift feel as natural as working with local files. It eliminates thousands of lines of boto3 boilerplate across a typical data engineering codebase and replaces them with expressive, readable one-liners that do the right thing by default.

The library is not a replacement for Spark on large-scale distributed workloads, but for the large category of data engineering tasks that involve moderate-sized datasets, catalog management, and Athena query orchestration, it is one of the most productive tools available in the AWS Python ecosystem.

If your Python data engineering codebase is drowning in boto3 boilerplate or you want to build a lightweight ETL layer on AWS without the overhead of Spark clusters, contact Infra IT Consulting to discuss the right architecture for your workload.

Related posts