AWS Data Engineering redshiftspectrumsql

Optimising Amazon Redshift Spectrum for Federated Queries

By Infra IT Consulting · March 4, 2024 · 9 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

Amazon Redshift Spectrum extends the reach of your Redshift cluster to query data stored in Amazon S3 without loading it into Redshift storage. For organisations running a data lake on Amazon S3 alongside a Redshift data warehouse, Spectrum makes it possible to join hot, frequently queried data in Redshift with cold historical data in S3 — in a single SQL query, without ETL pipelines, without data movement, and without maintaining separate query tools for each tier.

But Spectrum queries can be slow and expensive if not configured correctly. This guide covers the optimisation techniques that make the difference between a Spectrum query that returns in 10 seconds and one that times out or generates a surprise cost spike.

How Spectrum Works Under the Hood

Understanding Spectrum’s execution model is essential for optimising it. When Redshift’s query planner encounters a reference to an external table (one backed by S3 via the Glue Data Catalog), it routes the relevant portion of the query to the Spectrum layer — a fleet of managed, parallelised scan nodes that operate independently of your Redshift cluster.

The Spectrum layer handles:

Reading S3 files (Parquet, ORC, CSV, JSON, Avro)
Applying partition filters to skip irrelevant S3 prefixes
Executing predicate pushdown — filtering rows within Parquet files using column statistics
Projecting only the columns referenced in the query
Returning pre-aggregated or pre-filtered results to Redshift for final joins and aggregations

Redshift then merges the Spectrum results with any native Redshift table data and applies the remaining query operations (final aggregations, ORDER BY, LIMIT).

This separation means Spectrum costs are charged separately — $5 per terabyte of data scanned from S3, identical to Amazon Athena’s pricing. Like Athena, Spectrum’s scan cost is dramatically reduced by good partitioning and columnar Parquet storage.

Setting Up External Schemas and Tables

Spectrum requires an external schema in Redshift that maps to a Glue Data Catalog database. Creating the external schema:

CREATE EXTERNAL SCHEMA glue_cleansed
FROM DATA CATALOG
DATABASE 'cleansed'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

The IAM role must have permissions to:

glue:GetDatabase, glue:GetTable, glue:GetPartitions — to read catalog metadata
s3:GetObject, s3:ListBucket — to read S3 files
If using Lake Formation: lakeformation:GetDataAccess — to validate column-level grants

Once the external schema is created, all Glue tables in the cleansed database are immediately queryable in Redshift under the glue_cleansed schema name:

SELECT * FROM glue_cleansed.web_sessions LIMIT 10;

External tables registered in the Glue catalog automatically appear here — including tables created by Glue crawlers, Athena DDL statements, or your ETL jobs’ explicit catalog API calls. This shared metadata layer is one of the most valuable aspects of building on the AWS Glue Data Catalog.

Partition Pruning in Spectrum

Partition pruning works in Spectrum the same way it works in Athena — Spectrum reads partition metadata from the Glue catalog and skips S3 prefixes that do not match the query’s partition filter predicates. The performance impact is identical: a well-filtered query on a 10 TB partitioned table might scan 50 GB.

Crucially, the partition filter must be expressed directly against the partition columns. Do not apply functions to partition columns in WHERE clauses:

-- WRONG: Function on partition column disables pruning
WHERE DATE_TRUNC('month', event_date) = '2024-01-01'

-- CORRECT: Direct predicate on partition column enables pruning  
WHERE year = '2024' AND month = '01'

For Spectrum queries that join a native Redshift table with an external S3 table, Redshift can sometimes push join predicates down into Spectrum — reducing the data Spectrum returns to Redshift before the join. This is called predicate pushdown across the Spectrum boundary. To enable it, the join condition must be on a partition column or a column with statistics that allow Spectrum to filter effectively.

Confirm partition pruning is working by examining the query plan with EXPLAIN:

EXPLAIN
SELECT s.session_id, u.email, s.page_views
FROM glue_cleansed.web_sessions s
JOIN users u ON s.user_id = u.user_id
WHERE s.year = '2024' AND s.month = '02';

Look for S3 Seq Scan nodes in the plan with partitions pruned=X indicating how many partitions were eliminated. If partitions pruned is zero when you expect pruning, the partition filter expression is not being recognised.

Spectrum Table Statistics

Redshift’s query optimiser uses table statistics to choose efficient join strategies and estimate result set sizes. For native Redshift tables, you run ANALYZE to gather statistics. For Spectrum external tables, you must set statistics manually using ALTER TABLE ... TABLE PROPERTIES:

ALTER TABLE glue_cleansed.web_sessions
TABLE PROPERTIES ('numRows'='5000000000');

Without accurate row count statistics, Redshift underestimates the cost of joining a large Spectrum table and may choose a suboptimal join strategy — for example, attempting a broadcast join on a table with 5 billion rows, which will fail or perform extremely poorly.

Update these statistics when the table grows significantly. A good practice is to update numRows as part of your ETL pipeline after each batch load, using an AWS Glue script or Lambda function that counts rows and calls the Redshift API via the Data API.

Query Design Patterns for Spectrum

Pattern 1: Historical data federated to Spectrum, recent data in Redshift.

The most common Spectrum use case. Keep the last 6-12 months of data in Redshift (fast, cached, highly concurrent), and push older data to S3-backed Spectrum tables (cheap, compressed, less frequently queried):

-- Unified view spanning both tiers
CREATE VIEW all_orders AS
  SELECT order_id, customer_id, order_date, total_amount
  FROM native_orders  -- Redshift table: last 12 months
  UNION ALL
  SELECT order_id, customer_id, order_date, total_amount
  FROM glue_curated.historical_orders  -- Spectrum: > 12 months old
;

Queries against this view that filter by order_date within the last 12 months will be served entirely from Redshift (no Spectrum cost, sub-second latency). Queries spanning historical periods will read from both tiers. Business users and BI tools see a single seamless table.

Pattern 2: Aggregation pushdown.

Spectrum performs aggregations before returning data to Redshift. Push as much aggregation work as possible into the Spectrum portion of the query by pre-aggregating in a subquery:

-- Suboptimal: Redshift receives all rows from Spectrum, then aggregates
SELECT customer_id, SUM(amount)
FROM glue_cleansed.transactions
WHERE year = '2024'
GROUP BY customer_id;

-- Better: Same SQL but Spectrum can push the GROUP BY if the planner cooperates
-- Verify with EXPLAIN that the S3 node shows aggregation in its description

Spectrum’s ability to push GROUP BY into the scan layer depends on the query structure and whether Spectrum determines the pushdown is beneficial. Review the EXPLAIN plan to confirm.

Pattern 3: Avoiding small file scans.

If your S3 data contains many small files (< 10 MB each), Spectrum incurs high per-file overhead — each file requires an S3 GET request, and many small requests limit parallelism. Compact small files before querying with Spectrum. A Glue ETL job that reads and re-writes a recent partition with target file sizes of 256-512 MB improves Spectrum query performance on that partition dramatically.

See our detailed guide on S3 data partitioning strategies for file compaction approaches that directly benefit Spectrum performance.

Concurrency and Cost Management

Spectrum queries consume Spectrum nodes from a shared pool managed by AWS. Your Redshift cluster’s node count and type influence how many Spectrum requests can be dispatched in parallel — more nodes means more parallel S3 reads. Spectrum concurrency is not a configurable parameter; it scales with your cluster.

For cost management, monitor the SpectrumScanSizeBytes CloudWatch metric per query using Redshift’s STL_SCAN system table:

-- Identify expensive Spectrum scans in the last 24 hours
SELECT query, perm_table_name, SUM(rows) as rows_scanned,
       SUM(bytes) / 1024 / 1024 / 1024 as gb_scanned
FROM SVL_S3QUERY_SUMMARY
WHERE starttime > DATEADD(hour, -24, GETDATE())
GROUP BY query, perm_table_name
ORDER BY gb_scanned DESC
LIMIT 20;

Queries with high gb_scanned but low rows_scanned are scanning many bytes to return few rows — a sign of poor partitioning or missing Parquet column statistics. Queries with high gb_scanned and high rows_scanned are expected for large aggregate queries; the focus there should be on whether the aggregation result set is reasonable for the business question.

Set Redshift query monitoring rules (QMR) to alert or cancel Spectrum queries that scan more than a threshold — for example, cancelling any query that scans more than 5 TB from Spectrum to prevent runaway costs from accidental full-table scans.

Combining Spectrum with Redshift Federated Query

Redshift also supports Federated Query — direct, live queries against Amazon RDS PostgreSQL, Amazon Aurora PostgreSQL, and Amazon Aurora MySQL. Unlike Spectrum (which queries S3), Federated Query reaches into the live operational database.

A powerful architecture combines Spectrum (for S3 data lake) and Federated Query (for live operational data) in a single Redshift query:

-- Join live CRM data (federated from Aurora), 
-- historical orders (Spectrum from S3),
-- and Redshift's own enriched customer table
SELECT c.customer_name,
       SUM(h.order_total) as lifetime_value,
       MAX(l.last_login_date) as last_activity
FROM aurora_schema.customers c           -- Federated Query (live Aurora)
JOIN glue_curated.order_history h        -- Spectrum (S3 data lake)
  ON c.customer_id = h.customer_id
JOIN redshift_customers rc               -- Native Redshift table
  ON c.customer_id = rc.customer_id
LEFT JOIN aurora_schema.logins l         -- Federated Query
  ON c.customer_id = l.customer_id
WHERE h.year >= '2022'
GROUP BY c.customer_name;

This pattern — the lakehouse architecture on AWS — collapses the distinction between operational databases, data lakes, and data warehouses. Data lives in the right system for its access patterns, and Redshift’s query engine federates across all of them.

Conclusion

Amazon Redshift Spectrum is a powerful extension to a Redshift data warehouse — enabling cost-effective federated queries over S3 data lake content without data movement. The performance and cost outcomes depend almost entirely on three factors: partition pruning (getting the WHERE clause right), accurate external table statistics, and well-sized Parquet files in the S3 data lake. Teams that invest in these fundamentals get Spectrum queries that are fast, predictable, and genuinely cheap. Teams that skip them spend their query budget on data that was never needed. Ready to build or optimise your AWS data infrastructure? Contact the Infra IT Consulting team for a free consultation.

AWS Data Engineering

Talk to our team →

Optimising Amazon Redshift Spectrum for Federated Queries

How Spectrum Works Under the Hood

Setting Up External Schemas and Tables

Partition Pruning in Spectrum

Spectrum Table Statistics

Query Design Patterns for Spectrum

Concurrency and Cost Management

Combining Spectrum with Redshift Federated Query

Conclusion

Related posts

AWS Lake Formation Best Practices for Data Governance

S3 Data Partitioning Strategies That Cut Athena Query Costs

AWS Glue vs. Apache Spark: Which ETL Tool Is Right for Your Pipeline?

How Spectrum Works Under the Hood

Setting Up External Schemas and Tables

Partition Pruning in Spectrum

Spectrum Table Statistics

Query Design Patterns for Spectrum

Concurrency and Cost Management

Combining Spectrum with Redshift Federated Query

Conclusion

Related posts

AWS Lake Formation Best Practices for Data Governance

S3 Data Partitioning Strategies That Cut Athena Query Costs

AWS Glue vs. Apache Spark: Which ETL Tool Is Right for Your Pipeline?

We value your privacy