Optimising Amazon Redshift Spectrum for Federated Queries
Amazon Redshift Spectrum extends the reach of your Redshift cluster to query data stored in Amazon S3 without loading it into Redshift storage. For organisations running a data lake on Amazon S3 alongside a Redshift data warehouse, Spectrum makes it possible to join hot, frequently queried data in Redshift with cold historical data in S3 â in a single SQL query, without ETL pipelines, without data movement, and without maintaining separate query tools for each tier.
But Spectrum queries can be slow and expensive if not configured correctly. This guide covers the optimisation techniques that make the difference between a Spectrum query that returns in 10 seconds and one that times out or generates a surprise cost spike.
How Spectrum Works Under the Hood
Understanding Spectrumâs execution model is essential for optimising it. When Redshiftâs query planner encounters a reference to an external table (one backed by S3 via the Glue Data Catalog), it routes the relevant portion of the query to the Spectrum layer â a fleet of managed, parallelised scan nodes that operate independently of your Redshift cluster.
The Spectrum layer handles:
- Reading S3 files (Parquet, ORC, CSV, JSON, Avro)
- Applying partition filters to skip irrelevant S3 prefixes
- Executing predicate pushdown â filtering rows within Parquet files using column statistics
- Projecting only the columns referenced in the query
- Returning pre-aggregated or pre-filtered results to Redshift for final joins and aggregations
Redshift then merges the Spectrum results with any native Redshift table data and applies the remaining query operations (final aggregations, ORDER BY, LIMIT).
This separation means Spectrum costs are charged separately â $5 per terabyte of data scanned from S3, identical to Amazon Athenaâs pricing. Like Athena, Spectrumâs scan cost is dramatically reduced by good partitioning and columnar Parquet storage.
Setting Up External Schemas and Tables
Spectrum requires an external schema in Redshift that maps to a Glue Data Catalog database. Creating the external schema:
CREATE EXTERNAL SCHEMA glue_cleansed
FROM DATA CATALOG
DATABASE 'cleansed'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;
The IAM role must have permissions to:
glue:GetDatabase,glue:GetTable,glue:GetPartitionsâ to read catalog metadatas3:GetObject,s3:ListBucketâ to read S3 files- If using Lake Formation:
lakeformation:GetDataAccessâ to validate column-level grants
Once the external schema is created, all Glue tables in the cleansed database are immediately queryable in Redshift under the glue_cleansed schema name:
SELECT * FROM glue_cleansed.web_sessions LIMIT 10;
External tables registered in the Glue catalog automatically appear here â including tables created by Glue crawlers, Athena DDL statements, or your ETL jobsâ explicit catalog API calls. This shared metadata layer is one of the most valuable aspects of building on the AWS Glue Data Catalog.
Partition Pruning in Spectrum
Partition pruning works in Spectrum the same way it works in Athena â Spectrum reads partition metadata from the Glue catalog and skips S3 prefixes that do not match the queryâs partition filter predicates. The performance impact is identical: a well-filtered query on a 10 TB partitioned table might scan 50 GB.
Crucially, the partition filter must be expressed directly against the partition columns. Do not apply functions to partition columns in WHERE clauses:
-- WRONG: Function on partition column disables pruning
WHERE DATE_TRUNC('month', event_date) = '2024-01-01'
-- CORRECT: Direct predicate on partition column enables pruning
WHERE year = '2024' AND month = '01'
For Spectrum queries that join a native Redshift table with an external S3 table, Redshift can sometimes push join predicates down into Spectrum â reducing the data Spectrum returns to Redshift before the join. This is called predicate pushdown across the Spectrum boundary. To enable it, the join condition must be on a partition column or a column with statistics that allow Spectrum to filter effectively.
Confirm partition pruning is working by examining the query plan with EXPLAIN:
EXPLAIN
SELECT s.session_id, u.email, s.page_views
FROM glue_cleansed.web_sessions s
JOIN users u ON s.user_id = u.user_id
WHERE s.year = '2024' AND s.month = '02';
Look for S3 Seq Scan nodes in the plan with partitions pruned=X indicating how many partitions were eliminated. If partitions pruned is zero when you expect pruning, the partition filter expression is not being recognised.
Spectrum Table Statistics
Redshiftâs query optimiser uses table statistics to choose efficient join strategies and estimate result set sizes. For native Redshift tables, you run ANALYZE to gather statistics. For Spectrum external tables, you must set statistics manually using ALTER TABLE ... TABLE PROPERTIES:
ALTER TABLE glue_cleansed.web_sessions
TABLE PROPERTIES ('numRows'='5000000000');
Without accurate row count statistics, Redshift underestimates the cost of joining a large Spectrum table and may choose a suboptimal join strategy â for example, attempting a broadcast join on a table with 5 billion rows, which will fail or perform extremely poorly.
Update these statistics when the table grows significantly. A good practice is to update numRows as part of your ETL pipeline after each batch load, using an AWS Glue script or Lambda function that counts rows and calls the Redshift API via the Data API.
Query Design Patterns for Spectrum
Pattern 1: Historical data federated to Spectrum, recent data in Redshift.
The most common Spectrum use case. Keep the last 6-12 months of data in Redshift (fast, cached, highly concurrent), and push older data to S3-backed Spectrum tables (cheap, compressed, less frequently queried):
-- Unified view spanning both tiers
CREATE VIEW all_orders AS
SELECT order_id, customer_id, order_date, total_amount
FROM native_orders -- Redshift table: last 12 months
UNION ALL
SELECT order_id, customer_id, order_date, total_amount
FROM glue_curated.historical_orders -- Spectrum: > 12 months old
;
Queries against this view that filter by order_date within the last 12 months will be served entirely from Redshift (no Spectrum cost, sub-second latency). Queries spanning historical periods will read from both tiers. Business users and BI tools see a single seamless table.
Pattern 2: Aggregation pushdown.
Spectrum performs aggregations before returning data to Redshift. Push as much aggregation work as possible into the Spectrum portion of the query by pre-aggregating in a subquery:
-- Suboptimal: Redshift receives all rows from Spectrum, then aggregates
SELECT customer_id, SUM(amount)
FROM glue_cleansed.transactions
WHERE year = '2024'
GROUP BY customer_id;
-- Better: Same SQL but Spectrum can push the GROUP BY if the planner cooperates
-- Verify with EXPLAIN that the S3 node shows aggregation in its description
Spectrumâs ability to push GROUP BY into the scan layer depends on the query structure and whether Spectrum determines the pushdown is beneficial. Review the EXPLAIN plan to confirm.
Pattern 3: Avoiding small file scans.
If your S3 data contains many small files (< 10 MB each), Spectrum incurs high per-file overhead â each file requires an S3 GET request, and many small requests limit parallelism. Compact small files before querying with Spectrum. A Glue ETL job that reads and re-writes a recent partition with target file sizes of 256-512 MB improves Spectrum query performance on that partition dramatically.
See our detailed guide on S3 data partitioning strategies for file compaction approaches that directly benefit Spectrum performance.
Concurrency and Cost Management
Spectrum queries consume Spectrum nodes from a shared pool managed by AWS. Your Redshift clusterâs node count and type influence how many Spectrum requests can be dispatched in parallel â more nodes means more parallel S3 reads. Spectrum concurrency is not a configurable parameter; it scales with your cluster.
For cost management, monitor the SpectrumScanSizeBytes CloudWatch metric per query using Redshiftâs STL_SCAN system table:
-- Identify expensive Spectrum scans in the last 24 hours
SELECT query, perm_table_name, SUM(rows) as rows_scanned,
SUM(bytes) / 1024 / 1024 / 1024 as gb_scanned
FROM SVL_S3QUERY_SUMMARY
WHERE starttime > DATEADD(hour, -24, GETDATE())
GROUP BY query, perm_table_name
ORDER BY gb_scanned DESC
LIMIT 20;
Queries with high gb_scanned but low rows_scanned are scanning many bytes to return few rows â a sign of poor partitioning or missing Parquet column statistics. Queries with high gb_scanned and high rows_scanned are expected for large aggregate queries; the focus there should be on whether the aggregation result set is reasonable for the business question.
Set Redshift query monitoring rules (QMR) to alert or cancel Spectrum queries that scan more than a threshold â for example, cancelling any query that scans more than 5 TB from Spectrum to prevent runaway costs from accidental full-table scans.
Combining Spectrum with Redshift Federated Query
Redshift also supports Federated Query â direct, live queries against Amazon RDS PostgreSQL, Amazon Aurora PostgreSQL, and Amazon Aurora MySQL. Unlike Spectrum (which queries S3), Federated Query reaches into the live operational database.
A powerful architecture combines Spectrum (for S3 data lake) and Federated Query (for live operational data) in a single Redshift query:
-- Join live CRM data (federated from Aurora),
-- historical orders (Spectrum from S3),
-- and Redshift's own enriched customer table
SELECT c.customer_name,
SUM(h.order_total) as lifetime_value,
MAX(l.last_login_date) as last_activity
FROM aurora_schema.customers c -- Federated Query (live Aurora)
JOIN glue_curated.order_history h -- Spectrum (S3 data lake)
ON c.customer_id = h.customer_id
JOIN redshift_customers rc -- Native Redshift table
ON c.customer_id = rc.customer_id
LEFT JOIN aurora_schema.logins l -- Federated Query
ON c.customer_id = l.customer_id
WHERE h.year >= '2022'
GROUP BY c.customer_name;
This pattern â the lakehouse architecture on AWS â collapses the distinction between operational databases, data lakes, and data warehouses. Data lives in the right system for its access patterns, and Redshiftâs query engine federates across all of them.
Conclusion
Amazon Redshift Spectrum is a powerful extension to a Redshift data warehouse â enabling cost-effective federated queries over S3 data lake content without data movement. The performance and cost outcomes depend almost entirely on three factors: partition pruning (getting the WHERE clause right), accurate external table statistics, and well-sized Parquet files in the S3 data lake. Teams that invest in these fundamentals get Spectrum queries that are fast, predictable, and genuinely cheap. Teams that skip them spend their query budget on data that was never needed. Ready to build or optimise your AWS data infrastructure? Contact the Infra IT Consulting team for a free consultation.
Related posts
Book a free 30-minute consultation to discuss your data engineering and analytics needs.
Talk to our team â