AWS Data Engineering parquetorccolumnar

Parquet vs. ORC on AWS: Choosing the Right Columnar Format

By Infra IT Consulting · May 20, 2024 · 8 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

When you are storing hundreds of terabytes in Amazon S3 and querying it with Amazon Athena or processing it through AWS Glue, the file format you choose is not a minor implementation detail — it is an architectural decision that compounds over time. Parquet and ORC are the two dominant columnar formats in the AWS ecosystem, and each has genuine strengths that make it the better choice in specific contexts. This guide cuts through the marketing noise and gives data engineers and CTOs the information needed to make the right call for their workload.

What Makes Columnar Formats Different

Row-oriented formats like CSV and JSON write all fields for a single record contiguously on disk. When an analytical query reads only three columns out of fifty, it must still scan every byte of every row. Columnar formats flip this layout: all values for a single column are stored together, compressed as a unit, and skipped entirely if the query does not reference them.

Both Parquet and ORC take this approach, but they differ in their metadata structures, compression defaults, and ecosystem integrations. Understanding those differences requires a look at how each format is physically laid out.

Parquet stores data in row groups (default 128 MB each), and within each row group, data is stored column by column. Each column chunk carries min/max statistics, null counts, and optional bloom filters. The format is self-describing via a Thrift-encoded schema footer.

ORC stores data in stripes (default 256 MB each), with an index at the start of each stripe containing row-level statistics for every 10,000 rows. ORC also supports ACID transactions natively, which is why Apache Hive defaulted to it for years.

Performance Characteristics on AWS Services

Amazon Athena

Athena charges $5 per terabyte of data scanned. Columnar formats dramatically reduce scanned bytes through two mechanisms: column pruning (only referenced columns are read) and predicate pushdown (row groups or stripes whose statistics prove they contain no matching rows are skipped entirely).

In practice, converting a 1 TB CSV dataset to Parquet with Snappy compression typically reduces it to 150–250 GB on disk. For a query touching four columns, Athena might scan only 20–40 GB, cutting your bill by 90–95% compared with the raw CSV.

Both formats deliver similar scan reduction for simple equality filters. ORC’s row-level index (every 10,000 rows) gives it a marginal edge on high-cardinality range scans over very large datasets. Parquet’s bloom filter support (available since Parquet 2.0, supported in AWS Glue 3.0+) closes this gap for equality lookups on high-cardinality columns.

Verdict for Athena: Parquet is the default choice and is recommended in the official AWS Athena documentation. ORC is a valid alternative if your team already has ORC tooling in place.

AWS Glue

AWS Glue’s Data Catalog treats both formats as first-class citizens. The Glue ETL runtime (based on Apache Spark) reads and writes both formats natively. However, there are practical differences:

Glue Data Quality rules operate on DynamicFrames; converting between DynamicFrame and Spark DataFrame adds overhead. Both formats incur this equally.
Glue Crawlers infer schema from both formats but can merge Parquet schemas across multiple files more reliably because Parquet embeds the full schema in every file footer. ORC schema inference occasionally requires manual intervention when files were produced by different writers with schema evolution.
Glue Studio visual jobs default to writing Parquet. Changing to ORC requires a custom script or a code node.

For AWS Glue, Parquet is the path of least resistance.

Amazon EMR

EMR gives you the full Apache Spark and Hive ecosystem. Here the picture is more nuanced. Hive historically preferred ORC because its ACID support enables row-level updates and deletes — essential for slowly changing dimensions without a full partition rewrite. Spark, by contrast, has richer native Parquet support with vectorised reads and the Parquet-native page index.

If you are running Spark jobs on EMR (the majority of modern EMR workloads), Parquet with Snappy or ZSTD compression will outperform ORC in most benchmarks. If you are running Hive workloads or need native ACID semantics without a table format layer like Delta Lake or Iceberg, ORC remains compelling.

See our post on EMR Serverless vs. EMR on EC2 for guidance on choosing the right EMR deployment model before locking in your format decision.

Compression Codec Comparison

Both formats support multiple compression codecs. The choice of codec matters as much as the format itself:

Codec	Compression Ratio	CPU Cost	Splittable	Best For
Snappy	Moderate	Low	No (within row group/stripe)	Default Spark/Glue workloads
ZSTD	High	Moderate	No	Storage-cost-sensitive pipelines
GZIP	High	High	No	Cold archival, infrequent reads
LZO	Moderate	Low	Yes	Legacy Hadoop streaming
Uncompressed	None	None	Yes	Debugging only

For most AWS workloads, Snappy is the safe default — it decompresses fast enough that EMR and Glue workers rarely become CPU-bound. ZSTD at level 3 offers 20–30% better compression than Snappy with only a modest CPU penalty, making it worth evaluating for large datasets where S3 storage costs are significant.

Note that columnar formats are not splittable at the codec level in the same way that raw GZIP CSVs are not splittable. However, because each row group or stripe is independently compressed, Spark and Glue can parallelise reads at the row group/stripe boundary without decompressing the entire file. Aim for file sizes of 128 MB to 1 GB to ensure each Spark task gets meaningful work without excessive small-file overhead.

Schema Evolution Support

Real pipelines evolve. New columns get added, types get widened, and old columns get renamed. Both Parquet and ORC support schema evolution, but with different capabilities:

Parquet supports adding columns at the end of the schema and widening numeric types (e.g., INT32 → INT64). It does not natively support reordering or renaming columns without a full rewrite.

ORC supports adding columns and has slightly better support for type promotion. Hive’s ORC reader can handle some renames if the schema-on-read mapping is configured.

Apache Iceberg and Delta Lake abstract away most of these limitations by tracking schema history at the table layer. If schema evolution is a primary concern, consider using a table format on top of Parquet rather than switching to ORC. Our guide to Apache Iceberg with AWS Glue covers this in depth.

A Practical Decision Framework

Use Parquet when:

Your primary query engine is Athena, Spark on EMR, or AWS Glue
You need broad ecosystem compatibility (Redshift Spectrum, QuickSight SPICE imports, SageMaker Feature Store)
Your team is greenfield and has no existing format investment
You are adopting Delta Lake or Apache Iceberg (both prefer Parquet internally)

Use ORC when:

Your workload is Hive-heavy and requires native ACID semantics
You are migrating from an on-premises Hadoop cluster where ORC is the established standard
You have measured a specific performance advantage for your query patterns on ORC

A simple benchmark to run before committing: take a representative sample of your data, convert it to both formats with Snappy compression, and run your five most expensive Athena queries against each. The Athena cost explorer will show you scanned bytes per query. The format with lower scanned bytes wins for your specific access patterns.

Partitioning Amplifies Format Benefits

Columnar format benefits compound when combined with intelligent partitioning. A Parquet file in the wrong partition still forces a full scan of that partition. Pair your format choice with a partitioning strategy that matches your most common query filters — typically by date and a high-cardinality business key like region or product line.

The combination of Parquet + Snappy + date-based partitioning + Athena partition pruning can reduce query costs by 95–99% compared with CSV without partitioning. That is the difference between a $5,000/month Athena bill and a $50–$100 bill for the same analytical workload.

Conclusion

For the vast majority of AWS data engineering teams, Parquet with Snappy or ZSTD compression is the right default. It has the broadest service compatibility, the most mature Spark integration, and is the foundation that Delta Lake and Apache Iceberg build upon. ORC earns its place in Hive-heavy environments and legacy migrations, but it is not the first choice for greenfield AWS workloads.

If you are designing a new data lake on S3 or modernising an existing pipeline, getting the format decision right early saves significant re-engineering effort down the line. Infra IT Consulting helps Canadian and international organisations design cost-efficient, performant AWS data architectures from the ground up. Get in touch to talk through your specific workload requirements.

AWS Data Engineering

Talk to our team →

Parquet vs. ORC on AWS: Choosing the Right Columnar Format

What Makes Columnar Formats Different

Performance Characteristics on AWS Services

Amazon Athena

AWS Glue

Amazon EMR

Compression Codec Comparison

Schema Evolution Support

A Practical Decision Framework

Partitioning Amplifies Format Benefits

Conclusion

Related posts

Orchestrating Data Pipelines with AWS Step Functions

Decoupling Data Pipelines with AWS SNS and SQS

Infrastructure as Code for AWS Data Stacks with Terraform

What Makes Columnar Formats Different

Performance Characteristics on AWS Services

Amazon Athena

AWS Glue

Amazon EMR

Compression Codec Comparison

Schema Evolution Support

A Practical Decision Framework

Partitioning Amplifies Format Benefits

Conclusion

Related posts

Orchestrating Data Pipelines with AWS Step Functions

Decoupling Data Pipelines with AWS SNS and SQS

Infrastructure as Code for AWS Data Stacks with Terraform

We value your privacy