Tech Tutorials & How-Tos interviewcareeraws

50 AWS Data Engineering Interview Questions (With Answers)

By Infra IT Consulting · April 19, 2024 · 14 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

Data engineering interviews test a wide range of knowledge: SQL fundamentals, Python and Spark, AWS service specifics, system design, and the behavioural side of working in a data team. This post compiles 50 questions that regularly appear in data engineering interviews at Canadian and US technology companies, with concise, accurate answers. Use it as a study guide, a self-assessment, or a reference when preparing candidates on your team.

Section 1: SQL Fundamentals (Questions 1–10)

1. What is the difference between WHERE and HAVING?

WHERE filters rows before aggregation is applied. HAVING filters groups after GROUP BY aggregation. Use WHERE to filter individual rows, HAVING to filter aggregated results.

-- WHERE filters before aggregation
SELECT customer_id, COUNT(*) AS order_count
FROM orders
WHERE status = 'delivered'  -- filter rows first
GROUP BY customer_id
HAVING COUNT(*) > 5;        -- then filter groups

2. Explain the difference between RANK(), DENSE_RANK(), and ROW_NUMBER().

All three assign integer rankings within a partition. With tied values:

ROW_NUMBER() assigns unique sequential numbers (ties broken arbitrarily)
RANK() assigns the same number to ties and skips subsequent numbers (1, 1, 3)
DENSE_RANK() assigns the same number to ties without skipping (1, 1, 2)

3. What is a CTE and when should you use one instead of a subquery?

A CTE (Common Table Expression) defined with WITH is a named, reusable result set scoped to the current query. Use CTEs when: (a) the same subquery is referenced more than once, (b) the logic is complex enough that inline subqueries hurt readability, or (c) you need to build logic incrementally across multiple steps. CTEs do not inherently perform better than subqueries — the query optimizer usually treats them equivalently.

4. How does a LEFT JOIN differ from an INNER JOIN? Give a practical example.

INNER JOIN returns only rows with matching keys in both tables. LEFT JOIN returns all rows from the left table, with NULLs for right-table columns where no match exists. Use LEFT JOIN when you want to include records from the primary table regardless of whether a related record exists — for example, showing all customers including those with no orders:

SELECT c.customer_id, c.name, COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name;

5. What is a window function frame clause and why does LAST_VALUE behave unexpectedly without it?

The frame clause (ROWS BETWEEN ... AND ...) defines which rows relative to the current row are included in the window calculation. The default frame is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. LAST_VALUE with this default frame returns the current row’s value — not the last value in the partition. To get the true last value, specify ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

6. How do you find duplicate records in a table?

SELECT email, COUNT(*) AS duplicate_count
FROM customers
GROUP BY email
HAVING COUNT(*) > 1
ORDER BY duplicate_count DESC;

To return the full rows of duplicates with a window function:

SELECT * FROM (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY created_at) AS rn
    FROM customers
)
WHERE rn > 1;

7. What is the difference between UNION and UNION ALL?

UNION removes duplicate rows between the two result sets; UNION ALL keeps all rows including duplicates. UNION ALL is significantly faster because it skips the deduplication sort step. Use UNION only when duplicates are genuinely a problem.

8. Explain what an index does and why it matters for analytical queries.

An index is a data structure (typically a B-tree or bitmap) that allows the database engine to locate rows matching a condition without scanning the full table. For OLAP analytical databases like Redshift, traditional indexes are replaced by sort keys (which physically order data on disk for range scans) and distribution keys (which control how data is spread across nodes to minimize cross-node joins).

9. What does NULL represent in SQL and how should it be handled in calculations?

NULL represents the absence of a value, not zero or an empty string. Arithmetic with NULL returns NULL. Comparisons with NULL using = or <> return NULL, not TRUE or FALSE — always use IS NULL or IS NOT NULL. Use COALESCE(value, default) to substitute a default for NULL in calculations.

10. Write a query that finds the second-highest salary per department.

SELECT department_id, employee_id, salary
FROM (
    SELECT
        department_id,
        employee_id,
        salary,
        DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rnk
    FROM employees
)
WHERE rnk = 2;

Section 2: Python and Apache Spark (Questions 11–20)

11. What is the difference between a Spark transformation and an action?

Transformations (e.g., filter, map, groupBy, join) are lazy — they define a computation plan (DAG) but do not execute. Actions (e.g., count, show, collect, write) trigger the actual execution of the DAG. This lazy evaluation allows Spark to optimize the full computation plan before any data moves.

12. What causes a shuffle in Spark and why is it expensive?

Shuffles occur when data must be redistributed across partitions — typically during groupBy, join, distinct, repartition, and orderBy. They are expensive because they require writing intermediate data to disk and transmitting it across the network between executors. Shuffles are the most common cause of Spark job slowness.

13. What is a broadcast join and when should you use it?

A broadcast join sends a copy of a small DataFrame to every executor, avoiding a shuffle for the join operation. Use it when joining a large DataFrame with a small one (typically under 100 MB). In PySpark:

from pyspark.sql.functions import broadcast

result = large_df.join(broadcast(small_df), on="key_column", how="left")

14. Explain the difference between cache() and persist() in Spark.

Both keep a DataFrame in memory to avoid recomputation. cache() is equivalent to persist(StorageLevel.MEMORY_AND_DISK). persist() lets you specify the storage level explicitly: MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, MEMORY_AND_DISK_SER (serialized), etc. Use persist(MEMORY_AND_DISK) for DataFrames that are reused multiple times — Spark will spill to disk if memory is insufficient.

15. What is data skew in Spark and how do you fix it?

Data skew occurs when some partitions have significantly more data than others — causing certain tasks to take much longer and becoming the bottleneck. Common causes: joining on a key with low cardinality (e.g., joining on country_code where 70% of records are CA). Fixes: (a) salting the skewed key (add a random suffix, join on the salted key, then aggregate), (b) broadcast join if one side is small enough, (c) Adaptive Query Execution (AQE) in Spark 3.x which handles skew automatically.

16. Write Python code to read a Parquet file from S3 and filter rows where revenue > 1000.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("etl").getOrCreate()

df = spark.read.parquet("s3://my-data-lake/processed/orders/year=2024/")
filtered = df.filter(col("revenue") > 1000)
filtered.write.mode("overwrite").parquet("s3://my-data-lake/curated/high_value_orders/")

17. What is the difference between .collect() and .take(n) in Spark?

Both bring data from Spark executors to the driver. .collect() brings all rows — dangerous on large DataFrames as it can OOM the driver. .take(n) brings only the first n rows and is safe for sampling. Prefer .show() for debugging and .write for production output over .collect().

18. Explain Python’s GIL and why it matters for data processing.

The Global Interpreter Lock (GIL) prevents multiple Python threads from executing Python bytecode simultaneously. This limits CPU-bound parallelism with threads. For CPU-bound data processing, use multiprocessing (separate processes bypass the GIL) or libraries like pandas that release the GIL for NumPy operations. Spark sidesteps the GIL entirely because actual processing runs on the JVM across executors.

19. What are Python generators and when would you use them in a data pipeline?

Generators are functions that yield values lazily (one at a time) rather than computing a full list. They are memory-efficient for large data streams:

def read_csv_chunks(file_path: str, chunksize: int = 10_000):
    import pandas as pd
    for chunk in pd.read_csv(file_path, chunksize=chunksize):
        yield chunk

for chunk in read_csv_chunks("large_file.csv"):
    process(chunk)

Use generators when the full dataset does not fit in memory or when processing can start before the full input is available.

20. How do you handle schema evolution in a Spark DataFrame pipeline?

Schema evolution strategies: (a) use mergeSchema=True when reading Parquet (spark.read.option("mergeSchema", "true").parquet(...)) to union schemas across files, (b) validate expected columns at pipeline start and fail fast on schema mismatch, (c) use StructType schema enforcement with spark.read.schema(schema).parquet(...) to explicitly reject unexpected columns, (d) for slowly evolving schemas, use Delta Lake which handles schema evolution with MERGE semantics.

Section 3: AWS Data Services (Questions 21–35)

21. What is the difference between Amazon Redshift and Amazon Athena?

Redshift is a provisioned (or serverless) MPP data warehouse that stores data internally in columnar format. Athena is a serverless query engine that runs SQL directly against files in S3. Redshift is better for complex, repeated queries on structured data with high concurrency. Athena is better for ad hoc exploration, ELT staging, and querying raw S3 data without loading it. See our Redshift vs. Athena comparison for a detailed breakdown.

22. What is a Glue Data Catalog and why is it important?

The Glue Data Catalog is a centralized metadata repository — a Hive-compatible metastore that stores table definitions (schema, location, partitions, format). It is shared across Athena, Redshift Spectrum, EMR, and Glue ETL jobs, so all services see the same table definitions. Without it, each service would need its own metadata management.

23. Explain the difference between Glue DPUs and how to right-size a Glue job.

A DPU (Data Processing Unit) provides 4 vCPUs and 16 GB of memory. Each Glue job runs with a minimum of 2 DPUs. Right-sizing: start with the default, run the job, check the Glue job metrics for CPU and memory utilization. If memory utilization is consistently below 40%, reduce DPUs. If the job spills to disk (check the S3 staging directory size), increase DPUs. Use G.1X worker type for memory-intensive jobs and G.2X for compute-heavy jobs.

24. What is S3 Select and when would you use it?

S3 Select allows you to run SQL queries on individual S3 objects (CSV, JSON, Parquet, ORC) and retrieve only matching data, reducing data transfer. Useful when you repeatedly need a small subset of a large file — for example, filtering a 1 GB CSV to the last 7 days of records before loading it into pandas. It is faster and cheaper than downloading the full file.

25. What is an S3 lifecycle policy and why should data lakes use them?

S3 lifecycle policies automatically transition objects between storage classes or expire them after a specified number of days. Data lakes use them to move raw data from S3 Standard (frequently accessed) to S3 Standard-IA (infrequent access) after 30 days, and to Glacier for long-term archival after 90 days. This can reduce storage costs by 60-80% for data that is rarely accessed after initial processing.

26. Explain the purpose of S3 versioning and its cost implications.

S3 versioning preserves every version of an object when it is overwritten or deleted. Useful for data recovery (accidental deletes or bad writes) and audit trails. Cost implication: all versions consume storage — if you update a 1 GB file 10 times, you pay for all 10 versions until old ones are expired via lifecycle rules. Always add a lifecycle rule to expire non-current versions after 30 days in versioned buckets.

27. What is AWS Kinesis Data Firehose and how does it differ from Kinesis Data Streams?

Kinesis Data Firehose is a fully managed delivery service for streaming data — it receives records and automatically delivers them to S3, Redshift, OpenSearch, or Splunk. It handles batching, compression, and format conversion (to Parquet/ORC). Kinesis Data Streams is a lower-level streaming backbone — you consume records yourself with your own consumer application. Use Firehose when the destination is S3 or Redshift and you do not need custom consumer logic.

28. What is AWS Lake Formation and what does it add over raw S3 + Glue?

Lake Formation is a security and governance layer on top of S3 and the Glue Data Catalog. It adds fine-grained access control at the column and row level (column-level security, row filters), centralized permission management across Athena/Redshift Spectrum/Glue, data tagging for compliance, and governed tables with ACID transactions. Use it when you need column-level access control or centralized data governance across multiple consumers.

29. What is the purpose of an SQS dead-letter queue in a data pipeline?

A dead-letter queue (DLQ) captures messages that fail processing after a configured number of retries. In a Lambda-triggered pipeline, if a Lambda function fails to process an S3 event after 3 attempts, the event goes to the DLQ rather than being silently dropped. This enables: (a) investigation of failures without data loss, (b) replaying failed events after fixing the bug, (c) alerting on unexpected failure rates.

30. How does Redshift Spectrum differ from using Athena?

Both query S3 data using the Glue Data Catalog. Redshift Spectrum runs queries through the Redshift query planner, allowing joins between S3 external tables and internal Redshift tables in a single query. Athena is a standalone serverless engine — it cannot join against Redshift internal tables. Use Spectrum when you need to mix historical S3 data with recent Redshift data in the same query.

31. What is the difference between an AWS Step Functions Standard and Express workflow?

Standard workflows support long-running executions (up to 1 year), exactly-once execution semantics, and visible execution history. They cost per state transition. Express workflows support short executions (up to 5 minutes), at-least-once semantics, and are suited for high-volume event-driven pipelines. Express workflows are significantly cheaper for high-frequency, short-duration executions.

32. What is AWS Glue Streaming ETL and when would you use it over Kinesis Data Analytics?

Glue Streaming ETL runs a continuously running Spark Structured Streaming job that reads from Kinesis or Kafka, applies transforms, and writes to S3 or other targets. Use it when your streaming transformation logic is complex enough to benefit from Spark’s DataFrame API, when you want to reuse existing Glue/Spark transform code in a streaming context, or when you need Glue Catalog integration. Kinesis Data Analytics (now Amazon Managed Service for Apache Flink) is better for lower-latency requirements and stateful stream processing with Flink’s operator model.

33. What is an IAM role vs. an IAM user, and which should services use?

An IAM user has permanent credentials (access key + secret). An IAM role has no permanent credentials — it issues temporary credentials (15 minutes to 12 hours) that are automatically rotated. Services like Lambda, ECS tasks, Glue jobs, and EC2 instances should always use IAM roles (attached as instance profiles or execution roles), never IAM users. This follows the principle of least privilege and eliminates credential rotation overhead.

34. Explain the purpose of an S3 VPC Endpoint.

An S3 VPC Endpoint (Gateway Endpoint) allows traffic from resources inside a VPC to reach S3 without going through the public internet. Benefits: (a) traffic stays on the AWS network — faster and more secure, (b) no data transfer charges for traffic through the endpoint (vs. public internet egress), (c) bucket policies can restrict access to specific VPCs for compliance. Essential for data engineering workloads in private subnets.

35. What is AWS Glue job bookmarking?

Glue job bookmarks track which data has already been processed by a job. When enabled, a Glue job records the S3 path and last-modified timestamp of processed files. On the next run, it only processes new or changed files, implementing incremental processing without custom state management. Useful for S3-to-S3 and S3-to-Redshift incremental loads.

Section 4: System Design (Questions 36–45)

36. Design a real-time pipeline that ingests 10,000 clickstream events per second and makes them queryable within 5 minutes.

High-level: events → Kinesis Data Streams (shards sized for throughput) → Kinesis Data Firehose (buffering 5 minutes or 128 MB) → S3 with Parquet conversion → Glue Crawler updates Data Catalog → Athena queries. For sub-minute latency, replace Firehose with a Lambda consumer writing micro-batches directly.

37. How would you design an idempotent ETL pipeline?

Idempotency means re-running the pipeline produces the same result. Strategies: (a) write outputs to a path that includes the execution date/batch ID — each run writes to a unique location, (b) use overwrite mode for the target partition, (c) track processed file hashes in DynamoDB to skip already-processed inputs, (d) use Glue job bookmarks for S3 source tracking.

38. What is the medallion architecture and how does it apply to S3 data lakes?

The medallion architecture organizes S3 data into three zones: Bronze (raw, unmodified data as ingested), Silver (cleaned, typed, deduplicated data), Gold (aggregated, business-ready datasets). This separation makes it easy to reprocess — if the silver transformation logic changes, you can reprocess from bronze without re-ingesting. See Building a Data Lake on S3 for a detailed implementation.

39. How do you handle late-arriving data in a batch pipeline?

Options: (a) use a watermark — ignore data arriving more than N hours late, (b) use a reprocess window — keep the last 3 days of partitions mutable, re-run them daily to pick up late arrivals, (c) use Delta Lake or Iceberg for ACID upserts when late records need to correct earlier results, (d) maintain a separate corrections table and union it at query time.

40. Describe the schema-on-read vs. schema-on-write distinction.

Schema-on-write (traditional data warehouses): schema is enforced when data is written — invalid data is rejected. Schema-on-read (data lakes): data is stored as-is; schema is applied at query time. Schema-on-read offers more flexibility for raw data ingestion but pushes schema validation downstream. Most production architectures combine both: raw zone uses schema-on-read, curated zone enforces schema-on-write via Glue or dbt transformations.

41. How would you design a slowly changing dimension (SCD Type 2) in Redshift?

SCD Type 2 preserves history by adding new rows for changes rather than updating existing ones. Each row has valid_from, valid_to, and is_current columns. Implement with: (a) a staging table with incoming records, (b) UPDATE existing rows to set valid_to = NOW(), is_current = FALSE where the business key matches but an attribute has changed, (c) INSERT new rows for changed and new records with valid_from = NOW(), is_current = TRUE.

42. What is data lineage and how do you implement it on AWS?

Data lineage tracks the origin, transformations, and destination of data assets. Implementation options: (a) AWS Glue Data Catalog with Glue lineage events, (b) dbt’s built-in DAG and documentation site tracks lineage between models, (c) Apache Atlas (via EMR) for fine-grained column-level lineage, (d) AWS Lake Formation’s governance features. Minimum viable: tag S3 objects with source metadata in their file path and record transformation jobs in a DynamoDB audit table.

43. How do you design for cost optimization in an S3 data lake?

Key levers: (a) use Parquet with Snappy compression — typically 5-10x smaller than CSV, (b) partition data by date for partition pruning in Athena (pay for data scanned), (c) apply S3 lifecycle rules to move cold data to IA/Glacier, (d) use Athena workgroup query result reuse to cache repeated queries, (e) right-size Glue DPUs and use spot instances for EMR, (f) use S3 Intelligent-Tiering for access patterns you cannot predict.

44. What is the fan-out pattern in Lambda and how is it used in data pipelines?

Fan-out distributes work across multiple parallel Lambda invocations. Pattern: a coordinator Lambda receives an S3 event or Step Functions task, lists all files to process, and invokes one worker Lambda per file asynchronously via invoke_type=Event. Worker Lambdas process in parallel, constrained by Lambda concurrency limits. This pattern works well for parallelizing processing of large file sets that can be worked on independently.

45. How would you implement data quality checks in a production pipeline?

Layer the checks: (a) schema validation on ingestion — reject records with missing required fields, (b) statistical checks in the transformation layer using Great Expectations or dbt tests — check for null rates, value ranges, referential integrity, (c) business rule validation — revenue cannot be negative, order_date cannot be in the future, (d) output validation — row count comparison between source and target, (e) anomaly detection — alert on >20% row count deviation from the previous run. See Data Quality with Great Expectations on AWS for implementation details.

Section 5: Behavioural Questions (Questions 46–50)

46. Describe a data pipeline you built that had a significant bug in production. What happened and what did you learn?

Strong answer elements: specific technical root cause (not vague), clear description of the business impact, structured incident response (detection → containment → fix → postmortem), and specific changes made to prevent recurrence (tests added, monitoring added, process changed). Avoid answers that minimize the impact or blame others.

47. How do you handle conflicting priorities between data engineering work and ad hoc analysis requests from stakeholders?

Strong answer: describe a concrete prioritization framework — for example, P0 (data down, blocks decisions), P1 (scheduled pipeline, SLA-bound), P2 (planned feature work), P3 (ad hoc requests). Mention proactively setting expectations with stakeholders, providing self-serve tools that reduce ad hoc demand over time, and escalating through your manager when priorities genuinely conflict at the organizational level.

48. Tell me about a time you had to learn a new technology quickly for a project.

Structure with STAR (Situation, Task, Action, Result). Emphasize your learning approach: documentation first, then hands-on experimentation, then community resources. Show that you validate your understanding by building something working rather than just reading. Mention what you would do differently with more time.

49. How do you ensure data pipelines remain maintainable as the team grows?

Key practices: (a) version control everything — code, infrastructure, SQL, (b) document intent in code comments and README files, not just implementation, (c) automated testing that runs in CI/CD, (d) consistent naming conventions enforced via linting, (e) modular design — pipelines composed of small, independently testable stages, (f) runbook documentation for on-call engineers, (g) regular pipeline reviews to retire unused pipelines.

50. Describe how you approach debugging a data pipeline that is producing incorrect results.

Systematic approach: (a) reproduce the issue with a minimal example, (b) identify the last known-good state (which pipeline run was correct?), (c) compare input data for correct vs. incorrect runs — is the issue in the source data?, (d) trace the transformation step by step using intermediate outputs, (e) check for silent failures — mismatched joins producing NULLs, type coercion silently converting values, duplicate rows from a non-deduplicated join. Avoid modifying production data until the root cause is confirmed.

Further Preparation

For the technical hands-on portions of interviews, practice writing actual SQL against a database (not just reading about it). Window functions appear in nearly every data engineering interview — review our SQL Window Functions in Amazon Athena tutorial for a practical refresher. For AWS architecture questions, building the systems described in our roadmap post on Getting Started as an AWS Data Engineer will give you the hands-on experience to answer system design questions with confidence.

If your organization needs data engineering expertise — whether building a team from scratch or augmenting an existing one — contact Infra IT Consulting. We work with Canadian companies across the full data engineering lifecycle.

Tech Tutorials & How-Tos

Talk to our team →

50 AWS Data Engineering Interview Questions (With Answers)

Section 1: SQL Fundamentals (Questions 1–10)

Section 2: Python and Apache Spark (Questions 11–20)

Section 3: AWS Data Services (Questions 21–35)

Section 4: System Design (Questions 36–45)

Section 5: Behavioural Questions (Questions 46–50)

Further Preparation

Related posts

Kafka vs. Kinesis: A Hands-On Comparison for Data Engineers

AWS CDK for Data Infrastructure: Type-Safe IaC for Data Teams

Python and Boto3: Automating S3 Data Operations

Section 1: SQL Fundamentals (Questions 1–10)

Section 2: Python and Apache Spark (Questions 11–20)

Section 3: AWS Data Services (Questions 21–35)

Section 4: System Design (Questions 36–45)

Section 5: Behavioural Questions (Questions 46–50)

Further Preparation

Related posts

Kafka vs. Kinesis: A Hands-On Comparison for Data Engineers

AWS CDK for Data Infrastructure: Type-Safe IaC for Data Teams

Python and Boto3: Automating S3 Data Operations

We value your privacy