Infra IT Consulting logo Infra ITC
Microsoft Fabric Microsoft FabricSparkdata pipelines

Data Engineering in Microsoft Fabric: Spark Notebooks, Pipelines, and Lakehouse Patterns

By Infra IT Consulting · · 12 min read

Content on this site is AI-assisted and personally reviewed by Hazem. Learn more

A common reaction from data engineers evaluating Microsoft Fabric for the first time goes something like this: “OK, it’s Synapse with a better UI.” That reaction is understandable — Fabric’s Data Engineering workload is superficially similar to Azure Synapse Spark pools. But once you understand how the Lakehouse storage model integrates with Spark, how pipelines orchestrate across workloads, and how monitoring works, the experience is meaningfully different from what Synapse offered.

This guide covers the practical mechanics of building data engineering pipelines in Fabric — what the Lakehouse is, how Spark notebooks work in practice, how to structure a medallion architecture, and where Fabric’s approach compares favourably (and unfavourably) to AWS Glue.

The Fabric Data Engineering Workload

The Data Engineering workload in Microsoft Fabric includes four primary item types:

  • Lakehouses: The primary storage-and-compute unit. A Lakehouse is a Delta-format data store backed by OneLake with an attached Spark engine and a SQL analytics endpoint.
  • Notebooks: VS Code-embedded Jupyter-compatible notebooks for interactive PySpark, Scala, or Spark SQL development.
  • Spark Job Definitions: Configured, parameterisable Spark application definitions for scheduled or triggered batch processing.
  • Pipelines: Orchestration of data movement, Notebook execution, Spark Job Definitions, and Dataflow Gen2 runs in a DAG-style pipeline editor.

These four components work together as a complete ETL and data transformation layer. They are not separate services to connect — they are items within a Fabric workspace, sharing OneLake storage automatically.

Understanding the Lakehouse: Files Zone vs Tables Zone

When you create a Lakehouse in Fabric, it provisions two logical zones in OneLake:

Files zone (Files/): An unstructured storage area for raw files in any format — CSV, JSON, Parquet, XML, images, PDFs. The Files zone is analogous to the raw layer of an ADLS Gen2 account. Data here is not queryable via SQL; it requires a Spark read to access. This is where you land data from ingestion pipelines before any processing.

Tables zone (Tables/): A managed Delta Lake area. Every Delta table you create via Spark (df.write.format("delta").saveAsTable("my_table")) or via the Lakehouse SQL analytics endpoint lands here. Tables in this zone are immediately queryable through:

  • PySpark in notebooks (spark.read.table("my_table"))
  • The SQL analytics endpoint (T-SQL queries without any connection configuration)
  • Power BI in DirectLake mode (direct file reads from the underlying Delta Parquet files)

The implication is that a data engineer writing a Delta table via Spark automatically makes that table available to the Power BI developer working in the same workspace. There is no dataset refresh to schedule, no import to trigger, no connection string to configure.

Writing PySpark Notebooks in Fabric

Fabric notebooks run in a VS Code-embedded Jupyter interface. For developers coming from Databricks, the experience is similar in structure — cells are Python, Scala, or Spark SQL; outputs display inline; the spark session is pre-initialised. The primary difference is the absence of Databricks-specific utilities (dbutils) replaced by Fabric’s notebookutils module.

Reading from OneLake in a Fabric notebook uses standard Spark table reads for Tables zone data:

df = spark.read.table("silver_lakehouse.customer_transactions")

For Files zone data (raw files landed from ingestion):

df = spark.read.option("header", True).csv("Files/raw/transactions/2026-06-01/*.csv")

Writing a managed Delta table to the Tables zone:

(
    df
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable("silver.customer_transactions_clean")
)

Writing with MERGE for upsert patterns (critical for CDC and incremental loads):

from delta.tables import DeltaTable

target = DeltaTable.forName(spark, "silver.customer_transactions_clean")
target.alias("t").merge(
    source_df.alias("s"),
    "t.transaction_id = s.transaction_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

Spark Job Definitions for Scheduled Workloads

Notebooks are interactive-first. For production batch workloads that run on a schedule or are triggered by events, Fabric provides Spark Job Definitions — a configured Spark application that references a .py or .jar file stored in OneLake, with configurable parameters and execution settings.

Spark Job Definitions are the correct pattern for:

  • Nightly batch transformations that run independently of a human session
  • Parameterised jobs that process data for a specific date or partition
  • Long-running Spark applications that exceed notebook session limits

The Spark Job Definition editor provides configuration for executor count, executor memory, driver memory, and Spark configuration overrides — equivalent to a spark-submit configuration without the cluster management overhead.

Data Factory Pipelines in Fabric: Orchestration Layer

Fabric’s pipeline editor (Data Factory in Fabric) provides the orchestration layer above Spark notebooks and Spark Job Definitions. A pipeline is a directed acyclic graph of activities:

  • Copy activity: Moves data from a source connector (SQL Server, Salesforce, REST API, S3, Event Hubs, and 90+ others) into OneLake’s Files or Tables zone.
  • Notebook activity: Triggers a Fabric notebook with specified parameters.
  • Spark Job Definition activity: Triggers a Spark Job Definition.
  • Dataflow Gen2 activity: Runs a low-code transformation flow.
  • Control flow activities: If/Else, ForEach, Until, Wait — standard pipeline orchestration primitives.

A complete ingestion-to-gold pipeline in Fabric looks like:

[Copy Activity: Source → Bronze Files Zone]

[Notebook Activity: Bronze → Silver (clean, validate, MERGE)]

[Notebook Activity: Silver → Gold (aggregate, business logic)]

[Success notification activity]

Pipelines are scheduled via trigger (time-based or event-based via storage event triggers on OneLake).

Medallion Architecture Implementation Walkthrough

The recommended Fabric architecture for a complete medallion implementation uses three Lakehouses in one workspace:

Bronze Lakehouse: Raw data as-landed. Copy activity writes CSV or JSON files to Files/raw/{source}/{date}/. No schema enforcement. Retain indefinitely for reprocessing.

Silver Lakehouse: Clean, typed, deduplicated data. A Spark notebook reads from Bronze Files, validates schema, deduplicates using MERGE, and writes to Silver Tables. Silver tables are the canonical source of truth.

Gold Lakehouse: Aggregated, business-logic-applied tables. A second Spark notebook reads from Silver Tables and writes aggregated Gold tables. Power BI semantic models connect to Gold in DirectLake mode.

Workspace access control separates the layers: data engineers have Contributor access to all three Lakehouses; BI developers have Viewer access to Gold only; end users access data only through Power BI reports with row-level security applied.

Monitoring: Fabric Monitoring Hub and Spark History

Fabric provides a Monitoring Hub — a workspace-level dashboard showing the status and duration of all Spark sessions, pipeline runs, and Spark Job Definition executions. For each run you can access:

  • Spark job DAG and stage breakdown
  • Executor metrics (GC time, shuffle bytes, task failures)
  • Log output (stdout, stderr, driver logs)

The Monitoring Hub replaces what you would previously have accessed through the Azure Monitor + Log Analytics integration in Synapse — and is meaningfully easier to navigate for most data engineers.

For cross-workspace or capacity-level monitoring, the Fabric Capacity Metrics app (a Power BI report deployed to your workspace) shows CU consumption by item, workload, and time period. This is essential for capacity right-sizing.

Fabric Pipelines vs AWS Glue

For teams familiar with AWS Glue evaluating Fabric as an alternative:

CapabilityFabric Pipelines + SparkAWS Glue
ETL authoringPipeline editor + PySpark notebooksGlue Studio visual editor + PySpark scripts
Spark runtimeOpen-source Spark (managed)Glue Spark (proprietary, AWS-managed)
Job schedulingPipeline triggers (time/event)Glue Triggers + EventBridge
Data catalogueFabric workspace (Delta tables)Glue Data Catalog (Iceberg, Delta, Hudi)
StreamingEventstream + Spark Structured StreamingGlue Streaming ETL
Serverless modelYes (no cluster provisioning)Yes (Glue serverless)
Debugging experienceFabric Monitoring Hub + Spark UIGlue Continuous Logging + CloudWatch

Glue’s strength is its deep integration with the AWS service catalogue — native connectivity to S3, Redshift, DynamoDB, Kinesis, and 60+ other AWS services. Fabric’s strength is the zero-configuration storage integration with OneLake and the seamless handoff to Power BI. For AWS-native organisations, Glue remains the more natural choice. For Microsoft-ecosystem organisations, Fabric pipelines eliminate the cross-service connection management that made Synapse + ADF complex to operate.

Conclusion

Microsoft Fabric’s Data Engineering workload — Lakehouses, Spark notebooks, Spark Job Definitions, and Pipelines — is a coherent, production-capable data engineering platform. The Lakehouse model’s automatic integration with OneLake, the SQL analytics endpoint, and Power BI DirectLake eliminates an entire category of integration work that burdened previous Azure analytics architectures.

For teams building from scratch in a Microsoft-aligned environment, Fabric is a defensible platform choice. For teams migrating from Azure Synapse, it is the clearly supported path forward. See our guide on migrating from Azure Synapse to Microsoft Fabric for the migration specifics.

Evaluating Microsoft Fabric for your organisation? Infra IT Consulting helps Canadian and international businesses assess, architect, and implement modern data platforms. Book a discovery call →

Related posts