P_Global Freight

Company Background

Global Freight Forwarders is a logistics company responsible for managing international shipping operations across multiple regions. The company generates daily logistics events such as shipment status updates, carrier movements, and delivery confirmations in JSON format.

Problem Statement

The organization required an efficient and scalable way to ingest logistics data incrementally into Microsoft Fabric without reprocessing historical files. Full reloads were costly, slow, and prone to duplication. Challenges: Full data reloads consumed excessive compute resources Historical files were reprocessed unnecessarily • Data duplication occurred frequently Pipeline performance degraded as data volume grew

Objectives

• Implement watermark-based incremental ingestion

• Avoid reprocessing historical files

• Improve pipeline performance and reliability

• Maintain data consistency with Delta tables

Design

• Source JSON files in Lakehouse Files

• Watermark control table tracks last processed timestamp

• Lookup activity reads the current watermark value

• Copy Data activity filters and loads only new record

•Notebook activity updates watermark after successful load 6. Bronze_ShippingLogs Delta table stores the ingested data

ChatGPT Image Dec 17, 2025, 04_28_29 PM.png

Execuation

Watermark Table Setup

What was implemented:

A Watermark Control Table was created using Spark SQL to store the last successfully processed timestamp for each data source.
Purpose:
To enable controlled and reliable incremental data ingestion.

Highlights:

Stores Title Name and Watermark Value columns
Initialized with a baseline timestamp
Supports tracking across multiple source entities.

Lookup & Copy Activity Configuration

What was implemented:

A Lookup activity was configured to fetch the current watermark value, followed by a Copy Data activity to perform incremental loading.

Configuration details:

Lookup retrieves the latest watermark from the control table
Copy activity applies filter: LogTimestamp > WatermarkValue
Incremental records loaded into Bronze_ShippingLog

Result:

Only newly generated records are processed during each pipeline run.

Incremental Load Validation

What was implemented:

The Bronze_ShippingLogs table was queried post-execution to validate the incremental load behavior.

Validation outcome:

Only new records were ingested
No duplication from previous runs
Data consistency and integrity preserved

Watermark Update Notebook Setup

What was implemented:

A Notebook activity was added to update the watermark table after a successful data load.

Configuration details:

Workspace: Project Workspace – Group B
Notebook: Update_Watermark
Triggered only after Copy activity success

Result:

Watermark values are updated automatically after each successful run.

Pipeline Timestamp Parameterization

What was implemented:

The pipeline trigger time was passed as a parameter to the notebook.

Parameter details:

Name: pipelinetimestamp_va
Type: String
Value: @pipeline().TriggerTime

Result:

Notebook receives the exact execution timestamp for accurate watermark updates.

Notebook Watermark Update Logic

What was implemented:

Spark SQL logic was developed inside the notebook to update the watermark table.

Key aspects:

Uses parameterized SQL for security
Updates only the relevant Shipping Logs entry
Atomic operation ensures data consistency

Successful End-to-End Pipeline Execution

What was implemented:

The complete pipeline was executed and monitored for successful completion.

Outcome:

End-to-end incremental ingestion validated
Lookup, Copy, and Notebook activities executed successfully
Pipeline Run ID captured for monitoring and audit

Company Background

Problem Statement

Objectives

Design

Execuation

Lookup & Copy Activity Configuration

​

Incremental Load Validation

​

Watermark Update Notebook Setup

​

Pipeline Timestamp Parameterization

​

Notebook Watermark Update Logic

​

Successful End-to-End Pipeline Execution

​