Fortune 500 Logistics Provider — Real-Time Fleet Tracking

Real-time fleet tracking on Azure — reference architecture

Summary

Built an event-driven fleet tracking platform on Azure using Event Hubs (Kafka endpoint) and AKS. GPS/vehicle telemetry is validated, deduplicated, and upserted into PostgreSQL (PostGIS) for geospatial queries, with Blob Storage capture for analytics. Live status/ETAs are exposed through API Management with clear SLAs—cutting WISMO calls and enabling steady OTIF improvements.

Problem

Stale location data and missed events from heterogeneous devices/apps.
No single low-latency source of truth for “last known location,” geofence status, or ETA.
WISMO (“Where is my order?”) calls spiked during delays; updates to CRM/notifications were inconsistent.

Solution Mechanics

Primary pattern: Event-driven streaming (Java + Spring Boot on AKS).

Ingestion (Kafka on Azure)
- Devices/mobile apps publish to Azure Event Hubs (Kafka endpoint) with keys = vehicleId to preserve ordering per vehicle.
- Topics: telemetry.raw, telemetry.parsed, events.geofence, errors.dlq.
Processing (AKS / Spring Boot)
- Telemetry Ingestor (Spring Kafka): JSON schema validation, clock-skew checks, dedupe by (vehicleId, eventTs), publish to telemetry.parsed.
- Enricher/Aggregator: computes last-known location, speed, heading, stop/idle detection, and ETA; emits geofence enter/exit to events.geofence.
- Status Writer: idempotent UPSERT of per-vehicle status into Azure Database for PostgreSQL (PostGIS) with POINT geometry; maintains compact history table for recent windows.
Storage & Analytics
- Event Hubs Capture → Azure Blob Storage (Parquet) for long-term analytics and model training.
APIs & Notifications
- API Orchestration Layer (Spring Boot behind Azure API Management):
  - GET /vehicles/{id}/status, GET /vehicles/search?bbox=…&since=…
  - POST /subscriptions/webhook (register customer/CRM webhooks).
- Azure Service Bus topics: fan-out status changes and ETA deltas to CRM, customer comms, and alerting services.
Observability & Ops
- Micrometer → Azure Monitor/App Insights (producer lag, consumer lag, p95 ingest→status, DLQ depth).
- Replay tool: reprocess from Blob or errors.dlq by time range/vehicle.

Diagram 1 - Context Diagram — Real-time fleet tracking on Azure

Context Diagram — Real-time fleet tracking on Azure

Diagram 2 - Sequence — Telemetry ingest to live status/ETA

Sequence — Telemetry ingest to live status/ETA

Diagram 3 - Operations — DLQ & replay controls

Operations — DLQ & replay controls

Process Flow

Producer (truck device/mobile) publishes GPS/vehicle event to Event Hubs (Kafka) with key = vehicleId.
Telemetry Ingestor validates schema, drops duplicates (same vehicleId + eventTs), normalizes coordinates/timezone, writes to telemetry.parsed.
Enricher/Aggregator calculates last-known status (moving/idle), speed, geofence enter/exit, and ETA to next stop/hub.
Status Writer upserts current snapshot into Postgres/PostGIS and appends a slim history row (TTL/partitioned).
API Orchestration Layer serves GET /status and geospatial searches (e.g., bounding box), with p95 < 300 ms.
Service Bus publishes status/ETA changes to CRM and customer notification services; retries/DLQ are handled at the messaging layer.
Event Hubs Capture writes raw streams to Blob; ops can replay selected windows to recover from defects.
App Insights dashboards track freshness (ingest→status), lag per consumer group, and DLQ trends.

Outcomes

Fresher location data: ingest→status p95 under 5–8s during peak (Verified in pre-prod load tests).
Lower WISMO calls: proactive status updates and ETA deltas reduce “where is my order?” inquiries (Modeled −15–25% based on alert subscription uptake).
OTIF uplift: geofence/ETA signals enable better exception handling (Modeled +2–5% assuming intervention on predicted delays).
Single query surface for live tracking with bounding-box searches (Verified functional).

Strategic Business Impact

Customer experience lift (Proxy): real-time visibility lowers uncertainty and escalations.
Operational efficiency (Modeled): dispatcher actions on predicted late arrivals stabilize downstream slots.
Data asset creation (Proxy): clean archive (Parquet on Blob) unlocks planning and driver scoring use cases.

Method tags: Verified (measured in env tests), Modeled (estimations from baselines), Proxy (leading indicators such as freshness and adoption).

Role & Scope

Owned architecture and build for Event Hubs topics/partitions, AKS services (Ingestor, Aggregator, Status/API), Postgres/PostGIS schema, APIM exposures, Service Bus integration, Capture/replay, and observability dashboards.

Key Decisions & Trade-offs

Event Hubs (Kafka endpoint) vs self-managed Kafka: managed ops and elastic throughput vs fewer broker-level knobs.
Idempotent upserts over “exactly-once”: simpler recovery and replay safety vs slightly more write overhead.
Postgres/PostGIS for hot geospatial reads vs specialized time-series DB: strong geospatial, fewer moving parts.
Capture to Blob (Parquet) for durable history vs retaining long windows in Postgres: cheap storage with batch-friendly format.
Per-vehicle partition key: preserves order per vehicle but can create hot partitions for large fleets → mitigated with partition scaling and compaction policies.

Risks & Mitigations

Out-of-order/late events → sequence by event time with tolerance window; recompute snapshot if a late event arrives.
Clock skew → server-side timestamping + drift detection; reject extreme skews.
Producer dropouts → heartbeat detection; create “stale” status after threshold and alert.
Traffic spikes → autoscale AKS consumers; pre-provision Event Hubs throughput units.
Privacy/PII → keep payload minimal (vehicleId, coords, timestamps); secure tokens via APIM; encrypt at rest and in transit.

Suggested Metrics (run-time SLOs)

Ingest→status latency p50/p95/p99.
Event Hubs lag (per consumer group) & throughput units utilization.
DLQ depth and replay success rate.
API p95 for /status and bbox searches.
Freshness % (vehicles with updates in last N seconds).
Notification latency (status change → Service Bus → consumer).

Closing principle

Favor freshness and recovery over perfect delivery. Design every stage for idempotence and replay, so telemetry pipelines stay reliable under real-world noise.