September 9, 2025

Tier 1 US Telecom — Unified Topology Graph Platform (Ingestion + Stitching)

Unified topology platform—conceptual architecture

Summary

Built a standards-aligned topology platform that ingests inventory from 30+ OSS sources, normalizes to TMF SID + ITU M.3100, persists into Neo4j, and stitches relationships in both bulk and real time—enabling query-in-seconds traversals for service impact and assurance.


Problem

Topology and inventory lived in heterogeneous systems (MongoDB, Oracle, legacy OSS, flat files). Schemas conflicted, lineage was unclear, and assurance teams lacked a single, trustworthy view. Impact analysis required manual joins; queries took hours, CMDB replicas drifted, and integration costs rose with every new source.


Solution Mechanics

  • Multi-source CDC ingestion via Kafka Connect (Mongo/Oracle/file/legacy interfaces).
  • Canonical modeling to TMF SID + ITU M.3100 (equipment, termination points, connections, capacity, alarms).
  • Stream + batch processing with Apache Flink and Spring Batch for backfills.
  • Neo4j as the read-optimized topology store; GraphQL/REST for traversal and downstream replication.
  • Rules-driven stitching:
    • Rules Templates stored in MongoDB (JSON).
    • A Java rules library generates Cypher and enforces relationship uniqueness.
    • Two paths: bulk (historical) and real-time (CDC events).
  • Validation & remediation run in parallel (constraints, rule engine, audited invalid store) to keep integrity high.

Diagram 1 — System Flow

System Flow — ingestion → transform → graph → consumers Multi-source CDC → Flink transform → Neo4j → assurance/CMDB consumers.

Diagram 2 — Rules-Driven Stitching

Rules-driven stitching — templates, library, cypher, sinks Templates in MongoDB → rules processor → Cypher → Neo4j sink (bulk + real-time).


How it works (step-by-step)

  1. CDC connectors capture changes from MongoDB / Oracle / file feeds / legacy OSS.
  2. Events land in Kafka (topics partitioned by entity/type, idempotent keys).
  3. Apache Flink normalizes to TMF SID + ITU M.3100 and enriches with lineage.
  4. Nodes & properties are upserted into Neo4j with constraints and relationship uniqueness.
  5. A stitcher (Flink realtime + Spring Batch bulk) loads MongoDB rules templates, generates Cypher, and creates relationships.
  6. Validation classifies records (valid / remediable / invalid); invalids go to an audited store with remediation loops.
  7. Consumers (assurance dashboards, GraphQL/REST, CMDB replicate) query the connected graph in seconds.
  8. Backfills run in bulk windows while CDC maintains freshness with DLQs, backpressure, and controlled replay.

Outcomes

  • Single source of truth for topology spanning 30+ systems.
  • Near-real-time freshness through CDC with idempotent upserts and DLQs.
  • End-to-end traversals (Equipment → TPs → Connections → Capacity → Customer) in seconds, not hours.
  • Reduced defect leakage via validation and remediation loops (audited invalids, rule-backed fixes).
  • Plug-and-play consumers: assurance dashboards, CMDB replication, GraphQL impact APIs.

Strategic Business Impact

  • MTTR reduction on correlation-heavy incidents (Modeled; Med–High confidence)
    Assumptions: baseline MTTR, % incidents requiring topology correlation, path-finding time cut from hours → minutes.
    Modeled outcome: 20–40% MTTR reduction for those incidents → fewer customer-visible minutes.

  • Integration cost avoidance (Proxy)
    Standards-aligned model + reusable connectors reduced per-source mapping/rework.

  • Assurance capability uplift (Proxy)
    “Query-in-seconds” impact assessments replaced manual joins—faster decisions during events.

Method tags: MTTR = Modeled; Cost avoidance & capability uplift = Proxy.
Evidence hooks: traversal timings, incident post-mortems, connector catalog, runbooks.


Role & Scope

  • Led end-to-end architecture across ingestion, canonical modeling, Neo4j persistence, API exposures, and rules-driven stitching.
  • Aligned discovery/planning/CMDB/assurance teams on model contracts, SLAs, and governance (versioned rules, invalid-data workflow, DLQ recovery).

Key Decisions & Trade-offs

  • Standards first (TMF/ITU): higher up-front modeling, lower lifetime integration cost; avoids schema drift.
  • Graph as topology system of record: fast traversals and impact analysis; tuned constraints/APOC and batching for write throughput.
  • Rules outside code: faster iteration; enforced versioning, testing, and rollout gates.
  • Dual stitching paths (bulk + real-time): correctness for backfills, freshness for CDC; dedupe/uniqueness rules to prevent duplicates.
  • CDC over scheduled ETL: higher freshness; required idempotency and backpressure controls.

Risks & Mitigations

  • Rule explosion / write contention → constrained rule grammar, pre-validation, batched Cypher, relationship uniqueness, bulk windows.
  • CDC volume spikes / lag → DLQs with auto-replay, backpressure, snapshot markers.
  • Model drift across sources → canonical schema registry + contract tests per connector.
  • Skewed relationship creation → partitioned processing, p95 write SLOs, adaptive batch sizing.

Suggested Metrics (run-time SLOs)

  • CDC lag p95 per source; DLQ depth & recovery time.
  • Traversal latency p50/p95 by path length; relationship creation throughput per entity.
  • Invalid→valid remediation rate; constraint violation rate (trend).
  • Consumer freshness for CMDB/assurance datasets.

Principle carried forward:

Standards-aligned models, CDC correctness, rules-driven stitching, and graph-first assurance are the fastest route from data silos to reliable operations at telecom scale.

Ready to take your idea to the next level? Let's work together.