Inspiration

Modern observability systems generate vast amounts of metrics, yet engineers still discover SLA breaches after users are impacted. Alerts are noisy, dashboards are reactive, and root-cause analysis is manual. We were inspired to explore a different question: what if live traffic could explain itself in real time? BreachGuard was born from the idea that data in motion + AI can transform raw telemetry into immediate, actionable intelligence.


What it does

BreachGuard is a real-time, AI-powered SLA breach intelligence system. It continuously ingests live application traffic, computes SLA metrics on the fly, detects latency and availability violations, and uses AI to generate human-readable explanations and remediation insights, delivered instantly to engineers via Slack.

Instead of just signaling that something broke, BreachGuard explains what broke, why it matters, and what to do next—all while traffic is still flowing.


How we built it

  • A Node.js backend emits real-time request telemetry (latency, status codes, success signals).
  • Events are streamed into Confluent using Apache Kafka.
  • Confluent Cloud Flink SQL processes data entirely in motion:

    • Event-time alignment
    • Windowed p95 latency and error-rate computation
    • Continuous SLA evaluation
  • SLA breaches are emitted as structured Kafka events.

  • A Python alert engine consumes breach events and invokes Vertex AI to:

    • Interpret breach context
    • Generate explanations and recommendations
  • AI-generated insights are delivered to teams via Slack in real time.

Every component is event-driven, streaming-first, and horizontally scalable.


Challenges we ran into

  • Designing accurate event-time windowing to avoid false positives under bursty traffic
  • Balancing SLA sensitivity without generating alert fatigue
  • Structuring Kafka topics to represent increasing semantic meaning
  • Ensuring AI outputs were concise, actionable, and trustworthy—not generic
  • Coordinating streaming SQL logic with downstream AI interpretation

Accomplishments that we're proud of

  • Built a fully streaming SLA pipeline with zero batch processing
  • Used SQL-only Flink jobs to express complex operational logic clearly
  • Applied AI directly to live operational data, not historical logs
  • Delivered contextual, human-readable alerts instead of raw metrics
  • Demonstrated a real-world, production-relevant use case for AI on data in motion

What we learned

  • Streaming systems are most powerful when each topic represents a clear semantic contract
  • AI adds the most value when placed after signal extraction, not before
  • Real-time observability is as much about communication as computation
  • Data in motion enables faster decisions than any dashboard ever can

What's next for BreachGuard

  • Predictive SLA breach detection before thresholds are crossed
  • Adaptive, AI-tuned SLA thresholds per endpoint
  • Integration with incident management and auto-remediation workflows
  • Long-term learning from historical breach patterns
  • Expansion beyond SLAs into cost, security, and reliability intelligence

Built With

Share this project:

Updates