BreachGuard

Inspiration

Modern observability systems generate vast amounts of metrics, yet engineers still discover SLA breaches after users are impacted. Alerts are noisy, dashboards are reactive, and root-cause analysis is manual. We were inspired to explore a different question: what if live traffic could explain itself in real time? BreachGuard was born from the idea that data in motion + AI can transform raw telemetry into immediate, actionable intelligence.

What it does

BreachGuard is a real-time, AI-powered SLA breach intelligence system. It continuously ingests live application traffic, computes SLA metrics on the fly, detects latency and availability violations, and uses AI to generate human-readable explanations and remediation insights, delivered instantly to engineers via Slack.

Instead of just signaling that something broke, BreachGuard explains what broke, why it matters, and what to do next—all while traffic is still flowing.

How we built it

A Node.js backend emits real-time request telemetry (latency, status codes, success signals).
Events are streamed into Confluent using Apache Kafka.
Confluent Cloud Flink SQL processes data entirely in motion:
- Event-time alignment
- Windowed p95 latency and error-rate computation
- Continuous SLA evaluation
SLA breaches are emitted as structured Kafka events.
A Python alert engine consumes breach events and invokes Vertex AI to:
- Interpret breach context
- Generate explanations and recommendations
AI-generated insights are delivered to teams via Slack in real time.

Every component is event-driven, streaming-first, and horizontally scalable.

Challenges we ran into

Designing accurate event-time windowing to avoid false positives under bursty traffic
Balancing SLA sensitivity without generating alert fatigue
Structuring Kafka topics to represent increasing semantic meaning
Ensuring AI outputs were concise, actionable, and trustworthy—not generic
Coordinating streaming SQL logic with downstream AI interpretation

Accomplishments that we're proud of

Built a fully streaming SLA pipeline with zero batch processing
Used SQL-only Flink jobs to express complex operational logic clearly
Applied AI directly to live operational data, not historical logs
Delivered contextual, human-readable alerts instead of raw metrics
Demonstrated a real-world, production-relevant use case for AI on data in motion

What we learned

Streaming systems are most powerful when each topic represents a clear semantic contract
AI adds the most value when placed after signal extraction, not before
Real-time observability is as much about communication as computation
Data in motion enables faster decisions than any dashboard ever can