💡 Inspiration
Modern DevOps teams are drowning in a sea of alerts. On average, engineers receive notifications from 5-10 different monitoring tools daily—GitHub, Kubernetes, Datadog, PagerDuty, Jenkins—each with its own dashboard and alert format. This fragmentation leads to:
- Alert fatigue: Critical issues get lost in the noise
- Slow incident response: Time wasted context-switching between tools
- Missed correlations: Related events across systems go unnoticed
We asked ourselves: What if AI could unify these streams and tell us what actually matters?
OpsVision was born from this question—a platform that transforms chaos into clarity using real-time stream processing and AI-powered insights.
🔨 How We Built It
OpsVision is a three-tier architecture combining real-time streaming, stream processing, and AI:
1. Event Ingestion Layer (FastAPI + Kafka)
We built a FastAPI backend that accepts webhooks from multiple DevOps tools and normalizes them into the CloudEvents specification. Events are serialized using Avro and published to Confluent Cloud Kafka for durability and scale.
2. Stream Processing Layer (Apache Flink SQL)
The magic happens in Confluent Cloud Flink. We wrote 11 SQL statements that:
- Aggregate events into 5-minute tumbling windows:
$$W_i = [t_0 + 5i, t_0 + 5(i+1))$$
- Calculate health scores using severity weighting:
$$\text{Health Score} = \frac{\sum(\text{critical} \times 10 + \text{error} \times 5 + \text{warning} \times 2)}{\text{total events}}$$
- Detect correlated incidents across systems using correlation IDs
- Produce a denormalized AI-ready summary table
3. Intelligence Layer (Google Gemini + React)
A Kafka consumer feeds aggregated summaries to Google Gemini 2.0 Flash, which generates natural language insights like:
"⚠️ System Health: WARNING. Kubernetes is the primary concern with 23 pod restart errors in the last 5 minutes. Correlating with recent GitHub deployment at 14:32 UTC. Recommend: Check resource limits on affected pods."
The React dashboard displays everything in real-time via WebSockets.
📚 What We Learned
Flink SQL is Powerful—But Has Quirks
Confluent Cloud Flink lacks some window functions we take for granted (like LAG() and ROW_NUMBER() on non-time columns). We learned to work around these using:
- Self-joins with offset windows for trend detection
- Manual time bucketing for low-volume scenarios
- Application-layer calculations for complex aggregations
Watermarks Are Everything
Stream processing requires handling late-arriving events. We learned that watermark strategy directly impacts result latency:
$$\text{Watermark} = \max(\text{event_time}) - \text{tolerance}$$
We settled on a 10-second tolerance—a balance between completeness and timeliness.
AI Needs Structure
Gemini works best when given structured, consistent input. Our denormalized gemini_summary table ensures the AI receives the same format every time, leading to more reliable insights.
🚧 Challenges We Faced
Challenge 1: Watermark Advancement in Low-Volume Scenarios
Problem: Flink requires ~250 events per partition to advance watermarks during demos.
Solution: We created an alternative aggregation strategy using manual time bucketing:
FLOOR(`time` TO MINUTE) - INTERVAL '1' SECOND * (EXTRACT(MINUTE FROM `time`) MOD 5)
Challenge 2: Real-Time AI Without Overloading
Problem: Calling Gemini on every event would be expensive and slow.
Solution: We batch events into 5-minute windows, reducing API calls by 99% while maintaining near-real-time insights.
Challenge 3: Cross-System Event Correlation
Problem: How do you link a Kubernetes pod crash to a GitHub deployment?
Solution: We implemented correlation IDs that propagate across systems, allowing Flink to group related events:
GROUP BY `correlation_id`, TUMBLE(...)
Challenge 4: Avro Deserialization with Schema Registry
Problem: Confluent Schema Registry integration required precise configuration.
Solution: We built a reusable AvroDeserializer class that handles schema fetching and caching automatically.
🎯 What's Next
- Predictive alerts: Use historical patterns to predict incidents before they happen
- Runbook automation: Let Gemini suggest and execute remediation steps
- Multi-tenant support: Scale to enterprise deployments
OpsVision: Because DevOps teams deserve clarity, not chaos.
Log in or sign up for Devpost to join the conversation.