OpsVision

💡 Inspiration

Modern DevOps teams are drowning in a sea of alerts. On average, engineers receive notifications from 5-10 different monitoring tools daily—GitHub, Kubernetes, Datadog, PagerDuty, Jenkins—each with its own dashboard and alert format. This fragmentation leads to:

Alert fatigue: Critical issues get lost in the noise
Slow incident response: Time wasted context-switching between tools
Missed correlations: Related events across systems go unnoticed

We asked ourselves: What if AI could unify these streams and tell us what actually matters?

OpsVision was born from this question—a platform that transforms chaos into clarity using real-time stream processing and AI-powered insights.

🔨 How We Built It

OpsVision is a three-tier architecture combining real-time streaming, stream processing, and AI:

1. Event Ingestion Layer (FastAPI + Kafka)

We built a FastAPI backend that accepts webhooks from multiple DevOps tools and normalizes them into the CloudEvents specification. Events are serialized using Avro and published to Confluent Cloud Kafka for durability and scale.

2. Stream Processing Layer (Apache Flink SQL)

The magic happens in Confluent Cloud Flink. We wrote 11 SQL statements that:

Aggregate events into 5-minute tumbling windows:

$$W_i = [t_0 + 5i, t_0 + 5(i+1))$$

Calculate health scores using severity weighting:

$$\text{Health Score} = \frac{\sum(\text{critical} \times 10 + \text{error} \times 5 + \text{warning} \times 2)}{\text{total events}}$$

Detect correlated incidents across systems using correlation IDs
Produce a denormalized AI-ready summary table

3. Intelligence Layer (Google Gemini + React)

A Kafka consumer feeds aggregated summaries to Google Gemini 2.0 Flash, which generates natural language insights like:

"⚠️ System Health: WARNING. Kubernetes is the primary concern with 23 pod restart errors in the last 5 minutes. Correlating with recent GitHub deployment at 14:32 UTC. Recommend: Check resource limits on affected pods."

The React dashboard displays everything in real-time via WebSockets.

📚 What We Learned

Flink SQL is Powerful—But Has Quirks

Confluent Cloud Flink lacks some window functions we take for granted (like LAG() and ROW_NUMBER() on non-time columns). We learned to work around these using:

Self-joins with offset windows for trend detection
Manual time bucketing for low-volume scenarios
Application-layer calculations for complex aggregations

Watermarks Are Everything

Stream processing requires handling late-arriving events. We learned that watermark strategy directly impacts result latency:

$$\text{Watermark} = \max(\text{event_time}) - \text{tolerance}$$

We settled on a 10-second tolerance—a balance between completeness and timeliness.

AI Needs Structure

Gemini works best when given structured, consistent input. Our denormalized gemini_summary table ensures the AI receives the same format every time, leading to more reliable insights.

🚧 Challenges We Faced

Challenge 1: Watermark Advancement in Low-Volume Scenarios

Problem: Flink requires ~250 events per partition to advance watermarks during demos.

Solution: We created an alternative aggregation strategy using manual time bucketing:

FLOOR(`time` TO MINUTE) - INTERVAL '1' SECOND * (EXTRACT(MINUTE FROM `time`) MOD 5)

Challenge 2: Real-Time AI Without Overloading

Problem: Calling Gemini on every event would be expensive and slow.

Solution: We batch events into 5-minute windows, reducing API calls by 99% while maintaining near-real-time insights.

Challenge 3: Cross-System Event Correlation

Problem: How do you link a Kubernetes pod crash to a GitHub deployment?

Solution: We implemented correlation IDs that propagate across systems, allowing Flink to group related events:

GROUP BY `correlation_id`, TUMBLE(...)

Challenge 4: Avro Deserialization with Schema Registry

Problem: Confluent Schema Registry integration required precise configuration.

Solution: We built a reusable AvroDeserializer class that handles schema fetching and caching automatically.

🎯 What's Next

Predictive alerts: Use historical patterns to predict incidents before they happen
Runbook automation: Let Gemini suggest and execute remediation steps
Multi-tenant support: Scale to enterprise deployments

OpsVision: Because DevOps teams deserve clarity, not chaos.

Built With

confluent
flink
kafka
ksql
node.js
python
react
vite

Updates

Private user started this project — Dec 30, 2025 03:54 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.