SentinelOps: Autonomous SRE & DevOps Incident Command Portal

Every answer is provably sourced — matched runbooks with live cosine-similarity scores.
MongoDB Atlas Vector Search + Gemini 2.5 for grounded SRE incident diagnosis.
Live MongoDB Atlas data with Gemini-synthesized user memory profiles.
Ingest any runbook — embedded with text-embedding-004 into 768-dim vectors and written straight into MongoDB Atlas, instantly searchable.

Inspiration

Every engineer who has been on call knows the feeling: it's 3 AM, production is down, and you're frantically grepping through wikis, Slack threads, and half-outdated runbooks trying to remember how you fixed this last time. The knowledge exists somewhere — it's just scattered and slow to reach when every minute of downtime costs money.

At the same time, AI chatbots are everywhere now — but you can't trust a tool that confidently makes things up when production is on fire. We wanted an assistant that's both fast and provably trustworthy: it answers instantly, and it shows you exactly which document the answer came from. That combination — speed plus grounded transparency — became SentinelOps.

What it does

SentinelOps is an autonomous SRE (Site Reliability Engineering) agent that diagnoses production incidents using your own runbook library.

Ask in plain English — "How do I fix Redis cache key eviction during high traffic?" — and the agent retrieves the most relevant runbooks and answers from them.
Grounded, not hallucinated — every response shows a MongoDB Atlas Vector Search Results panel with the matched runbooks and their cosine-similarity scores, so you can verify the source.
Persistent memory — Gemini synthesizes a per-user memory profile from conversation history, stored in MongoDB Atlas.
Ingestion pipeline — paste any runbook and it's embedded into a 768-dimensional vector and written straight into Atlas, instantly searchable.
Dual models — switch between Gemini 2.5 Flash (fast) and Pro (deeper reasoning), both backed by the same Atlas grounding.
Observability hooks — an authenticated webhook endpoint lets monitoring tools trigger autonomous diagnosis.

How we built it

The foundation is MongoDB Atlas, which powers three live collections — users, sessions, and knowledge_vectors.

The retrieval flow:

A user's question is embedded with Google's text-embedding-004 model into a 768-dimensional vector.
MongoDB Atlas Vector Search runs a $vectorSearch aggregation, ranking runbooks by cosine similarity:

$$\text{similarity}(\mathbf{q}, \mathbf{d}) = \frac{\mathbf{q} \cdot \mathbf{d}}{\lVert \mathbf{q} \rVert \, \lVert \mathbf{d} \rVert}$$

The top matches are injected into Gemini 2.5's context as grounding, and the matched titles + scores are surfaced in the UI.

Stack:

Backend: Python + Flask REST API, with four Vertex AI tools (search_knowledge_base, load_user_memory, save_chat_history, execute_mongodb_mcp_tool).
AI: Google Gemini 2.5 Flash & Pro via Vertex AI; text-embedding-004 for embeddings.
Database: MongoDB Atlas (M0 free tier) with a cosine-similarity vector index on 768-dim embeddings.
MCP: The official MongoDB MCP Server (25 tools) wired to Gemini over JSON-RPC 2.0, so the agent can run live database operations.
Infra: Containerized with Docker (Python + Node.js), deployed serverless on Google Cloud Run, with Cloud Logging and GCS runbook backups.
Frontend: A vanilla-JS glassmorphic dashboard on GitHub Pages — incident command, diagnostic chat, a live MongoDB Memory Core explorer, and a runbook ingester.

Challenges we ran into

Running the MCP server inside Cloud Run. The MongoDB MCP server is a Node.js subprocess, while our app is Python. Getting both runtimes into one container — and handling the cold-start timeout before the MCP handshake completed — took a multi-runtime Dockerfile and a threaded, timeout-guarded initialization with graceful fallback.
A silent AI-synthesis bug. Our memory-summary feature was calling an undefined GenerativeModel, which threw an exception that got swallowed by a try/except — so it silently fell back to a templated string instead of real AI output. We caught it by inspecting the live database and switched to the unified genai client.
Securing a public webhook. Our alert endpoint triggered a Gemini call on every request — an open door for quota abuse. We added X-Webhook-Secret authentication to lock it down.
"Committed" ≠ "deployed." More than once we fixed code, pushed to Git, and the live behavior didn't change — because Cloud Run was still serving the old revision. We learned to verify against the live endpoints, not the repo.
Vector index naming. Atlas Vector Search silently returns nothing if the index name in code doesn't match the one created in the Atlas UI — a subtle gotcha we now document explicitly.

Accomplishments that we're proud of

Grounding you can see. The similarity-score panel turns "trust me" into "here's the proof" — exactly what makes an AI agent usable for real incident response.
A genuinely live, end-to-end system — deployed on Cloud Run, with real Atlas vector search, real Gemini answers, and a real ingestion write-path, all working together.
Honest by design. Simulated/demo panels are clearly badged, and our stats are all verifiable facts — no inflated metrics.
Real partner integration — not just storing data in MongoDB, but using Atlas Vector Search and the official MongoDB MCP Server as core, load-bearing parts of the agent.

What we learned

RAG is only as trustworthy as its transparency. Surfacing the retrieved sources and scores changed the product from "a chatbot" into "a tool an engineer would actually rely on."
Vector search quality lives in the details — embedding model choice, index configuration, numCandidates, and similarity metric all materially affect results.
Tool-use orchestration is powerful but fragile — subprocess lifecycles, timeouts, and silent fallbacks need deliberate handling, especially serverless.
Demo integrity matters. Closing the gap between what we claimed and what the code did made the whole project stronger.

What's next for SentinelOps: Autonomous SRE & DevOps Incident Command Portal

Real integrations to replace the simulated panels — live Dynatrace/Datadog alert ingestion and genuine GitLab/GitHub merge-request creation for AI-generated hotfixes.
Multi-tenant runbook libraries so teams can ground the agent on their own private documentation.
Auto-ingestion of runbooks from existing wikis, Confluence, and Git repos.
Feedback loop — let engineers rate answers so retrieval and synthesis improve over time.
Proactive incident response — chaining Atlas-grounded diagnosis with automated remediation actions, with a human approval gate.

Built With

atlas-vector-search
css3
docker
flask
gemini
gemini-2.5
github
google-cloud
google-cloud-logging
google-cloud-run
html5
javascript
mcp
model-context-protocol
mongodb
mongodb-atlas
node.js
python
rest-api
text-embedding-004
vertex-ai

Updates

Avish Tejas Maniar started this project — Jun 11, 2026 12:55 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.