SentinelOps — Project Story
Inspiration
Every engineering team shares the same nightmare.
It's 2 AM. The CI pipeline fails. Someone gets paged. Three hours later, a senior engineer surfaces from a sea of raw log output to find the culprit: a missing environment variable. Something that could have been caught in 30 seconds — if anyone had been looking in the right place.
I've lived that night. Most engineers have. What makes it worse isn't the failure itself — it's the sameness of it. The same patterns repeat sprint after sprint. The same engineers lose the same hours. The same post-mortem is written, filed, and forgotten.
Current DevOps tools are built to alert you after the damage is done. They show raw logs with no context, no history, and no intelligence. They treat every failure as a fresh mystery — when most failures are reruns of incidents that have already been solved, somewhere, by someone, who no longer remembers.
In a typical 30-person engineering team, CI failures consume over 100 hours of productivity per month.
That number is what pushed me to build SentinelOps. Not to make a smarter alert — but to make the alert unnecessary in the first place.
What It Does
SentinelOps is a real-time AI engineering intelligence system — not a chatbot, not another monitoring dashboard. It's an autonomous co-pilot that operates across three phases of the development lifecycle:
Before a PR is merged: SentinelOps analyzes the code diff, author history, file types, and dependency changes. It assigns a risk probability (0–100%) and blocks dangerous merges autonomously via the GitHub Commit Status API — surfacing a visual traffic light (🟢 Safe / 🟡 Caution / 🔴 High Risk) before a human reviewer ever opens the PR.
When CI fails: The system automatically downloads the failure log, extracts the relevant error block, retrieves the code diff, and queries OpenAI GPT-4o with all three as context. It returns a structured root cause explanation in plain English — what broke, which files caused it, the exact fix as a patch diff, and an estimated time to resolve. What used to take 2–6 hours now takes under 2 minutes.
Over time: Every log is embedded using SentenceTransformers. When a new failure occurs, SentinelOps searches against the full history of incidents by cosine similarity — surfacing matches like: "This failure is 95% similar to Incident #234 — that was a memory leak in the connection pool." The system builds institutional memory automatically, so the same incident is never diagnosed from scratch twice.
How I Built It
Data Flow
GitHub PR opened
→ Webhook fires → FastAPI receives → Redis queue → Celery worker picks up
→ Static Risk Analyzer computes risk score (lines, file types, author history)
→ ML Predictor (Logistic Regression) computes failure probability P(failure)
→ Result stored in PostgreSQL → WebSocket broadcast → Dashboard updates live
CI workflow_run completed with failure
→ GitHub Actions log downloaded → Error block extracted
→ SentenceTransformer embedding created → Similar incidents retrieved
→ OpenAI GPT-4o called with: log + diff + similar incidents as context
→ JSON response: root cause + responsible files + patch diff
→ Stored as Incident → Dashboard shows real-time alert
Key Technical Components
1. Risk Scoring Engine
The pre-merge risk score is a weighted composite of five signals blended with a trained ML model:
$$\text{Risk Score} = \alpha \cdot \hat{P}_{\text{LR}}(\vec{x}) + (1 - \alpha) \cdot \text{StaticScore}(\vec{x})$$
Where the static component weights five features:
$$\text{StaticScore} = w_1 f_{\text{lines}} + w_2 f_{\text{filetype}} + w_3 f_{\text{author_hist}} + w_4 f_{\text{pr_size}} + w_5 f_{\text{dep_changes}}$$
The ML component uses Logistic Regression trained on 1,000 synthetic PRs:
$$\hat{P}_{\text{LR}}(\vec{x}) = \sigma(\vec{w}^T \vec{x} + b) = \frac{1}{1 + e^{-(\vec{w}^T \vec{x} + b)}}$$
I chose Logistic Regression deliberately. For tabular CI metadata, it is fast, interpretable, and produces a true probability — exactly what the Gatekeeper needs to decide whether to report success or failure to GitHub.
2. LLM Root Cause Pipeline
Structured prompt engineering with three context inputs: the raw log (tail + error block), the PR diff, and the top-k similar historical incidents. The response format is enforced as a typed JSON contract:
{
"root_cause": "string",
"responsible_files": ["string"],
"error_category": "string",
"confidence": 0.0,
"suggested_fix": "string",
"fix_diff": "string",
"risk_if_unresolved": "string",
"estimated_fix_time": "string"
}
Treating the system prompt like an API contract — not a suggestion — was the key insight that made GPT-4o output reliably parseable.
3. Failure Similarity Search
all-MiniLM-L6-v2 creates 384-dimensional embeddings of every CI log:
$$\vec{v} \in \mathbb{R}^{384}$$
New failures are matched against historical incidents using cosine similarity:
$$\text{sim}(\vec{a}, \vec{b}) = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}| \cdot |\vec{b}|}$$
Incidents above a 70% similarity threshold are surfaced as related — turning every new failure into a search query against the team's full incident history.
4. Digital Twin — Monte Carlo Simulation
Before a deployment, SentinelOps runs $N = 1{,}000$ Monte Carlo iterations to estimate deployment stability from historical metrics (duration, flakiness rate, resource contention):
$$\hat{P}(\text{stable}) \approx \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}[\text{sim}_i = \text{success}]$$
This gives a confidence interval for production stability before any code ships.
5. Incident Memory Graph
An interactive React Flow visualization maps relationships between PR → Commit → Author → Failure → Log Pattern. Similar incidents are connected by dashed edges weighted by their similarity score — making cross-incident patterns visible at a glance.
Technologies Used
| Layer | Stack |
|---|---|
| Frontend | Next.js 14 · TypeScript · Tailwind CSS · Recharts · React Flow · Framer Motion |
| Backend | Python 3.11 · FastAPI · Celery · SQLAlchemy (async) · Alembic · WebSockets |
| Database & Cache | PostgreSQL 15 · Redis 7 |
| AI / ML | OpenAI GPT-4o · SentenceTransformers all-MiniLM-L6-v2 · scikit-learn LogisticRegression |
| Infrastructure | Docker · Docker Compose · Vercel (frontend) · Render (backend) |
| Integrations | GitHub REST API v3 · GitHub Webhooks · GitHub Actions |
Challenges I Faced
The cold start problem. The risk model and similarity search are useless without historical data. I solved this by building a seed script that generates realistic synthetic incident data — PRs, failures, embeddings — so SentinelOps is useful on day one, not month six.
WebSocket state consistency. Keeping CI run statuses live on the dashboard without polling — and without serving stale state after reconnects — required building a reconnection-aware hook that rehydrates from the REST API on every connect. Distributed state is easy until the connection drops.
Prompt reliability at scale. Getting GPT-4o to return structured, parseable JSON every time — not just most of the time — required careful prompt iteration, strict output format instructions, and fallback parsing logic. LLM prompting is an engineering discipline, not a trick.
LLM cost vs. latency. GPT-4o calls are not free. I added log truncation (sending only the tail and error lines), embedding caching, and async Celery dispatch so LLM analysis never blocks the API response path.
GitHub Gatekeeper idempotency. Commit status updates must be idempotent — if a webhook fires twice, the status should not double-report. Handling webhook delivery failures, retries, and PR state transitions required careful design that I hadn't anticipated going in.
What I Learned
Building SentinelOps compressed years of distributed systems learning into a single project.
I learned that async system design reveals itself slowly — the failure modes of coordinating FastAPI, Celery, Redis, and WebSockets don't appear until everything runs together under load. Graceful degradation isn't a feature you add at the end; it has to be designed in from the start.
I learned that ML in production is a different problem than ML in a notebook. The logistic regression model is simple. The hard parts are versioning, retraining triggers, feature drift, and what happens when predictions are systematically wrong on a new codebase with no history.
And I learned that the best system prompt is one you treat like a contract — written precisely, tested like code, and never changed without understanding the downstream consequences for every consumer of its output.
Impact
| Metric | Before SentinelOps | With SentinelOps |
|---|---|---|
| Mean Time to Root Cause | 2–6 hours | < 2 minutes |
| Repeat Incidents | ~40% | < 5% (similarity detection) |
| Risky PR Merges | Unknown until CI fails | Flagged before merge |
| Engineering Hours Lost / Month | 100+ | < 20 |
What Makes SentinelOps Different
Most tools build AI assistants — systems that answer questions when asked. SentinelOps is a proactive intelligence system. It doesn't wait to be queried. It monitors, detects, analyzes, and acts — autonomously, at every stage of the pipeline.
The combination of pre-merge risk prediction + post-failure LLM analysis + cross-incident similarity search creates a compounding feedback loop: every failure makes the next one cheaper to diagnose, and every diagnosed failure reduces the probability of the one after that.
Future Roadmap
- Plugin architecture for CircleCI, Jenkins, and GitLab CI adapters
- Multi-org SaaS with isolated namespaces and Stripe billing
- Autonomous mode — SentinelOps opens GitHub PRs with fixes and tags the right reviewers automatically
- SOC2 readiness — audit logging, secret management, and compliance reporting
- Kubernetes deployment with auto-scaling Celery workers per organization
- Self-hosted LLM support — swap GPT-4o for a local model (Ollama, etc.) for privacy-sensitive environments
SentinelOps is built to be enterprise-ready from day one. Every architecture decision — async-first, queue-backed, containerized — prioritizes scalability, security, and extensibility.
Built with ❤️ by Arsh Verma
Built With
- celery
- docker
- fastapi
- githubrestapi
- gpt-4o
- next.js
- postgresql
- pydantic
- python3.11
- react
- redis7
- sqlalchemy
- tailwindcss
- typescript


Log in or sign up for Devpost to join the conversation.