Sentinel

Sentinel

Inspiration

AI factories run two things on top of GPUs: AI agents, and the infrastructure underneath them. Both fail in new ways, and the tools that exist today are split. Detection libraries flag threats but never act. Observability platforms log incidents after the damage is done. Cluster monitoring catches GPU memory and network problems but does not understand LLM semantics. Engineers stitch three or four panes of glass together, often at 3am, while an agent quietly drops a production table or a model serving config crashes a customer endpoint. Sentinel exists because the response to an agent attempting to delete a database should not look fundamentally different from the response to a GPU node overheating. Both are operational incidents. Both need the same answer: what happened, why, what action to take, who to wake up.

What it does

Sentinel is one operations layer for the AI Factory. Same reasoning engine, two signal streams.

Layer 1, agent firewall. Every tool call an AI agent makes is intercepted, scored by a two-tier ranker (regex heuristics first, an LLM for the ambiguous middle band), and routed by severity. Critical events auto-block. High severity opens a war room. Medium events post to a Stream chat channel with action buttons. Our labeled eval set of 36 examples shows 100% precision, 95% recall, F1 0.97.

Layer 2, AI Factory operations. Sentinel ingests Cisco's AI Factory dataset (18 scenarios across performance, GPU placement, and failure detection), reads alerts, logs, and runbooks, and returns Cisco's required structured recommendation: action, target, reason category, confidence, evidence. Each recommendation also includes a plain-English reasoning sentence and a three-step on-call playbook with specific thresholds. All 18 pass Cisco's official validate_recommendation.py --require-all validator.

Cross-layer escalation. Both layers share the same channels. A Stream incident feed with filterable tabs (All, Agent, Cisco, Escalations) is the audit log. A Tencent TRTC and Google Meet war room is the synchronous response surface for critical events. A VoiceOS MCP server lets engineers query incidents, evaluate Cisco scenarios, and decide on actions hands-free, from anywhere.

How we built it

The backend is FastAPI with an in-memory event bus and SSE streaming to the dashboard. The ranker is two tiers: regex heuristics for clearly destructive or clearly benign tool calls, an LLM for the ambiguous band, results cached by call fingerprint. The Cisco advisor is a separate module that pulls scenario context (alerts, logs, runbooks, allowed action menu) and produces the structured recommendation, with deterministic heuristics routing by primary alert and the LLM enriching the reasoning and next-steps narrative. We exposed Sentinel as an MCP stdio server with five tools, picked up by VoiceOS for the voice surface.

The frontend is Next.js 16 with Tailwind 4 and React 19. Live event table over SSE, severity counters, critical-event banner, Cisco scenario panel with a recommendation card, embedded Stream chat sidebar with filter tabs, and a war room modal that provisions a Google Meet room and shows the brief plus decision controls.

Sponsor integrations: Cisco dataset and validator, Stream Chat (server SDK + React SDK), Tencent TRTC (HMAC-SHA256 UserSig minting), VoiceOS (custom MCP integration), Google Meet (pre-provisioned room link).

Challenges we ran into

Sponsors moved underneath us as we discovered what they actually shipped. VoiceOS turned out to be a desktop action agent, not telephony, so we redesigned the voice flow around MCP tool calls instead of phone calls. Tencent did not give us general compute, so the cloud-deploy phase got dropped and we ran everything locally. Stream Chat React v14 changed where the custom message prop lives, which silently broke our action-button card and crashed the dashboard render until we tracked it down. Tencent's TRTC UserSig algorithm uses a non-standard base64 alphabet that their docs glance over. Multi-device demos hit enterprise wifi client isolation, which pushed us to swap embedded TRTC for Google Meet so responders could join from any network.

Accomplishments that we're proud of

Four hackathon sponsors integrated end-to-end, each doing a distinct job. Cisco data drives the infrastructure layer, Stream is the unified incident audit log, VoiceOS is the voice surface via MCP, Tencent and Google Meet power the war room. Not bolted on, not for show.

All 18 Cisco scenarios pre-evaluated and validated. Our sentinel_recommendations.json passes Cisco's official validate_recommendation.py with strict --require-all mode. Schema clean, every required field present, every recommended action drawn from the allowed menu for its track. The dashboard surfaces this with a click-through "Cisco validator passed" badge.

A two-tier ranker with honest numbers. Regex heuristics short-circuit the obvious cases, an LLM resolves the ambiguous middle band, results cached by call fingerprint. A 36-example labeled eval set runs on every boot and reports 100% precision, 95% recall, F1 0.97 live in the dashboard footer. No vibes, no hand-waving.

A reasoning UI that reads like an engineer talking. Each Cisco recommendation includes a one-sentence narrative connecting the signals to the action via the runbook, plus a three-step on-call playbook with specific thresholds ("reduce batch size by 20-30%, lower concurrency cap, monitor p95 latency for 10 minutes"). Engineers get an executable next step, not just an action name.

A reusable MCP server, not a bespoke integration. Sentinel exposes five tools (incidents_lookup, incidents_decide, sentinel_status, cisco_scenarios, cisco_evaluate, current_meet_link, warroom_create_meet) over stdio. VoiceOS picks them up today; any MCP-compatible client (Claude Desktop, Cursor, AdaL) works tomorrow with zero changes on our side.

One coherent architecture, not two products in a trench coat. The same engine that intercepts agent tool calls also evaluates Cisco AI Factory scenarios. The same Stream channel logs both. The same Google Meet war room hosts both. The same VoiceOS commands query both. "Agents on top, infrastructure underneath" is the architectural decision, not a marketing line we wrote backwards.

Why Sentinel is a company, not a feature

The category is real and the buyer is obvious. Every company shipping AI agents into production in 2026 is hitting the same gap: their agents have prod-level access, their existing tools were not built to reason about agent failures, and a single prompt injection can move from "harmless email" to "DROP TABLE" in milliseconds. Gartner has already named "agent security" a top-three unsolved problem for the year. The competitive landscape splits cleanly: Lakera, NeMo, and Llama Guard detect but do not act; LangSmith and Helicone log but do not prevent; Datadog and Cisco's own AI Factory tooling watch infrastructure but do not understand LLM semantics. Sentinel sits in the empty quadrant: real-time, action-taking, agent-aware, and infra-aware.

Who buys it. Platform teams at companies with more than 50 AI agents in production. Today that's a few hundred companies; by mid-2026 it's a few thousand.

How it monetizes. Per-intercepted-call pricing for the agent layer (think Cloudflare for agents), per-cluster for the infrastructure layer. Free tier covers small teams; enterprise tier ships SSO, audit logs, role-based decision authorization, and per-customer fine-tuned classifiers trained on the tenant's own agent traces.

Why it's defensible. The MCP-first architecture means Sentinel works with any agent stack today (Claude, Cursor, OpenAI, in-house) without integration work on the customer side. The two-tier ranker improves continuously as eval data grows. The Cisco-validated structured-recommendation format is a foothold into AI-infrastructure ops, a market Cisco itself is sizing in the billions.

Path from hackathon to v1. Real JSON-RPC MCP proxy in front of production agent fleets. Per-customer fine-tuning. Fleet-wide firewall rule generation from observed attacks. Role-based decisions, audit logs, multi-party approval for highest-stakes blocks. Deeper use of raw telemetry (node metrics, network metrics, checkpoint events) for hypothesis scoring instead of the pre-aggregated summary. Pricing pilot with three design-partner customers in Q3.

Sentinel is the operations pane for the AI Factory. The first version is built. The rest is execution.