💡 Inspiration

As Large Language Models (LLMs) like Gemini and GPT-4 become critical infrastructure, passive monitoring is no longer enough. We realized that companies don't just need to know when an AI fails—they need a system that actively prevents cascading failures.

We asked ourselves: "What if we could build a 'Circuit Breaker' for AI?"

This inspired the AI Trust Control Plane—a real-time decision engine that sits between users and AI models. Instead of just logging errors, it calculates a deterministic "Trust Score" and enforces active guardrails (like a Kill Switch) to block traffic when trust degrades.

⚙️ What it does

The platform provides a single pane of glass for AI reliability. It operates on a "Green → Red → Green" loop:

  1. Monitor: Detects Rate Limits (429s) and Latency Spikes in real-time.
  2. Score: Calculates a dynamic Trust Score (0-100). For example, a rate limit drops the score by 30 points immediately.
  3. Act: If the score drops below 60, the Kill Switch Guardrail activates, physically blocking API traffic to prevent cost overruns or bad user experiences.
  4. Audit: Every action (resolution, policy change) is cryptographically signed with SHA-256 and stored in an immutable ledger for compliance.

🛠️ How we built it

We built a local-first, fail-safe architecture to ensure the Control Plane survives even if the backend falters.

  • Frontend: Built with Next.js 14 (App Router) and TypeScript for type safety. We used Shadcn/UI and Tailwind CSS for the mission-control aesthetic.
  • State Management: We encountered a challenge where rapid API failures caused React state glitches. We solved this by implementing a useRef based "Instant Memory" system that tracks score degradation (100 → 70 → 40) with zero lag.
  • Datadog Integration: The system pushes both Logs and Real-time Events to Datadog's US5 region via a Next.js API proxy, allowing us to visualize the "Staircase Degradation" effect on Datadog dashboards.
  • Security: We implemented a mock RBAC (Role-Based Access Control) system. The "SRE" role requires a security PIN to elevate privileges, while the "Auditor" role is read-only, demonstrating enterprise readiness.

🚧 Challenges we faced

The hardest part was the "Natural Degradation" logic. Initially, our Datadog graphs would jump straight from 100 to 0. To fix this, we wrote a custom scoring engine that subtracts penalties cumulatively (e.g., -15 for latency, -30 for rate limits). This created a realistic, organic "step-down" pattern in our observability metrics that mirrors real-world outages.

🏆 Accomplishments that I'm proud of

  • The Kill Switch: Seeing the "GUARDRAIL ACTIVE" red overlay trigger automatically when the score hit Critical was a huge win.
  • Immutable Audit Logs: Integrating the Web Crypto API to generate real SHA-256 hashes for every log entry makes the system feel truly "Audit-Ready."
  • Datadog Sync: Successfully connecting our local simulation to the Datadog cloud and seeing the graphs move in real-time.

Built With

Share this project:

Updates