Sentinel Fabric: Resilient AI Orchestration for the Enterprise
๐ก Inspiration
As enterprise adoption of Generative AI accelerates, developers face a harsh reality: LLMs are inherently unpredictable. API outages, severe rate limits, high latency for repetitive queries, and hallucinations caused by stale training data can cripple an enterprise application.
We were inspired by the principles of Chaos Engineering and Resilience in Distributed Systems. We realized that modern AI applications shouldn't rely on a single "God Model." Instead, they need an intelligent middlewareโa fabricโthat treats LLMs as volatile nodes in a larger ecosystem. We built Sentinel Fabric to prove that by combining TrueFoundry's AI Gateway with multi-tier caching and agentic tool execution, we can guarantee near 100% uptime and data accuracy, even when the underlying infrastructure fails.
โ๏ธ What it does
Sentinel Fabric is a resilient AI agent middleware and observability dashboard. It acts as the ultimate safety net between your application and external LLM providers.
Key capabilities include:
- Multi-Model Fallback Chain: Automatically routes requests from primary models to secondary/free models (e.g., Llama 3.1 405B โ Gemma 4 โ GLM 4.5) if an API timeout or failure occurs.
- TrueFoundry Semantic Caching: Uses Redis to semantically match incoming queries with previous responses, drastically reducing latency and API costs.
- Agentic Tool Execution (MCP): Forces the LLM to use enterprise state data (via local inventory tools and policy fetchers) instead of relying on its potentially stale, pre-trained knowledge.
- Emergency "Shield Mode": If all LLMs fail, the system bypasses the AI layer entirely and queries a local database mirror directly to provide deterministic, keyword-routed fallback answers, keeping the system 100% online.
- Observability Dashboard: A React-based real-time control room that visually tracks the source of every response. It also proves database health by independently streaming live business metrics (recent sales) from our legacy MySQL database directly to the UI.
๐ ๏ธ How we built it
We architected the system using a robust, decoupled stack:
- Backend: Built with Java & Spring Boot. This layer handles the complex orchestration logic, enforces circuit-breaker timeouts, and executes local Agentic tools.
- AI Routing: Leveraged TrueFoundry AI Gateway to manage API keys, routing rules, and official semantic caching headers.
- Persistence & Caching: Used Redis for lightning-fast in-memory caching and fallback routing. We also integrated a MySQL database that acts as a live heartbeat, feeding real transactional data (like recent sales) to our monitoring UI to prove system vitality.
- Frontend: Developed a modular React dashboard featuring dynamic visual containers, micro-animations, and auto-scrolling terminal logs to visualize system resilience in real-time.
From a reliability engineering perspective, we modeled our system's total reliability ($R_{sys}$) using the parallel components equation. By chaining $n$ independent fallback layers, the probability of complete failure drops exponentially:
$$R_{sys} = 1 - [ (1 - R_1) \times (1 - R_2) \times \dots \times (1 - R_n) ]$$
Where $R_i$ is the reliability of an individual layer (e.g., Cache, Primary LLM, Secondary LLM, Legacy DB).
๐ก๏ธ TrueFoundry Guardrails Integration & Testing (Plan B)
Security and compliance are non-negotiable for enterprise deployments. To secure the orchestration layer without rewriting the core application, we implemented and successfully verified two critical guardrail layers using TrueFoundry AI Gateway:
- LLM Input Guardrail (Prompt Injection Protection): Detects and blocks malicious prompt injections and unauthorized tool-triggering attempts before they reach the LLM.
- MCP Pre-Tool Guardrail (SQL Sanitizer): Scans LLM-generated parameters passed to our local tools to detect and neutralize SQL injection attacks, protecting our internal databases.
๐ Testing Infrastructure (Cloud-to-Local Tunnel via ngrok)
Since the TrueFoundry AI Gateway operates in the cloud and our Spring Boot backend runs locally, we used ngrok to establish a secure tunnel exposing our local port (http://localhost:8080):
- We connected our local MCP Server to the TrueFoundry AI Gateway using SSE (Server-Sent Events) and JSON-RPC over
https://evil-evasive-reprogram.ngrok-free.dev, confirming that our low-code/no-code guardrails block and mutate payloads perfectly in real time.
๐พ Meticulous Database Architecture & Schema Design
To guarantee absolute enterprise reliability, performance, and scalability, we built our persistence layer on MySQL. We utilize a highly optimized hybrid relational storage schema designed to serve both real-time business telemetry and our Emergency "Shield Mode" semantic fallback engine.
๐ Relational Database Schema (Entity-Relationship Table)
Below is the rigorous structure of the MySQL database tables currently actively powering the Sentinel Fabric system:
1. Table: semantic_states
This table serves as the core semantic memory storage. It holds unstructured and semi-structured company states, security standards, and metadata. In Shield Mode, our custom token-matching algorithm queries this table directly using full-text key keyword scanning.
| Column Name | Data Type | Constraints / Attributes | Description |
|---|---|---|---|
state_key ๐ |
VARCHAR(255) |
PRIMARY KEY, NOT NULL |
The unique semantic key identifier (e.g., enterprise_knowledge, stock_status, diagnostic_path). |
state_value |
VARCHAR(2000) |
NULLABLE |
Highly descriptive JSON-structured payload representing the status or state values. |
last_updated |
DATETIME |
NOT NULL |
Heartbeat timestamp recording the exact time of the last update. |
Sample Seed Row:
{
"state_key": "enterprise_knowledge",
"state_value": "{\"policy_id\": \"SENTINEL-ALPHA-2026\", \"security_standard\": \"AES-512-RSA-8192\", \"compliance\": \"SOC-3-READY\", \"storage_path\": \"C:\\\\Users\\\\Yeni\\\\.gemini\\\\antigravity\\\\policies\"}"
}
2. Table: sales
This table holds dynamic transactional sales records. It is streamed live to the Observability Dashboard every 5 seconds to demonstrate the health and data-carrying capacity of our legacy database mirror under load.
| Column Name | Data Type | Constraints / Attributes | Description |
|---|---|---|---|
id ๐ |
BIGINT |
PRIMARY KEY, AUTO_INCREMENT | Unique database sequence identifier. |
product_name |
VARCHAR(255) |
NOT NULL |
The name of the enterprise software/hardware item sold. |
amount |
DOUBLE |
NOT NULL |
Transaction value in USD ($). |
region |
VARCHAR(255) |
NOT NULL |
Geographical business market (e.g., North America, Europe). |
timestamp |
DATETIME |
NOT NULL |
Exact transactional timestamp. |
3. Table: resiliency_logs
Our comprehensive audit trail. This table keeps record of every failover attempt, model latency, cache hit, system blackout, and active failover level to feed our visual timeline.
| Column Name | Data Type | Constraints / Attributes | Description |
|---|---|---|---|
id ๐ |
BIGINT |
PRIMARY KEY, AUTO_INCREMENT | Unique log sequence identifier. |
model_used |
VARCHAR(255) |
NOT NULL |
The exact LLM node used (e.g., Gemma-4, GLM-4.5, Llama-3.1-405B). |
status |
VARCHAR(255) |
NOT NULL |
Status outcome (SUCCESS, FAILOVER, SHIELD_ACTIVE). |
tier |
VARCHAR(255) |
NOT NULL |
Target tier evaluated (Primary, Secondary, L1 Cache, Shield). |
latency_ms |
INTEGER |
NOT NULL |
Total response round-trip time in milliseconds. |
error_details |
VARCHAR(500) |
NULLABLE |
Detailed stack trace snippets or timeout messages during outages. |
timestamp |
DATETIME |
NOT NULL |
Timestamp of the logged event. |
๐๏ธ Production 3NF Normalization vs. Hackathon Agility
When judges ask about our architectural decisions and relational normal form (3NF), here is our design rationale:
- Hackathon Agility (Hybrid Mock Execution):
- Currently, product schemas and stock details (like returning
stock = 42for any requested product name via theget_inventory_statustool) are executed using a highly deterministic JSON mock mapping layer inModelService.java. - When the tool is invoked, it logs a simulated query (
SELECT stock FROM inventory WHERE product = 'args') directly to our live observability stream. This approach ensures rapid development, zero runtime DB connection lockouts for LLM agent loops, and allows us to verify SQL Sanitizer parameters without crashing the database.
- Currently, product schemas and stock details (like returning
- Production 3NF Schema (Relational Normalization):
- For our full production architecture, we designed a decoupled, 3rd Normal Form schema separating Product Catalog Metadata from Live Inventory Stock Levels.
- By implementing a 1-to-Many (
1:N) relationship where one Product SKU maps to multiple stock levels in different Warehouses, we eliminate data redundancy and maximize transaction throughput.
Production Schema Blueprint:
-- Product Catalog Metadata Table
CREATE TABLE products (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
sku VARCHAR(255) UNIQUE NOT NULL,
product_name VARCHAR(255) NOT NULL,
description TEXT,
price DOUBLE NOT NULL
);
-- Decoupled Physical Stock Table (3NF Normalization)
CREATE TABLE inventory (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
product_id BIGINT NOT NULL,
count INT NOT NULL,
warehouse VARCHAR(255) NOT NULL,
FOREIGN KEY (product_id) REFERENCES products(id) ON DELETE CASCADE
);
๐ง Challenges we ran into
- Cache vs. Live Tool Execution: One of the toughest challenges was ensuring the orchestrator correctly distinguished between a TrueFoundry semantic cache hit and a live, tool-driven execution. We had to carefully manage HTTP headers and response metadata to map out the exact provenance of the data.
- Taming Chaos and Timeouts: Implementing strict 30-second failover timeouts without stranding the user's request required intricate asynchronous programming and thread management in Spring Boot.
- Overriding LLM Hallucinations: Enforcing the LLMs to strictly use our backend tools (like querying the database for live sales data) rather than guessing answers based on their training data took heavy prompt engineering and strict JSON schema enforcement.
๐ Accomplishments that we're proud of
We are incredibly proud of the Emergency Shield Mode and the transparent Observability Dashboard. The ability to completely sever the connection to the internet, watch the system seamlessly failover from an LLM to local Redis, and finally to a direct MySQL queryโall while the React UI visually highlights exactly what layer saved the systemโis a magical experience. We built an AI app that survives without AI.
๐ What we learned
We learned that building enterprise AI is less about picking the smartest model and more about architecting the smartest infrastructure around that model. We gained deep insights into semantic caching strategies, AI Gateway configurations, and the absolute necessity of deterministic fallbacks in non-deterministic systems.
๐ What's next for Sentinel Fabric
- Production Framework Integrations (LangChain / LangGraph): While Spring Boot manages the core resilience and failover logic, we plan to integrate enterprise agentic frameworks like LangChain and LangGraph directly on top of our fabric to build complex, state-aware agentic graphs.
- Local LLM for Semantic Queries: Deploy and integrate local, lightweight LLMs (like Llama 3 or Phi 3) to process semantic queries locally, minimizing dependency on external APIs and providing high-fidelity, secure local fallback reasoning.
- Kubernetes Failover Orchestration: Containerize the local components and implement Kubernetes failover handling for the local setup (Redis, MySQL, and Backend service) to guarantee maximum availability and self-healing when local infrastructure nodes experience outages.
- Script-Based Automated Self-Healing: Expand the Agentic Tool capabilities to include automated self-healing scripts. If a health check detects a downed node, the agent will automatically write and execute a local PowerShell or Bash script to restart the service and restore operations.
- Live Legacy Database Agentic Integration: Currently, our legacy Sales MySQL database feeds the observability dashboard. Next, we aim to give our LLM agents direct, read-only MCP access to this database, allowing users to perform complex, natural language data analytics on live, real-time business metrics.
- Dynamic Cost Routing: Automatically routing simple queries to cheaper, local models while reserving heavy compute models strictly for complex analytical tasks based on TrueFoundry's cost metrics.
Log in or sign up for Devpost to join the conversation.