Inspiration

Enterprise adoption of Splunk Observability frequently runs into massive technical and financial friction points:

  1. Compounding Ingestion Costs & Billing Unpredictability: Splunk's volume-ingest pricing models cause unpredictable billing spikes during microservice error storms. SRE teams are heavily penalized because Log Observer Connect queries already-indexed logs, forcing organizations to pay full volume premiums at the indexing edge before any filtering can be applied.
  2. The "AIOps Paradox" (Data Quality Degradation): AI troubleshooting tools are strictly bottlenecked by underlying telemetry data quality. Splunk’s schema-on-read architecture delays structure validation until query execution. If log formats are inconsistent or trace IDs are missing, AI diagnostics suffer from accuracy drops, regex timeouts, or hallucinations.
  3. Critical Security Gaps in Agentic Workflows (MCP Threats): Connecting autonomous AI agents to internal systems via the Model Context Protocol (MCP) introduces dangerous threat vectors. Centralized MCP servers require highly privileged tokens, exposing the system to key exploits identified by the Coalition for Secure AI (CoSAI):
  4. MCP-T4 (Input/Instruction Boundary Failure): Indirect prompt injections via raw logs or support tickets hijacking an agent to execute malicious commands.
  5. MCP-T5 (Inadequate Data Protection): Plaintext API keys, PII, or credentials leaking into debug logs and being sucked directly into the LLM's context window.
  6. MCP-T6 (Missing Integrity Controls): Post-deployment modifications or "shadow" MCP servers spoofing tool definitions to silently intercept tokens.
  7. MCP-T8 (Network Binding Failures): Binding to insecure interfaces (0.0.0.0), exposing internal REST management APIs.
  8. MCP-T9 (Trust Boundary Failures): Confused deputy scenarios where compromised agents blindly use elevated privileges to bypass read/write boundaries or erase databases.

  9. Alert Fatigue & Manual Investigation Bottlenecks: Human on-call engineers must waste critical hours manually context-switching across disconnected dashboards to figure out "what changed," delaying MTTR.


What it does

Project AegisOps is an MCP-enabled agentic middleware proxy that bridges the gap between the Splunk MCP Server, the Splunk MCP TA, and local OpenTelemetry collector configurations. It acts as an intelligent, secure, and cost-aware layer operating via an autonomous causal loop:

  • Telemetry Purifier Engine: Monitors index volume run-rates and dynamically generates OpenTelemetry processor configurations to filter duplicate debug logs when ingestion volume spikes.
  • Telemetry Quality Validator: Profiles schemas against CIM standard formats at the edge to prevent AI agent processing failures and broken causal graphs.
  • Semantic MCP Firewall: Evaluates the semantic intent of incoming JSON-RPC 2.0 tool calls in real time. It calculates a dynamic risk index and blocks unauthorized or malicious commands before execution, returning a standardized error response.

How we built it

We built AegisOps as a secure proxy containerized via Docker. We integrated the official read-only Splunk MCP Server alongside an open-source FastMCP implementation to manage multi-mode transport capabilities (STDIO, SSE, and REST API modes). The core firewall checks incoming natural language instructions and uses cosine-similarity scoring to compare prompts against safe operational policy boundaries. To guarantee safety, we implemented a strict Human-in-the-Loop (HITL) governance mechanism that integrates with Slack/Teams to halt medium/high-risk tasks (like microservice restarts or KV store deletions) until an SRE clicks "Approve."


Challenges we ran into

A major technical challenge was introducing zero-latency semantic evaluation to prevent slow searches during live incidents. The firewall needed to instantly process the following real-time risk index formula:

$$\text{Risk Score} = w_1(\text{Command Entropy}) + w_2(\text{Access Privilege Level}) + w_3(\text{Semantic Deviation})$$

If the score passes the system threshold, the transaction must immediately freeze. Fine-tuning the weight parameters to prevent false positives during high-velocity data correlation, while keeping complete protection against CoSAI trust boundary failures, required rigorous testing and simulated injection attacks.


Accomplishments that we're proud of

  • Built an active inline firewall capable of blocking 100% of unvetted, high-risk file and index operations.
  • Achieved a 30% to 50% reduction in redundant debug log volumes, effectively stabilizing cloud infrastructure costs.
  • Designed a pipeline that slashes incident response MTTR by up to 50% by eliminating manual telemetry enrichment bottlenecks.
  • Successfully married multi-agent automation with strict human oversight using real-time Slack approval interactions.

What we learned

We discovered that relying on a schema-on-read mechanism leaves AI agents exposed to corrupted data during active incidents. Structural validation must happen at the ingestion layer. Furthermore, we realized that traditional network firewalls are blind to agent workflows; because malicious commands look like valid JSON-RPC structures, security boundaries must evolve to evaluate semantic intent.


What's next for AegisOps

We intend to scale AegisOps by creating automated OpenTelemetry remediation processors that patch missing trace IDs on the fly. We also plan to expand our out-of-the-box policy blueprints to cover multi-cloud Kubernetes environments, ensuring that generative AI can be safely and cost-effectively operationalized across heavily regulated enterprise sectors.


Built With

Share this project:

Updates

posted an update

Update: Enterprise Readiness & Security Hardening

We are excited to announce a new update to AegisOps! This release significantly hardens the agentic SRE infrastructure against prompt injections, runaway execution loops, and credential leaks, making the platform enterprise-ready.

All of these new features are now available for testing in the feature/enterprise-readiness branch.

What's New?

  • Hardened Semantic MCP Firewall
  • Contextual Isolation
  • Dynamic OTel Edge Processing
  • Inline DLP Masking
  • Strict Memory Determinism
  • Episodic Memory

Test the Chaos Incidents

To verify these enterprise defenses, we've included three new mock chaos incidents.

Log in or sign up for Devpost to join the conversation.