Vanguard Ops AI
1. Inspiration
Modern enterprises generate millions of logs, metrics, traces, and alerts every day. While observability platforms provide visibility into system health, engineers still spend significant time manually investigating incidents, correlating events, identifying root causes, and coordinating responses.
During outages, teams often struggle with:
- Alert fatigue caused by excessive notifications.
- Manual log analysis across multiple systems.
- Slow root-cause identification.
- Fragmented operational knowledge.
- Increasing Mean Time To Resolution (MTTR).
- High operational costs due to downtime.
We asked ourselves:
What if operational systems could explain their own failures?
This idea inspired us to build Vanguard Ops AI, an autonomous operational intelligence platform that combines Splunk observability data with AI agents to investigate incidents, identify root causes, recommend fixes, and generate reports automatically.
2. What it does
Vanguard Ops AI acts as an AI-powered operations teammate.
2.1 AI Incident Investigation
- Collects logs, metrics, and alerts.
- Detects abnormal patterns.
- Correlates related events.
- Summarizes incidents automatically.
- Prioritizes critical issues.
2.2 Root Cause Intelligence Engine
- Analyzes system behavior.
- Correlates telemetry data.
- Maps service dependencies.
- Identifies probable root causes.
- Generates investigation timelines.
2.3 AI Operations Copilot
Users can ask questions such as:
- Why is latency increasing?
- What changed before the outage?
- Show critical incidents from today.
- Summarize database failures.
- Recommend the next troubleshooting step.
2.4 Smart Remediation Advisor
- Suggests corrective actions.
- Generates operational runbooks.
- Recommends configuration fixes.
- Provides preventive measures.
- Helps reduce MTTR.
2.5 Automated Reporting
- Executive summaries.
- Technical incident reports.
- Postmortem documents.
- Impact assessments.
- Resolution tracking.
2.6 Workflow Automation
- Creates tickets automatically.
- Notifies relevant teams.
- Triggers operational workflows.
- Tracks investigations.
- Maintains audit history.
3. How we built it
3.1 Frontend
Built using:
- React
- Vite
- Tailwind CSS
Features:
- AI Copilot Dashboard
- Incident Investigation Workspace
- Analytics & Reporting Interface
- Responsive Enterprise UI
3.2 Backend
Built using:
- Node.js
- Express.js
- REST APIs
Responsibilities:
- Agent orchestration
- Incident processing
- Workflow execution
- Session management
3.3 Splunk Integration
Integrated with:
- Splunk MCP Server
- Splunk Search APIs
- Observability datasets
- Hosted AI Models
3.4 AI Agent Architecture
Agent 1: Log Analysis Agent
Responsibilities:
- Parse logs
- Detect anomalies
- Identify suspicious events
Agent 2: Root Cause Agent
Responsibilities:
- Correlate telemetry
- Analyze dependencies
- Generate root-cause hypotheses
Agent 3: Remediation Agent
Responsibilities:
- Generate recommendations
- Suggest fixes
- Create runbooks
Agent 4: Reporting Agent
Responsibilities:
- Incident summaries
- Executive reports
- Postmortems
Agent 5: Workflow Agent
Responsibilities:
- Automate tasks
- Trigger actions
- Coordinate operational workflows
4. Challenges we ran into
Challenge 1: Operational Data Complexity
- Large volumes of logs.
- Noisy telemetry.
- Unstructured information.
Challenge 2: Agent Coordination
- Context sharing between agents.
- Workflow orchestration.
- Consistent decision-making.
Challenge 3: Context Management
- Handling large datasets.
- Maintaining investigation history.
- Delivering concise responses.
Challenge 4: Explainability
- Building user trust.
- Making AI reasoning transparent.
- Providing actionable recommendations.
Challenge 5: User Experience
- Reducing information overload.
- Simplifying investigations.
- Maintaining enterprise-grade usability.
5. Accomplishments that we're proud of
Achievement 1
Built a complete AI-powered operational intelligence platform.
Achievement 2
Implemented autonomous incident investigation workflows.
Achievement 3
Created a multi-agent architecture for root-cause analysis.
Achievement 4
Integrated AI-powered remediation recommendations.
Achievement 5
Developed automated reporting and postmortem generation.
Achievement 6
Designed an enterprise-grade dashboard experience.
Achievement 7
Demonstrated how AI can actively participate in operations instead of simply monitoring systems.
6. What we learned
Lesson 1
AI performs best when integrated into structured workflows.
Lesson 2
Specialized agents improve reliability and explainability.
Lesson 3
Operational context is essential for meaningful insights.
Lesson 4
User trust depends on transparent AI reasoning.
Lesson 5
Observability data becomes significantly more valuable when combined with intelligent automation.
Lesson 6
Agentic systems represent the future of operational intelligence.
7. What's next for Vanguard Ops AI
Phase 1: Predictive Intelligence
- Predict incidents before they occur.
- Detect early warning signals.
- Forecast system failures.
Phase 2: Autonomous Remediation
- Execute approved fixes automatically.
- Reduce manual intervention.
- Accelerate recovery times.
Phase 3: Security Operations Integration
- Threat detection.
- Security investigations.
- Incident response automation.
Phase 4: Enterprise Intelligence Layer
- Knowledge graph integration.
- Organizational memory.
- Historical incident learning.
Phase 5: Multi-Agent Ecosystem
- Operational agents.
- Security agents.
- Platform agents.
- Collaboration between specialized AI systems.
Final Vision
Vanguard Ops AI transforms operational data into autonomous intelligence, helping organizations move from reactive monitoring to proactive and AI-driven operations.
Built With
- ai-agents
- express.js
- github
- javascript
- mongodb
- mongoose
- natural-language-processing
- node.js
- postman
- react
- render
- rest-api
- splunk-enterprise
- splunk-hosted-models
- splunk-mcp-server
- splunk-search-api
- tailwind-css
- vercel
- vite
Log in or sign up for Devpost to join the conversation.