Inspiration
Bragging rights, possible money to be gained, education, experience points, pursuit of knowledge, skill refinement, skill identification, and the opportunity to possibly contribute to the world of cybernetworks and cybersecurity.
What It Does
SignalSage is an AI-powered incident investigation copilot that connects to your live Splunk instance and automates the entire workflow from detection to resolution.
When an incident occurs, you simply point SignalSage at a service and time window. It then:
- Automatically generates and executes 12 targeted SPL queries in parallel
- Collects evidence across:
- Logs
- Metrics
- Traces
- Deployment events
- Logs
- Normalizes all evidence into a unified timeline
It then performs ML-powered analysis using Splunk's Machine Learning Toolkit:
- Anomaly detection (z-score)
- Log clustering
- Cross-signal correlation
- Latency distribution analysis
Output
SignalSage produces:
- Ranked root cause hypotheses with confidence scores
- Prioritized remediation playbooks, including:
- Risk levels
- Estimated resolution times
- Human approval gates for high-risk actions
- Risk levels
Key Features
- “Remediate Now”: Demonstrates autonomous AI-agent-driven incident response
- Real-time monitoring dashboard: Auto-refreshes from Splunk
- Ask Splunk Assistant:
- Query data in plain English
- Receive intelligent explanations (not raw tables)
- Query data in plain English
- Post-incident report generator:
- One-click export
- Markdown output (Confluence/Jira-ready)
- One-click export
Impact
SignalSage reduces Mean Time to Understand (MTTU) by replacing:
- Manual dashboard switching
- Writing SPL queries by hand
- Mental cross-signal correlation
Result: A 45-minute investigation becomes a 30-second automated pipeline
How We Built It
We built SignalSage using:
- Next.js 14
- TypeScript
- Tailwind CSS
Backend Architecture
- Connects to Splunk Enterprise via REST API (port 8089)
- Uses JWT authentication
- Executes SPL queries via:
- Traditional search job lifecycle
- Faster oneshot export mode
- Traditional search job lifecycle
Investigation Pipeline
- Query Generator → 12 targeted SPL queries per incident
- Live Evidence Collector → Executes queries in parallel
- Evidence Normalizer → Converts results into typed data
- Root Cause Analyzer:
- Uses 7 scoring models
- Enhanced with Splunk MLTK:
- Anomaly detection
- Clustering
- Forecasting
- Outlier detection
- Uses 7 scoring models
- Remediation Engine → Maps hypotheses to playbooks
Additional Components
- Splunk MCP server (development-time querying)
- Natural language → SPL interface
Frontend
- Tabbed investigation workflow
- Real-time dashboard (auto-refresh)
- Conversational AI assistant
Data Ingestion
- Uses HEC (HTTP Event Collector)
- Custom scripts generate realistic observability data:
- Logs, metrics, traces, deployments
- Logs, metrics, traces, deployments
- All events use timestamps relative to “now” for freshness
Challenges We Ran Into
Splunk AI Assistant Integration
- Successfully decoded cloud token and tenant/API structure
- Blocked by OAuth2 issue:
client_credentialsrejected tenant ID asclient_id
- Python SDK (
splunk-cloud-sdk) incompatible with Python 3.13
✅ Integration code is complete, but blocked on authentication
Performance Issues
The app frequently froze due to heavy UI effects:
backdrop-bluron many elements- Global
will-changeusage - 50 confetti elements
- Staggered animations across 100+ components
✅ Solution: Removed GPU-heavy effects and replaced with lighter alternatives
Query Performance
- Splunk polling model:
- 1 request/second
- Up to 60 seconds latency
- 1 request/second
✅ Fixed using oneshot export mode
UI Glitch
Pulsing green border caused white flashes due to:
- Hover state conflicts
- Brightness filters
- Inset box-shadow interactions
✅ Required multiple iterations to resolve
Accomplishments We’re Proud Of
- ✅ Fully connected to a real Splunk instance (not a demo)
- ✅ 12-query parallel pipeline produces meaningful results
- ✅ ML-powered root cause analysis works on live data
- ✅ “Remediate Now” demonstrates autonomous incident response
- ✅ Natural language assistant explains results clearly
Performance Milestone
End-to-end workflow completes in under 30 seconds:
- Evidence collection
- Root cause ranking
- Remediation playbooks
- Post-incident report generation
Production Readiness
- Input validation
- SPL injection prevention
- Credential masking
- Time window limits
What We Learned
- The gap between a demo and product is performance
- Heavy UI effects (blur, glassmorphism, animations):
- Look good in screenshots
- Hurt real-world usability
- Look good in screenshots
Key Technical Learnings
- Splunk REST API is:
- Powerful
- Designed for asynchronous workflows
- Powerful
✅ Required:
- Oneshot export mode
Parallel execution
Rule-based NL → SPL works for ~80% of use cases
Users value:
- Clear explanations
- Over perfect query translation
- Clear explanations
Splunk cloud AI:
- Powerful
- Difficult to integrate compared to on-prem
- Powerful
What’s Next for SignalSage
Immediate Next Step
- Complete Splunk AI Assistant integration
- Awaiting proper OAuth2 credentials
- Awaiting proper OAuth2 credentials
- Enables:
- LLM-powered SPL generation
- Advanced explanations
- LLM-powered SPL generation
Near-Term Roadmap
Make “Remediate Now” fully functional:
- Kubernetes rollbacks
- Feature flag toggles
- Connection pool scaling
- Human approval workflows
- Kubernetes rollbacks
Add real-time alerting:
- Auto-trigger investigations
- Shift to proactive operations
- Auto-trigger investigations
Longer-Term Vision
- Multi-tenant support
Team collaboration:
- Shared investigations
- @mentions
- Handoffs
- Shared investigations
Continuous learning system:
- Improve root cause scoring from confirmed cases
- Build institutional knowledge
- Accelerate future incident resolution
- Improve root cause scoring from confirmed cases
Built With
- app-router)
- clustering
- cross-signal-correlation
- css-frameworks:-next.js-14-(react-18
- forecasting-splunk-ai-assistant-(cloud-connected
- javascript
- jest-(testing)-platforms:-splunk-enterprise-10.2.3-(local-instance)
- languages:-typescript
- log-clustering
- ml-boosted-confidence-scoring-(z-score-anomaly-detection
- node.js-24
- oneshot-export-splunk-http-event-collector-(port-8088)-?-data-ingestion-splunk-machine-learning-toolkit-(mltk-v5.7.4)-?-anomaly-detection
- oneshot-synchronous-search-mode
- openai-sdk-key-techniques:-jwt-token-authentication
- parallel-query-execution
- pending-oauth2-approval)-openai-api-(gpt-4o-mini)-?-fallback-ai-summaries-and-explanations-splunk-mcp-server-?-development-time-query-interface-web-audio-api-?-synthesized-ui-sound-effects-libraries:-zod-(runtime-validation)
- polling
- railway-(deployment)-apis-&-services:-splunk-rest-api-(port-8089)-?-search-job-creation
- rule-based-nl-to-spl-conversion-with-ai-explanation-layer
- sharp-(image-processing)
- spl-(search-processing-language)
- spl-injection-prevention-(allowlist-regex)
- tailwind-css-3
- uuid
Log in or sign up for Devpost to join the conversation.