Inspiration

Bragging rights, possible money to be gained, education, experience points, pursuit of knowledge, skill refinement, skill identification, and the opportunity to possibly contribute to the world of cybernetworks and cybersecurity.

What It Does

SignalSage is an AI-powered incident investigation copilot that connects to your live Splunk instance and automates the entire workflow from detection to resolution.

When an incident occurs, you simply point SignalSage at a service and time window. It then:

  • Automatically generates and executes 12 targeted SPL queries in parallel
  • Collects evidence across:
    • Logs
    • Metrics
    • Traces
    • Deployment events
  • Normalizes all evidence into a unified timeline

It then performs ML-powered analysis using Splunk's Machine Learning Toolkit:

  • Anomaly detection (z-score)
  • Log clustering
  • Cross-signal correlation
  • Latency distribution analysis

Output

SignalSage produces:

  • Ranked root cause hypotheses with confidence scores
  • Prioritized remediation playbooks, including:
    • Risk levels
    • Estimated resolution times
    • Human approval gates for high-risk actions

Key Features

  • “Remediate Now”: Demonstrates autonomous AI-agent-driven incident response
  • Real-time monitoring dashboard: Auto-refreshes from Splunk
  • Ask Splunk Assistant:
    • Query data in plain English
    • Receive intelligent explanations (not raw tables)
  • Post-incident report generator:
    • One-click export
    • Markdown output (Confluence/Jira-ready)

Impact

SignalSage reduces Mean Time to Understand (MTTU) by replacing:

  • Manual dashboard switching
  • Writing SPL queries by hand
  • Mental cross-signal correlation

Result: A 45-minute investigation becomes a 30-second automated pipeline


How We Built It

We built SignalSage using:

  • Next.js 14
  • TypeScript
  • Tailwind CSS

Backend Architecture

  • Connects to Splunk Enterprise via REST API (port 8089)
  • Uses JWT authentication
  • Executes SPL queries via:
    • Traditional search job lifecycle
    • Faster oneshot export mode

Investigation Pipeline

  • Query Generator → 12 targeted SPL queries per incident
  • Live Evidence Collector → Executes queries in parallel
  • Evidence Normalizer → Converts results into typed data
  • Root Cause Analyzer:
    • Uses 7 scoring models
    • Enhanced with Splunk MLTK:
    • Anomaly detection
    • Clustering
    • Forecasting
    • Outlier detection
  • Remediation Engine → Maps hypotheses to playbooks

Additional Components

  • Splunk MCP server (development-time querying)
  • Natural language → SPL interface

Frontend

  • Tabbed investigation workflow
  • Real-time dashboard (auto-refresh)
  • Conversational AI assistant

Data Ingestion

  • Uses HEC (HTTP Event Collector)
  • Custom scripts generate realistic observability data:
    • Logs, metrics, traces, deployments
  • All events use timestamps relative to “now” for freshness

Challenges We Ran Into

Splunk AI Assistant Integration

  • Successfully decoded cloud token and tenant/API structure
  • Blocked by OAuth2 issue:
    • client_credentials rejected tenant ID as client_id
  • Python SDK (splunk-cloud-sdk) incompatible with Python 3.13

✅ Integration code is complete, but blocked on authentication

Performance Issues

The app frequently froze due to heavy UI effects:

  • backdrop-blur on many elements
  • Global will-change usage
  • 50 confetti elements
  • Staggered animations across 100+ components

✅ Solution: Removed GPU-heavy effects and replaced with lighter alternatives

Query Performance

  • Splunk polling model:
    • 1 request/second
    • Up to 60 seconds latency

✅ Fixed using oneshot export mode

UI Glitch

Pulsing green border caused white flashes due to:

  • Hover state conflicts
  • Brightness filters
  • Inset box-shadow interactions

✅ Required multiple iterations to resolve

Accomplishments We’re Proud Of

  • ✅ Fully connected to a real Splunk instance (not a demo)
  • 12-query parallel pipeline produces meaningful results
  • ✅ ML-powered root cause analysis works on live data
  • “Remediate Now” demonstrates autonomous incident response
  • ✅ Natural language assistant explains results clearly

Performance Milestone

End-to-end workflow completes in under 30 seconds:

  1. Evidence collection
  2. Root cause ranking
  3. Remediation playbooks
  4. Post-incident report generation

Production Readiness

  • Input validation
  • SPL injection prevention
  • Credential masking
  • Time window limits

What We Learned

  • The gap between a demo and product is performance
  • Heavy UI effects (blur, glassmorphism, animations):
    • Look good in screenshots
    • Hurt real-world usability

Key Technical Learnings

  • Splunk REST API is:
    • Powerful
    • Designed for asynchronous workflows

✅ Required:

  • Oneshot export mode
  • Parallel execution

  • Rule-based NL → SPL works for ~80% of use cases

  • Users value:

    • Clear explanations
    • Over perfect query translation
  • Splunk cloud AI:

    • Powerful
    • Difficult to integrate compared to on-prem

What’s Next for SignalSage

Immediate Next Step

  • Complete Splunk AI Assistant integration
    • Awaiting proper OAuth2 credentials
  • Enables:
    • LLM-powered SPL generation
    • Advanced explanations

Near-Term Roadmap

  • Make “Remediate Now” fully functional:

    • Kubernetes rollbacks
    • Feature flag toggles
    • Connection pool scaling
    • Human approval workflows
  • Add real-time alerting:

    • Auto-trigger investigations
    • Shift to proactive operations

Longer-Term Vision

  • Multi-tenant support
  • Team collaboration:

    • Shared investigations
    • @mentions
    • Handoffs
  • Continuous learning system:

    • Improve root cause scoring from confirmed cases
    • Build institutional knowledge
    • Accelerate future incident resolution

Built With

  • app-router)
  • clustering
  • cross-signal-correlation
  • css-frameworks:-next.js-14-(react-18
  • forecasting-splunk-ai-assistant-(cloud-connected
  • javascript
  • jest-(testing)-platforms:-splunk-enterprise-10.2.3-(local-instance)
  • languages:-typescript
  • log-clustering
  • ml-boosted-confidence-scoring-(z-score-anomaly-detection
  • node.js-24
  • oneshot-export-splunk-http-event-collector-(port-8088)-?-data-ingestion-splunk-machine-learning-toolkit-(mltk-v5.7.4)-?-anomaly-detection
  • oneshot-synchronous-search-mode
  • openai-sdk-key-techniques:-jwt-token-authentication
  • parallel-query-execution
  • pending-oauth2-approval)-openai-api-(gpt-4o-mini)-?-fallback-ai-summaries-and-explanations-splunk-mcp-server-?-development-time-query-interface-web-audio-api-?-synthesized-ui-sound-effects-libraries:-zod-(runtime-validation)
  • polling
  • railway-(deployment)-apis-&-services:-splunk-rest-api-(port-8089)-?-search-job-creation
  • rule-based-nl-to-spl-conversion-with-ai-explanation-layer
  • sharp-(image-processing)
  • spl-(search-processing-language)
  • spl-injection-prevention-(allowlist-regex)
  • tailwind-css-3
  • uuid
Share this project:

Updates