RedForesight — Devpost Project Story
Inspiration
Every security operations center in the world runs on the same flawed assumption: wait for something bad to happen, then alert on it. The analyst's job becomes triaging a flood of backward-looking notifications — what happened, where, when. Meanwhile the attacker is already three moves ahead.
The insight that sparked RedForesight came from chess. A grandmaster doesn't just react to the last move — they've memorised thousands of games and can project forward, evaluating candidate lines several steps ahead, pruning low-probability branches, focusing attention on what matters. A chess engine doesn't ask "what just happened?" It asks "what happens next?"
Security operations has exactly the same structure. The board position is your current signal. The move library is MITRE ATT&CK — 14 tactics, hundreds of techniques, thousands of documented procedure chains. The missing piece has always been an engine that could hold the whole game in memory and plan ahead.
That's what we built.
What It Does
RedForesight is an AI-powered security agent that connects to Splunk and predicts an attacker's next move before they make it.
When a threat signal fires — a credential dump attempt, a phishing hit, a suspicious process creation — RedForesight doesn't just log it. It:
Pulls live context from your Splunk environment via the MCP Server, firing four concurrent SPL queries to assemble a picture of what the target host has been doing in the surrounding 30-minute window.
Classifies the attack tactic using semantic similarity against 697 embedded MITRE ATT&CK techniques, mapping the observed signal to the correct kill chain phase.
Expands a game tree of candidate next attacker moves, sourcing techniques from the subsequent tactics in the kill chain and scoring each branch using a probability formula that combines semantic similarity, asset vulnerability weighting, and signal severity.
Re-scores with adversarial AI using Google Gemini prompted as a red team operator — not a helpful assistant. The LLM thinks like an attacker, assessing which follow-on techniques are most realistic given the specific signal and environmental context.
Surfaces a DefenderBrief to the analyst — ranked predictions with probability scores, confidence tiers, adversarial reasoning strings, and generated SPL hunting queries to run right now before the attacker executes the next move.
Learns from feedback. Every analyst confirmation or rejection writes back to episodic memory in ChromaDB. The next time a similar signal fires, the agent recalls past incidents and weights its predictions accordingly. It gets smarter for your specific environment over time.
The probability scoring formula at the heart of the game tree is:
$$p_{\text{raw}} = p_{\text{semantic}} \times p_{\text{platform}} \times p_{\text{severity}}$$
Where \( p_{\text{semantic}} \) is the cosine similarity score from ChromaDB semantic search, \( p_{\text{platform}} \) is a platform compatibility weight, and \( p_{\text{severity}} \) is derived from the signal severity tier. After LLM re-scoring, probabilities are re-normalised so they sum to 1.0:
$$p_{\text{final}}(i) = \frac{p_{\text{llm}}(i)}{\sum_{j} p_{\text{llm}}(j)}$$
How We Built It
RedForesight is built in six layers, each phase building on the last.
Foundation — Splunk MCP Integration
The agent communicates with Splunk exclusively through the official Splunk MCP Server using JSON-RPC 2.0. All data retrieval is routed through MCP tools — splunk_run_query for SPL execution, splunk_get_indexes for environment discovery, splunk_get_knowledge_objects for saved search integration. Four concurrent SPL queries fire via asyncio.gather on every agent run, pulling process creation events, authentication events, network connections, and host activity summaries simultaneously. The MCP connection uses encrypted tokens with audience mcp as required by the Splunk MCP Server.
MITRE ATT&CK Brain
We downloaded the full MITRE ATT&CK STIX 2.1 bundle and parsed 697 valid techniques — filtering deprecated, revoked, and malformed entries using regex validation ^T\d{4}(\.\d{3})?$. Every technique was embedded using sentence-transformers/all-MiniLM-L6-v2 and stored in ChromaDB with cosine distance metric. The embedding document for each technique combines technique ID, name, tactic, description (truncated to 500 characters), detection guidance (truncated to 300 characters), and platforms — carefully structured to maximise semantic search relevance within the model's token limit.
LangGraph Orchestrator
The agent runs as a LangGraph state machine with six typed nodes flowing through a shared AgentState TypedDict: ingest_signal → pull_splunk_context → classify_tactic → expand_game_tree → score_and_prune → generate_brief. Each node reads from state and returns only the fields it modifies. The graph degrades gracefully — if the MCP context pull fails, the node returns an empty SplunkContext and appends to the errors list rather than crashing the pipeline.
Game Tree + Kill Chain Expansion
The GameTree class implements a KILL_CHAIN_NEXT mapping that defines which tactics logically follow each observed tactic. For a Credential Access signal, the game tree expands techniques from Lateral Movement, Discovery, and Collection. Candidates are scored, normalised, pruned below a configurable threshold (default 0.15), and capped at five predictions.
Dual-Layer Memory
ChromaDB holds two separate collections: mitre_techniques for semantic memory (697 ATT&CK techniques) and incident_episodes for episodic memory (past incidents). The AgentMemory class combines both using asyncio.gather for concurrent recall. Episodic recall uses signal embedding similarity to surface the most relevant past incidents, and those episodes weight the probability scores. The memory compounds — after 21 seeded episodes, recall is returning 3 relevant past incidents per query.
LLM Re-scoring with Gemini
The score_and_prune node calls Google Gemini 1.5 Flash with a red team operator system prompt. The LLM receives the signal, environmental context, and candidate moves, and returns a re-scored JSON array with adversarial reasoning for each technique. The client implements a three-provider architecture — Gemini (default), Ollama (offline fallback), Anthropic (optional) — so the system works with or without external API access.
FastAPI Backend + Splunk HEC
A FastAPI server on port 8080 exposes /trigger (receives Splunk alert webhooks, fires agent as background task, returns 202 immediately), /trigger/status/{task_id} (polls agent run completion), and /feedback (receives analyst confirm/reject, updates episodic memory). Completed DefenderBrief objects are written back into Splunk via HTTP Event Collector with sourcetype redforesight:brief, making them immediately queryable via SPL.
Splunk Dashboard A custom Splunk App provides two panels: a prediction feed table showing the latest DefenderBriefs with technique, probability, and LLM reasoning columns, and an analyst feedback form that POSTs directly to the FastAPI backend. The dashboard uses SimpleXML with embedded JavaScript for the feedback form interaction.
Challenges We Ran Into
The MCP repository didn't exist.
Our first attempt to clone github.com/splunk/splunk-mcp returned "repository not found." The official Splunk MCP Server is not a separately cloned Node.js process — it is a Splunk App installed from Splunkbase (App ID 7931) that runs inside the Splunk instance itself. Discovering this cost time but led to a better architecture: no separate process to manage, no npm dependencies, and the MCP endpoint runs at https://localhost:8089/services/mcp with encrypted tokens using audience mcp specifically.
Token audience mismatch.
The Splunk MCP Server requires tokens with audience mcp, not standard Splunk API tokens. The error "Invalid token audience: redforesight-dev" was initially confusing — the standard token worked for Splunk REST API calls but was rejected by the MCP endpoint. The fix was creating an MCP Encrypted Token from the Splunk MCP Server dashboard specifically.
Bad SPL returning HTTP 200.
The Splunk MCP Server returns HTTP 200 for invalid SPL queries — the error is packaged inside the JSON response body rather than as an HTTP error code. This required defensive parsing throughout the MCP client: checking both result.success and validating that result.data contains actual row data rather than error strings.
ChromaDB version drift.
After a machine restart, Docker automatically started a fresh ChromaDB container on port 8001, replacing the local chroma.exe instance that held our seeded data. The v1 heartbeat API (/api/v1/heartbeat) returned a deprecation error on the new version. We resolved this by switching to the local chroma.exe binary with a persistent --path flag, and documented the startup sequence in the README to prevent recurrence.
pydantic and mcp version conflict.
mcp==1.2.0 requires pydantic>=2.10.1 while our initial pinned pydantic==2.9.2 created an irresolvable dependency conflict. Fixed by loosening the pydantic constraint to pydantic>=2.10.1,<3.0.0 and letting pip resolve the langchain ecosystem versions freely.
LLM JSON parsing reliability.
Gemini 1.5 Flash occasionally prepends a sentence before the JSON array despite explicit instructions to return only JSON. Our _parse_llm_response method handles this by finding the first [ and last ] in the response and extracting only that substring before attempting json.loads(). This defensive parsing pattern made the LLM integration robust across hundreds of test calls.
NumPy array truthiness in ChromaDB.
ChromaDB's Python client occasionally returns NumPy arrays for embeddings, causing ValueError: The truth value of an array with more than one element is ambiguous when checking for truthiness with if result["embeddings"]. Fixed by explicitly checking if result["embeddings"] is not None — a subtle but production-breaking bug that would have caused silent failures in the episodic memory recall path.
Accomplishments That We're Proud Of
The game tree actually works. Given a Credential Access signal, RedForesight correctly expands to Lateral Movement, Discovery, and Collection techniques — never predicting the attacker repeats the same tactic. The kill chain logic is grounded in real adversarial tradecraft, not heuristics.
The LLM reasons adversarially. The Gemini reasoning strings are genuinely insightful — not generic descriptions but specific assessments of why an attacker would choose a technique given the observed signal and environmental context. This is the system prompt doing its job: "You are a red team operator simulating an advanced persistent threat actor."
The memory compounds. After seeding 21 episodes and running the agent on a similar signal, episodic recall returns 3 relevant past incidents and correctly weights the probability scores. The agent is demonstrably smarter with memory than without — which is the entire point.
Zero data leaves the environment. All Splunk data retrieval routes through the Splunk MCP Server. The LLM receives signal metadata and technique names — never raw log data. DefenderBriefs write back into Splunk via HEC. The architecture is genuinely enterprise-safe.
Graceful degradation throughout. The MCP client never raises to callers — always returns MCPToolResult. The LLM client falls back to original moves on any failure. The orchestrator handles empty context gracefully. The system degrades to useful rather than crashing.
What We Learned
The game tree metaphor is more than an analogy — it's the right data structure. MITRE ATT&CK techniques have real prerequisite dependencies encoded in the kill chain phase relationships. The tree has natural pruning conditions that dramatically reduce the search space. Surfacing this structure explicitly is what makes the predictions useful rather than just probable.
Episodic memory compounds faster than expected. After 21 incidents, recall precision was noticeably environment-specific — the agent was surfacing techniques that matched the BOTS v3 dataset's attacker patterns, not just global MITRE frequencies. This is the compounding value proposition working in practice.
Prompt engineering for adversarial reasoning requires a different mindset. Standard assistant prompts produce defensive, hedged responses. The red team operator framing — "you are simulating an attacker, think about path of least resistance, forensic footprint, and tradecraft" — produces fundamentally different outputs. The framing of the system prompt is as important as the model choice.
The MCP protocol is genuinely powerful for security operations. Routing all Splunk interactions through MCP means the agent can be connected to any Splunk instance by changing one environment variable. The tool abstraction handles authentication, retry, and result formatting — the orchestrator just calls splunk_run_query and gets rows back.
What's Next for RedForesight
Campaign detection. Cluster related incidents over a sliding 7-day window using episodic memory similarity scores. When three or more related episodes are detected, surface a "campaign in progress" warning identifying the attacker's likely objective based on the kill chain trajectory.
Depth-5 game tree with beam search. The current implementation plans 3 moves ahead. Extending to depth-5 with beam search pruning — keeping only the top-3 branches at each level — would enable multi-hop campaign prediction: not just "what happens next" but "where does this campaign end."
SOAR integration. When prediction confidence exceeds a configurable threshold (default 0.80), automatically trigger a Splunk SOAR playbook for the top predicted technique — pre-staging containment actions before the attacker executes.
Probability calibration. Use analyst feedback history to temperature-scale raw LLM probabilities. Confirmed predictions shift their priors upward, rejected predictions shift them down. After 100+ incidents, confidence scores become statistically meaningful rather than decorative.
Red team simulation mode. A /simulate endpoint that accepts a hypothetical initial access scenario and steps through the predicted campaign interactively — letting security teams run AI-powered tabletop exercises against their own Splunk data.
Multi-tenant memory. Separate episodic memory collections per customer environment, enabling RedForesight to learn different attacker patterns for different organisations from a single deployment.
Built With
- api
- chromadb
- docker
- enterprise
- fastapi
- httpx
- javascript
- langchain
- langgraph
- mcp
- pydantic
- python
- rich
- sentence-transformers
- spl
- splunk
- tenacity
- uvicorn
Log in or sign up for Devpost to join the conversation.