Inspiration

Modern Security Operations Centers (SOCs) run on platforms like Splunk that excel at detecting threats, but the true bottleneck is no longer "finding events" it’s turning thousands of daily alerts into timely, accurate decisions at scale. Analysts face severe alert fatigue, thin alert handoffs with missing business context, and the steep learning curve of writing complex Splunk Processing Language (SPL) queries under immense pressure.

Furthermore, investigating alerts in isolated silos prevents security teams from seeing the bigger picture. A single alert rarely tells the whole story of a sophisticated attack. Raw IP addresses and cryptic usernames mean very little without knowing the asset's actual business identity or how those entities connect to previous incidents. We realized that AI shouldn't just replace analysts; instead, it should act as an autonomous, tirelessly working "SOC Copilot." Our vision was to build a system that compresses hours of investigation time into seconds, correlates disparate network events, standardizes adversarial reasoning, and surfaces actionable, context-rich insights immediately.

What it does

At its absolute core, ThinkingSOC is an AI-powered Agentic Ops Router designed to act as an autonomous extension of a human SOC analyst. Its primary mission is to relentlessly detect, investigate, and neutralize sophisticated cyber threats. However, recognizing that modern security often overlaps with IT reliability, ThinkingSOC uniquely extends its capabilities to handle Observability and IT Operations tasks. By unifying security investigations and system observability under one intelligent umbrella, it ensures that your organization is not only secure from adversaries but also maintains peak operational performance.

Instead of analysts manually querying logs, cross-referencing assets, and deciphering raw data, ThinkingSOC completely automates the triage and investigation phase through several groundbreaking features:

  • Smart Alert Handoff & Full Context Extraction: Webhooks often ship with highly limited data rows. ThinkingSOC seamlessly captures Splunk webhooks and instantly reaches back via the Splunk REST API to automatically fetch the full, unabridged search result job context (sid), ensuring the AI has the complete picture from the start.
  • Asset Identity Management & Business Context: Raw IPs and obscure usernames are meaningless without real-world context. Before triggering any AI analysis, ThinkingSOC performs dynamic entity resolution. It maps raw Splunk log fields (src, dest, user) directly to your internal organizational asset inventory. This vital enrichment step ensures the AI instantly knows whether a targeted IP is a low-risk developer's testing VM or a mission-critical "crown jewel" database, and precisely attributes malicious actions to real human identities. This drastically reduces false positives and prioritizes actual business risks.
  • Advanced Graph-Based Correlation (The Game Changer): ThinkingSOC refuses to analyze alerts in a vacuum. It leverages a highly powerful graph engine to map every single entity (IPs, hostnames, user accounts, file hashes) extracted from Splunk. When a new alert triggers, the system instantly traverses this graph to uncover hidden relationships with historical alerts. For instance, it can connect a seemingly benign failed login from last week to a high-severity data exfiltration alert today, simply because they share the same compromised asset or lateral movement path. This automatically transforms fragmented, isolated alerts into a unified, visual attack narrative—allowing analysts to instantly visualize the full "blast radius" and exact timeline of a multi-stage breach.
  • Dual-Pipeline Agentic Routing: An exclusive, highly-tuned LLM classifier evaluates every enriched incoming alert. It intelligently identifies the nature of the event and routes cyber threats to a dedicated Security Pipeline (for malicious activity, intrusions, and malware) and system issues to an Observability Pipeline (for server crashes, performance degradation, and IT health). This allows the SOC to operate as a single, omniscient pane of glass for both security and operational reliability.
  • Autonomous Investigation via Multi-Agent Debates: ThinkingSOC orchestrates specialized, persona-driven AI agents (Defender, Hunter, Judge) using LangGraph. These agents autonomously consult external threat intelligence (e.g., VirusTotal), map findings to established security frameworks like MITRE ATT&CK, and actually debate the evidence amongst themselves to formulate a highly accurate, definitive verdict.
  • Interactive SOC Chat & RAG Capabilities: ThinkingSOC features a robust conversational interface powered by Retrieval-Augmented Generation (RAG). Using Qdrant as a vector database, analysts can dynamically "chat" with their enterprise data. They can ask for historical context, recall previous incident responses, or use natural language to execute Text-to-SQL queries, instantly pulling past investigations and correlated events without ever leaving the dashboard.
  • Splunk AI Integration: The platform directly integrates with Splunk MCP (Model Context Protocol) and Splunk AI Assistant (SAIA) to autonomously translate the AI's natural language reasoning into precise, CIM-aligned investigation SPL, running active hunt queries directly against the user's live Splunk environment.

How we built it

Our architecture was strictly designed to be context-aware, highly scalable, developer-friendly, and modular:

  • Backend: A robust, asynchronous FastAPI layer serves as the central nervous system, handling incoming Splunk webhooks, manual API calls, and complex agent orchestration with minimal latency.
  • Custom Developer SDK: To ensure ThinkingSOC is not just a closed, rigid dashboard but a highly extensible platform, we developed a dedicated, fully-featured Python SDK. This empowers SOC engineers and backend developers to programmatically interact with our AI agents, trigger custom security investigations, query the correlation graph, and seamlessly embed ThinkingSOC's capabilities into their own internal scripts, custom CLI tools, or CI/CD pipelines with just a few lines of code.
  • AI Orchestration: We heavily utilized LangGraph to construct our complex multi-agent pipelines (Security vs. Observability). To maintain vendor neutrality and maximum flexibility, we wrapped multiple top-tier LLMs (OpenAI, Anthropic, NVIDIA NIM) using LiteLLM.
  • Data, Memory & Graph Stores:
    • PostgreSQL: Acts as the primary relational database, storing structured triage records, user data, and the critical asset inventory.
    • Qdrant: Houses high-dimensional vector embeddings to power our SOC RAG pipelines, enabling hyper-fast semantic search over thousands of past incidents and playbooks.
    • Neo4j: Serves as the beating heart of our Correlation Engine, dynamically building nodes and edges between alerts, users, and assets to render the interactive, real-time attack graph.
  • Splunk Native Tools: Deep, native integration with Splunk 10+, Splunk REST API (port 8089), the Splunk MCP Server, and Splunk AI Assistant (SAIA) to ensure seamless query execution and data ingestion.
  • Frontend: A sleek, highly responsive Next.js Analyst UI where security teams can visualize the investigation timeline, deeply explore the Neo4j correlation graph visually, manage asset identities, and interact with the AI via the SOC Chat interface.
  • Engineering Rigor (Testing & Documentation): We treated this project as enterprise, production-grade software from day one. The entire codebase is backed by robust unit and integration testing (utilizing pytest) to ensure our AI agents, API endpoints, and data pipelines behave predictably even under severe edge cases. Furthermore, thanks to FastAPI, the project boasts out-of-the-box, auto-generated Swagger UI and OpenAPI documentation. This zero-config, interactive API documentation makes the platform instantly accessible, highly maintainable, and incredibly easy for other developers to integrate with or audit.

Challenges we ran into

  • Graph Data Modeling & Avoiding "Super Nodes": Building the Neo4j correlation engine required meticulous schema design. We had to implement intelligent filtering algorithms to avoid creating "super nodes" (such as highly common public DNS IPs, broadcast addresses, or generic admin accounts) which would falsely link completely unrelated alerts together, thereby creating massive noise instead of clarity.
  • LLM Context Limits & Intelligent Log Summarization: When feeding massive arrays of raw Splunk logs and lengthy threat intelligence reports into our LangGraph agents, we quickly hit LLM context window limits and unacceptable latency spikes. To solve this, we engineered a dynamic data chunking and pre-summarization pipeline, ensuring our AI retained the critical adversarial signals without exceeding token caps or slowing down the real-time investigation pipeline.
  • Accurate Asset Identity Resolution: Reliably mapping wildly inconsistent raw log strings (like dynamic DHCP IPs, spoofed MAC addresses, or disparate usernames) to structured, meaningful business entities in real-time required building a highly robust entity resolution pipeline prior to any LLM evaluation steps.
  • RAG & Chat Hallucinations: Designing the SOC Chat meant we had to absolutely ensure the AI only relied on factual, retrieved data from Qdrant and our PostgreSQL databases. Fine-tuning the Text-to-SQL capabilities and RAG retrieval mechanisms to return highly specific security context without generating dangerous AI hallucinations was a major, continuous hurdle.
  • Agentic Orchestration & SPL Generation: Translating abstract, natural language security concepts into precise, syntactically correct, and CIM-aligned Splunk SPL was notoriously difficult. We overcame this by deeply integrating with Splunk's native SAIA and MCP capabilities while rigorously prompt-engineering our LangGraph agents to output strict, validated JSON structures.

What's next for ThinkingSOC

  • Automated Remediation via Splunk SOAR: We plan to move beyond merely providing "Next Steps" suggestions. Our goal is to empower ThinkingSOC to autonomously execute critical containment actions (e.g., automatically blocking malicious IPs at the firewall, isolating compromised hosts from the network, or forcefully rotating breached credentials) directly through Splunk SOAR, heavily guided by the precise blast radius identified in our Neo4j correlation graph.
  • Proactive, Graph-Triggered Threat Hunts: We are upgrading the correlation engine so that when Neo4j detects a slow, suspicious pattern forming organically across multiple low-severity, seemingly unrelated alerts, it will proactively spin up an autonomous hunting agent. This agent will investigate the network and hunt for lateral movement before a major, high-severity alert even has the chance to trigger.
  • Custom Enterprise Knowledge Base Integration: We aim to radically deepen the RAG (Qdrant) integration to securely ingest massive internal company playbooks, compliance documentation, and historical post-mortem reports. This will allow the AI Chat to recommend specific Standard Operating Procedures (SOPs) perfectly tailored to a specific organization's exact internal security protocols and legal requirements.

Built With

Share this project:

Updates