ApexOps AI Architecture

Overview

ApexOps AI is an autonomous enterprise operations platform that combines Splunk observability, AI-powered executive agents, and automated remediation workflows. The system continuously monitors infrastructure, investigates incidents, assesses business impact, and recommends corrective actions.


Architecture Diagram

┌─────────────────────────────────────────────────────┐
│                Enterprise Infrastructure            │
├─────────────────────────────────────────────────────┤
│ Applications │ APIs │ Databases │ Cloud │ Network   │
└─────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────┐
│                Splunk Enterprise                    │
├─────────────────────────────────────────────────────┤
│ • HEC Event Ingestion                               │
│ • Metrics Collection                                │
│ • Logs & Traces                                     │
│ • SPL Searches                                      │
│ • ML Toolkit Anomaly Detection                      │
│ • Enterprise Security Events                        │
│ • Saved Alerts & Webhooks                           │
└─────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────┐
│            Splunk Alerting Layer                    │
├─────────────────────────────────────────────────────┤
│ Alert Generated                                     │
│ ↓                                                   │
│ Webhook Trigger                                     │
│ ↓                                                   │
│ Incident Sent to ApexOps Backend                    │
└─────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────┐
│              ApexOps FastAPI Backend                │
├─────────────────────────────────────────────────────┤
│ • Incident Orchestrator                             │
│ • Splunk Query Engine                               │
│ • Agent Coordinator                                 │
│ • Remediation Engine                                │
│ • Audit Logger                                      │
└─────────────────────────────────────────────────────┘
                          │

        ┌─────────────────┼─────────────────┐
        ▼                 ▼                 ▼

┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ SRE Agent    │ │ Security     │ │ Finance      │
│              │ │ Agent        │ │ Agent        │
│ Root Cause   │ │ Threat Hunt  │ │ Revenue      │
│ Analysis     │ │ MITRE Mapping│ │ Impact       │
└──────────────┘ └──────────────┘ └──────────────┘

        ▼                 ▼                 ▼

┌──────────────┐ ┌──────────────┐
│ Compliance   │ │ Remediation  │
│ Agent        │ │ Agent        │
│ GDPR/SOC2    │ │ Auto Actions │
└──────────────┘ └──────────────┘

               │
               ▼

┌─────────────────────────────────────────────────────┐
│           Groq Llama3-70B AI Models                 │
├─────────────────────────────────────────────────────┤
│ • Incident Summarization                            │
│ • Root Cause Investigation                          │
│ • Threat Analysis                                   │
│ • Business Impact Assessment                        │
│ • Executive Decision Reports                        │
└─────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────┐
│          Chief Operations Agent (COA)              │
├─────────────────────────────────────────────────────┤
│ • Aggregates Agent Findings                         │
│ • Determines Severity                               │
│ • Generates Recommendations                         │
│ • Creates Executive Decision Report                 │
└─────────────────────────────────────────────────────┘
                          │

        ┌─────────────────┼─────────────────┐
        ▼                 ▼                 ▼

┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Next.js      │ │ Slack Alerts │ │ Splunk Audit │
│ Dashboard    │ │ Notifications│ │ Trail Index  │
└──────────────┘ └──────────────┘ └──────────────┘

                          │
                          ▼

┌─────────────────────────────────────────────────────┐
│            Human Approved Remediation               │
├─────────────────────────────────────────────────────┤
│ • Restart Services                                  │
│ • Scale Infrastructure                              │
│ • Block Threat Actors                               │
│ • Generate Incident Tickets                         │
└─────────────────────────────────────────────────────┘

Data Flow

1. Data Collection

Enterprise systems continuously generate:

  • Application Logs
  • API Logs
  • Database Events
  • Infrastructure Metrics
  • Network Telemetry
  • Security Events

These events are ingested into Splunk Enterprise using HEC (HTTP Event Collector).


2. Splunk Analysis

Splunk performs:

  • Log aggregation
  • Metrics monitoring
  • Distributed tracing
  • SPL searches
  • ML-based anomaly detection
  • Enterprise Security correlation

When an anomaly or threat is detected, Splunk generates an alert.


3. Incident Triggering

Splunk alerts trigger webhooks that notify the ApexOps backend.

The backend creates an incident and launches the AI Executive Team.


4. AI Executive Investigation

SRE Executive

  • Root cause analysis
  • Infrastructure diagnostics
  • Performance investigation

Security Executive

  • Threat hunting
  • Attack analysis
  • MITRE ATT&CK mapping

Finance Executive

  • Revenue impact estimation
  • SLA risk analysis
  • Business impact assessment

Compliance Executive

  • GDPR evaluation
  • SOC2 assessment
  • Regulatory risk scoring

Remediation Executive

  • Recovery planning
  • Automated action generation
  • Verification procedures

5. AI Model Layer

All executives leverage Groq Llama3-70B to:

  • Analyze incidents
  • Generate findings
  • Produce recommendations
  • Create executive reports

6. Chief Operations Agent

The Chief Operations Agent (COA) combines outputs from all executives and generates:

  • Incident Severity
  • Root Cause
  • Security Assessment
  • Financial Impact
  • Compliance Assessment
  • Recommended Actions

7. Results & Audit Trail

The final decision report is:

  • Displayed in the Next.js dashboard
  • Sent via Slack notifications
  • Written back into Splunk

This creates a complete audit trail of AI-driven decisions.


8. Autonomous Remediation

Approved remediation actions can be executed, including:

  • Restarting services
  • Scaling infrastructure
  • Blocking malicious IPs
  • Creating incident tickets
  • Running recovery workflows

Key Technologies

Layer Technology
Frontend Next.js, React, TypeScript
Backend FastAPI, Python
Observability Splunk Enterprise
AI Models Groq Llama3-70B
Agent Framework Multi-Agent Architecture
Notifications Slack
Audit Trail Splunk Indexes
Deployment Docker, Vercel

Core Value Proposition

ApexOps transforms enterprise operations from:

Alert → Human Investigation → Manual Decision

into

Alert → AI Investigation → Executive Decision → Automated Remediation

reducing response time, improving visibility, and enabling autonomous operations at scale.

Built With

Share this project:

Updates