Inspiration

Every SRE and DevOps engineer knows the 3 AM pager alert. A production server is struggling—CPU spiking, memory leaking, processes hanging—and you're scrambling to diagnose the issue while half-asleep. I asked myself: What if AI could be the first responder?

Traditional monitoring tools tell you what is happening, but not why. They flood you with alerts but lack the reasoning to connect the dots. I was inspired by how experienced SREs think—they observe patterns, form hypotheses, simulate outcomes, and take measured action. I wanted to bring that same cognitive loop to AI, powered by Gemini 3's advanced reasoning capabilities.

The kernel is the heart of every Linux system. By tapping directly into kernel-level telemetry via eBPF, we can see everything—every syscall, every scheduler decision, every page fault. Combined with Gemini 3's ability to reason about complex, interconnected systems, I saw an opportunity to build something truly autonomous.

What it does

KernelSight AI is an autonomous Site Reliability Engineering (SRE) agent that monitors, diagnoses, and remediates Linux system issues in real-time.

Core Capabilities:

  1. Real-Time Kernel Telemetry — eBPF tracers capture syscalls, scheduler events, I/O latency, and page faults at the kernel level with minimal overhead.

  2. Semantic Signal Processing — Raw events are transformed into meaningful "signals" (e.g., memory_pressure_high, io_bottleneck_detected, process_thrashing) that Gemini 3 can reason about.

  3. 6-Phase Autonomous Decision Cycle:

    • OBSERVE → Collect current system state and signals
    • EXPLAIN → Use Gemini 3 to understand why issues are occurring
    • SIMULATE → Predict outcomes of potential actions
    • DECIDE → Select the best remediation with confidence scoring
    • EXECUTE → Safely apply fixes (with human-in-the-loop option)
    • VERIFY → Confirm the issue is resolved
  4. Interactive Chat Interface — Ask questions in natural language: "Why is the server slow?" and get contextual answers backed by real telemetry data.

  5. Human-in-the-Loop Safety — The agent requests approval before executing any commands, ensuring humans remain in control.

How I built it

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                     KernelSight AI                          │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │ eBPF Tracers│  │   Metrics   │  │  Semantic Ingestion │  │
│  │  (C/libbpf) │  │   Scraper   │  │      Pipeline       │  │
│  └──────┬──────┘  └──────┬──────┘  └──────────┬──────────┘  │
│         │                │                     │            │
│         └────────────────┴─────────────────────┘            │
│                          │                                  │
│                    ┌─────▼─────┐                            │
│                    │  SQLite   │                            │
│                    │ Signal DB │                            │
│                    └─────┬─────┘                            │
│                          │                                  │
│         ┌────────────────┼────────────────┐                 │
│         │                                 │                 │
│  ┌──────▼──────┐                   ┌──────▼──────┐          │
│  │  Autonomous │                   │ Interactive │          │
│  │    Agent    │                   │    Agent    │          │
│  │ (Gemini 3)  │                   │ (Gemini 3)  │          │
│  └─────────────┘                   └─────────────┘          │
└─────────────────────────────────────────────────────────────┘

Technology Stack

Layer Technologies
Kernel Collection C, eBPF, libbpf, BTF
Data Pipeline Python, SQLite, NumPy
AI/ML Gemini 3 API (google-genai SDK)
Agent Framework Custom 6-phase loop with tool-augmented reasoning

Gemini 3 Integration

I use Gemini 3 as the "brain" of the agent through function calling and structured reasoning:

# Agent tools available to Gemini 3
tools = [
    get_current_signals,      # Fetch real-time system signals
    get_baseline_comparison,  # Compare against normal behavior
    execute_command,          # Run remediation commands
    query_signal_history,     # Analyze trends over time
]

# Gemini 3 reasons about the system state
response = model.generate_content(
    f"System signals: {signals}\nAnalyze and recommend action.",
    tools=tools
)

The agent leverages Gemini 3's:

  • Long context window — Feed comprehensive system state without truncation
  • Function calling — Dynamically invoke diagnostic and remediation tools
  • Reasoning capabilities — Understand causal relationships between signals
  • Safety alignment — Conservative action selection with confidence thresholds

Challenges I ran into

1. eBPF Complexity

Writing correct eBPF programs is notoriously difficult—the kernel verifier is strict, and debugging is painful. I spent significant time ensuring our tracers were safe, efficient, and captured the right events without overwhelming the system.

2. Signal-to-Noise Ratio

The kernel generates millions of events per second. Transforming raw telemetry into meaningful signals that Gemini 3 could reason about required careful feature engineering and anomaly detection.

3. Grounding AI in Reality

LLMs can hallucinate. We needed to ensure Gemini 3's recommendations were grounded in actual system data, not fabricated insights. My solution: always provide real telemetry in the prompt and validate proposed commands before execution.

4. Safe Autonomous Action

Giving an AI agent the ability to run commands on a production system is scary. I implemented multiple safety layers: confidence thresholds, action allowlists, human approval gates, and automatic rollback detection.

5. Cross-Platform Build Issues

Developing on Windows with a Linux VM introduced challenges with line endings, clock skew, and shared folder corruption. I had to carefully manage the build environment.

Accomplishments that I'm proud of

  • 🏆 End-to-end autonomous remediation — From kernel event to AI diagnosis to safe fix, fully automated
  • 🏆 Sub-second signal detection — eBPF + streaming pipeline catches issues as they happen
  • 🏆 Natural language interaction — Ask "Why is memory high?" and get answers backed by real data
  • 🏆 Production-ready safety — Human-in-the-loop, confidence scoring, and action validation
  • 🏆 10+ automated tests — Unit, integration, and chaos testing for reliability
  • 🏆 Clean architecture — Modular design separating collection, processing, and reasoning

What I learned

  1. Gemini 3's reasoning shines with structured data — Providing well-organized signals (not raw logs) dramatically improves response quality.

  2. Function calling is powerful for agents — Letting Gemini 3 decide which tools to invoke creates more natural, adaptive behavior than hardcoded logic.

  3. eBPF is a superpower — Kernel-level visibility unlocks insights impossible with traditional monitoring.

  4. AI + Human collaboration > Full automation — The best results came from AI handling the cognitive load while humans retain final authority.

  5. Prompting is engineering — Careful prompt design for the agent's decision loop was as important as the code itself.

What's next for KernelSight AI

  • ☁️ Cloud-Native Expansion — Kubernetes-aware monitoring with pod and container context
  • 🔗 Integration Ecosystem — Connect with PagerDuty, Slack, Prometheus, and Grafana
  • 🧠 Fine-tuned Models — Train specialized models on SRE runbooks and incident postmortems
  • 🌐 Multi-Node Support — Correlate signals across distributed systems
  • 📱 Mobile Companion — Get AI-powered incident summaries on your phone

I believe the future of SRE is AI-augmented, not AI-replaced. KernelSight AI is our step toward that vision—an intelligent partner that watches the kernel so you don't have to.

Built With

Share this project:

Updates