The Spark: Beyond Error Logging

The inspiration for Talos came from a specific, recurring frustration: "Notification Fatigue." Like many developers, I was tired of CI/CD pipelines acting like smoke detectors—loudly alerting me to a fire (a failed build) but doing nothing to put it out. I realized that the current generation of DevOps tools are passive observers. They tell you what broke, but the mental load of context switching, reading stack traces, and writing the patch still falls on the human. I wanted to build something that didn't just report errors but actually fixed them. I wanted a "Digital Employee"—an autonomous agent that treats a broken build as a "pain signal" and initiates a biological-style "healing response" without my intervention.

The Architecture: Anatomy of an Agent

Building Talos required moving beyond simple "chatbot" architecture into an agentic loop. I structured the system into three distinct biological components:

  1. The Nervous System (FastAPI + Supabase) I needed a robust backend to manage the state of "healing runs." Using FastAPI (Python 3.11), I built an event-driven core that listens for GitHub webhooks. When a workflow_run fails, the system calculates a "Pain Signal" intensity.
    • Storage: I used Supabase to store the history of "Patient Zero" (the root cause files) and the "Thought Process" logs.
    • Cognition: The brain is Google's Gemini 3 (gemini-3-flash-preview). I chose it for its massive context window, allowing me to feed it entire dependency graphs and error logs.
  2. The Hands (E2B Sandbox) This was the most critical integration. You cannot let an AI run rm -rf / on your production server. I utilized E2B to create ephemeral, secure sandboxes. When Talos attempts a fix, it clones the repo into this isolated environment, installs dependencies, and runs the tests there. If the tests fail, the "patient" dies in the sandbox, not in production.
  3. The Visual Cortex (Playwright) Code isn't just logic; it's also presentation. A common challenge with AI code generation is that it might fix a logic error but break the UI (e.g., changing a CSS class). I implemented a "Visual Cortex" using Playwright. Before submitting a fix, Talos takes a screenshot of the running app to ensure no visual regressions occurred.

The Logic: Quantifying the Fix

One of the hardest parts was determining when the agent should be confident enough to open a Pull Request. I modeled this as a weighted scoring function.

Let $$S_{fix}$$be the confidence score of a generated patch. We define the acceptance threshold$$\theta$$such that a PR is only opened if$$S_{fix} > \theta$$.

$$ S_{fix} = \alpha \cdot T_{pass} + \beta \cdot (1 - \Delta_{UI}) + \gamma \cdot C_{sem} $$

Where:

  • $$T_{pass} \in {0, 1}$$ is the binary result of the unit tests in the sandbox.
  • $$\Delta_{UI}$$ represents the visual divergence (pixel difference) detected by the Visual Cortex, normalized between 0 and 1.
  • $$C_{sem}$$ is the semantic consistency score returned by the LLM (does this code look like the rest of the repo?).
  • $$\alpha, \beta, \gamma$$ are weights tuning the strictness of the agent.

For Talos, I prioritized logical correctness, setting $$\alpha$$ heavily, meaning if tests fail, the score remains 0.

Challenges Faced

  1. The "Hallucination" Loop Early versions of Talos would get stuck in infinite loops. The AI would suggest a fix, the test would fail with a new error, and the AI would suggest the same fix again.
    • Solution: I implemented a "Healing History" context. The prompt sent to Gemini includes the previous failed attempts in the current run, effectively telling it: "You already tried X and it caused Y. Try something else."
  2. User Trust (The "Black Box" Problem) Developers don't trust AI that works in the dark. If a bot just opens a PR, you wonder, "How did it get here?"
    • Solution: I built the Neural Dashboard using Next.js 15 and Server-Sent Events (SSE). This created a "Glass Box" experience. Users can watch the "Neural Stream" in real-time—seeing the agent parse the stack trace, "think" about the dependency graph, and execute terminal commands—building trust through transparency.

What I Learned

Building Talos taught me that Agentic AI is fundamentally different from Generative AI.

  • Context is King: The quality of the fix is directly proportional to the quality of the "Patient Zero" context you provide.
  • Sandboxing is Non-Negotiable: To give an agent agency, you must give it a safe playground. E2B was the enabler that turned a text generator into a code executor.
  • Visuals Matter: For frontend debugging, text-based logs are insufficient. Giving the AI "eyes" (Playwright) drastically reduced false positives. Talos isn't just a tool; it's a proof of concept for a future where developers oversee "digital squads" rather than writing every line of code themselves.

Built With

Share this project:

Updates