Epilog: The "Why" Engine for AI Agents

Inspiration

Debugging traditional software is hard, but debugging AI agents is a nightmare. When a script fails, you get a line number. When an AI agent fails, you get a cryptic log like ElementNotFoundError or Timeout. Was the page different? Did a "Accept Cookies" banner cover the button? Did a paywall pop up?

We realized that agents are essentially "blind" to us while they work. We wanted to give them eyes and a recorded memory that developers could play back and analyze. Epilog was born from the need to turn "I don't know why it failed" into "Here is exactly what went wrong and how to fix it."

What it does

Epilog is a multimodal debugging platform designed for the era of autonomous agents. It doesn't just record what an agent does—it records what an agent sees and thinks.

Multimodal Tracing: Captures every tool call, internal thought, and—most importantly—a screenshot of the environment at the exact moment of failure.
AI-Powered "Autopsy": Uses Gemini to analyze the execution trace alongside visual context. It can identify visual obstacles like modals, auth-walls, or layout shifts that text-based logs miss.
Auto-Surgeon (Patch Generation): Beyond just diagnosing, Epilog generates Standard Unified Diff patches to fix the agent's code or logic on the fly.
Real-Time Dashboard: A high-fidelity, live-streaming dashboard where you can watch your agents navigate the web in real-time via SSE (Server-Sent Events).

How we built it

We built Epilog using a modern, high-performance stack centered around multimodal reasoning.

The Brain: Powered by Gemini 1.5 Flash, we built a custom diagnosis engine that performs joint reasoning over JSON traces and JPEG artifacts.
The Eyes: We developed a Python SDK with a LangChain/LangGraph Callback Handler that seamlessly integrates with existing agent loops. For visual context, we built a lightweight capture layer using Playwright.
The API: A FastAPI backend manages high-frequency event streaming and stores traces in an Asynchronous PostgreSQL database.
The Dashboard: A Next.js 15 frontend using Tailwind CSS and shadcn/ui provides a premium, developer-first experience with real-time state synchronization.

Challenges we ran into

One of our biggest hurdles was Multimodal Latency. Sending full-resolution PNGs during every agent step would kill performance and explode token costs. We solved this by building a custom Image Compression Pipeline using Pillow, which intelligently resizes and transcodes visuals to optimized JEEGs, maintaining semantic clarity for Gemini while reducing bandwidth by over 90%.

Another challenge was Real-time Synchronization—ensuring the dashboard accurately reflected the agent's state without overwhelming the frontend. We implemented a robust SSE (Server-Sent Events) architecture to stream events efficiently.

Accomplishments that we're proud of

We are incredibly proud of the Zero-Config Developer Experience. You can take an existing complex LangGraph agent and enable full visual debugging by just adding one line of code to your callbacks.

Watching the "AI Diagnosis" feature successfully identify that a LinkedIn login wall was the cause of a scraping failure—and then suggest the exact selector change to handle it—was a true "Aha!" moment for the team.

What we learned

We learned that Visual Grounding is the missing piece in agent development. Many "unreliable" agents aren't actually suffering from poor logic; they are simply reacting to a dynamic web environment that their developers can't see. Multimodal debugging bridges the gap between what a developer expects and what an agent actually experiences.

What's next for Epilog

Epilog is just getting started. Our roadmap includes:

Proactive Self-Healing: Enabling agents to apply the generated "Auto-Surgeon" patches autonomously and retry their tasks without human intervention.
Video Playback: Moving from static screenshots to frame-by-frame recordings of agent interactions.
Multi-Agent Orchestration Visualizer: A specialized view for debugging complex swarms of agents interacting with each other in real-time.