Inspiration

AI agents are powerful, but when they fail, teams are often left piecing together raw logs, broken tool calls, and unclear execution paths. We built AgentLens to make those failures observable, understandable, and fixable, especially for teams running Vertex AI agents in production.

What it does

AgentLens is a self-healing observability platform for Vertex AI agents. It ingests failure events, clusters repeated issues by fingerprint, highlights which tools are breaking most often, and shows operators the exact trace step where an execution failed. It then generates root-cause analysis and a suggested fix diff so teams can review and apply remediations with confidence.

How we built it

We built AgentLens as a full-stack MVP with a React, Vite, Tailwind, and React Flow frontend paired with a FastAPI and Pydantic backend. The frontend powers a command center dashboard, remediation lab, and time-travel trace visualizer, while the backend handles ingestion, failure fingerprinting, issue clustering, RCA orchestration, and persistence through a flexible repository layer. We designed the platform to fit naturally into a Google Cloud workflow using Cloud Logging, Pub/Sub, Vertex AI, Cloud Run, Firebase Hosting, and Auth0.

Challenges we ran into

One of the biggest challenges was turning noisy agent failures into something actionable. We had to normalize inconsistent logs, cluster repeated incidents, identify the exact broken step in a trace, and generate fix suggestions that felt concrete instead of generic. We also had to balance a polished user experience with a production-ready architecture for authentication, cloud ingestion, and persistent storage.

Accomplishments that we're proud of

We’re proud that AgentLens already delivers an end-to-end workflow: ingestion, clustering, dashboard visibility, trace replay, RCA, and remediation review in one product. We’re also proud that we made a technically dense problem feel approachable through clear UI patterns and a focused operator workflow. Most of all, we’re proud that AgentLens was built around real operational pain points instead of treating observability and remediation as separate problems.

What we learned

We learned that agent observability is not just about collecting logs, it is about reconstructing execution in a way humans can quickly understand and act on. We also learned that self-healing systems need transparency: teams trust automated suggestions much more when they can inspect the trace, understand the reasoning, and review a proposed diff before anything changes. From an engineering perspective, we learned the importance of strong boundaries between UI, APIs, RCA logic, and storage from the very beginning.

What's next for Agent Lens

Next, we want to deepen the production path with live project connections, richer filtering, remediation history, and safer approval flows for applying fixes. We also want to improve the RCA layer so it can compare incidents across releases, catch regressions earlier, and recommend higher-confidence remediations. Longer term, we want AgentLens to become the operational control plane for AI agents: not just showing what broke, but helping teams prevent the same failures from happening again.

Built With

  • auth0
  • cloud-logging
  • cloud-run
  • fastapi
  • firebase-hosting
  • firestore
  • gemini
  • google-cloud-platform-(gcp)
  • pub/sub
  • react
  • vertex-ai
Share this project:

Updates