Inspiration

Prompt engineering in production is painful. Developers spend hours manually digging through telemetry logs, diagnosing edge cases, rewriting instructions, and running manual comparisons to ensure changes don't cause regressions.

We built Argus to completely automate this cycle. Our goal was to create a self-improving developer agent that sits in the middle of LLM observability (Arize Phoenix) and DevSecOps (GitLab), autonomously optimizing and deploying prompt improvements directly to production

What it does

Argus is an autonomous LLM Eval-to-Improvement Loop Agent. Here is how it functions:

  1. Monitor: Programmatically watches live telemetry traces on Arize Phoenix Cloud, auditing response correctness using Gemini LLM-as-a-Judge evaluations.
  2. Diagnose: When accuracy falls below a threshold, it isolates the failing traces and clusters them into distinct root-cause categories (e.g. prompt leakage, truncation, policy violation) using structured Gemini classification.
  3. Optimize: It engineers three distinct candidate prompt strategies (targeted fixes, structured workflows, and few-shot negative constraints).
  4. Evaluate: It runs zero-temperature shadow evaluations against a high-fidelity golden dataset using a double-layer check (expected substrings + LLM judges) to select the winner.
  5. Deploy: It uses a GitLab MCP server to autonomously spin up a Git branch, update the localized prompt config, and open a Merge Request complete with a rich Markdown metrics dashboard comparing latency, tokens, and accuracy.

How we built it

Argus was built using a modular python architecture:

  • Agent Framework: Google ADK (Agent Development Kit) & Gemini models to orchestrate reasoning.
  • Observability & Introspection: Arize Phoenix Cloud coupled with OpenInference auto-instrumentation and the @arizeai/phoenix-mcp server.
  • GitOps Integration: @structured-world/gitlab-mcp server for executing git actions.
  • Dashboard: FastAPI server delivering a single-page HSL-tailored glassmorphic dark-mode interface with live pipeline progress and run logs.

Challenges we ran into

  • Gemini API Rate Limiting: Running concurrent shadow evaluations against a golden dataset for three different system prompt candidates generated massive traffic spikes. We solved this by implementing wrapper-based exponential backoff retry logic.
  • OTel Span Duplication: When tracking the agent itself, overlapping auto-instrumentations from both Google ADK and Google GenAI duplicated spans. We resolved this by isolating GenAI's instrumentor while running optimization sweeps.
  • State Persistence Robustness: Seamlessly falling back to local JSON history logging when MongoDB Atlas connectivity was bottlenecked by corporate network settings.

Accomplishments that we're proud of

  • Complete Closed-Loop Autonomy: Argus goes from detecting real-world telemetry failures to opening a formatted GitLab MR with zero human intervention.
  • Dual MCP Integration: Leveraging both Arize Phoenix and GitLab Model Context Protocol servers in the same workflow to demonstrate the power of standardized agent tools.
  • Design Polish: Building a highly responsive, premium glassmorphism dashboard that visualizes the pipeline's stages in real-time.

What we learned

  • Structured Outputs are Mandatory: Guaranteeing structured JSON outputs via Pydantic schemas is absolutely essential when building agent chains where one LLM's output directly fuels another's optimization loop.
  • MCP is a Game Changer: Standardizing git operations and observability lookups behind MCP servers dramatically simplified our tool definitions, avoiding raw API integration spaghetti.

What's next for Argus

  • Multi-Agent Chain Optimization: Extending the optimizer from single-prompt optimizations to multi-turn agent conversations and agent graphs, where prompts depend heavily on one another.
  • GitLab MR Comment Loop: Allowing developers to leave comments directly on the GitLab MR (e.g., "Make the tone more casual"), which the agent will scrape and use to regenerate and test a new variant.
  • Dynamic Dataset Curation: Autonomously promoting newly discovered telemetry edge-cases directly into the Golden Dataset to continuously raise the testing bar.

Built With

Share this project:

Updates