Aegis Budget Guardian

# 🛡️ Aegis Budget Guardian

## Inspiration

While working with AI agents in production, I witnessed a common nightmare: **runaway costs**. A single misconfigured agent making continuous LLM API calls burned through $2,000 in just 3 hours before anyone noticed. Traditional monitoring only showed total spend - by the time alerts fired, the damage was done.

I realized we needed **proactive** cost monitoring - not reactive billing alerts. What if we could track the **rate of spend** ($/minute) instead of just total cost? That's how Aegis was born.

## What it does

**Aegis Budget Guardian** monitors AI agents in real-time and automatically kills runaway sessions before they drain your budget.

### Key Features:
- 📊 **Real-time cost velocity tracking** ($/minute metric)
- 🤖 **LLM observability** via Datadog (traces, tokens, quality)
- 🚨 **Automated alerts** when cost velocity exceeds threshold
- 🔧 **Self-healing** via Datadog Workflow Automation
- ⚡ **Emergency kill switch** to stop runaway agents
- 🌐 **Production deployment** on Google Cloud Run

### The Innovation: Cost Velocity

Instead of alerting when you've **already spent** $100, Aegis alerts when you're **burning at** $5/minute - giving you 20x more reaction time.

## How I built it

### Tech Stack:
- **LLM:** Google Gemini 2.5 Flash (via AI Studio API)
- **Backend:** FastAPI + Python 3.11
- **Observability:** Datadog LLM Observability + Custom Metrics
- **Automation:** Datadog Workflow Engine
- **Deployment:** Google Cloud Run (containerized)
- **Frontend:** Vanilla HTML/JS (no framework needed)

### Architecture Flow:

Agent Request ↓ Gemini API Call (tracked) ↓ Custom Metrics → Datadog ↓ Cost Velocity Monitor ↓ [If threshold exceeded] ↓ Datadog Workflow Triggered ↓ HTTP POST to /kill endpoint ↓ Session Terminated


### Implementation Highlights:

**1. Custom Metrics Collector** (`agent/monitors.py`)
```python
def track_cost_velocity(cost, duration, session_id, ticker):
    velocity = (cost / duration) * 60  # $/minute
    send_to_datadog('aegis.cost_velocity', velocity)

2. Agent with Session Tracking (agent/agent.py)

Gemini API integration
Token counting & cost calculation
Session management with kill capability

3. Datadog Integration

LLM Observability for traces
Custom metrics via HTTP API (no agent needed)
Monitors with dynamic thresholds
Workflow automation for self-healing

4. Production Deployment

Dockerized FastAPI app
Cloud Run for serverless scaling
Environment-based configuration

Challenges I ran into

1. Vertex AI vs Google AI Studio Confusion

Initially tried using Vertex AI (langchain-google-vertexai) but hit authentication issues. Switched to Google AI Studio API which worked seamlessly - lesson learned about choosing the right API for the use case.

2. Datadog Agent Dependency

StatsD metrics required local Datadog Agent. Solved by using Datadog HTTP API directly - making it serverless-friendly.

3. LangChain Version Conflicts

Dependency hell between langchain versions. Stripped down to minimal dependencies - only what's needed for production.

4. Cost Velocity Calibration

Finding the right threshold was tricky. Too low = false alarms, too high = missed runaway cases. Settled on $0.01/min for Gemini Flash (adjustable per model).

5. Cloud Run Environment Variables

Had to carefully manage secrets (API keys) vs config (service names). Used --set-env-vars for deployment.

Accomplishments that I'm proud of

✅ Novel metric: Cost velocity ($/min) is more actionable than total spend
✅ End-to-end automation: From detection to remediation with zero human intervention
✅ Production-ready: Live on Cloud Run, not just a demo
✅ Clean architecture: Modular, testable, extensible
✅ Real Datadog integration: Not mocked - actual LLM Observability + Workflows

Most proud of: The system actually works in production. It's not vaporware - you can test it right now at the live URL.