# 🛡️ Aegis Budget Guardian

## Inspiration

While working with AI agents in production, I witnessed a common nightmare: **runaway costs**. A single misconfigured agent making continuous LLM API calls burned through $2,000 in just 3 hours before anyone noticed. Traditional monitoring only showed total spend - by the time alerts fired, the damage was done.

I realized we needed **proactive** cost monitoring - not reactive billing alerts. What if we could track the **rate of spend** ($/minute) instead of just total cost? That's how Aegis was born.

## What it does

**Aegis Budget Guardian** monitors AI agents in real-time and automatically kills runaway sessions before they drain your budget.

### Key Features:
- 📊 **Real-time cost velocity tracking** ($/minute metric)
- 🤖 **LLM observability** via Datadog (traces, tokens, quality)
- 🚨 **Automated alerts** when cost velocity exceeds threshold
- 🔧 **Self-healing** via Datadog Workflow Automation
- ⚡ **Emergency kill switch** to stop runaway agents
- 🌐 **Production deployment** on Google Cloud Run

### The Innovation: Cost Velocity

Instead of alerting when you've **already spent** $100, Aegis alerts when you're **burning at** $5/minute - giving you 20x more reaction time.

## How I built it

### Tech Stack:
- **LLM:** Google Gemini 2.5 Flash (via AI Studio API)
- **Backend:** FastAPI + Python 3.11
- **Observability:** Datadog LLM Observability + Custom Metrics
- **Automation:** Datadog Workflow Engine
- **Deployment:** Google Cloud Run (containerized)
- **Frontend:** Vanilla HTML/JS (no framework needed)

### Architecture Flow:

Agent Request ↓ Gemini API Call (tracked) ↓ Custom Metrics → Datadog ↓ Cost Velocity Monitor ↓ [If threshold exceeded] ↓ Datadog Workflow Triggered ↓ HTTP POST to /kill endpoint ↓ Session Terminated


### Implementation Highlights:

**1. Custom Metrics Collector** (`agent/monitors.py`)
```python
def track_cost_velocity(cost, duration, session_id, ticker):
    velocity = (cost / duration) * 60  # $/minute
    send_to_datadog('aegis.cost_velocity', velocity)

2. Agent with Session Tracking (agent/agent.py)

  • Gemini API integration
  • Token counting & cost calculation
  • Session management with kill capability

3. Datadog Integration

  • LLM Observability for traces
  • Custom metrics via HTTP API (no agent needed)
  • Monitors with dynamic thresholds
  • Workflow automation for self-healing

4. Production Deployment

  • Dockerized FastAPI app
  • Cloud Run for serverless scaling
  • Environment-based configuration

Challenges I ran into

1. Vertex AI vs Google AI Studio Confusion

Initially tried using Vertex AI (langchain-google-vertexai) but hit authentication issues. Switched to Google AI Studio API which worked seamlessly - lesson learned about choosing the right API for the use case.

2. Datadog Agent Dependency

StatsD metrics required local Datadog Agent. Solved by using Datadog HTTP API directly - making it serverless-friendly.

3. LangChain Version Conflicts

Dependency hell between langchain versions. Stripped down to minimal dependencies - only what's needed for production.

4. Cost Velocity Calibration

Finding the right threshold was tricky. Too low = false alarms, too high = missed runaway cases. Settled on $0.01/min for Gemini Flash (adjustable per model).

5. Cloud Run Environment Variables

Had to carefully manage secrets (API keys) vs config (service names). Used --set-env-vars for deployment.

Accomplishments that I'm proud of

Novel metric: Cost velocity ($/min) is more actionable than total spend
End-to-end automation: From detection to remediation with zero human intervention
Production-ready: Live on Cloud Run, not just a demo
Clean architecture: Modular, testable, extensible
Real Datadog integration: Not mocked - actual LLM Observability + Workflows

Most proud of: The system actually works in production. It's not vaporware - you can test it right now at the live URL.

What I learned

Technical:

  • Datadog LLM Observability is powerful for agent monitoring
  • Workflow Automation enables true self-healing systems
  • Cost tracking needs to be real-time, not batch
  • Serverless (Cloud Run) is perfect for agent APIs

Product:

  • Developers care more about rate of spend than total spend
  • Self-healing systems need human override (kill button)
  • Good observability = proactive, not reactive

Hackathon:

  • Ship fast, iterate faster
  • Focus on ONE novel idea (cost velocity) done well
  • Production deployment >>> localhost demo

What's next for Aegis Budget Guardian

Short-term:

  • 📧 Slack/email notifications when sessions killed
  • 📊 Cost forecasting based on velocity trends
  • 🎯 Per-project budget limits for multi-tenant use

Medium-term:

  • 🔀 Multi-agent orchestration monitoring
  • 💰 Budget allocation across teams/projects
  • 📈 Cost optimization recommendations (model suggestions)

Long-term:

  • 🤖 Auto-scaling cost limits based on usage patterns
  • 🌐 Multi-cloud support (AWS Bedrock, Azure OpenAI)
  • 💳 Billing API integration for actual spend reconciliation

Try it yourself

Live Demo: https://aegis-budget-guardian-387404025158.us-central1.run.app

GitHub: [https://github.com/Harsh8818198/aegis-budget-guardian.git]

Test Cases:

  • AAPL - Normal analysis (~3s, $0.00002)
  • SLOW - 10-second delay test
  • LOOP - Runaway scenario (auto-killed)

Built for the Datadog AI Partner Catalyst Hackathon
Because AI agents shouldn't cost more than your coffee budget

Built With

  • artifact-registry-observability:-datadog-(llm-observability
  • custom-metrics
  • datadog-llm-observability-cloud:-google-cloud-run
  • datadog-metrics-api
  • ddtrace
  • google-cloud-build
  • html/css-frameworks:-fastapi
  • javascript
  • languages:-python-3.11
  • monitors
  • uvicorn-apis:-google-gemini-2.5-flash-(ai-studio)
  • workflows)-tools:-docker
Share this project:

Updates