# 🛡️ Aegis Budget Guardian
## Inspiration
While working with AI agents in production, I witnessed a common nightmare: **runaway costs**. A single misconfigured agent making continuous LLM API calls burned through $2,000 in just 3 hours before anyone noticed. Traditional monitoring only showed total spend - by the time alerts fired, the damage was done.
I realized we needed **proactive** cost monitoring - not reactive billing alerts. What if we could track the **rate of spend** ($/minute) instead of just total cost? That's how Aegis was born.
## What it does
**Aegis Budget Guardian** monitors AI agents in real-time and automatically kills runaway sessions before they drain your budget.
### Key Features:
- 📊 **Real-time cost velocity tracking** ($/minute metric)
- 🤖 **LLM observability** via Datadog (traces, tokens, quality)
- 🚨 **Automated alerts** when cost velocity exceeds threshold
- 🔧 **Self-healing** via Datadog Workflow Automation
- ⚡ **Emergency kill switch** to stop runaway agents
- 🌐 **Production deployment** on Google Cloud Run
### The Innovation: Cost Velocity
Instead of alerting when you've **already spent** $100, Aegis alerts when you're **burning at** $5/minute - giving you 20x more reaction time.
## How I built it
### Tech Stack:
- **LLM:** Google Gemini 2.5 Flash (via AI Studio API)
- **Backend:** FastAPI + Python 3.11
- **Observability:** Datadog LLM Observability + Custom Metrics
- **Automation:** Datadog Workflow Engine
- **Deployment:** Google Cloud Run (containerized)
- **Frontend:** Vanilla HTML/JS (no framework needed)
### Architecture Flow:
Agent Request ↓ Gemini API Call (tracked) ↓ Custom Metrics → Datadog ↓ Cost Velocity Monitor ↓ [If threshold exceeded] ↓ Datadog Workflow Triggered ↓ HTTP POST to /kill endpoint ↓ Session Terminated
### Implementation Highlights:
**1. Custom Metrics Collector** (`agent/monitors.py`)
```python
def track_cost_velocity(cost, duration, session_id, ticker):
velocity = (cost / duration) * 60 # $/minute
send_to_datadog('aegis.cost_velocity', velocity)
2. Agent with Session Tracking (agent/agent.py)
- Gemini API integration
- Token counting & cost calculation
- Session management with kill capability
3. Datadog Integration
- LLM Observability for traces
- Custom metrics via HTTP API (no agent needed)
- Monitors with dynamic thresholds
- Workflow automation for self-healing
4. Production Deployment
- Dockerized FastAPI app
- Cloud Run for serverless scaling
- Environment-based configuration
Challenges I ran into
1. Vertex AI vs Google AI Studio Confusion
Initially tried using Vertex AI (langchain-google-vertexai) but hit authentication issues. Switched to Google AI Studio API which worked seamlessly - lesson learned about choosing the right API for the use case.
2. Datadog Agent Dependency
StatsD metrics required local Datadog Agent. Solved by using Datadog HTTP API directly - making it serverless-friendly.
3. LangChain Version Conflicts
Dependency hell between langchain versions. Stripped down to minimal dependencies - only what's needed for production.
4. Cost Velocity Calibration
Finding the right threshold was tricky. Too low = false alarms, too high = missed runaway cases. Settled on $0.01/min for Gemini Flash (adjustable per model).
5. Cloud Run Environment Variables
Had to carefully manage secrets (API keys) vs config (service names). Used --set-env-vars for deployment.
Accomplishments that I'm proud of
✅ Novel metric: Cost velocity ($/min) is more actionable than total spend
✅ End-to-end automation: From detection to remediation with zero human intervention
✅ Production-ready: Live on Cloud Run, not just a demo
✅ Clean architecture: Modular, testable, extensible
✅ Real Datadog integration: Not mocked - actual LLM Observability + Workflows
Most proud of: The system actually works in production. It's not vaporware - you can test it right now at the live URL.
What I learned
Technical:
- Datadog LLM Observability is powerful for agent monitoring
- Workflow Automation enables true self-healing systems
- Cost tracking needs to be real-time, not batch
- Serverless (Cloud Run) is perfect for agent APIs
Product:
- Developers care more about rate of spend than total spend
- Self-healing systems need human override (kill button)
- Good observability = proactive, not reactive
Hackathon:
- Ship fast, iterate faster
- Focus on ONE novel idea (cost velocity) done well
- Production deployment >>> localhost demo
What's next for Aegis Budget Guardian
Short-term:
- 📧 Slack/email notifications when sessions killed
- 📊 Cost forecasting based on velocity trends
- 🎯 Per-project budget limits for multi-tenant use
Medium-term:
- 🔀 Multi-agent orchestration monitoring
- 💰 Budget allocation across teams/projects
- 📈 Cost optimization recommendations (model suggestions)
Long-term:
- 🤖 Auto-scaling cost limits based on usage patterns
- 🌐 Multi-cloud support (AWS Bedrock, Azure OpenAI)
- 💳 Billing API integration for actual spend reconciliation
Try it yourself
Live Demo: https://aegis-budget-guardian-387404025158.us-central1.run.app
GitHub: [https://github.com/Harsh8818198/aegis-budget-guardian.git]
Test Cases:
AAPL- Normal analysis (~3s, $0.00002)SLOW- 10-second delay testLOOP- Runaway scenario (auto-killed)
Built for the Datadog AI Partner Catalyst Hackathon
Because AI agents shouldn't cost more than your coffee budget ☕
Built With
- artifact-registry-observability:-datadog-(llm-observability
- custom-metrics
- datadog-llm-observability-cloud:-google-cloud-run
- datadog-metrics-api
- ddtrace
- google-cloud-build
- html/css-frameworks:-fastapi
- javascript
- languages:-python-3.11
- monitors
- uvicorn-apis:-google-gemini-2.5-flash-(ai-studio)
- workflows)-tools:-docker
Log in or sign up for Devpost to join the conversation.