Inspiration

I have been exploring agentic AI platforms for a few months now — experimenting with LangChain, building small automations, and trying to understand how AI agents can do real work beyond just answering questions. During this journey, I kept running into the same pattern: infrastructure breaks at the worst possible time, someone gets paged, they SSH into a server, run a few commands, and fix something that honestly follows a very predictable playbook.

That got me thinking — if the fix is predictable, why does a human have to do it?

I was exploring DigitalOcean's API documentation and was genuinely surprised by how clean and well-structured it is. You can reboot a droplet, resize a node, or query live metrics with a single API call. That was the lightbulb moment. If an LLM could read an alert, reason about what is wrong, and call the right DigitalOcean API endpoint — you basically have an on-call engineer that never sleeps.

That is how DOzen was born.

What it does

DOzen is an AI-driven operations agent built exclusively for the DigitalOcean platform. It sits between your monitoring stack and your infrastructure, and acts autonomously when things go wrong.

Here is the flow:

  1. Alert Ingestion — Prometheus/Alertmanager fires a webhook when something breaks.
  2. AI Diagnosis — An LLM reads the alert payload, error traces, and system load, then produces a plain-English diagnostic summary and a remediation plan.
  3. Autonomous Action — DOzen executes the fix directly via the DigitalOcean API — rebooting a droplet, scaling a node, or flagging the issue for human review if it is outside safe thresholds.
  4. Cost Guardrails — Every action is validated against a configurable spending ceiling so autonomous operations never run away financially.
  5. Live Dashboard — A React frontend shows a real-time infrastructure map, live telemetry, spending charts, and a terminal-style task feed so you always know exactly what DOzen is doing and why.

How we built it

The backend is Python and FastAPI. It exposes a webhook endpoint that Alertmanager posts to, handles the AI reasoning loop, and talks to the DigitalOcean API. SQLite stores incident history and action logs. WebSockets push live updates to the frontend the moment something happens.

For the AI layer, I built an agentic reasoning loop where the LLM is not just generating text — it is deciding which tool to call, calling it, reading the result, and deciding what to do next. This is what makes it genuinely agentic rather than just a chatbot wrapper.

The frontend is React with TypeScript, styled with Tailwind CSS, and animated with Framer Motion. Recharts handles the live spending and telemetry graphs. The goal was for the dashboard to feel like a real command center with substance behind every panel.

For testing the whole pipeline end-to-end, I wrote a chaos engineering script that fires synthetic alerts and lets you watch DOzen diagnose and respond in real time.

Observability is handled via OpenLIT OTLP integration so every LLM call, token count, and latency is traced and visible.

Challenges we ran into

Getting the agentic loop right was the hardest part. Early versions of the reasoning loop would either over-act (trying to do too much from a single vague alert) or under-act (producing a diagnosis but not committing to a fix). Tuning the prompts and the tool boundaries took a lot of iteration.

DigitalOcean API auth and rate limits tripped me up early. Understanding how Personal Access Tokens scope across resources and how to safely handle API errors inside an autonomous loop required careful reading and a fair amount of trial and error.

WebSocket state synchronization between the backend and the React dashboard was trickier than expected. Keeping the live map and the task feed consistent without race conditions needed some deliberate design around how events were sequenced and broadcast.

The cost guardrail logic was conceptually simple but practically subtle. Deciding what counts as a safe autonomous action versus something that needs human approval required thinking carefully about edge cases — especially for resize operations that are irreversible.

Accomplishments that we're proud of

The moment the end-to-end loop worked for the first time — a synthetic alert fired, the LLM diagnosed it correctly, and DOzen executed a droplet reboot via the API autonomously — that felt like a real milestone. Shipping something that actually does something consequential in the real world is a different kind of satisfaction than building a tutorial project.

The dashboard also came together well. The live infrastructure map, the animated terminal feed, and the real-time telemetry charts communicate what is happening under the hood clearly and without clutter.

What we learned

  • Agentic AI is not magic — it is careful prompt engineering plus solid tool design. The quality of the agent's decisions is directly tied to how clearly you define what each tool does and what its boundaries are.

  • DigitalOcean's API is genuinely developer-friendly. The documentation was clear and the API was predictable, which made a significant difference when building on top of it with minimal prior experience on the platform.

  • Autonomous systems need guardrails from day one, not as an afterthought. Letting an AI agent make real API calls with real consequences focuses your mind on safety in a way that purely generative projects do not.

  • Observability matters. Integrating OpenLIT early made debugging the AI loop significantly easier because every model input, decision, and output was traceable.

What's next for DOzen: AI-Powered Ops Agent for DigitalOcean Infrastructure

  • Multi-cloud expansion — extending beyond DigitalOcean to AWS and GCP so the same agentic loop works across providers.
  • Memory and pattern learning — giving DOzen a persistent incident memory so it recognizes recurring failure patterns and gets smarter about remediation over time.
  • Slack and PagerDuty integration — posting diagnostic summaries and requesting human approval for high-risk actions directly in the tools teams already use.
  • Expanded remediation library — more DigitalOcean API actions including managed database failovers, load balancer reconfiguration, and firewall rule updates.
  • Fine-tuned ops model — eventually training a smaller, faster model specifically on infrastructure incident data for lower latency and lower cost per resolution.

Built With

Share this project:

Updates