Inspiration
We were exploring Akash Network and realized something - if your deployment goes down, there's literally nothing to bring it back. No health checks, no auto-recovery, nothing. You have to manually go and redeploy everything yourself.
We spoke to Greg Osuri (CEO of Akash Network) about this and he confirmed that yes, there is no self-healing on the platform right now.
That's when we decided to build AkashGuard.
What it does
AkashGuard is an autonomous agent that monitors your Akash deployments and fixes them automatically when they break.
It runs a loop every 30 seconds - checks health of all services, and if something is down, it sends the health data to Llama 3.3 70B on AkashML which diagnoses the problem and gives a confidence score. If confidence is above 70%, the agent automatically closes the broken deployment, creates a new one, accepts a bid from a provider, creates the lease, and waits till the new service is live.
After recovery, Venice AI's vision model takes a screenshot of the restored service and visually verifies it's actually working.
Everything shows live on a real-time dashboard and you get voice alerts on Telegram. No human needed at any point.
How we built it
Python + FastAPI for the agent. SQLite for storing service data, health checks, and AI decisions.
For Akash we built a full Console API client that handles 8 endpoints - deployment create/close, bid fetching, lease creation, certificate management, URI polling. SDL templates are stored per service.
AkashML Llama 3.3 70B does the diagnosis. We give it structured health data and it returns JSON with what's wrong, how confident it is, and what to do.
Venice AI we use for three things - TTS (voice alerts on Telegram), chat completions (generating incident narratives), and vision model Qwen 2.5 VL (screenshot verification after recovery).
Dashboard is a single HTML file with SSE streaming, Chart.js graphs, and a pipeline visualization showing each step live. Telegram gets voice notes, incident report cards generated with Pillow, and vision check results.
Challenges we ran into
Akash Console API documentation is very limited. We had to figure out a lot through trial and error. Like deployments dont auto-accept bids, you have to create lease separately. SDL needs to be raw YAML not file path. Certificate is needed before lease creation.
Getting the LLM to give consistent JSON responses for infrastructure decisions was tough. Had to do a lot of prompt tuning to handle edge cases like intermittent failures vs actual outages.
The timing of recovery pipeline was also tricky - bids take time, leases take time, services take time to boot. Had to build polling with proper timeouts and also handle cleanup if something fails midway like orphaned deployments.
Accomplishments that we're proud of
This is the first self-healing agent for Akash Network. Nothing like this exists on the platform.
The agent handles the entire Akash deployment lifecycle through code - 8 API endpoints orchestrated with error handling and cleanup at every step.
The AI decision making is fully autonomous. Real health data goes in, real diagnosis comes out, real action happens. No human approval anywhere.
The live dashboard lets you watch the entire detect-diagnose-recover cycle happening in real time. During demo you can literally see each step as it happens.
What we learned
How Akash deployments actually work under the hood - the bid-lease-certificate lifecycle, provider selection, URI assignment, all of it programmatically.
Building AI agents for infrastructure is very different from building chatbots. The confidence calibration really matters. Too aggressive and you redeploy on small blips. Too conservative and real outages sit there.
Graceful degradation is must for autonomous systems. If Venice is down, skip voice but still recover. If AkashML is down, wait dont act blindly. Core loop should never break.
What's next for AkashGuard
Provider scoring - track which providers give best uptime and prefer them during recovery instead of just cheapest bid.
Multi-strategy recovery - not just redeploy but also scaling, provider migration based on what type of failure AI detects.
AkashGuard as a service where any Akash deployer can register their services and get self-healing without running their own agent.
Log in or sign up for Devpost to join the conversation.