Inspiration Cloud monitoring is always reactive—you fix problems after they've already impacted users. We were inspired to build a preventative platform; an AI that could predict and stop cloud outages before they ever happen.

What it does "ZeroDown AI" is a 3-agent system that autonomously analyzes cloud logs to predict imminent failures. When a user requests a health check, a "Coordinator" agent (built with Google's ADK ) fetches the latest service logs. It then sends these logs to a powerful gemma2:9b model running on a separate, GPU-accelerated Cloud Run service. This "Prediction Agent" finds the "needle in the haystack" error pattern and alerts the user of a critical failure before it occurs.

How we built it We built this as a 4-service microservices architecture, all on Google Cloud Run:

Frontend: A simple Streamlit UI.

Coordinator Agent: The "brain" of the operation, built with Google ADK.

Log Analysis Agent: A specialist ADK agent that fetches log data from GCS.

Prediction Agent: The gemma2:9b model served on an NVIDIA L4 GPU.

Crucially, all backend services are private. They communicate securely using Google's standard service-to-service authentication with IAM run.invoker roles. We also connected our ADK agents to a Cloud SQL database for persistent, stateful sessions, making this a truly scalable, production-ready application.

Challenges we ran into Scoping: Our initial "7-agent, multi-cloud" vision was a full startup, not a hackathon project. We smartly scoped it down to a 3-agent proof-of-concept that proved the core idea.

Security: Implementing the secure, service-to-service authentication was the biggest challenge. It was complex, but it was the right way to build and proved our app is production-ready.

Accomplishments that we're proud of Blending Categories: We successfully combined the "AI Agents" category (using ADK ) and the "GPU" category (using an L4 GPU ) into a single, cohesive project.

Production-Ready Architecture: We didn't take shortcuts. We built a secure (using IAM ) and stateful (using Cloud SQL ) system that truly scales and is built on best practices.

What we learned ADK is for Orchestration: The Google ADK is the perfect tool for building a "team" of specialist agents that can collaborate on a complex workflow.

Serverless GPUs are a Game-Changer: Using an L4 GPU on Cloud Run for AI reasoning is incredibly powerful. The fact that it scales to zero means we get this power without the cost.

"Hack" the Data, Not the Architecture: We learned it's better to simplify your data (using a simulated log file ) than to simplify your architecture. Building the secure, multi-service app made the project a success.

What's next for ZeroDown AI Go Real-Time: Connect our system to live log streams from Google Cloud Operations.

Get Smarter: Fine-tune our Gemma model on more failure data to make it an expert "Digital Technician."

Add an "Action" Agent: Create a new agent that can automatically fix the problem (like rerouting traffic) when a failure is predicted.

Built With

Share this project:

Updates