Inspiration
Modern software depends on many connected services, but when something breaks, it can be hard for teams to understand what failed, why it failed, and how to recover. Since the hackathon theme was community-driven, we wanted to represent infrastructure as a community of services: each node depends on the others, reports its own problems, and works together with an AI commander to recover.
What it does
Cloud Pilot is an AI-powered infrastructure simulation dashboard. It visualizes a cloud system as a live graph with users, API servers, caches, databases, and load balancers. Users can trigger incidents like traffic surges, cache bypasses, and database outages.
As the system changes, nodes update in real time, report their status, and send messages to the AI Commander. The AI Commander analyzes the situation, explains the root cause, and can apply recovery actions such as scaling API servers, enabling cache routing, failing over the database, or adding a load balancer.
How we built it
We built the frontend with Next.js, React, and a graph-based interface to show infrastructure as connected nodes and edges. The dashboard includes live metrics, scenario controls, service status, event timelines, and AI commander output.
The backend was built with FastAPI and WebSockets. The backend owns the simulation state and sends real-time state_update messages to the frontend. We created a deterministic simulation engine that calculates traffic, latency, error rate, reliability, node health, and system pressure from the current infrastructure state.
For the AI layer, we used OpenRouter to call an LLM. The AI does not directly mutate the system. Instead, it returns a structured recovery plan, and the backend validates the actions before applying them safely.
Challenges we ran into
One major challenge was separating frontend visuals from backend simulation logic. At first, the frontend handled too much of the simulation, which made the app feel hardcoded. We refactored so the backend became the source of truth and the frontend only rendered live state.
Another challenge was deploying the backend. Since our app uses WebSockets, we had to move away from a serverless-style deployment and use a backend host that supports long-lived FastAPI connections.
We also had to carefully design the AI system so it felt powerful without becoming unreliable. Instead of letting the model invent arbitrary infrastructure, we restricted it to safe actions like scaling APIs, enabling cache, failing over the database, and adding a load balancer.
Accomplishments that we’re proud of
We are proud that Cloud Pilot feels like a living system rather than a static dashboard. The graph updates in real time, incidents propagate through connected services, and the recovery actions visibly change the infrastructure.
We are also proud of the controlled AI architecture. The AI can explain what is happening and propose recovery actions, but the backend still validates everything before changing the system. This made the demo more stable while still feeling intelligent.
Finally, we are proud of creating a project that connects deeply to the theme. Cloud Pilot shows that infrastructure is not just isolated servers; it is a community of services that must coordinate under pressure.
What we learned
We learned how to structure a real-time app using WebSockets, how to separate simulation state from frontend rendering, and how to design safer AI systems by validating model outputs.
We also learned that hackathon projects need more than technical complexity. They need a clear story, strong visuals, and a demo flow that judges can understand quickly.
What’s next for Cloud Pilot
Next, we want to connect Cloud Pilot to real infrastructure tools like Kubernetes, Docker, AWS, or Datadog-style logs. Instead of only simulating incidents, Cloud Pilot could monitor real services, detect problems, and suggest safe recovery actions.
We also want to add collaborative incident response, where multiple team members can view the same system, discuss AI recommendations, and approve recovery plans together. Ultimately, Cloud Pilot could become a training and operations tool for teams learning how modern infrastructure fails and recovers.
Built With
- fastapi
- javascript
- nextjs
- python
- react
- tailwind
- websockets
Log in or sign up for Devpost to join the conversation.