Inspiration
The idea came from a real pain point every DevOps and SRE team faces: That is when a Kubernetes pod crashes at 2am or a GPU node runs out of memory mid-training, engineers waste precious time digging through logs, Stack Overflow, and runbooks. We wanted to build an AI copilot that acts like a senior SRE sitting next to you, instantly diagnosing the problem and telling you exactly what to run.
What it does
NimbusOps is a CloudOps incident response and troubleshooting copilot. You paste in logs, error messages, or describe an infrastructure issue, and the AI responds with a structured diagnosis:
- Issue Summary — what went wrong
- Root Cause Analysis — why it happened technically
- Fix Steps — numbered step-by-step resolution
- CLI Commands — exact kubectl, docker, and linux commands to run
- Prevention — best practices to stop it from recurring
It handles Kubernetes CrashLoopBackOff, OOMKilled, Docker restart loops, CUDA out of memory errors, CI/CD failures, Terraform issues, and can generate full RCA incident reports.
How we built it
- Frontend: Next.js 16, Tailwind CSS, shadcn/ui, Vercel AI SDK for streaming chat with a dark neon cyan DevOps dashboard theme
- Backend: FastAPI (Python) with a Hermes-3 agentic loop that automatically calls tools — log pattern analyzer, CLI command generator, and documentation search — before generating a response
- AI Model: NousResearch/Hermes-3-Llama-3.1-70B, a NemoClaw-style function-calling agent running on NVIDIA Nemotron via Crusoe Cloud Managed Inference
- Infrastructure: PostgreSQL (via Docker) for chat history, Redis for resumable streams, NextAuth for authentication
- Model routing: Openai-compatible connects the frontend directly to Crusoe Cloud, with Vercel AI Gateway as fallback for other models
Challenges we ran into
We had challenges when Hermes-3 tool calling required a multi-round agentic loop: run tools first, inject results into context, then stream the final structured response. Further, getting the AI to consistently respond in the structured SRE format (Issue Summary → Root Cause → Fix Steps → Commands → Prevention) required careful system prompt engineering.
Accomplishments that we're proud of
We were able to build a fully working end-to-end AI agent that runs a real open-source model (Hermes-3) on dedicated GPU infrastructure (Crusoe Cloud). Further, we were able to set up the structured SRE response format that makes the AI output immediately actionable — no vague answers, always exact commands
What we learned
We learnt how to use Crusoe Cloud's managed inference for development. We also learnt to use NVIDIA Nemotron-based models like Hermes-3 for for structured, technical reasoning tasks.
What's next for NimbusOps
- Live log ingestion — connect directly to Kubernetes clusters, Datadog, or CloudWatch to pull logs in real time instead of copy-pasting
- Auto-remediation — with user approval, execute the suggested kubectl/docker commands directly from the UI
- Multi-cloud support — add AWS, GCP, and Azure-specific troubleshooting knowledge
Built With
- crusoe
- fastapi
- hermes
- nemoclaw
- next.js
- nousresearch
- nvidia-nemotron
- openai
- postgresql
- tailwind
- vercel
Log in or sign up for Devpost to join the conversation.