NimbusOps

Inspiration

The idea came from a real pain point every DevOps and SRE team faces: That is when a Kubernetes pod crashes at 2am or a GPU node runs out of memory mid-training, engineers waste precious time digging through logs, Stack Overflow, and runbooks. We wanted to build an AI copilot that acts like a senior SRE sitting next to you, instantly diagnosing the problem and telling you exactly what to run.

What it does

NimbusOps is a CloudOps incident response and troubleshooting copilot. You paste in logs, error messages, or describe an infrastructure issue, and the AI responds with a structured diagnosis:

Issue Summary — what went wrong
Root Cause Analysis — why it happened technically
Fix Steps — numbered step-by-step resolution
CLI Commands — exact kubectl, docker, and linux commands to run
Prevention — best practices to stop it from recurring

It handles Kubernetes CrashLoopBackOff, OOMKilled, Docker restart loops, CUDA out of memory errors, CI/CD failures, Terraform issues, and can generate full RCA incident reports.

How we built it

Frontend: Next.js 16, Tailwind CSS, shadcn/ui, Vercel AI SDK for streaming chat with a dark neon cyan DevOps dashboard theme
Backend: FastAPI (Python) with a Hermes-3 agentic loop that automatically calls tools — log pattern analyzer, CLI command generator, and documentation search — before generating a response
AI Model: NousResearch/Hermes-3-Llama-3.1-70B, a NemoClaw-style function-calling agent running on NVIDIA Nemotron via Crusoe Cloud Managed Inference
Infrastructure: PostgreSQL (via Docker) for chat history, Redis for resumable streams, NextAuth for authentication
Model routing: Openai-compatible connects the frontend directly to Crusoe Cloud, with Vercel AI Gateway as fallback for other models

Challenges we ran into

We had challenges when Hermes-3 tool calling required a multi-round agentic loop: run tools first, inject results into context, then stream the final structured response. Further, getting the AI to consistently respond in the structured SRE format (Issue Summary → Root Cause → Fix Steps → Commands → Prevention) required careful system prompt engineering.

Accomplishments that we're proud of

We were able to build a fully working end-to-end AI agent that runs a real open-source model (Hermes-3) on dedicated GPU infrastructure (Crusoe Cloud). Further, we were able to set up the structured SRE response format that makes the AI output immediately actionable — no vague answers, always exact commands

What we learned

We learnt how to use Crusoe Cloud's managed inference for development. We also learnt to use NVIDIA Nemotron-based models like Hermes-3 for for structured, technical reasoning tasks.

What's next for NimbusOps

Live log ingestion — connect directly to Kubernetes clusters, Datadog, or CloudWatch to pull logs in real time instead of copy-pasting
Auto-remediation — with user approval, execute the suggested kubectl/docker commands directly from the UI
Multi-cloud support — add AWS, GCP, and Azure-specific troubleshooting knowledge

Built With

crusoe
fastapi
hermes
nemoclaw
next.js
nousresearch
nvidia-nemotron
openai
postgresql
tailwind
vercel

Updates

Private user started this project — May 25, 2026 06:30 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.