Inspiration

The idea came from a real pain point every DevOps and SRE team faces: That is when a Kubernetes pod crashes at 2am or a GPU node runs out of memory mid-training, engineers waste precious time digging through logs, Stack Overflow, and runbooks. We wanted to build an AI copilot that acts like a senior SRE sitting next to you, instantly diagnosing the problem and telling you exactly what to run.

What it does

NimbusOps is a CloudOps incident response and troubleshooting copilot. You paste in logs, error messages, or describe an infrastructure issue, and the AI responds with a structured diagnosis:

  • Issue Summary — what went wrong
  • Root Cause Analysis — why it happened technically
  • Fix Steps — numbered step-by-step resolution
  • CLI Commands — exact kubectl, docker, and linux commands to run
  • Prevention — best practices to stop it from recurring

It handles Kubernetes CrashLoopBackOff, OOMKilled, Docker restart loops, CUDA out of memory errors, CI/CD failures, Terraform issues, and can generate full RCA incident reports.

How we built it

  • Frontend: Next.js 16, Tailwind CSS, shadcn/ui, Vercel AI SDK for streaming chat with a dark neon cyan DevOps dashboard theme
  • Backend: FastAPI (Python) with a Hermes-3 agentic loop that automatically calls tools — log pattern analyzer, CLI command generator, and documentation search — before generating a response
  • AI Model: NousResearch/Hermes-3-Llama-3.1-70B, a NemoClaw-style function-calling agent running on NVIDIA Nemotron via Crusoe Cloud Managed Inference
  • Infrastructure: PostgreSQL (via Docker) for chat history, Redis for resumable streams, NextAuth for authentication
  • Model routing: Openai-compatible connects the frontend directly to Crusoe Cloud, with Vercel AI Gateway as fallback for other models

Challenges we ran into

We had challenges when Hermes-3 tool calling required a multi-round agentic loop: run tools first, inject results into context, then stream the final structured response. Further, getting the AI to consistently respond in the structured SRE format (Issue Summary → Root Cause → Fix Steps → Commands → Prevention) required careful system prompt engineering.

Accomplishments that we're proud of

We were able to build a fully working end-to-end AI agent that runs a real open-source model (Hermes-3) on dedicated GPU infrastructure (Crusoe Cloud). Further, we were able to set up the structured SRE response format that makes the AI output immediately actionable — no vague answers, always exact commands

What we learned

We learnt how to use Crusoe Cloud's managed inference for development. We also learnt to use NVIDIA Nemotron-based models like Hermes-3 for for structured, technical reasoning tasks.

What's next for NimbusOps

  • Live log ingestion — connect directly to Kubernetes clusters, Datadog, or CloudWatch to pull logs in real time instead of copy-pasting
  • Auto-remediation — with user approval, execute the suggested kubectl/docker commands directly from the UI
  • Multi-cloud support — add AWS, GCP, and Azure-specific troubleshooting knowledge

Built With

  • crusoe
  • fastapi
  • hermes
  • nemoclaw
  • next.js
  • nousresearch
  • nvidia-nemotron
  • openai
  • postgresql
  • tailwind
  • vercel
Share this project:

Updates