Inspiration
AI Infrastructure is a passion of mine. As someone who has worked with both Full Stack Development and Machine Learning, I've seen firsthand how much effort goes into keeping models running in production. But I also noticed how large the gap is for tools that don't just report incidents, but self heal while considering both security and safety.
What it does
NeuralOps autonomously triages AI cluster incidents. Gemini diagnoses, Claude fixes, and AdaL decides whether to resolve, escalate, or retry. Repeated incidents are resolved from cache, whereas escalations go to human review via VoiceOS.
How we built it
We have a FastAPI backend connecting our agents to the front end. Gemini CLI and Claude Code are MCP subprocesses, where AdaL is the main orchestrator and decision maker for agent reports. SQLite is our cache layer/local database for incidents, with React as our frontend (with live WebSocket streaming) and Cisco's AI factory dataset for real production scenarios.
Challenges we ran into
I wanted to try something new, so I chose to use AdaL (knowing it was a cli) as an orchestrator to push the boundaries of agents having both root access and controlling other agents. One issue was that AdaL would often ignore my MCP tools and call its own, which I solved by pre-computing Gemini and Claude outputs separately and feeding AdaL only their condensed reports as a decision maker, not a researcher
Accomplishments that we're proud of
I am proud that I was able to use a CLI tool as an orchestration layer over more capable specialist models, giving AdaL authority without giving it the research burden
What we learned
I learned that orchestration agent is harder than just calling APIs, since each agent needs a specific job. I also learned that a system that knows when NOT to a ct autonomously is more valuable than an agent that brute forces every problem.
What's next for NeuralOps
I would like to expand my list of subagents, have broader dataset coverage, and a feedback loop where the human decisions improve AdaL's confidence thresholds over time based on real operator decisions.
Built With
- adal
- ai-factory-dataset
- cisco
- claude-code
- cli
- duckdb
- fastapi
- gemini-cli
- javascript
- mcp
- ollama/qwen2.5:7b
- ollama/qwen2.adal-cli
- python
- react
- sqlite
- voiceos
- websocket
Log in or sign up for Devpost to join the conversation.