Inspiration

Every second of downtime costs companies thousands of dollars, yet most incident response systems remain reactive and manual. Inspired by real-world DevOps and MLOps pain points, our team envisioned a system that doesn’t just alert, it thinks, reasons, and acts autonomously. OpsAgent was born from the idea of combining agentic reasoning (LLMs) with real-time observability, leveraging AWS and NVIDIA’s GPU-accelerated intelligence to create a next-gen incident response ecosystem.

What it does

OpsAgent is an autonomous agent that continuously monitors infrastructure logs, detects anomalies, and executes corrective actions, all without human intervention.

OpsAgent = Detect + Diagnose + Respond + Learn

Core functions: Ingests system metrics, logs, and traces from AWS CloudWatch, Prometheus, or local observability tools. Uses Llama-3 1-Nemotron-Nano-8B-v1 NIM for multi-step reasoning on incident context. Retrieves similar historical incidents using a Retrieval Embedding NIM for contextual grounding. Generates and executes corrective scripts (e.g., restarting services, reallocating GPU resources). Continuously learns from past responses to improve decision accuracy.

How we built it

OpsAgent is built as a modular cloud-native microservice system, leveraging AWS, NVIDIA, and open-source frameworks.

Architecture Overview Data Layer → Vector Index (Retrieval NIM) → Reasoning Core (Llama-3 NIM) → Action Executor (Ops Scripts)

  • Frontend: a lightweight ReactJS dashboard showing live incident summaries and agent actions.
  • Backend: built with Python (FastAPI) to orchestrate communication between models and infrastructure.
  • Reasoning Core: powered by Llama-3 1-Nemotron-Nano-8B-v1 NIM, hosted on NVIDIA NIM inference server for efficient multi-step reasoning.
  • Retrieval Layer: uses Retrieval Embedding NIM for vectorizing logs, traces, and historical incident data to provide contextual support.
  • Storage: PostgreSQL for structured metadata and FAISS vector index for fast similarity search.
  • Cloud Infrastructure: deployed on AWS ECS, using S3 for log storage and CloudWatch for monitoring.

Challenges we ran into

NIM orchestration: Getting the Llama-3 reasoning and retrieval NIMs to communicate efficiently under latency constraints. Action safety: Ensuring the AI’s generated corrective actions are validated before execution. Token optimization: Balancing reasoning quality vs. inference cost under hackathon runtime limits. Vector drift: Managing updates to the retrieval index when new logs and incidents are continuously added.

Accomplishments that we're proud of

Fully autonomous prototype responding to simulated AWS infrastructure incidents in real-time. Seamless integration of Llama-3 NIM reasoning with Retrieval Embedding NIM contextual grounding. Built an explainable AI console where every action includes a rationale trace. Achieved sub-3s response latency for common incident classes.

What we learned

How to architect agentic AI systems combining reasoning, memory, and action. How to deploy and optimize NVIDIA NIM models for multi-agent collaboration. The critical role of retrieval grounding in preventing LLM hallucinations during reasoning. Practical aspects of observability automation and cloud resource orchestration.

What's next for OpsAgent:

Integrating reinforcement learning from human feedback (RLHF) to refine action policies. Adding autonomous remediation playbooks and security incident response capabilities. Building an OpsAgent SaaS platform for SMBs to access affordable intelligent incident management.

Built With

Share this project:

Updates