Agentic On-Call Engineer

Inspiration

On-call engineers spend more time finding information than solving problems.
Alerts force engineers to jump across dashboards, logs, metrics, and runbooks under pressure. This is painful for everyone—and especially challenging for new engineers who lack system context.

I wanted to build an agent that does the investigation, not just summarizes it.

What it does

Agentic On-Call Engineer acts as an autonomous on-call responder.

Given an alert, the agent investigates telemetry, correlates signals, forms hypotheses, and produces a concise incident brief with recommended next steps.

How I built it

I built this using the Gemini API as an autonomous on-call investigation engine. The system relies on Gemini function calling to let the model actively query alerts, metrics, logs, deploy history, feature flags, dependencies, and runbooks instead of just suggesting next steps. I use gemini-flash-lite-latest for fast, iterative tool-driven investigation and gemini-3-pro-preview for final deep reasoning and synthesis. The final output is generated using structured JSON responses with a defined schema, ensuring reliable confidence scoring, evidence tracking, and actionable next steps.

Challenges I ran into

429s hitting Gemini 3 - eventually landed on using a lighter model for running through tools and using Gemini 3 for final response
Scoping the prototype to demonstrate real agentic behavior without building a full production system
Choosing which parts of the on-call workflow to model vs intentionally leave out

Accomplishments that I'm proud of

Built an agent that acts, not just chats
Implemented autonomous multi-step investigations
Preserved familiar on-call workflows
Reduced cognitive load across experience levels

What I learned

Tool execution matters more than prompt complexity
Most on-call toil is procedural and automatable
Large context windows make agentic workflows easier by preserving investigation state across steps

What's next for Agentic On-Call Engineer

Use specialist models for anomaly detection and feed the output to LLM
Let engineers chat with the agent to provide any additional context
Integrate real telemetry sources
Learn from historical incidents

Built With

gemini-3
google-ai-studio
typescript

Updates

Arnav Kasturia started this project — Feb 08, 2026 11:33 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.