Inspiration
On-call engineers spend more time finding information than solving problems.
Alerts force engineers to jump across dashboards, logs, metrics, and runbooks under pressure. This is painful for everyone—and especially challenging for new engineers who lack system context.
I wanted to build an agent that does the investigation, not just summarizes it.
What it does
Agentic On-Call Engineer acts as an autonomous on-call responder.
Given an alert, the agent investigates telemetry, correlates signals, forms hypotheses, and produces a concise incident brief with recommended next steps.
How I built it
I built this using the Gemini API as an autonomous on-call investigation engine. The system relies on Gemini function calling to let the model actively query alerts, metrics, logs, deploy history, feature flags, dependencies, and runbooks instead of just suggesting next steps. I use gemini-flash-lite-latest for fast, iterative tool-driven investigation and gemini-3-pro-preview for final deep reasoning and synthesis. The final output is generated using structured JSON responses with a defined schema, ensuring reliable confidence scoring, evidence tracking, and actionable next steps.
Challenges I ran into
- 429s hitting Gemini 3 - eventually landed on using a lighter model for running through tools and using Gemini 3 for final response
- Scoping the prototype to demonstrate real agentic behavior without building a full production system
- Choosing which parts of the on-call workflow to model vs intentionally leave out
Accomplishments that I'm proud of
- Built an agent that acts, not just chats
- Implemented autonomous multi-step investigations
- Preserved familiar on-call workflows
- Reduced cognitive load across experience levels
What I learned
- Tool execution matters more than prompt complexity
- Most on-call toil is procedural and automatable
- Large context windows make agentic workflows easier by preserving investigation state across steps
What's next for Agentic On-Call Engineer
- Use specialist models for anomaly detection and feed the output to LLM
- Let engineers chat with the agent to provide any additional context
- Integrate real telemetry sources
- Learn from historical incidents
Built With
- gemini-3
- google-ai-studio
- typescript
Log in or sign up for Devpost to join the conversation.