Inspiration
Every engineer has been there: it's 3 AM, alerts are firing, and you're drowning in logs, dashboards, and Slack threads trying to figure out what's broken. We built Gemini SRE Commander because incident response shouldn't feel like detective work with missing clues. We wanted to give SREs an AI-powered teammate that can read logs, analyze screenshots, and tell you exactly what's wrong—and how to fix it.
What it does
Gemini SRE Commander transforms chaotic system data into actionable incident intelligence:
- Upload logs & screenshots – Drop in text logs, JSON exports, Grafana dashboards, even architecture diagrams
- AI-powered analysis – Gemini 3 Flash processes everything together using multimodal reasoning
- Get instant answers – Root cause, severity rating, evidence timeline, and step-by-step mitigation plan
- Take action – Copy-paste runbook commands tailored to your specific incident
- Track progress – Interactive checklist to mark mitigation steps as complete
- Export & share – Download a markdown post-mortem for your incident review
Bonus features: Real-time log streaming via WebSocket for live incident analysis, and 4 pre-built demo scenarios to instantly show the system in action.
How we built it
| Component | Tech | Why We Chose It |
| Runtime | Bun | Native TypeScript + HTTP/WebSocket in one tool | | AI Engine | Gemini 3 Flash Preview | Multimodal (text + images) + structured JSON output | | Frontend | Vanilla HTML/CSS/JS | Zero build step, instant load, easy to demo |
Key architectural decisions:
- Structured output schema – Enforced JSON guarantees consistent, parseable responses
- Context-aware runbooks – Keyword matching serves relevant commands based on incident type
- Smart truncation – Large logs auto-truncate with markers to stay within token limits
- Connection-scoped state – Each WebSocket session has isolated buffers to prevent data leaks
Challenges we ran into
1. Context window limits
Large log files exceeded AI token limits. We implemented smart truncation that preserves the most recent logs (usually the most relevant) with clear [TRUNCATED] markers.
2. Structured output reliability LLMs occasionally return malformed JSON or extra prose. We combined Gemini's native schema enforcement with a graceful fallback parser that returns valid error objects instead of crashing the UI.
3. WebSocket state management Managing buffered logs and analysis state across concurrent connections was tricky. We solved it with unique connection IDs and isolated state per session.
4. Runbook relevance Generic commands aren't helpful during specific incidents. We built a keyword-matching system that serves context-aware commands (e.g., database commands for DB incidents, cache commands for Redis issues).
Accomplishments that we're proud of
End-to-end workflow – From log upload to exportable post-mortem in one seamless flow
10-second demo – Judges click "Load Demo" and see complete incident analysis instantly
True multimodal reasoning – The AI actually "looks" at architecture diagrams and metrics screenshots, not just text logs
Smart runbooks – Copy-paste commands with confirmation warnings for destructive operations
Real-time streaming – Live log analysis that triggers automatically when patterns emerge
What we learned
Technical insights:
- Gemini's structured output is a game-changer—enforcing JSON schemas at the API level eliminates an entire class of parsing bugs
- Bun's native WebSocket support makes real-time features trivial to implement
- Multimodal AI unlocks use cases impossible with text-only models
Product insights:
- Progress tracking transforms static reports into collaborative tools
- Export functionality (markdown post-mortems) makes tools immediately production-ready
What's next for Gemini SRE Commander
- Team collaboration features (comments, assignments, shared timelines)
- Custom runbook editor for team-specific commands
- Alert correlation to group related alerts into single incidents
Long-term vision:
"The Self-Healing Platform" – A system that not only diagnoses incidents but executes safe remediation steps automatically, with human approval for critical actions.
Log in or sign up for Devpost to join the conversation.