Inspiration

Engineers can spend critical minutes switching between Datadog dashboards, logs, and deployment histories during production. Handling incidents follow a painful pattern of gathering signals, relating metrics with recent changes, and searching past incidents for patterns, then synthesizing it into a coherent solution. We wanted this workflow to be parallelized and automated with AI. The inspiration was to build an intelligent command center so that teams and an assistant can work in a real time collaborative workflow.

What it does

A place for users and organizations to manage their tickets, monitoring, and incident response while leveraging a multi-agent AI orchestrated response to incidents. Users can chat in real time using the built in team chat, and speak with the moderator assistant. Organizations can connect to their Datadog monitors using their datadog credentials. When Datadog alerts, or if a natural language query comes in like “why is checkout failing?”, the system orchestrates multiple AI agents that will analyze the logs, metrics, and historical incidents for patterns. A moderator agent synthesizes findings into a recommendation, severity assessment, and root probable cause. The platform includes persistent team chatting where users can collaborate and query the AI assistant with @assistant mentions.

How we built it

We utilized Confluent to orchestrate multiple specialized AI agents to answer specific questions and provide context to a final moderator agent that gives a final recommendation/response to an incident. Incidents can be created via natural language queries by prompting in the homepage to create a new one or prompting @assistant in the team chats where teams can discuss solutions under incidents. Incidents are also created when they receive Datadog webhook alerts from monitors and trigger all AI agents, in contrast to natural language queries which only trigger specific agents depending on the intent of the query.

Challenges we ran into

Challenges we ran into Orchestrating multiple AI agents and adjusting how they gather context depending on the intent, the source of the incident, and whether it’s a natural language query or datadog webhook was challenging. The nondeterministic nature of LLMs makes writing MCPs and handling responses challenging and the code verbose at times.

Accomplishments that we're proud of

Given it was both of our first times working with Datadog and Confluent as platforms added a bit of a learning curve, but we learned a lot about what the platforms are capable of and how they can be utilized.

What we learned

We learned a lot about Datadog and Confluent as platforms, the importance of observability and incident response, and how to manage/utilize telemetry data to it’s fullest.

What's next for Incident Command Center

If we were to expand our incident commander, adding more agents would make it feel more in depth, like a cost agent or SLO agent to have more organization or user facing agents that consider those particular viewpoints. Also adding more customizable dashboard components would allow it to suit more organizations.

FOR TESTING: sign in with email: guest@example.com password: Password123!

Share this project:

Updates