Agentic Infrastructure troubleshooting

Inspiration

In my group, we face challenges troubleshooting our infrastructure deployed on Kubernetes, AWS ECS, and other platforms. We aim to simplify operational incident resolution, which is why I started using the open-source tool Home GPT.

What it does

The tool connects to a large language model of your choice and offers various tools for infrastructure troubleshooting, such as the kubectl helm command. It can read from corporate Confluence, as well as logs and metrics from Loki and Grafana.

Challenges we ran into

However, I identified a few gaps that I wanted to address for my specific use case. With the Holmes GPT tool deployed as it is, there is currently no way to follow up with the tool after an LLM investigation, nor is there a method to verify user questions. This there is no way to e.g deploy it as a Slack bot to help to help engineers to troubleshot their infrastructure.

How we built it

I am using the open-source tool Holmes GPT and have extended it with additional functionality. I contributed my work to GitHub so that the wider community could benefit from it. https://github.com/robusta-dev/holmesgpt/pull/395

Accomplishments that we're proud of

I have extended the tool to make it more useful for real-world production use cases.

What we learned

It is easy to create a basic proof of concept with an LLM agent, but significant work is needed to make it production-ready.

What's next for Agentic Infrastructure troubleshooting

We plan to support running tools deployed as MCP servers remotely.

Built With

holmesgpt
llm
python

Updates

Alex Zveruk posted an update — May 03, 2025 09:34 AM EDT

There is an issue with my Vimeo link. You can view the live demo at https://share.cleanshot.com/H9mSgzxh.

I am using the code from my merge request to troubleshoot the Kubernetes cluster as part of dmoe. I initially asked vague questions, and after my code change, HolmesGPT asked for clarification and conducted an analysis. Following that, it requested a follow-up to help resolve the issue. Previously, this open-source project did not support questions after the investigation was completed.

Log in or sign up for Devpost to join the conversation.

Alex Zveruk started this project — May 03, 2025 09:29 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.