Inspiration
In my group, we face challenges troubleshooting our infrastructure deployed on Kubernetes, AWS ECS, and other platforms. We aim to simplify operational incident resolution, which is why I started using the open-source tool Home GPT.
What it does
The tool connects to a large language model of your choice and offers various tools for infrastructure troubleshooting, such as the kubectl helm command. It can read from corporate Confluence, as well as logs and metrics from Loki and Grafana.
Challenges we ran into
However, I identified a few gaps that I wanted to address for my specific use case. With the Holmes GPT tool deployed as it is, there is currently no way to follow up with the tool after an LLM investigation, nor is there a method to verify user questions. This there is no way to e.g deploy it as a Slack bot to help to help engineers to troubleshot their infrastructure.
How we built it
I am using the open-source tool Holmes GPT and have extended it with additional functionality. I contributed my work to GitHub so that the wider community could benefit from it. https://github.com/robusta-dev/holmesgpt/pull/395
Accomplishments that we're proud of
I have extended the tool to make it more useful for real-world production use cases.
What we learned
It is easy to create a basic proof of concept with an LLM agent, but significant work is needed to make it production-ready.
What's next for Agentic Infrastructure troubleshooting
We plan to support running tools deployed as MCP servers remotely.
Built With
- holmesgpt
- llm
- python
Log in or sign up for Devpost to join the conversation.