Inspiration
Our team has over 20 years working with observability tools. When you are the monitoring guy, you also become the troubleshooting guy because it is often up to you to prove why something wasn't a false alert. We thought it would be cool to try to throw something together with AI that would leverage AI to describe the underlying issues and perform additional troubleshooting for the operations staff to have a better idea of what is going on.
What it does
When a site is added it will continually check the http response, response time, and route. When it finds an issue and it continues, it will offer some insight into the issue as well as some commands that can be run from the app. This is where the human in the loop piece comes in. The commands are listed with a risk level if any. In the end the thought is that it gives operations teams a better idea around where to go next with a detected failure.
How we built it
Code was produced via ClaudeCode and GeminiCLI, utilizing the GSD plugin to maintain an organized and methodical approach to the build.
Challenges we ran into
Time. We both have jobs and familys and this was our first AI hackathon. We had additional plans for the app itself and the app scope that had to be pushed out. Some of those items are RBAC, and agent skills when the app has access to the site infrastructure like recommend app server and DB server actions.
Accomplishments that we're proud of
It works! It is a solid MVP (minimum viable product) for something that could be very useful in managing compute assets. Sure, we could have added alot in terms of features but this was something we took on because we were becoming excited about AI and thought this might be a good first shot at it.
What we learned
Using skills and add-ons with your tools can really keep things moving forward without going off the rails and without them things can go off the rails fairly easily.
What's next for Virtual SRE - External
RBAC for users Server side features RAG functionality for server side stuff. A use case would be a new release of you app server software was released and you can add the docs so that your agent is grounded with the latest data.
Built With
- claudecode
- geminicli
- influxdb
- nova
- prometheus
- python
Log in or sign up for Devpost to join the conversation.