Inspiration ✨
In the financial services industry, downtime is more than an inconvenience—it directly translates into millions of dollars in losses and erodes customer trust. Inspired by Google’s Site Reliability Engineering (SRE) principles and the Bank of Anthos project, we set out to create a tool that demonstrates how AI-driven automation can reduce outages, improve reliability, and allow SRE teams to focus on innovation instead of repetitive firefighting.
What We Learned 📚
- Building a Kubernetes-native operator in Python.
- Creating a REST API with FastAPI for status checks, interventions, and RCA outputs.
- Designing healing playbooks in YAML for automated incident response.
- Using Google Gemini API to provide AI-powered root cause analysis.
- Validating self-healing workflows with chaos testing (e.g., pod crashes).
How We Built It 🛠️
- Implemented the SRE-Agent as a Python service using FastAPI.
- Created a ConfigMap-driven playbook system where healing rules are defined in YAML.
- Integrated with the Kubernetes API to monitor events and trigger remediation.
- Added Prometheus hooks to enable metric-based healing actions.
- Connected with Google Gemini API for AI-powered RCA reports.
- Packaged the app in a Docker container and deployed it to GKE.
# Example: Apply healing rules
kubectl create configmap sre-agent-playbook --from-file=healing-playbook.yaml
Challenges We Faced ⚡
- Ensuring safe remediation with a dry-run mode before applying fixes.
- Balancing real-time healing with AI-powered RCA (without adding delays).
- Implementing leader election logic in Python to avoid conflicts.
- Integrating multiple systems (Kubernetes, Prometheus, Gemini API) smoothly.
Built With 🧩
- Language: Python
- Framework: FastAPI
- Cloud Platform: Google Kubernetes Engine (GKE)
- APIs: Kubernetes API, Google Gemini API
- Monitoring: Prometheus
- Containerization: Docker
Try It Out 🚀
- GitHub Repo → [Add your repo link here]
- Demo Video → [Add Loom/YouTube link]
Built With
- apis":-[-"kubernetes-api
- cloudplatform":-"google-kubernetes-engine-(gke)
- containerization":,"docker
- framework":-"fastapi
- google-gemini-api"-],"monitoring":-"prometheus
- language":-"python"
Log in or sign up for Devpost to join the conversation.