Rift - Project Story
Inspiration
The inspiration came from experiencing 3 AM production outages where developers scramble to diagnose and fix issues. I wanted to build a platform that never sleeps - an autonomous infrastructure guardian that detects, diagnoses, and fixes problems before humans even notice.
What it does
Rift is an autonomous infrastructure orchestration platform that:
- Monitors cloud infrastructure every 30 seconds using Prometheus metrics
- Detects incidents like high CPU, memory leaks, or disk space issues
- Diagnoses root causes using AI-powered analysis
- Remediates problems automatically via SSH execution
- Provisions new infrastructure through natural language requests
- Learns from every incident to improve future responses
All powered by DigitalOcean Gradient AI (DeepSeek R1) with zero-touch automation.
How I built it
Tech Stack:
- Backend: FastAPI + Python (async)
- Frontend: Next.js + TypeScript + Tailwind CSS
- AI Engine: DigitalOcean Gradient AI (DeepSeek R1 model)
- Monitoring: Prometheus + Node Exporter
- Infrastructure: Terraform + DigitalOcean API
- Automation: Cloud-init + SSH key distribution
Architecture:
- Multi-agent system with specialized AI agents (Monitor, Diagnostic, Remediation, Provisioner)
- MCP abstraction layer for cloud provider independence
- Knowledge base (RAG) with runbooks and best practices
- Real-time WebSocket updates for incident tracking
- Autonomous loop running every 30 seconds
Automation Features:
- Auto-install Node Exporter on new VMs via cloud-init
- Auto-register new VMs with Prometheus for monitoring
- Auto-inject SSH keys for immediate remediation access
Challenges we ran into
- Agent Coordination: Orchestrating multiple AI agents to work together autonomously required careful state management and error handling
- SSH Authentication: Initially struggled with password prompts blocking automation - solved with DigitalOcean SSH Keys feature
- Prometheus Integration: Had to build dynamic config management to add/remove targets without manual intervention
- Real-time Updates: Implementing WebSocket broadcasts for live incident updates across multiple clients
- Safety Validation: Ensuring AI-generated remediation commands are safe before execution
Accomplishments
✨ Zero-touch automation - From provisioning to monitoring to remediation, everything happens automatically
🤖 Autonomous healing - The system detects and fixes issues without human intervention
⚡ Fast incident response - Detects issues within 30 seconds and can remediate in under 1 minute
🔒 Production-ready safety - Built-in validation prevents destructive operations
🎯 Natural language provisioning - "Create a web server with 2GB RAM" - Infrastructure deployed
📊 Complete observability - Real-time dashboards, incident tracking, and audit logs
What I learnt
- AI orchestration is powerful but complex - Multi-agent systems need robust coordination
- Automation requires thoughtful design - Small details (like SSH keys) can block entire workflows
- Monitoring is critical - Without good metrics, autonomous systems are blind
- Safety first - AI-generated commands need validation before execution
- MCP pattern scales well - Abstraction layers make multi-cloud support feasible
- Cloud-init is underrated - First-boot automation eliminates manual VM setup
- DeepSeek R1 is impressive - Excellent for infrastructure code generation and problem analysis
What's next for Rift
🌍 Multi-cloud support - Extend to AWS, GCP, Azure with the same autonomous capabilities
🧠 Learning system - Build incident history database to improve diagnosis accuracy over time
📈 Predictive maintenance - Use ML to predict failures before they happen
🔐 Advanced security - Automated vulnerability scanning and patching
🎨 Visual infrastructure designer - Drag-and-drop interface for complex architectures
🤝 Team collaboration - Multi-user support with role-based access control
📱 Mobile app - Monitor and control infrastructure from anywhere
🔔 Smart alerting - Context-aware notifications that don't wake you up for minor issues
Built With
- amazon-web-services
- digitalocean
- fastapi
- mcp
- next.js
- prometheus
- rag
- terraform

Log in or sign up for Devpost to join the conversation.