Rift - Project Story

Inspiration

The inspiration came from experiencing 3 AM production outages where developers scramble to diagnose and fix issues. I wanted to build a platform that never sleeps - an autonomous infrastructure guardian that detects, diagnoses, and fixes problems before humans even notice.

What it does

Rift is an autonomous infrastructure orchestration platform that:

  • Monitors cloud infrastructure every 30 seconds using Prometheus metrics
  • Detects incidents like high CPU, memory leaks, or disk space issues
  • Diagnoses root causes using AI-powered analysis
  • Remediates problems automatically via SSH execution
  • Provisions new infrastructure through natural language requests
  • Learns from every incident to improve future responses

All powered by DigitalOcean Gradient AI (DeepSeek R1) with zero-touch automation.

How I built it

Tech Stack:

  • Backend: FastAPI + Python (async)
  • Frontend: Next.js + TypeScript + Tailwind CSS
  • AI Engine: DigitalOcean Gradient AI (DeepSeek R1 model)
  • Monitoring: Prometheus + Node Exporter
  • Infrastructure: Terraform + DigitalOcean API
  • Automation: Cloud-init + SSH key distribution

Architecture:

  • Multi-agent system with specialized AI agents (Monitor, Diagnostic, Remediation, Provisioner)
  • MCP abstraction layer for cloud provider independence
  • Knowledge base (RAG) with runbooks and best practices
  • Real-time WebSocket updates for incident tracking
  • Autonomous loop running every 30 seconds

Automation Features:

  • Auto-install Node Exporter on new VMs via cloud-init
  • Auto-register new VMs with Prometheus for monitoring
  • Auto-inject SSH keys for immediate remediation access

Challenges we ran into

  1. Agent Coordination: Orchestrating multiple AI agents to work together autonomously required careful state management and error handling
  2. SSH Authentication: Initially struggled with password prompts blocking automation - solved with DigitalOcean SSH Keys feature
  3. Prometheus Integration: Had to build dynamic config management to add/remove targets without manual intervention
  4. Real-time Updates: Implementing WebSocket broadcasts for live incident updates across multiple clients
  5. Safety Validation: Ensuring AI-generated remediation commands are safe before execution

Accomplishments

Zero-touch automation - From provisioning to monitoring to remediation, everything happens automatically

🤖 Autonomous healing - The system detects and fixes issues without human intervention

Fast incident response - Detects issues within 30 seconds and can remediate in under 1 minute

🔒 Production-ready safety - Built-in validation prevents destructive operations

🎯 Natural language provisioning - "Create a web server with 2GB RAM" - Infrastructure deployed

📊 Complete observability - Real-time dashboards, incident tracking, and audit logs

What I learnt

  • AI orchestration is powerful but complex - Multi-agent systems need robust coordination
  • Automation requires thoughtful design - Small details (like SSH keys) can block entire workflows
  • Monitoring is critical - Without good metrics, autonomous systems are blind
  • Safety first - AI-generated commands need validation before execution
  • MCP pattern scales well - Abstraction layers make multi-cloud support feasible
  • Cloud-init is underrated - First-boot automation eliminates manual VM setup
  • DeepSeek R1 is impressive - Excellent for infrastructure code generation and problem analysis

What's next for Rift

🌍 Multi-cloud support - Extend to AWS, GCP, Azure with the same autonomous capabilities

🧠 Learning system - Build incident history database to improve diagnosis accuracy over time

📈 Predictive maintenance - Use ML to predict failures before they happen

🔐 Advanced security - Automated vulnerability scanning and patching

🎨 Visual infrastructure designer - Drag-and-drop interface for complex architectures

🤝 Team collaboration - Multi-user support with role-based access control

📱 Mobile app - Monitor and control infrastructure from anywhere

🔔 Smart alerting - Context-aware notifications that don't wake you up for minor issues

Built With

Share this project:

Updates