Inspiration

DevOps teams waste hours provisioning infrastructure, writing deployment scripts, and managing cloud resources. We asked: what if you could just tell Kibana "deploy this to 100 VMs" and it happens automatically? Elastic Infra Commander was born from the vision of making massive parallel deployments as simple as having a conversation.

What it does

Elastic Infra Commander transforms Kibana into a conversational infrastructure control center. Users simply chat with Kibana's AI agent to deploy applications across unlimited VMs in parallel:

  • Natural language deployments: "Deploy my app to 50 VMs" - that's it
  • Massively parallel execution: Deploy to 2, 10, or 100+ VMs simultaneously in ~50 seconds
  • Real-time monitoring: Track deployment progress through Elasticsearch indices
  • Instant preview URLs: Get secure, token-protected URLs for each deployed instance
  • Zero DevOps overhead: No Terraform, no Kubernetes, no infrastructure code

The system uses Elasticsearch as the orchestration backbone, with a distributed runner that polls for deployment requests and executes them across Blaxel's perpetual sandbox infrastructure.

How we built it

Architecture:

  1. Kibana Agent Builder - Natural language interface for deployment requests
  2. Elasticsearch Indices - Message queue and results storage (distributed-tool-requests, distributed-tool-results, deployment-logs)
  3. Distributed Runner - Python async worker that polls Elasticsearch and orchestrates parallel deployments
  4. Blaxel Sandboxes - Instant-launching VMs that deploy applications in isolated environments
  5. Workflow YAML Files - Elasticsearch Agent Builder workflows for deployment, status checks, and VM listing

Tech Stack:

  • Python 3.11+ with asyncio for concurrent deployments
  • Elasticsearch 8.x for orchestration and logging
  • Blaxel SDK for VM provisioning and management
  • YAML-based workflow definitions for Kibana integration

Key Innovation: Using Elasticsearch as a distributed task queue allowed us to decouple the UI (Kibana) from execution (runner), enabling horizontal scaling and fault tolerance.

Challenges we ran into

  1. Blaxel SDK Evolution: The SDK deprecated sandbox.wait() mid-development. We had to search documentation and adapt to the new instant-ready sandbox model where VMs are available immediately without explicit waiting.

  2. Type Conversion Bug: Elasticsearch returned num_vms as a string, causing runtime errors. Fixed by adding explicit int() conversion in the runner.

  3. Document ID Mismatch: Initially, the runner auto-generated Elasticsearch document IDs, causing Kibana's agent to fail when retrieving results by request_id. Solved by using request_id as the document ID for direct lookups.

  4. Process Execution Patterns: Learned to use wait_for_completion: True with timeouts for blocking commands (npm install, build) and wait_for_completion: False for background processes (servers).

  5. Preview URL Security: Implemented token-based authentication for preview URLs with 24-hour expiration to balance security and usability.

Accomplishments that we're proud of

  • Sub-minute parallel deployments: 50 seconds to deploy, build, and serve applications across 2+ VMs simultaneously
  • True conversational infrastructure: No YAML, no CLI commands - just natural language
  • Production-ready architecture: Fault-tolerant design with comprehensive logging and error handling
  • Seamless Elasticsearch integration: Leveraged existing Elastic stack without custom infrastructure
  • Clean, minimal codebase: ~200 lines of Python for the entire distributed runner

What we learned

  • Elasticsearch as a task queue is incredibly powerful for distributed systems - built-in persistence, querying, and real-time updates
  • Async Python with asyncio.gather() makes parallel VM deployments trivial and performant
  • Blaxel's perpetual sandboxes eliminate cold start problems - VMs are ready in <25ms from standby
  • Agent Builder workflows can orchestrate complex infrastructure operations through simple YAML definitions
  • Documentation matters: Blaxel's API evolved rapidly; staying current with docs was critical

What's next for Elastic Infra Commander

  1. Auto-scaling: Automatically adjust VM count based on load metrics from Elasticsearch
  2. Multi-region deployments: Deploy across geographic regions for global applications
  3. Rollback capabilities: One-click rollback to previous deployments with state snapshots
  4. Cost optimization: Automatic VM hibernation during idle periods using Blaxel's standby mode
  5. CI/CD integration: GitHub Actions workflow to trigger deployments on push
  6. Custom runtime templates: Support for Python, Go, Rust, and other language runtimes beyond Node.js
  7. Health monitoring: Automated health checks and alerting through Elasticsearch watchers
  8. Team collaboration: Multi-user support with RBAC for enterprise deployments

Built With

Share this project:

Updates