Elastic Infra Commander

Inspiration

DevOps teams waste hours provisioning infrastructure, writing deployment scripts, and managing cloud resources. We asked: what if you could just tell Kibana "deploy this to 100 VMs" and it happens automatically? Elastic Infra Commander was born from the vision of making massive parallel deployments as simple as having a conversation.

What it does

Elastic Infra Commander transforms Kibana into a conversational infrastructure control center. Users simply chat with Kibana's AI agent to deploy applications across unlimited VMs in parallel:

Natural language deployments: "Deploy my app to 50 VMs" - that's it
Massively parallel execution: Deploy to 2, 10, or 100+ VMs simultaneously in ~50 seconds
Real-time monitoring: Track deployment progress through Elasticsearch indices
Instant preview URLs: Get secure, token-protected URLs for each deployed instance
Zero DevOps overhead: No Terraform, no Kubernetes, no infrastructure code

The system uses Elasticsearch as the orchestration backbone, with a distributed runner that polls for deployment requests and executes them across Blaxel's perpetual sandbox infrastructure.

How we built it

Architecture:

Kibana Agent Builder - Natural language interface for deployment requests
Elasticsearch Indices - Message queue and results storage (distributed-tool-requests, distributed-tool-results, deployment-logs)
Distributed Runner - Python async worker that polls Elasticsearch and orchestrates parallel deployments
Blaxel Sandboxes - Instant-launching VMs that deploy applications in isolated environments
Workflow YAML Files - Elasticsearch Agent Builder workflows for deployment, status checks, and VM listing

Tech Stack:

Python 3.11+ with asyncio for concurrent deployments
Elasticsearch 8.x for orchestration and logging
Blaxel SDK for VM provisioning and management
YAML-based workflow definitions for Kibana integration

Key Innovation: Using Elasticsearch as a distributed task queue allowed us to decouple the UI (Kibana) from execution (runner), enabling horizontal scaling and fault tolerance.

Challenges we ran into

Blaxel SDK Evolution: The SDK deprecated sandbox.wait() mid-development. We had to search documentation and adapt to the new instant-ready sandbox model where VMs are available immediately without explicit waiting.
Type Conversion Bug: Elasticsearch returned num_vms as a string, causing runtime errors. Fixed by adding explicit int() conversion in the runner.
Document ID Mismatch: Initially, the runner auto-generated Elasticsearch document IDs, causing Kibana's agent to fail when retrieving results by request_id. Solved by using request_id as the document ID for direct lookups.
Process Execution Patterns: Learned to use wait_for_completion: True with timeouts for blocking commands (npm install, build) and wait_for_completion: False for background processes (servers).
Preview URL Security: Implemented token-based authentication for preview URLs with 24-hour expiration to balance security and usability.

Accomplishments that we're proud of

Sub-minute parallel deployments: 50 seconds to deploy, build, and serve applications across 2+ VMs simultaneously
True conversational infrastructure: No YAML, no CLI commands - just natural language
Production-ready architecture: Fault-tolerant design with comprehensive logging and error handling
Seamless Elasticsearch integration: Leveraged existing Elastic stack without custom infrastructure
Clean, minimal codebase: ~200 lines of Python for the entire distributed runner

What we learned

Elasticsearch as a task queue is incredibly powerful for distributed systems - built-in persistence, querying, and real-time updates
Async Python with asyncio.gather() makes parallel VM deployments trivial and performant
Blaxel's perpetual sandboxes eliminate cold start problems - VMs are ready in <25ms from standby
Agent Builder workflows can orchestrate complex infrastructure operations through simple YAML definitions
Documentation matters: Blaxel's API evolved rapidly; staying current with docs was critical

What's next for Elastic Infra Commander

Auto-scaling: Automatically adjust VM count based on load metrics from Elasticsearch
Multi-region deployments: Deploy across geographic regions for global applications
Rollback capabilities: One-click rollback to previous deployments with state snapshots
Cost optimization: Automatic VM hibernation during idle periods using Blaxel's standby mode
CI/CD integration: GitHub Actions workflow to trigger deployments on push
Custom runtime templates: Support for Python, Go, Rust, and other language runtimes beyond Node.js
Health monitoring: Automated health checks and alerting through Elasticsearch watchers
Team collaboration: Multi-user support with RBAC for enterprise deployments

Built With

asyncio
blaxel
elasticsearch
git
kibana
node.js
python
yaml

Updates

Success Nwachukwu started this project — Feb 27, 2026 11:56 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.