Inspiration
Running AI workloads today is unreliable. Jobs fail, GPUs are unavailable, queues stall, and developers waste time debugging infrastructure instead of building.
We kept seeing the same pattern:
- A run fails → retry → suddenly works
- No config change → just capacity showing up
The problem isn’t access to GPUs. It’s fragmented, unreliable execution.
What it does
Jungle Grid is an intent-based execution layer for AI workloads and agents.
You don’t pick GPUs.
You describe the workload.
The system:
- Classifies the workload (inference, training, batch)
- Selects compatible GPU options
- Routes across multiple providers
- Retries automatically until it finds a viable run
We also introduced an agentic layer (MCP):
- Agents can submit workloads directly
- Execution becomes part of autonomous workflows
- No human-in-the-loop infrastructure decisions
How we built it
- Go backend orchestrator for scheduling, routing, and failover
- Redis for job queue and real-time state
- PostgreSQL for persistence (jobs, nodes, users)
- Scoring engine using price, latency, reliability, and availability
- Node agent (distributed compute layer) for external GPU providers
- CLI + API for submission and integration
- Integrated with managed GPU providers for real execution
Challenges we ran into
- Capacity fragmentation: GPUs exist, but not where/when you need them
- Provider inconsistencies: different failure modes, APIs, and behaviors
- Cold starts & queue delays: unpredictable execution timing
- Image/runtime mismatches: jobs failing due to environment issues
- Designing a system that keeps trying without failing prematurely
What we learned
- Reliability matters more than raw compute
- Developers don’t want GPUs—they want completed workloads
- The future isn’t GPU selection—it’s intent-based execution
- Agent-driven systems need infrastructure abstraction, not exposure
What’s next
- Expand provider coverage
- Improve scheduling with real-time latency + region awareness
- Deepen agent (MCP) integration for autonomous execution
- Build a global distributed supply layer via node agents
Try It Out
Built With
- cli-tooling
- distributed-systems
- docker
- github-actions
- go
- gpu-compute-(cuda)
- multi-provider
- node.js
- postgresql
- redis
- rest-apis
- typescript
Log in or sign up for Devpost to join the conversation.