Logo

About the Project: Vibindu ✨

This documentation details the story, architecture, and engineering effort behind Vibindu—the world's first "Vibe Coding" industrial automation project driven entirely by voice.

💡 What Inspired Us

Industrial automation and PLC (Programmable Logic Controller) programming have traditionally been rigid, deeply technical fields requiring specialized software and steep learning curves. We asked ourselves: What if designing complex automated sequences was as intuitive as having a conversation with a senior engineer?

Inspired by the release of Google's Gemini Live, which supports real-time multimodal interaction, we envisioned a platform where engineers could simply "speak" their factory processes into existence. We wanted to bridge the gap between human intent and machine execution, transforming the tedious task of SFC (Sequential Function Chart) implementation into a fluid, voice-driven "Vibe Coding" experience.

🏗️ How We Built It

We designed the system with a modern, modular architecture deployed on Google Cloud Platform, using WebSockets and REST APIs to bridge the interface and the AI core.

App Architecture & Cloud Deployment

Our tech stack is cleanly decoupled into containerized microservices, all deployed seamlessly on Google Cloud Run to ensure auto-scaling, high availability, and secure secret management:

Frontend (Cloud Run - React + TypeScript): An intuitive visual editor utilizing Vite and Zustand. It captures raw audio and provides a real-time rendering engine for GRAFCET diagrams.
Backend (Cloud Run - AdonisJS): A robust Node.js middleware responsible for API routing, compiling our custom SFC DSL, managing user-scoped project state (via PostgreSQL & Cloud Storage), and serving the WebSocket relay.
AI Core (Cloud Run - Python FastAPI): A powerful, autonomous agentic cluster built with the Google Agentic Development Kit (ADK) that interacts directly with the suite of Google Gemini foundational models.

The most important part of the backend is the Backend Compiler Context. Rather than blindly accepting LLM output, our backend attempts to statically compile the AI-generated logic.

Agent Architecture & Google AI Models

Vibindu is not powered by a single prompt. It bridges 7 specialized Google AI Models through a collaborative Swarm of specialized AI Agents handling the entire engineering lifecycle:

Live Agent (Gemini 2.5 Flash Native Audio API): The central bridge handling real-time, low-latency voice interaction. By piping raw PCM audio via WebSockets to the Native Audio API, it interprets voice commands instantly and handles A2A (Agent-to-Agent) dispatching.
ThinkingForge Swarm (Gemini 3.1 Pro via ADK): The heavy-lifting reasoning cluster. It groups the Analyst, Gemma Architect, SFC Engineer, and Simulation Agents to extract physical I/O over iterative "Code -> Compile -> Fix" loops until the generated automation sequence is mathematically flawless. Let it think, and it builds production-grade SFC logic.
Computer / UI Navigator Agent (Gemini 2.5 Computer Use & Gemini Nano): Operates graphical interfaces. It takes encoded UI screenshots via the Computer Use model to compute precise (X,Y) clicks, falling back to local Gemini Nano for fast DOM parsing.
Storyteller Agent (Veo 3.0 Fast, Imagen 4.0, Gemini TTS): A creative agent that documents the journey. It autonomously prompts Veo 3.0 for cinematic backgrounds, Imagen 4.0 for asset generation, and Gemini TTS to narrate the finalized project media.

🎓 What We Learned

Building Vibindu taught us several profound lessons about LLMs and domain-specific engineering:

Agents Need Tools to Validate: Initially, giving an LLM the task to write raw logic resulted in frequent hallucinations. It was only when we built the Compiler Feedback Loop that the system reached autonomy. If the code breaks, the agent fixes it.
Latency is Key for Voice: Achieving fluid conversation required extensive tuning of the WebSocket bridge, ensuring that the Live Agent responds without breaking the user's train of thought.
Math & Logic Representation: GRAFCET sequences represent state machines where transitions must strictly adhere to Boolean logic. For any state $S_n$, the fundamental activation equation can be represented as: $$ S_n(t+1) = \left( S_{n-1} \cdot T_{n-1} + S_n \right) \cdot \overline{T_{n}} $$ Teaching our custom SFC Engineer agent to adhere rigorously to this state-evolution math (via prompt engineering and compiler feedback) was our biggest breakthrough.

🚧 Challenges We Faced

Deterministic Logic from Probabilistic Models: PLC systems run factories; they cannot afford a "hallucination" in an automated drill or conveyor sequence. Bridging the probabilistic nature of LLMs with the absolute deterministic necessity of industrial control was incredibly difficult.
Real-time A2A (Agent-to-Agent) Communication: Coordinating 6+ specialized agents asynchronously while maintaining a seamless, single-timeline voice session with the user required building a robust, custom internal transaction log and event dispatcher.
GCP Quotas and Scalability: Expanding our Fast API swarm deployment on Google Cloud Run meant aggressively optimizing our container sizes and handling sudden port-binding challenges under concurrency pressure.
Vibe Coding with Voice: Converting voice directly to executable industrial architecture meant dealing with vague verbal instructions. Our Analyst and Architect agents had to be strictly programmed to ask clarifying questions before attempting to generate the logic.

Vibindu proves that the future of industrial logic design isn't found in a complex toolset, but rather embedded in natural, intelligent conversation.

Built With

adonisjs
docker
fastapi
gemini
gemini-flash
gemini-nano
gemini-pro
google-cloud
google-cloud-run
google-cloud-sql
imagen
node.js
postgresql
prisma
python
react
typescript
veo
vite
websockets

Submitted to

Gemini Live Agent Challenge

Created by

I was the sole developer and architect for this entire multi-modal ecosystem. I built the React/Vite frontend interface, the Node.js database layer, and designed the complex A2A (Agent-to-Agent) Python orchestration system. My biggest challenge—and proudest achievement—was successfully containerizing these distinct microservices on Google Cloud Run and engineering a way to stream raw, bi-directional PCM audio over WebSockets directly into the Gemini 2.5 Flash Native Audio API for ultra-low latency voice control.

Houssem Mezzi

Updates

Houssem Mezzi started this project — Mar 16, 2026 07:24 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.