VIDEO DEMO HERE, couldn't upload due to wifi issues: https://docs.google.com/document/d/1jgZlpW468rwLQ5fCjtlQqrQzp8c2zg3cW8bx6mJkYeQ/edit?usp=sharing

๐Ÿš€ Agen-Teach: Multi-Agent Workflow Learning System ๐Ÿง  Inspiration

We wanted to build an AI system that could learn the way humans teach each other โ€” by showing, not scripting. Most automation tools today require rigid prompts or programming. But humans naturally teach through a mix of talking, showing, and doing. We imagined an AI that could watch your screen, listen to your explanation, and learn your workflow โ€” turning that demonstration into a reusable automation skill.

That idea became Agen-Teach โ€” a system that learns by observing you.

๐Ÿ’ก What It Does

Agen-Teach learns new skills simply by watching and listening as you demonstrate a task on your computer. It combines voice interaction, screen observation, and tool experimentation to generate reusable AI Skills that can be executed later through 500+ integrations (Slack, Gmail, Notion, Discord, etc.).

Key features:

๐ŸŽ™๏ธ Voice-Guided Teaching โ€“ You explain tasks naturally; the system asks clarifying questions.

๐Ÿ‘€ Real-Time Screen Observation โ€“ Watches your actions and understands workflow context.

โš™๏ธ Parallel Multi-Agent Collaboration โ€“ Three AI agents (Voice, Listener, Executor) work together in real time.

๐Ÿงฉ Automatic Skill Generation โ€“ Produces Claude-compatible Skills with parameters, examples, and test cases.

๐Ÿ” Live Feedback Loop โ€“ Shows skill construction in real time as you demonstrate.

The result: a workflow youโ€™d normally โ€œteach a personโ€ โ€” now teachable directly to AI.

๐Ÿ—๏ธ How We Built It

Agen-Teach is powered by a multi-agent architecture running across three coordinated runtimes:

Main Application (Bun/TypeScript)

Hosts the Voice Agent (OpenAI Realtime API) and Listener Agent (action detector).

Uses a WebSocket event bus for synchronized data exchange.

Executor Agent (Python + FastAPI)

Powered by Claude 4.5 Sonnet, orchestrating tool calls via MCP and Composio.

Executes up to 20 parallel actions with validation feedback.

Desktop UI (Tauri + React)

Transparent overlay to visualize captured frames, voice logs, and live skill compilation.

We used:

๐Ÿง  OpenAI Realtime API for natural voice interaction

๐Ÿค– Anthropic Claude 4.5 Sonnet for action execution and reasoning

๐Ÿ”Œ Composio for tool integration

โšก Bun runtime for low-latency event streaming

๐ŸชŸ Tauri for a lightweight cross-platform desktop app

๐Ÿงฑ Architecture Overview โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Desktop UI (Tauri) โ”‚ โ”‚ Voice + Screen Feedback โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ WebSocket โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Main App (Bun) โ”‚ โ”‚ Voice Agent | Listener โ”‚ โ”‚ Skill Builder | Pub/Sub โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ HTTP โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Executor Agent (Python) โ”‚ โ”‚ Claude + MCP + Composio โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ˜ตโ€๐Ÿ’ซ Challenges We Ran Into

Realtime synchronization between audio, video, and actions โ€” getting frames, transcripts, and events aligned was tough.

Voice stream stability โ€” balancing 24kHz audio with OpenAI Realtime WebSockets.

Authentication setup โ€” juggling multiple API keys (OpenAI, Anthropic, Composio) across environments.

Cross-runtime communication โ€” Bun โ†” FastAPI โ†” Tauri integration required deep debugging.

Screen recording permissions on macOS โ€” lots of test cycles.

๐Ÿ† Accomplishments Weโ€™re Proud Of

Built a working voice-and-vision multi-agent architecture from scratch in under 48 hours.

Achieved real-time teaching feedback โ€” you can literally see the skill being built as you talk.

Seamless integration of Claude and OpenAI models for collaborative reasoning.

Designed a transparent Tauri UI overlay that feels magical โ€” the AI โ€œwatchingโ€ your workflow.

Created reusable Skill specifications that can run on any MCP-compatible system.

๐Ÿ“š What We Learned

How to orchestrate multiple large models together effectively โ€” streaming data, event memory, and reasoning handoff.

The importance of clear event schemas โ€” even the smallest sync bug between agents breaks the illusion of โ€œlearningโ€.

That teaching AI by demonstration feels far more natural than prompt engineering.

Bun is ridiculously fast for real-time TypeScript systems.

๐Ÿ”ฎ Whatโ€™s Next for Agen-Teach

๐ŸŽง Voice cloning + multimodal memory to preserve user teaching style

๐Ÿง  Skill Library & Marketplace โ€“ share and remix learned workflows

๐ŸŒ Cloud sync + team training โ€“ let multiple users co-teach one AI

๐Ÿ’ป Browser extension for observing web workflows

๐Ÿ› ๏ธ Skill validation sandbox โ€“ auto-test before deployment

๐Ÿงฉ Built With

OpenAI Realtime API โ€ข Claude Sonnet 4.5 โ€ข Bun โ€ข FastAPI โ€ข Composio โ€ข Tauri โ€ข React โ€ข TypeScript โ€ข Python

Built With

Share this project:

Updates