Iris-chan

Inspiration

We live in a strange paradox. We send two-minute voice notes on WhatsApp, jump on video calls across continents, and speak to friends and colleagues naturally. But with AI? We open a sandboxed app, switch windows, type, wait, copy, switch again, paste — and call this progress.

In 2018, Google demoed an assistant that could make phone calls autonomously. The world watched in awe. And then… nothing. The industry stagnated. Innovation became incremental instead of foundational.

Iris-chan starts from a simple question: What if computers reacted to what we look at, the same way humans react to what we pay attention to?

What it does

Iris-chan is a persistent, voice-first AI orchestrator that:

Sees what you see — context-aware visual understanding of your screen
Speaks and listens naturally — voice-first interaction with no prompt engineering required
Commands agent teams — delegates tasks to specialized sub-agents (Iris-chan + clawdbot)
Maintains continuity — durable memory, state, and preferences across sessions
Observes and reports — live task monitoring, cost tracking, and decision visibility

Key capabilities include:

Intent-driven interaction (predict user needs before explicit commands)
Continuous screen vocabulary extraction
Automatic 2FA detection and autofill
Multi-runtime behavioral modes: proactive, silent, feedback, introversion
Daily research reports via autonomous web exploration

How we built it

Architecture pillars:

Persistence — Durable state stored in Convex with event logs (what was attempted, what worked, why)
Hot-reload — Update logic and skills without killing the entity or losing context
Skill system — Detect, version, and test stabilized capabilities
Focus bridge — Focus as the "project compiler" (Specs → Tasks → execution), Iris-chan as the "conductor"
Visual representation — A live UI showing agents, tasks, blockers, decisions, cost, and latency

Tech stack: Voice via Gemini Live API, GUI execution via UI-TARS, knowledge retrieval via Perplexity, storage via Convex, orchestration runtime built in TypeScript/Node.js.

Challenges we ran into

The recompilation paradox — Every prompt change breaks context. We designed a continuous-state model to eliminate this.
Tool opacity — Cognitive latency (understanding what's happening) is often worse than machine latency. We made debugging native and observability first-class.
Keeping an agent "alive" — Most frameworks produce disposable runs. Designing persistence, hot-reload, and identity required rethinking the agent lifecycle entirely.

What we learned

An agent should behave like a persistent entity — with continuity, the ability to evolve, and a stable identity. The key insight: building an organization of agents, not a pipeline.

The system's three core pillars:

Intent and context understanding
Continuous improvement loop (frustration capture → task extraction → skill promotion)
Runtime behavioral modes (proactive, silent, feedback, introversion)

What's next

Finalize the 20-task roadmap focused on intent prediction, memory, and runtime modes
Ship the Sentinel demo ("watch your back" — event detection + warnings)
Integrate UI-TARS for surgical-precision desktop control
Add day/night cycle UI in introversion mode
Publish daily research reports from autonomous exploration
Custom avatar including: https://youtube.com/shorts/2JHv3zZXVgg

Built With

convex
gemini
node.js
perplexity
typescript
ui-tars

Updates

Livio Gamassia started this project — Mar 16, 2026 01:33 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.