Inspiration
We live in a strange paradox. We send two-minute voice notes on WhatsApp, jump on video calls across continents, and speak to friends and colleagues naturally. But with AI? We open a sandboxed app, switch windows, type, wait, copy, switch again, paste — and call this progress.
In 2018, Google demoed an assistant that could make phone calls autonomously. The world watched in awe. And then… nothing. The industry stagnated. Innovation became incremental instead of foundational.
Iris-chan starts from a simple question: What if computers reacted to what we look at, the same way humans react to what we pay attention to?
What it does
Iris-chan is a persistent, voice-first AI orchestrator that:
- Sees what you see — context-aware visual understanding of your screen
- Speaks and listens naturally — voice-first interaction with no prompt engineering required
- Commands agent teams — delegates tasks to specialized sub-agents (Iris-chan + clawdbot)
- Maintains continuity — durable memory, state, and preferences across sessions
- Observes and reports — live task monitoring, cost tracking, and decision visibility
Key capabilities include:
- Intent-driven interaction (predict user needs before explicit commands)
- Continuous screen vocabulary extraction
- Automatic 2FA detection and autofill
- Multi-runtime behavioral modes: proactive, silent, feedback, introversion
- Daily research reports via autonomous web exploration
How we built it
Architecture pillars:
- Persistence — Durable state stored in Convex with event logs (what was attempted, what worked, why)
- Hot-reload — Update logic and skills without killing the entity or losing context
- Skill system — Detect, version, and test stabilized capabilities
- Focus bridge — Focus as the "project compiler" (Specs → Tasks → execution), Iris-chan as the "conductor"
- Visual representation — A live UI showing agents, tasks, blockers, decisions, cost, and latency
Tech stack: Voice via Gemini Live API, GUI execution via UI-TARS, knowledge retrieval via Perplexity, storage via Convex, orchestration runtime built in TypeScript/Node.js.
Challenges we ran into
- The recompilation paradox — Every prompt change breaks context. We designed a continuous-state model to eliminate this.
- Tool opacity — Cognitive latency (understanding what's happening) is often worse than machine latency. We made debugging native and observability first-class.
- Keeping an agent "alive" — Most frameworks produce disposable runs. Designing persistence, hot-reload, and identity required rethinking the agent lifecycle entirely.
What we learned
An agent should behave like a persistent entity — with continuity, the ability to evolve, and a stable identity. The key insight: building an organization of agents, not a pipeline.
The system's three core pillars:
- Intent and context understanding
- Continuous improvement loop (frustration capture → task extraction → skill promotion)
- Runtime behavioral modes (proactive, silent, feedback, introversion)
What's next
- Finalize the 20-task roadmap focused on intent prediction, memory, and runtime modes
- Ship the Sentinel demo ("watch your back" — event detection + warnings)
- Integrate UI-TARS for surgical-precision desktop control
- Add day/night cycle UI in introversion mode
- Publish daily research reports from autonomous exploration
- Custom avatar including: https://youtube.com/shorts/2JHv3zZXVgg
Built With
- convex
- gemini
- node.js
- perplexity
- typescript
- ui-tars
Log in or sign up for Devpost to join the conversation.