Inspiration
We've all been in meetings where you're expected to contribute meaningfully while simultaneously trying to keep up with everything being said. Whether it's a fast-moving standup, a client call, or a technical review, the pressure to listen, think, and respond at the same time is exhausting. We wanted to build a tool that could sit alongside you in those meetings, one that listens, understands context, and can even speak on your behalf when you need it to. The idea of an AI that sounds like you and responds like you made Clairo feel like something genuinely useful rather than just another bot.
What it does
Clairo is an AI-powered meeting assistant that joins Zoom calls as a participant. It transcribes the conversation in real time, maintains running notes with decisions and open questions, and tracks action items as they come up. Users can choose how involved they want Clairo to be: it can take notes silently, surface suggested replies in the UI for the user to review, or automatically respond to direct questions using a clone of the user's own voice. After the meeting ends, Clairo generates a structured summary with the full transcript, key decisions, and a complete action item list.
How we built it
Clairo is a four-service microservices architecture: Next.js Control UI: the web dashboard where users manage meetings, view live transcripts, configure the agent, and enroll their voice. Built with Next.js 16, React 19, Tailwind CSS 4, and Supabase for auth and database. Zoom Meeting Gateway: a service that joins Zoom calls using the Zoom Meeting SDK, captures inbound audio, and injects synthesized speech back into the call. Gemini Intelligence Service: the agent brain, powered by Google Gemini's Live API. It processes streaming audio, builds a rolling transcript, maintains meeting context, and generates spoken replies. ElevenLabs Voice Runtime: handles voice enrollment via ElevenLabs Instant Voice Cloning, converts reply text to speech in the user's cloned voice, and manages the speech job queue. All user data is stored in Supabase (Postgres) with row-level security, so users only ever access their own sessions and notes. Backend services communicate through a token-protected internal webhook on the frontend.
Challenges we ran into
Getting low-latency audio to flow reliably through multiple services, from Zoom's SDK through the intelligence layer and back out as synthesized speech, is genuinely hard. Zoom's Meeting SDK has strict constraints on how audio is captured and injected, and coordinating that with Gemini's streaming API and ElevenLabs' TTS pipeline introduced compounding latency at every step. On the frontend side, synchronizing the live transcript view with real-time polling while keeping the UI responsive required careful state management. We also had to think carefully about security: API keys, service tokens, and voice data needed to stay server-side at every layer.
Accomplishments that we're proud of
We're proud of how complete and production-ready the control UI feels. The live meeting view, with its real-time transcript, tabbed notes and action items, agent status sidebar, and animated speaking indicator, gives a genuine sense of what the full product will look like in action. The voice enrollment wizard, the full database schema with RLS, the internal event webhook used by backend services, and the graceful fallback behavior when backends aren't running all reflect a level of architectural care we're happy with. The product brand and visual identity (Clairo, the deep navy + cyan + purple palette) came together in a way that feels polished.
What we learned
Building this forced us to think carefully about the boundary between the frontend and the backend services, what state lives where, how events flow between services, and how to design APIs that are stable enough for multiple consumers from day one. We learned a lot about the practical constraints of real-time audio pipelines and how much latency can accumulate across service boundaries. We also learned that designing for graceful degradation early (so the UI works without any backends) dramatically accelerates development.
What's next for Clairo
The three backend services, the Zoom gateway, the Gemini intelligence layer, and the ElevenLabs voice runtime, are fully specced and ready to be built. Beyond completing those, we want to add support for additional meeting platforms (Google Meet, Microsoft Teams), expand the post-meeting summary into a shareable document, and explore proactive agent behavior where Clairo can surface relevant context or documents mid-meeting without being asked. Longer term, we see Clairo evolving from a meeting assistant into a persistent work companion that understands your projects, your team, and your communication style.
Built With
- auth
- claude
- elevenlabs
- gemini
- github
- next.js
- openai
- react.js
- sql
- supabase
- tailwind
- typescript
- vercel
- whispr
- zoom
Log in or sign up for Devpost to join the conversation.