Inspiration

We kept running into the same wall: brilliant teachers and motivated students stuck on opposite sides of a language barrier in live calls. Subtitles and note‑taking bots helped a bit, but they never made the lesson feel native to each learner. We wanted something closer to a real co‑teacher that could sit in the call, understand what’s happening, and help every participant in their own language in real time. OmniTutor was born from that frustration—and from the idea that modern multimodal models are finally good enough to make this feel natural, not like a clunky translation overlay.

What it does

OmniTutor is an AI co‑tutor that joins your Google Meet as a participant, listens to the conversation, and provides real‑time, bidirectional translation and explanation. A student pastes a Meet link into our web app, chooses their language, and OmniTutor joins the call as an agent. When the teacher speaks (say, English), the student hears a fluent translation (e.g., French) in a natural voice; when the student speaks, the teacher hears their question in English. The app also shows live transcripts, health metrics, and an AI “Studio” HUD, and after the session it can generate creative summaries, key moments, and quizzes.

How we built it

We built a Vite + React frontend with a polished “Tutor Studio” that manages sessions, live transcripts, and a technical health HUD over WebSockets. Users create sessions via a backend FastAPI‑style REST API (/api/v1/sessions), which spins up a dedicated session agent and returns a ws_url for real‑time orchestration. On the backend, a browser automation layer (Playwright‑powered) launches a Chrome‑compatible instance, navigates to the Google Meet link, and uses a vision‑based navigator endpoint (/api/v1/navigator/calculate-action) to click through the join flow. Once in the call, an audio pipeline hooks into Meet audio, runs ASR → translation → TTS with LLMs and speech models, and streams transcripts and health updates back to the Studio UI while sending synthesized audio back into the Meet as the agent’s “voice”.

Challenges we ran into

Joining Google Meet reliably with an automated agent was much harder than just scripting DOM selectors: layouts change, login flows vary, and some environments show “Meet doesn’t work on your browser”. We had to design a pure vision‑driven navigation loop that can interpret screenshots and decide where to click, with robust fallbacks when the backend navigator isn’t reachable. On the real‑time side, getting audio latency low enough that translation still feels conversational—while juggling ASR, translation, and TTS—required careful buffering, streaming, and health monitoring. We also had to design a WebSocket protocol that stays simple enough for the frontend but expressive enough to drive all the session states (mic, language, barge‑in, health).

Accomplishments that we're proud of

We’re proud that OmniTutor can treat a Google Meet like a first‑class environment for AI, not just a video stream to record. The fact that an agent can join a call, understand what’s going on visually and acoustically, and then respond in a way that feels native in the Studio UI is a big milestone. The Studio HUD, which visualizes latency, vision FPS, audio sync, and voice activity in real time, turned out to be both useful for debugging and surprisingly delightful. We’re also proud of building a clean contract between frontend and backend—REST for session lifecycle, WebSockets for orchestration, and a dedicated vision endpoint—so the system can evolve without breaking the user experience.

What we learned

We learned that “just translate the call” is really a full orchestration problem: voice activity detection, turn‑taking, barge‑in, error handling, and UI feedback all matter as much as raw model quality. We saw firsthand how brittle traditional selectors are for video UIs and how much more robust a multimodal / vision‑guided approach can be, even for something as “simple” as clicking a Join button. We also learned that exposing internal health metrics (latency, sync, emotion state) to users isn’t just a dev tool—it builds trust by showing that the AI is actually doing work and gives them visibility when something degrades.

What's next for Omnitutor

Next, we want to expand beyond Google Meet to Zoom and Teams, using the same vision and audio orchestration layer so OmniTutor can join any major platform. On the intelligence side, we’ll go beyond translation to adaptive pedagogy: detecting confusion, slowing down explanations, generating targeted examples, and pushing micro‑quizzes in real time. We also plan to harden the infrastructure for classroom‑scale deployments (multi‑session management, per‑student profiles, institution dashboards) and to support more languages and voices. Long‑term, we see OmniTutor as a universal AI teaching layer that can sit on top of any live learning experience and make it accessible, personalized, and multilingual by default.

Built With

Share this project:

Updates