Inspiration

Modern professionals don’t struggle with a lack of information. They struggle with what happens after information. Meetings end, videos finish, podcasts pause, and suddenly there’s a familiar burden: What did we decide? What do I need to do? Where do I put this?. Existing transcription tools promise help, but they mostly create more work. They turn conversations into piles of text that still need to be read, interpreted, summarized, and organized by a human. We wanted to flip that model. Instead of building another tool that records what happened, we asked a different question: What if the system did the follow-up thinking for you?

What it does

Cue is an intelligent Chrome extension that turns the browser from a passive viewing surface into an active collaborator. Whether you’re in a Google Meet, watching a technical YouTube video, or listening to a podcast, Cue listens alongside you and then handles the work that normally comes after. It doesn’t just capture content. It understands it, reasons through it, and turns it into action.

How we built it

Cue was developed to advance the capabilities of the Gemini 3 ecosystem. It is designed to enhance platform performance and introduce new functionalities, pushing the frontier of what the ecosystem enables.

  • Google Agent Development Kit (ADK): We architected a multi-agent workforce using Google ADK. A "Router Agent" (Gemini 3 Flash) constantly monitors browser signals to classify user intent, while a "Scribe Agent" manages audio ingestion. This allowed us to build complex, asynchronous workflows that a simple API call couldn't handle.
  • Model Context Protocol (MCP): To give our AI "hands," we implemented the Model Context Protocol. This allows Cue to securely connect to the local Filesystem (to scaffold projects) and Google Workspace (Gmail/Calendar) using a standardized layer, rather than writing fragile custom integrations.
  • Native Multimodal Streaming: We utilized Gemini 3’s Native Audio API via WebSockets. By streaming raw audio chunks, we achieve the near-zero latency required for our "Go Live" feature to drive 3D animations in real-time.
  • Reasoned Output, Not Summaries: We employ chain-of-thought prompting to ensure the model doesn't just summarize, but actually reasons through the conversation to categorize outputs accurately.

Challenges we ran into

  • Browser Permissions: A key challenge was managing browser permissions correctly. We needed to ensure the extension had reliable access to audio while respecting Chrome’s permission model, which required careful handling to avoid interruptions, blocked access, or repeated permission prompts.
  • Preserving Context for Correct Reasoning: Another challenge was preserving enough context for the model to reason accurately. Early iterations struggled to distinguish casual discussion from concrete decisions. We addressed this by improving how contextual information was passed to Gemini 3, enabling clearer task prioritization and decision detection.

Accomplishments that we're proud of

  • Fully Agentic Workflow: We successfully built a system where a user can finish a session and immediately have a structured task list without clicking a single button.
  • Higher Audio Accuracy: Successfully leveraging Gemini's native audio capabilities allowed us to achieve significantly higher accuracy in technical jargon detection compared to standard Whisper-based implementations.
  • Actionable Across Google Applications: Cue doesn’t stop at understanding conversations; it can act on them. We built integrations that allow Cue to draft emails, create documents, and add events to calendars, turning insights directly into execution inside Google apps.
  • The "Halo Strip" UI: Designing a "calm tech" interface that predicts needs without being intrusive, only appearing when the Confidence Score of our intent model crosses a threshold.

What we learned

  • Chrome Extension & Permissions: We learned how to build a Chrome extension that respects the browser’s permission model while still reliably accessing audio and context from active tabs. Handling permissions correctly, without annoying repeated prompts and while staying compatible with other Google apps, was key to a smooth user experience.
  • Context is Critical for Reasoning: Feeding the model raw audio text alone wasn’t enough; supplying structured context from the session and surrounding discussion was essential to help Gemini 3 differentiate casual commentary from actual decisions and action items. Improving how we passed context into the reasoning pipeline made outputs far more meaningful.

What's next for Cue

Once Cue analyzes sessions and generates structured tasks, we plan to leverage tools like Nano Banana and Veo 3 to create short clips or key images that capture the most important moments of a meeting or video. This will allow users to see exactly what happened, understand decisions at a glance, and quickly grasp context without rereading anything. By combining task extraction with visual highlights, Cue will make follow-up actions faster, clearer, and more intuitive.

Share this project:

Updates