Inspiration: In today's digitally connected world, effective communication is more critical than ever. We noticed a recurring challenge in our internal team meetings and presentations: ensuring everyone could follow along, especially in a diverse, remote-first environment. We were inspired to create a solution that not only enhances real-time communication but also champions accessibility and knowledge preservation. The goal was to build a tool that could instantly make live-streamed content understandable and accessible to everyone, regardless of hearing ability or environmental distractions. We wanted to move beyond simple streaming and create a platform that actively enriches the communication experience.

What it does: Air-Caption is a self-contained, real-time streaming platform designed for internal team use. At its core, it allows a user to stream video and audio directly from their web browser. While they stream, the platform leverages the power of Google's Gemini 1.5 Flash model to generate incredibly fast and accurate live captions, which are overlaid directly onto the video player for all viewers to see.

Key features include: Live Video & Audio Streaming: Seamlessly stream from a webcam and microphone using WebRTC.Real-time AI Captions: Sub-2-second latency caption generation powered by the Gemini 1.5 Flash streaming API.Long-Duration Session Support: Conduct continuous sessions for up to 3 hours, with a robust session resumption feature that handles brief network disconnects gracefully.

Customizable User Experience: Users can customize the appearance of captions, including color, font size, and position, to suit their needs.

Recording & Export: Sessions can be recorded and exported as video files with synchronized captions. The caption data can also be exported separately in standard formats like SRT, VTT, or JSON for documentation purposes.

How we built it: Air-Caption is built on a modern, robust technology stack, meticulously planned to handle the demands of real-time data processing [cite: system-architecture.md].Frontend: We used a Next.js and React frontend with TypeScript for type safety and Tailwind CSS for styling. The user interface is built around the Plyr.js video player, which provides a flexible foundation for our custom caption overlay.

Real-time Media: WebRTC is the engine that captures the user's audio and video streams directly in the browser.

Backend: The backend is a Node.js application written in TypeScript. It uses a WebSocket server to maintain a persistent, low-latency, bidirectional connection with the client. This is the channel through which audio chunks are sent to the server and caption data is sent back to the client.

AI Model & Integration: The audio stream is relayed from our backend to the Google Gemini 1.5 Flash streaming API. We chose this model for its exceptional speed, accuracy, and native support for real-time, continuous transcription.

Infrastructure: To ensure stability and resilience, we integrated Redis as a high-speed in-memory store for managing session state and enabling our session resumption feature. For storing recordings, we used Appwrite, a secure and scalable open-source backend platform that handles our storage needs.

The data flow is a continuous loop: WebRTC captures the audio, the client sends it via WebSocket to our Node.js server, the server streams it to the Gemini API, receives caption results in real-time, and immediately pushes them back down the WebSocket to be displayed on the client's screen.

Challenges we ran into: Building a real-time system of this complexity presented several challenges: Achieving Ultra-Low Latency: The primary challenge was minimizing the delay between a word being spoken and its caption appearing on screen. This required optimizing every step of the data pipeline—from audio chunking on the client to efficient processing on the backend and leveraging the fastest possible response from the Gemini API.

Ensuring Session Stability: Supporting 3-hour continuous sessions is non-trivial. Network blips are inevitable over such a long period. We overcame this by designing and implementing a robust session resumption mechanism using Redis to persist session state, allowing users to automatically reconnect and continue their stream without losing context.

API Cost Optimization: Live-streaming audio to a powerful AI model can be costly. To make the project viable, we had to implement intelligent cost-saving measures. Our solution was to build a server-side Voice Activity Detection (VAD) module that pauses the data stream to the Gemini API during periods of silence, significantly reducing unnecessary processing costs.

Accomplishments that we're proud of:

  • End-to-End Real-Time Pipeline: We are incredibly proud of creating a stable, end-to-end pipeline that handles real-time video, audio, and AI processing with very low latency. Seeing a caption appear almost instantly as you speak is a testament to the architecture's success.
  • Flawless Gemini Integration: Successfully integrating the Gemini 1.5 Flash streaming API was a major accomplishment. We were able to harness its full potential for a seamless and highly accurate live transcription experience.
  • Robust Session Resumption: The session resumption feature is a standout accomplishment. It provides a level of reliability and user confidence that is critical for long-form presentations or important meetings.
  • Meaningful Accessibility: More than just a technical achievement, we built a tool that genuinely improves accessibility. The customizable, real-time captions make live content immediately more inclusive for everyone. What we learned: This project was a tremendous learning experience. We gained deep, practical knowledge of managing real-time data flows using a combination of WebRTC and WebSockets. We learned the best practices for integrating and optimizing state-of-the-art streaming AI APIs, including handling asynchronous data streams, managing API-specific features like speaker diarization, and implementing cost-control strategies. The project reinforced the absolute importance of thorough planning and documentation. Our detailed Project Scope, PRD, and Architecture documents were invaluable guides that kept the project on track and ensured consistency.

We learned that for complex, stateful applications like this, a dedicated state management solution like Redis is not just a "nice-to-have" but a core component for building reliable and resilient systems.

What's next for Air-Caption: This MVP is a strong foundation, and we have a clear vision for the future. The next steps for Air-Caption are focused on expanding its collaborative and enterprise capabilities.

Advanced User Management: We plan to introduce a full user authentication system with user profiles and role-based access control. Enhanced Interactivity: We want to add features like a live chat or a moderated Q&A module to run alongside the video stream. Scalability and Performance: We will evolve the architecture to support a much larger number of concurrent users by implementing load balancing and auto-scaling infrastructure. Third-Party Integrations: To embed Air-Caption into existing workflows, we plan to add integrations for platforms like Slack (for notifications) and Google Calendar (for scheduling streams). Mobile Experience: Developing native iOS and Android applications to allow users to both stream from and view sessions on their mobile devices is a key part of our long-term roadmap.

Built With

Share this project:

Updates