🎓 EduMentor Live: Your Bilingual Multimodal AI Tutor

🌟 Inspiration As a final-year Computer Science student at UMT, I often found myself overwhelmed by dense research papers and complex global scholarship requirements. The constant context-switching—copying text from a PDF and pasting it into a chatbot—was a major flow-breaker.

I envisioned a "Live Companion" that feels like a real tutor sitting next to me—one who can see exactly what I'm looking at and discuss it naturally in my native language (Urdu/Punjabi) as well as English. This inspired the birth of EduMentor Live.

🚀 What it does EduMentor Live is a real-time multimodal tutoring assistant designed to make education accessible. By sharing their screen, students can have an interactive dialogue about any visual content.

Vision-Enabled: It "sees" and analyzes screen content (like code, diagrams, or documents) at 1 FPS.

Bilingual Conversations: It bridges the language gap by seamlessly switching between English and Urdu.

Zero-Friction UI: Uses a manual Push-to-Talk (Spacebar) mechanism to ensure clear, turn-based dialogue without accidental interruptions.

🛠️ How I built it The project is built on the cutting-edge Gemini 2.0 Flash (Native Audio) model, which handles both visual and auditory inputs natively.

Backend: Developed using FastAPI to manage high-concurrency, bidirectional WebSockets.

Hosting: Fully containerized with Docker and deployed on Google Cloud Run for scalable, low-latency streaming.

Audio Pipeline: I engineered a custom Web Audio pipeline using the Browser's MediaStream API to capture, downsample (to 16kHz PCM), and stream audio chunks to the server.

Frontend: A high-performance, minimalist UI crafted with Vanilla JavaScript to keep the footprint small and the focus on the learning content.

🧠 Challenges I faced Deploying a real-time multimodal app in a serverless cloud environment wasn't easy:

WebSocket Stability & Security: Managing idle timeouts on Cloud Run required a custom "Text Ping" heartbeat to keep the session alive. Transitioning from local ws to secure wss in production was critical for browser permission compliance.

Audio Jitter: To prevent stuttering over varied internet speeds (common in Pakistan), I had to fine-tune the audio buffer sizes and synchronization logic.

Cloud Permissions: Navigating IAM policies for Artifact Registry and Cloud Build was a steep learning curve, but it taught me the importance of the Principle of Least Privilege.

📚 What I learned This hackathon was a deep dive into the future of Human-AI interaction. I mastered the art of handling Multimodal Data Streams simultaneously and learned how to build production-grade AI applications on Google Cloud. Most importantly, I saw firsthand how AI can democratize education by providing high-quality tutoring in regional languages.

Built With

Share this project:

Updates