Inspiration
Everyone wants to work out correctly but can't afford a personal trainer. Generic YouTube videos can't see YOU. We set out to build something that actually watches you, understands your movement in real time, and calls you out — loudly — when you're doing it wrong. Like a desi trainer who doesn't let you get away with lazy squats.
What it does
FitSenseAI is a real-time AI gym trainer that:
- 📷 Watches your form using MoveNet Thunder pose detection (17 keypoints at 15+ FPS) running entirely in your browser via TensorFlow.js
- 📐 Scores every rep 0–100% using a custom angle-constraint engine per movement phase (top/bottom/eccentric/concentric) across 22 exercises
- 🎙️ Coaches you with voice — Sarvam AI (STT + TTS) gives you spoken feedback in Hindi by default, switchable to English
- 🚨 Interrupts mid-exercise when form score drops below 40% — like a trainer physically stopping you to correct your form
- 💬 Answers questions naturally: speak "bhai meri form kaisi hai?" and the AI coach responds contextually using Gemini 2.5 Flash
- 📊 Tracks your history — sessions, rep scores, streak calendar stored in Prisma + SQLite
How we built it
Frontend (React 18 + TypeScript + Vite + TailwindCSS):
- MoveNet Thunder loaded via
@tensorflow-models/pose-detection— runs in-browser, no server round-trip for pose inference - Custom
useFormScoringhook calculates joint angles using law of cosines, compares against per-exercise, per-phase angle constraints useVoiceAgenthook: MediaRecorder buffers audio → streams over WebSocket to backend → plays back TTS audio via Web Audio API
Backend (Node.js + Express + TypeScript):
CoachEngine: rate-limited proactive coaching triggers, bilingual Hindi/English messages,checkUrgentFormInterruption()fires when score < 40 (4s cooldown)geminiClient: Gemini 2.5 Flash via Google GenAI SDK — system prompt with[LANG]tag for bilingual coaching, multi-model fallback chain (2.5-flash → 2.5-flash-lite → 2.0-flash)sarvamClient: Sarvam AI STT (saarika:v2.5) and TTS (bulbul:v2, speakeranushka) — defaulthi-IN- WebSocket
/ws/voice: duplex audio stream, each connection tracks language from?lang=param
Challenges we ran into
- Rep counting accuracy: MoveNet detects both sides of the body — we had to score the "best-side" angle at each frame to avoid counting half-reps
- Hindi voice understanding: Sarvam's STT returned transliterated Roman Hindi ("bhai kitna bacha hai") which required keyword matching in our fallback feedback engine
- Voice interruption timing: making the coach interrupt mid-rep without cutting off the user's own speech required WebSocket state tracking and an AudioContext cancel mechanism
- Phase detection threshold calibration: loosened joint angle thresholds significantly from textbook values — real humans at home have variable body proportions and camera angles
Accomplishments that we're proud of
- A coach that yells at you in Hindi when your squat form breaks — "Arre bhai! Ghutne bahar rakho!" — this genuinely feels like a real trainer
- 22 exercises covering home workouts and gym/dumbbell movements, all with custom biomechanical angle constraints
- The voice agent being interruptible — you can cut off the coach mid-sentence and ask your own question
- Full end-to-end voice pipeline: mic → WebSocket → WAV → Sarvam STT → Gemini prompt → Gemini response → Sarvam TTS → WAV → speaker in under 2 seconds
What we learned
- TensorFlow.js MoveNet runs remarkably well in-browser for a model this size — pose inference at 15+ FPS on a standard laptop with no GPU
- Sarvam AI's Hindi voice output (
bulbul:v2) sounds genuinely natural — far more so than any multilingual TTS we tested - Gemini 2.5 Flash's short-context coaching responses (150 tokens) are fast enough for real-time use when the system prompt is tightly constrained
- Building bilingual AI products for Indian users requires more than just language translation — tone, idioms, and energy all need to match the cultural context
What's next for FitSenseAI
- Google Cloud Run deployment for scalable global access
- Gemini Live API integration — replace the current request-response pattern with true streaming live audio for sub-500ms latency
- Workout plan generation — Gemini creates a weekly plan based on your history and goals
- Multi-person support — track multiple athletes in the same frame
- Mobile PWA — phone camera as portable gym trainer
Built With
- docker
- express.js
- google-gemini-2.5-flash
- google-genai-sdk
- movenet-thunder
- node.js
- prisma
- react-18
- sarvam-ai
- sqlite
- tailwindcss
- tensorflow.js
- typescript
- vite
- websocket
Log in or sign up for Devpost to join the conversation.