Inspiration
I wanted to build a lightweight, voice-first learning companion that makes short, focused practice sessions easy and engaging anywhere — especially on phones. The project was inspired by conversational tutors, accessibility-first design, and the idea that regular micro-practice beats occasional long sessions.
What it does
Core idea: Provides an interactive voice agent that listens, transcribes, responds with audio, and tracks learner progress. Features: Live mic input, streaming WebSocket audio, synthesized audio responses, photo upload for progress checks, and a small progress/XP system. User flow: User taps the mic → audio streams to the server → server transcribes / reasons → server sends back transcript and audio responses → client plays audio and updates progress.
How we built it
Backend: Fast, async WebSocket endpoints (served by uvicorn/ASGI) that accept PCM audio frames and stream responses. Frontend: Single-page UI with a mic button, canvas waveform, and camera capture. Uses navigator.mediaDevices.getUserMedia() and an AudioContext pipeline to capture and convert audio.
Integration: Images are captured to base64 and sent over the existing WebSocket; progress is persisted via a small REST API. Dev stack: Python for server logic, WebSockets for low-latency streaming, modern vanilla JS for the client, and Docker for deployment.
Challenges we ran into
Permissions & secure context: Mobile browsers require a secure origin (HTTPS) and a user gesture to prompt for microphone. During early deploys i by mistake attempted 😁 ws://localhost which prevented on open and thus the mic prompt never appeared on phones.
Accomplishments that we're proud of
Full end-to-end voice loop: Real-time mic capture → server processing → immediate audio reply with smooth playback and synchronized transcripts. Mobile-first UI: Polished waveform, large mic affordance, and camera capture for progress checks. Robust offline-friendly design: Graceful fallbacks when camera/mic are blocked; XP system and lightweight persistence for user continuity.
What we learned
Secure origins and correct WebSocket scheme (wss) are essential for mobile access and microphone permission flows Keep secrets out of images — use provider environment variables (Render/Heroku) instead of committing .env. Small, focused UI touches (big mic button, clear status) greatly increase usability on mobile.
What's next for Jarvis
Add adaptive curricula using session analytics (recommendations based on weak skills). Improve robustness: reconnect logic, backpressure handling for high-latency networks, and smaller audio frames for poor mobile networks.
Built With
- api
- css
- genai
- html5
- javascript
- python
- uvicorn
- websockets
Log in or sign up for Devpost to join the conversation.