AI Mock Interviewer

Real-time voice interview system — Gemini Live for audio/vision, FastAPI backend, tool state machine, and Gemini 2.0 Flash for reports.

Inspiration

Technical interviews are high-stakes but hard to practice. Mock interviews with friends are hard to schedule, platforms like LeetCode give no conversational feedback, and hiring a coach is expensive. I wanted to build something that feels like a real interview where you speak out loud, write actual code, and get honest feedback available any time, for free.

What it does

AI Mock Interviewer conducts a full technical interview in real time using voice and vision. You speak to an AI interviewer named Alex, answer concept questions out loud, write code in your editor while Alex watches your screen, and receive a scored coaching report the moment the interview ends. The entire session from greeting to feedback report runs autonomously with no human involvement.

How we built it

The core is Google Gemini Live for real-time bidirectional audio and vision. The browser captures microphone audio at 16kHz via AudioWorklet and screen frames as JPEG every 5 seconds, both streamed to a FastAPI backend over WebSocket. The backend forwards audio and video to Gemini Live and plays back the AI response at 24kHz. The hardest part was the audio pipeline. Gemini Live requires explicit stream_end signals to know when the user has finished speaking. We built a custom VAD system from scratch - computing RMS energy on every 256ms PCM chunk, detecting sustained silence, applying a cooldown after the AI finishes speaking to prevent echo triggering false stream_ends, and handling barge-in when the candidate speaks over Alex. The second challenge was making the interview follow a reliable structure. A tool-based state machine enforces the correct sequence - concept questions, coding phase, wrap-up - by blocking tool calls that fire out of order and returning error messages so Gemini self-corrects. Five concurrent async tasks per session handle audio sending, audio receiving, VAD, stuck-candidate nudging, and timer watching. The final report is generated by Gemini 2.5 Flash and rendered as markdown in the browser.

Challenges we ran into

1011 WebSocket errors - Gemini Live closes idle connections. Fixed with a keepalive loop sending silent audio every 20 seconds with guards to prevent it firing during tool calls or active speech.
Duplicate audio - Gemini sometimes sends the same sentence twice as separate audio turns. Fixed by comparing each turn's transcript against the previous and sending an interrupted signal to the browser to discard duplicates.
Model vocalising tool calls - Gemini occasionally speaks "log behavioral note" aloud instead of calling the tool silently. Fixed with transcript cleaning that strips function-call syntax and auto-rescue that detects and executes spoken tool calls directly in Python.
Premature interview endings - Without guardrails Gemini would call end_interview after 30 seconds. Fixed with a closing_spoken gate that only unblocks end_interview after a turn containing actual closing words like "thanks for your time."

Accomplishments that we're proud of

A fully autonomous AI interviewer that conducts a real technical interview from greeting to scored report - completely hands-free, in real time, with voice and vision. The VAD pipeline handles natural speech with pauses, barge-in, and echo cancellation without any third-party VAD library. The tool state machine reliably enforces interview structure across every session.

What we learned

Gemini Live is powerful but requires careful state management on the application side. The model is capable of conducting a great interview when given the right guardrails, the challenge is building those guardrails reliably in an async real-time environment where timing matters at the millisecond level. Pure "let the LLM decide everything" doesn't work in production, structured tool validation and state machines are essential for agentic systems that need to follow a reliable flow.

What's next for AI Mock Interviewer

Support for multiple interview types: system design, behavioral-only, frontend
Session history so candidates can track improvement over time across multiple sessions
Configurable difficulty levels: junior, mid, senior
Multi-language support for non-English speakers
Integration with job descriptions: paste a JD and Alex tailors the interview to that specific role

Built With

fastapi
gemini-2.5-flash
gemini-live
google-adk
google-cloud-run
python
web-audio-api
websockets

Updates

Esha Agarwal started this project — Mar 16, 2026 09:12 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.