Visionary Recruiter

The Next-Generation Multimodal AI Interview Coach


Inspiration

Technical interviews are notoriously broken.

For years, candidates have prepared using static text boxes, **LeetCode grinders, and non-interactive video rubrics. But real interviews aren’t just about the code you write; they are about how you communicate under pressure, how you maintain your composure, and how you articulate the "why" behind your decisions.

We were inspired to build Visionary Recruiter when we saw the capabilities of the Gemini Multimodal Live API.

For the first time, we realized we could create an AI that doesn't just read your answers—it actually hears the hesitation in your voice, sees your body language through the webcam, and interrupts you when you ramble.

We wanted to build a high-fidelity emotional and technical simulation that genuinely prepares people for the intensity of elite tech interviews.


What it does

Visionary Recruiter is a real-time, multimodal AI interview coach.

It acts as a realistic "Senior Recruiter" named Sarah, interacting through a sub-second latency WebSocket connection.

The system evaluates both technical thinking and human communication signals.


Contextual Intelligence

You upload your exact resume, and the AI dynamically generates its first question based on your actual past experience.


Multimodal Evaluation

Using your webcam, it tracks:

  • Body language
  • Posture

If you enter the Technical Dive track, you can literally draw a system architecture diagram on a piece of paper, hold it up to the camera, and Gemini will visually grade your design logic.


Live Telemetry & STAR Analysis

As you speak, the AI constantly executes Function Calls: update_interview_metrics()

Your answer is parsed through the STAR framework:

[ STAR = Situation + Task + Action + Result ]

The system updates a live dashboard showing:

  • Confidence score
  • Articulation quality
  • STAR completion progress

Reverse Q&A

I implemented a Wrap-Up phase where the AI prompts the user:

“Do you have any questions for me?”

This evaluates the candidate's insightfulness, mirroring real interviews.


How we built it

I architected Visionary Recruiter entirely on the edge to minimize latency and maximize the live interview feeling.


Frontend

I built a rich glassmorphism UI using:

  • React
  • Tailwind CSS
  • Framer Motion

The product follows a 4-stage interview journey:

[ Landing \rightarrow Setup \rightarrow Live\ Room \rightarrow Debrief ]


Audio Pipeline

I leveraged the Web Audio API and created a custom AudioWorklet.

The system:

  • Captures microphone input at 16kHz
  • Applies an RMS-based Voice Activity Detection (VAD) noise gate
  • Streams raw PCM audio instantly

The Brain

I established a bidirectional WebSocket connection to the Gemini 2.0 Flash Multimodal Live API.

I simultaneously stream:

  • PCM audio data
  • Base64 encoded webcam frames

This allows Gemini to see and hear the candidate in real time.


Tool Calling

I defined a strict JSON schema inside our function_declarations.

Gemini triggers these functions to update the UI instantly.

Example tool: update_interview_metrics()

This powers:

  • STAR breakdown telemetry
  • Confidence scoring
  • Real-time interview analytics

Challenges we ran into


Audio Artifacts & Noise

Sending raw microphone data caused Gemini to hear its own voice echo, leading to hallucinations.

I solved this by building a Voice Activity Detection (VAD) system using Root Mean Square (RMS) calculations in JavaScript.

[ RMS = \sqrt{\frac{1}{N} \sum_{i=1}^{N} x_i^2} ]

Only audio chunks where the user is actively speaking are transmitted.


Function Calling Overload

If Gemini attempted to grade too many metrics simultaneously, the JSON tool calls sometimes clipped or formatted incorrectly.

I stabilized the system by:

  • Enforcing strict JSON schemas
  • Injecting a CRITICAL SYSTEM INSTRUCTION prioritizing tool execution structure

Multilingual Hallucinations

When users mumbled, the Live API occasionally attempted to respond in other languages.

I fixed this by hardcoding strict system prompts restricting processing and output strictly to English (US).


Accomplishments that we're proud of

  • Successfully integrating concurrent multimodal streams (audio + visual evaluation of whiteboard drawings).
  • Achieving sub-second latency, making the AI feel like a real impatient recruiter.
  • Designing a futuristic telemetry dashboard that visualizes complex AI function calls like Live STAR breakdowns.

What we learned

Building Visionary Recruiter taught us the massive difference between REST APIs and bidirectional streaming AI systems.


State Management in Streams

Handling asynchronous toolCall responses mixed with audio streaming buffers in React without causing render thrashing required careful state design.


Prompt Engineering for Live Agents

Traditional prompting does not work well for live audio agents.

Instead, Live Agents require short and strict instructions, such as:

"Keep responses to 2–3 sentences max."

Otherwise the AI becomes too verbose and robotic.


What's next for Visionary Recruiter


Persistent Learner Profiles

Store user interviews in a database (e.g., Firebase) to track STAR improvement over 30 days.


More Tracks & Personas

Adding specialized interview tracks:

  • Product Management
  • Data Science

Each with different AI recruiter personalities and voices.


Video Export Summaries

Allow candidates to export a split-screen interview recording showing:

  • The candidate
  • The AI interviewer
  • The real-time grading overlay

This can be shared with career counselors or mentors.

Built With

  • framer-motion
  • gemini-flash-2.0-multimodal
  • lucide-react
  • prompt-engineering
  • vad
  • web-audio-api
Share this project:

Updates