We've all sat through mock interviews that felt nothing like the real thing: a friend reading questions off a list, a platform serving the same generic prompts regardless of how you answer. Real interviewers adapt. They follow up on what you said, skip questions you already covered, and evaluate you differently depending on what the role actually demands. We wanted to build a mock interview tool that behaves the way a great human interviewer does, but is available at 2 AM the night before your Stripe final round.

What it does

Horizon Hire is a voice-first AI mock interview platform that adapts questions in real-time based on your answers, evaluates every response across seven weighted dimensions calibrated to each question's purpose, and tracks your weaknesses across sessions to steer future practice.

You upload your resume, select a target role and company, choose an AI interviewer voice, and have a realistic spoken conversation. After the interview, you get:

A readiness score calculated as a weighted aggregate: ( \text{readiness} = \sum_{i} w_i \cdot s_i ) where ( w_i ) is the per-criterion weight and ( s_i ) is the normalized score Per-criteria breakdowns with the weights used to calculate them A narrative summary with actionable tips A question-by-question drill-down with specific rationales grounded in what you actually said

How we built it

Backend: FastAPI + Supabase (auth, PostgreSQL, file storage). Voice powered by ElevenLabs (TTS) and Deepgram (STT with automatic fallback).

Frontend: React + TypeScript + Material UI + Tailwind, with Zustand for state management.

Evaluation engine: Claude API with a system prompt selected through a controlled experiment. We tested three prompt architectures and scored each against hand-authored ground-truth observations using an LLM-as-judge:

Prompt A (Coach) --> coaching feedback directed at the candidate → 48% strict recall, 62% weighted recall Prompt B (Open) --> scored criteria + open-ended observations → 59% strict recall, 76% weighted recall Prompt C (Flagged) --> scored criteria + predefined flag taxonomy → 56% strict recall, 71% weighted recall

We chose Prompt B, the open-ended evaluator that lets the model surface whatever it notices rather than constraining it to predefined categories.

Adaptive questioning: After each answer, the LLM evaluates whether the candidate has already addressed a future question. If so, it removes it from the remaining list; no two interviews follow the same path.

Challenges we ran into

Getting the evaluation prompt right was harder than building the app. Our first attempt used predefined flag categories (authenticity concerns, accountability issues, communication issues) to structure feedback. A controlled experiment against ( n = 14 ) sample responses with 180+ hand-authored ground-truth observations showed that this cost us 11 percentage points of recall on weak answers compared to the open-ended approach:

$$\Delta_{\text{bad answers}} = 71% - 60% = 11\text{pp}$$

The taxonomy was causing the model to miss observations that didn't fit neatly into buckets — tense shifts, self-image tells, specific phrasings that reveal how a candidate thinks about the question.

We also wrestled with voice reliability: STT transcripts sometimes come back incoherent, so we built a flagged-response flow that shows the user what was captured and lets them retry (practice mode) or accept (simulation mode). Getting Supabase storage uploads to work with Row Level Security required switching to a service-role key for backend operations.

Accomplishments that we're proud of

The prompt experiment: Before writing evaluation code, we built 7 behavioral questions with paired good/bad answers, hand-authored ground-truth observations (no category bias — deliberately), ran all three prompts through a blind LLM judge, and made a data-driven decision. The winning prompt catches 71% of ground-truth observations on weak answers versus 60% for the structured alternative. That gap is where coaching actually matters — strong answers are easy to evaluate, but surfacing what went wrong in a vague or evasive response is the hard problem.

End-to-end voice interaction: The entire system — setup, live interview with TTS/STT, immediate feedback, post-interview results with question-by-question drill-down — works with real spoken conversation, not text boxes.

What we learned

Constraining an LLM with predefined categories feels like it should improve consistency, but it costs coverage. The open-ended prompt surfaces observations about tone, pacing, and word choice that a structured rubric has no bucket for. Interview evaluation is a recall problem, not a precision problem: ( \text{catching 10 real observations with 2 false positives} > \text{catching 6 with 0} ). Building voice-first means treating transcript quality as a first-class UX concern — the flagged-response system isn't an edge case handler, it's core to making the product trustworthy. Per-question weight calibration matters: a behavioral question about conflict resolution should weight leadership at ( w = 0.3 ) and zero out skills match, while a system design question does the opposite.

What's next for Horizon Hire

  • Animated interviewer avatars with lip-sync via Tavus
  • Conversational feedback mode — ask follow-up questions about your evaluation
  • Company-specific question banks seeded from real interview reports
  • Career trajectory mapping — readiness scores vs. target role requirements
  • Multiplayer mode — two candidates interview each other with AI evaluation on both sides

Built With

  • auth
  • claude-api-(anthropic)
  • deepgram
  • elevenlabs
  • fastapi
  • material-ui
  • python
  • react
  • storage)
  • supabase-(postgresql
  • tailwind-css
  • typescript
  • vite
  • zustand
Share this project:

Updates