Inspiration

Screening calls are time-consuming, inconsistent, and hard to scale. Recruiters told us their biggest bottleneck is first-round signal, not sourcing. Candidates said the process feels subjective and repetitive. We set out to build a privacy-first, multimodal interviewer that feels human, scores transparently, and only forwards the most promising applicants—without storing raw video by default.

What it does

Simulated Interviewer: Role-aware questions (role fit, project deep-dive, behavioral, light technical), with targeted follow-ups.

Speech + CV signals (on-device):

Speech: clarity, relevance, completeness (via LLM on transcripts).

Computer vision: eye-contact %, blink rate, head stability, posture lean (MediaPipe).

Optional Roboflow snapshots: attentive / distracted / away / second-face flag.

Resume parsing: Extracts skills/experience to contextualize questions and scoring.

Scoring & Summary: Per-answer and final scores, concise candidate profile, pass/fail vs. job threshold.

Recruiter dashboard: Transcript, per-metric charts, flags, one-click Forward or Reject with feedback.

How we built it

Frontend: Next.js + Tailwind (single app with role-gated routes), getUserMedia for mic/cam, MediaRecorder per-answer.

On-device CV: MediaPipe Tasks (Face & Pose Landmarker) at ~10 FPS → feature aggregates only (no frames).

(Optional) Roboflow: Lightweight browser classification for environment check & engagement snapshots (1 FPS).

Speech-to-Text: Batch per answer (Whisper / Google STT) via Firebase Cloud Functions.

LLM: Interviewer follow-ups + rubric scoring (prompted to return strict JSON).

Backend: Firebase Auth, Firestore (interviews/messages/scores), Storage (optional raw audio), Functions for STT/LLM/scoring.

Charts: Chart.js for behavior/time series.

Privacy: Store transcripts + aggregates; raw media off by default with explicit toggle.

Challenges we ran into

Browser quirks: Autoplay/permissions, device differences, keeping stable FPS.

Latency budget: Balancing batch STT (simple) vs. user-perceived responsiveness.

Normalization: Head-stability varies by camera/FOV; added min–max and sanity gates.

Fairness: Avoiding demographic proxies; keeping only behavior aggregates and exposing the rubric.

Model hosting: Serving .task files locally to avoid CDN hiccups.

Accomplishments that we're proud of

Shipped a true multimodal pipeline in <24h end-to-end.

Privacy-first CV (features, not pixels) that still produces useful signals.

Transparent, math-backed scoring that recruiters can tune.

Clean, role-gated UX: candidate flow → recruiter dashboard with actionable summaries.

What we learned

Multimodal ≠ heavy: On-device landmarks + simple aggregates provide strong signal.

Prompting is product: JSON schemas + short, role-aware prompts keep LLM outputs reliable.

Latency > complexity: Batch STT per answer hits the sweet spot for an MVP.

Explainability builds trust: Showing how scores are computed reduces skepticism.

What’s next for Round1

Voice & flow: Streaming TTS for the interviewer; smarter follow-ups.

Fairness & calibration: Small human-rated set for scaling/offsets; bias audits & model cards.

Integrations: ATS export (Greenhouse/Lever), Slack/Email webhooks.

Signals: Gesture/nod detection, gaze stability, multilingual STT, better environment QA via Roboflow.

Enterprise: Admin controls, audit logs, data retention policies.

Built With

Share this project:

Updates