About the Project

Polaris is an AI-assisted mock interview coach designed to help students and early-career applicants improve not only what they say, but also how they communicate it.

Many candidates understand the technical concepts required for a role but struggle to demonstrate their knowledge during an interview. Pressure can lead to rushed answers, excessive filler words, weak eye contact, distracting movement, poor answer structure, and low confidence. Most interview-preparation platforms focus mainly on generating questions or evaluating written responses, while non-verbal communication and delivery receive less attention.

Polaris explores a more complete approach by combining personalized interview questions, speech transcription, answer analysis, presentation feedback, progress tracking, and improvement roadmaps in one platform.

Inspiration

Our initial idea was to build an all-in-one career platform for recent graduates. We considered including job discovery, portfolio preparation, interview practice, skill-gap analysis, scheduling, and ghost-job detection.

However, we realized that attempting to solve the entire job-search process would produce an unfocused product, especially within a short hackathon development period. We decided to concentrate on one important and measurable problem, that being, interview confidence and communication.

The project was inspired by the observation that many final-year students do not fail interviews because they completely lack knowledge. Instead, they often struggle to communicate under pressure. A candidate may know the correct answer but present it with an unclear structure, avoid looking toward the interviewer, speak too quickly, use repetitive filler words, or appear visibly nervous.

This led us to the central question behind Polaris: How can AI help candidates understand both the content and delivery of their interview performance?

We named the project Polaris after the North Star. Just as the North Star has historically helped people navigate uncertain journeys, Polaris is intended to guide candidates through the uncertainty of interview preparation.

What it does

Polaris provides a complete mock-interview workflow.

Users begin by entering information such as:

  • Their name
  • Target role
  • Experience level
  • Target company
  • Preferred interview language

Based on this information, Polaris generates a personalized set of interview questions covering behavioral, technical, collaborative, and role-specific topics.

During the interview, the platform:

  • Presents questions one at a time
  • Reads questions aloud using browser text-to-speech
  • Captures the user's microphone and camera
  • Transcribes spoken answers using browser speech recognition
  • Displays live audio activity
  • Estimates movement and screen alignment
  • Provides prototype visual indicators for posture, gaze, and gesture feedback
  • Allows users to review and edit their transcript before submitting it

After each response, Polaris evaluates the answer using several criteria, including:

STAR answer structure

  • Relevance and completeness
  • Use of measurable results
  • Filler-word frequency
  • Communication quality
  • Professionalism
  • Speaking pace
  • Movement and presentation stability

The STAR framework can be represented as: $$ [ \text{STAR} = \text{Situation} + \text{Task} + \text{Action} + \text{Result} ] $$

Polaris checks whether the candidate provides enough context, explains their responsibility, describes what they personally did, and communicates the outcome. Then, the prototype combines content and delivery indicators into a weighted score: $$ [ w_c S_{\text{content}} + w_s S_{\text{STAR}} + w_v S_{\text{verbal}} + w_n S_{\text{nonverbal}} ], $$

where:

$$(S_{\text{content}})$$ measures answer relevance and detail

$$(S_{\text{STAR}})$$ measures response structure

$$(S_{\text{verbal}})$$ measures delivery-related indicators such as filler words and pace

$$(S_{\text{nonverbal}})$$ represents movement and alignment estimates $$(w_c, w_s, w_v,)$$ and $$(w_n)$$ are weighting factors

At the end of the session, users receive:

  • An overall interview score
  • Individual question feedback
  • STAR analysis
  • Detected filler words
  • Strengths and weaknesses
  • Actionable improvement suggestions
  • Saved interview history
  • A personalized four-week improvement roadmap

How we built it

Prototype simulation

Polaris was built as a functional proof of concept using a React and TypeScript frontend with a Node.js and Express backend.

The frontend was developed with:

  • React
  • TypeScript
  • Vite
  • Tailwind CSS
  • Browser Media APIs
  • Web Speech Recognition
  • Speech Synthesis
  • Web Audio API
  • HTML Canvas

The backend handles:

  • Interview-question generation
  • Answer evaluation
  • Session storage
  • Improvement-roadmap generation
  • Communication with the Gemini API

Then, using an available Gemini API key, it is used to:

  1. Generate questions tailored to the user's role and experience
  2. Analyze interview transcripts
  3. Produce personalized improvement roadmaps

We also created local fallback systems so that the prototype can continue functioning when an external AI service is unavailable. These fallbacks use predefined role-specific questions and rule-based text analysis.

The speech-recognition layer uses the browser's built-in speech-recognition API. The Web Audio API measures live microphone activity and approximate volume levels.

For the camera-analysis prototype, we used lightweight frame comparison. Consecutive low-resolution webcam frames are compared to estimate the amount of movement and the approximate horizontal center of the candidate.

A simplified movement calculation is: $$ [ M_t = \frac{1}{N} \sum_{i=1}^{N} \left|I_t(i)-I_{t-1}(i)\right|, ] $$ where:

  • $$(I_t(i))$$ is the intensity of pixel (i) in the current frame
  • $$(I_{t-1}(i))$$ is the same pixel in the previous frame
  • $$(N)$$ is the number of sampled pixels
  • $$(M_t)$$ represents estimated movement

The current landmark, gaze, posture, and hand visualizations are prototype simulations. They demonstrate the intended user experience and future architecture but do not yet represent full pose or facial-landmark detection.

We explored using OpenCV, MediaPipe, and pretrained hand-pose models, but reliable real-time gesture classification could not be completed within the development period. Rather than claiming that the prototype performs complete gesture recognition, we retained the interface as a demonstration of how a future vision model could integrate with the platform.

Interview sessions and roadmaps are currently stored using a JSON-based persistence layer, with browser local storage as a fallback. This was chosen for rapid prototyping and would later be replaced by an authenticated production database.

Challenges we ran into

Defining a Realistic Scope

Our largest early challenge was defining a realistic scope. We initially wanted Polaris to include:

  • Job searching
  • Portfolio evaluation
  • Skill matching
  • Scheduling
  • Scam detection
  • Personalized learning roadmaps
  • Interview preparation

Although these features were related, attempting to implement all of them would have weakened the main user experience. We therefore reduced the scope and focused on solving one problem well: AI-powered interview coaching.

Real-Time Gesture Recognition

Accurate gesture recognition was one of the most difficult technical challenges.

A hand-pose model may be able to locate hand keypoints, but keypoints alone cannot determine whether a gesture is appropriate in an interview. Reliable classification would require:

  • Consistent hand tracking
  • Temporal movement analysis
  • A dataset containing realistic interview gestures
  • Labels identifying appropriate and distracting behavior
  • Context from the candidate's speech
  • Testing across different cameras, lighting conditions, and body positions

We experimented with pretrained models, but their predictions were not stable enough to include as a reliable feature in the final prototype.

Distinguishing Movement From Meaningful Behavior

Pixel-level movement detection can determine that something has changed in the video frame, but it cannot always determine what caused the change.

For example, detected movement may come from:

  • The candidate's hands
  • Head movement
  • A changing background
  • Camera noise
  • Lighting changes
  • Another person entering the frame

This taught us that movement detection and gesture understanding are two different problems. Detecting motion is relatively simple, but interpreting whether that motion is meaningful, appropriate, or distracting requires additional context and more advanced models.

Browser Compatibility

Speech-recognition support behaves differently across browsers. Some browsers support continuous speech recognition, while others provide limited support or no support at all.

To make the prototype more accessible, we added a typed-transcript fallback so that users could still complete an interview even when automatic speech recognition was unavailable.

Limited Development Time

Because the project was developed during a short hackathon, we had to balance several responsibilities:

  • Research
  • Development
  • Testing
  • User-interface design
  • Model experimentation
  • Documentation
  • Demo preparation

Several advanced features had to remain simulations or future extensions so that we could complete a coherent, functional, end-to-end prototype.

This required us to prioritize features that best demonstrated the project's central idea rather than attempting to build every planned capability.

Responsible Presentation

Another challenge was deciding how to communicate the prototype accurately. The interface was designed to demonstrate a future system that could potentially include:

  • Facial-landmark detection
  • Gaze estimation
  • Posture analysis
  • Hand tracking
  • Speech and prosody analysis

However, not all of these systems were fully implemented in the prototype. We learned that it is important to clearly distinguish between:

  • A working feature
  • A heuristic estimate
  • A simulated interface
  • A planned future capability

This distinction is especially important when AI-generated scores and feedback may influence how users perceive their abilities. The system should not present uncertain or experimental measurements as objective facts.

Accomplishments that we're proud of

We are proud that Polaris provides a complete user journey rather than an isolated AI demonstration.

A user can:

  1. Configure a personalized interview
  2. Receive role-specific questions
  3. Answer using their microphone and camera
  4. Obtain a transcript
  5. Receive structured feedback
  6. Review question-level results
  7. Save their session
  8. Generate a long-term improvement roadmap

We are also proud that the platform can operate in two modes:

  • Gemini-assisted analysis when an API key is available
  • Local fallback analysis when the external model is unavailable

This makes the prototype more resilient and easier to demonstrate.

Other accomplishments include:

  • Building a polished and consistent user interface
  • Integrating camera, microphone, transcription, and text-to-speech features
  • Creating role-specific question-generation logic
  • Implementing STAR-based answer evaluation
  • Providing actionable rather than purely numerical feedback
  • Creating session-history and progress-tracking features
  • Keeping raw webcam footage and audio out of server storage
  • Completing a working production build within the hackathon period

Most importantly, we transformed a broad and uncertain idea into a focused product with a clear target user and problem statement.

What we learned

AI products require more than an API call

We learned that integrating a generative model is only one part of building an AI product.

A useful application also requires:

  • A clearly defined problem
  • Reliable inputs
  • Thoughtful scoring logic
  • Fallback behavior
  • An understandable interface
  • Privacy safeguards
  • Honest communication of limitations

Detection is not the same as interpretation

Detecting movement is relatively easy. Determining whether that movement represents nervousness, confidence, distraction, or an appropriate gesture is much more difficult.

The same gesture can have different meanings depending on:

  • Cultural background
  • Interview context
  • Camera framing
  • Individual communication style
  • Physical accessibility
  • The content being spoken

Therefore, future non-verbal feedback should be presented as guidance rather than absolute judgment.

Feedback should be explainable

A score alone is not enough. Users need to understand why they received it and how they can improve.

For this reason, Polaris attempts to connect feedback to observable behaviors, such as:

  • Missing a measurable result
  • Overusing filler words
  • Giving an answer without a clear action
  • Speaking too quickly
  • Moving excessively within the frame

Fallback systems matter

External APIs may fail because of:

  • Missing credentials
  • Network problems
  • Rate limits
  • Service outages

Building local fallback questions and evaluation rules allowed Polaris to remain usable even without a generative model.

Scope management is a technical skill

Reducing scope was not a failure. It was necessary to create a working product.

We learned that a smaller complete system is often more valuable than a large collection of unfinished features.

Responsible AI requires transparency

Interview feedback can affect a person's confidence. Therefore, the application must avoid presenting uncertain estimates as objective truths.

We learned that future versions should:

  • Explain how every score is calculated
  • Display confidence levels
  • Allow users to challenge or ignore feedback
  • Avoid judging protected or identity-related characteristics
  • Account for accessibility and cultural differences
  • Clearly disclose when a feature is simulated or experimental

What's next for Polaris

The next stage of Polaris would focus on replacing prototype simulations with reliable, validated systems.

Real pose and facial-landmark analysis

We plan to integrate MediaPipe or a similar framework for:

  • Face landmark detection
  • Head-pose estimation
  • Shoulder alignment
  • Hand tracking
  • Body-position stability

This would replace simulated landmarks with measurements derived from the actual camera feed.

Temporal gesture classification

Instead of judging one frame at a time, future versions would analyze sequences of movement.

A temporal model could distinguish among:

  • Natural explanatory gestures
  • Repetitive fidgeting
  • Face touching
  • Crossed arms
  • Excessive movement
  • Long periods of unnatural stillness

This would require a responsibly collected and labelled interview-gesture dataset.

Genuine vocal analysis

We plan to calculate real delivery metrics such as:

$$ \text{WPM} = \frac{\text{spoken word count}} {\text{speaking duration in minutes}} $$

Additional features could include:

  • Pause duration
  • Pitch variation
  • Volume consistency
  • Speech-energy patterns
  • Repetition
  • Filler-word frequency
  • Response latency

These signals should support coaching, not medical or psychological diagnosis.

Role-specific evaluation

The current local evaluator is more effective for technical roles than for every profession.

Future versions would use evaluation rubrics tailored to fields such as:

  • Software engineering
  • Design
  • Marketing
  • Finance
  • Teaching
  • Healthcare
  • Research
  • Entrepreneurship

Personalized learning loops

Future roadmaps would use actual session history to adjust daily activities.

For example, if filler-word frequency decreases but STAR structure remains weak, Polaris could reduce filler-word exercises and prioritize structured-answer practice.

Authentication and secure storage

A production release would include:

  • User accounts
  • Secure authentication
  • Private session ownership
  • Encrypted database storage
  • Data deletion controls
  • API rate limiting
  • Consent-based data processing

Human-in-the-loop coaching

Polaris is not intended to replace teachers, career advisers, or recruiters.

A future version could allow users to share selected sessions with:

  • Mentors
  • Career counsellors
  • Teachers
  • University career centres
  • Trusted peers

The AI would provide preliminary analysis, while humans would supply context, empathy, and professional judgment.

Our long-term vision is for Polaris to become a practical interview-training environment that helps candidates practise repeatedly, recognize patterns in their performance, and enter real interviews with greater clarity and confidence.

Model benchmarking and dataset development

We also plan to test a wider range of computer-vision, speech-analysis, and language models instead of relying on the first available solution. Each candidate model would be evaluated under the same conditions to determine which provides the best balance of:

  • Accuracy
  • Inference speed
  • Real-time performance
  • Hardware requirements
  • Robustness across lighting, camera, accent, and background conditions
  • Fairness across different users
  • Ease of deployment

For a model (m), we could define a prototype selection score such as:

$$ [ Q(m) = \frac{ \alpha A(m)+\beta R(m) }{ \gamma L(m)+\delta C(m) } ] $$

where:

  • (A(m)) represents accuracy
  • (R(m)) represents robustness
  • (L(m)) represents latency
  • (C(m)) represents computational cost
  • (\alpha,\beta,\gamma,\delta) represent the importance assigned to each factor

The model with the highest accuracy may not necessarily be the best choice if it is too slow or computationally expensive for real-time browser use. Polaris therefore needs systematic benchmarking rather than selecting models based only on their reported performance.

We would also investigate existing datasets for interview behavior, facial landmarks, body posture, hand gestures, speech delivery, filler words, and answer quality. Before using any dataset, we would review its:

  • Licensing and permitted use
  • Data-collection and consent process
  • Demographic representation
  • Label quality
  • Relevance to interview scenarios
  • Potential cultural or accessibility biases

If an appropriate dataset does not already exist, we could create a small, consent-based dataset specifically for interview coaching. Participants could perform mock interviews under different conditions, while trained annotators label observable behaviors such as repeated fidgeting, speaking pace, pauses, answer structure, and natural explanatory gestures.

This dataset could then be used to fine-tune or train specialized models. However, the labels should describe observable behavior rather than subjective traits. For example, the system may identify repeated movement or prolonged silence, but it should not claim to determine whether a person is inherently confident, trustworthy, competent, or suitable for employment.

Finally, we would compare fine-tuned models against pretrained and rule-based baselines. Fine-tuning would only be retained when it produces a measurable and reliable improvement on a separate validation set, rather than being used simply because it sounds more advanced.

Built With

Share this project:

Updates