Helivar

An adaptive AI tutor for NYC students preparing for the Specialized High Schools Admissions Test (SHSAT).

Inspiration

The SHSAT determines admission to NYC's nine specialized high schools — Stuyvesant, Bronx Science, Brooklyn Tech, and others. For many students, it's a high-stakes gate with limited prep resources. Commercial test prep is expensive, and free alternatives tend to be static question banks with no adaptivity.

We wanted to explore whether a small, focused app could deliver genuinely personalized tutoring by leveraging LLMs for the parts they're good at (planning, content generation, encouragement) while keeping them away from the parts they're bad at (grading, math computation). The result is a system where the AI is the coach and route-planner, but never the referee.

What it does

Helivar is an AI-powered study platform that adapts to each student. Rather than presenting the same questions to everyone, it follows a continuous learning loop:

Diagnose (Skill Scan) → Plan (AI generates path) → Teach (lessons, drills, quizzes) → Reassess

The app covers all 12 SHSAT topics — 10 math and 2 ELA — with procedurally generated math questions for infinite variety and curated verbal question banks for Revising/Editing and Reading Comprehension. Every question includes a pedagogical explanation that teaches students why the answer is correct.

Key Features

Skill Scan onboarding — A 12-question diagnostic (one per topic) that calibrates the student's proficiency without score pressure
Adaptive difficulty — An AI planner selects topics and difficulty levels based on the student's evolving proficiency model
Seven content types — Practice quizzes, AI-generated lessons, rapid-fire drills, 60-second timed challenges, timed estimation (number sense), spot-the-error activities, and term matching/ordering games
Session routing — The AI interleaves focus topics (70%) with spaced review (30%) within each session, ensuring no two consecutive activities share the same format
Proficiency tracking — An exponential moving average (EMA) model that weights recent performance more heavily:

$$P_{n+1} = (1 - \alpha) \cdot P_n + \alpha \cdot S_d$$

where $\alpha = 0.3$ and $S_d$ is a difficulty-weighted score (e.g., a hard correct answer yields $S_d = 95$, an easy wrong answer yields $S_d = 15$)

Gamification — XP, 10 level titles (Newcomer to SHSAT Prodigy), streaks with weekly freeze protection, daily goals with bonus XP, personal best tracking, and 15 unlockable achievements
Variable rewards — 15% chance of bonus XP on correct answers, near-miss nudges when close to milestones, and full-screen level-up celebrations
Dark/light theme — Persisted preference with system-aware defaults
Mobile-first design — Bottom tab navigation on mobile, responsive layouts, tap-optimized interactions
Periodic reassessment — After every 5 sessions, a mini diagnostic updates the student model

Research-Based Design

Every major design decision in Helivar is grounded in cognitive science and learning research:

Adaptive difficulty targets each student's zone of proximal development (Vygotsky, 1978; VanLehn, 2011)
Spaced review and interleaved practice strengthen long-term retention (Cepeda et al., 2006; Rohrer & Taylor, 2007)
Retrieval practice (testing effect) drives six of seven content types (Roediger & Karpicke, 2006; Dunlosky et al., 2013)
Immediate corrective feedback with pedagogical explanations prevents misconceptions from consolidating (Hattie & Timperley, 2007)
Error analysis ("spot the error") leverages the self-explanation effect (Chi et al., 1994)
Gamification is designed around Self-Determination Theory — supporting autonomy, competence, and relatedness (Deci & Ryan, 2000)
Variable rewards and near-miss nudges use reinforcement schedules and the goal gradient effect to sustain engagement (Kivetz et al., 2006)
Low-stakes framing ("discover your strengths") reduces test anxiety and produces more accurate diagnostic data (Dweck, 2006)
Timed fluency drills build automaticity for a timed exam (LaBerge & Samuels, 1974)

For full details and citations, see Research Foundations.

How we built it

Architecture

The core architectural decision was separating routing from grading:

Responsibility	Handler	Why
Topic selection	Google Gemini (LLM)	Requires weighing proficiency, spacing, and ZPD targeting
Session planning	Google Gemini (LLM)	Needs to balance focus vs. review and select content types
Lesson generation	Google Gemini (LLM)	Creative, structured teaching content
Drill generation	Google Gemini (LLM)	Template-validated to catch LLM math errors
Spot-the-error generation	Google Gemini (LLM)	Worked solutions or paragraphs with deliberate mistakes
Match/order generation	Google Gemini (LLM)	Term-definition pairs or sequencing activities
Grading	Deterministic code	`selectedChoice === correctIndex` — no hallucination risk
Explanations	Static (pre-generated)	Consistent, reviewed, no latency

Estimation is fully procedural — no LLM involved. For LLM-generated content types, errors are surfaced to the user with retry buttons rather than silently serving fallback content.

Math Question Engine

Rather than storing a fixed question bank for math, the app generates questions procedurally using seeded random number generation. Each of the 10 math topics has a template file with multiple question archetypes. For example, the ratios template generates problems like:

$$\text{If } \frac{a}{b} = \frac{c}{d}, \text{ find } x$$

with randomized values, computed correct answers, and carefully crafted distractors that mirror common student mistakes (e.g., inverting the ratio, using additive instead of multiplicative reasoning). This gives infinite variety while guaranteeing correctness — something an LLM alone cannot do reliably for math.

Student Model

The student model tracks per-topic proficiency using an EMA that responds quickly to recent performance while smoothing noise:

$$P_{new} = 0.7 \cdot P_{old} + 0.3 \cdot S_{difficulty}$$

Difficulty scoring is asymmetric by design — getting a hard question right is rewarded more than an easy one, and getting an easy question wrong is penalized more than a hard one:

Difficulty	Correct $S_d$	Wrong $S_d$
Easy	60	15
Medium	75	30
Hard	95	40

This keeps proficiency within $[0, 100]$ and ensures the system targets the student's zone of proximal development — the 30-70% proficiency range where learning is most effective.

Tech Stack

Next.js 15 (App Router) + React 19 + TypeScript
Tailwind CSS 3 with CSS custom properties for theming
Framer Motion for animations (drag-to-reorder, countdown sequences, timer bars)
NextAuth v5 with Google OAuth
Vercel AI SDK + Google Gemini (gemini-3-flash-preview)
localStorage for persistence (no backend needed)
Vercel for deployment

Challenges we ran into

LLMs Can't Do Math (Reliably)

Early prototypes had the LLM generate math questions and answers. The error rate was unacceptable — roughly 5-10% of generated math questions had wrong answers or flawed distractors. The fix was the procedural engine: hand-authored templates with randomized parameters and computed answers. The LLM still generates drill questions, but those go through a template-validation step where the LLM solves its own problem and the answer is cross-checked.

Grading Must Be Deterministic

It's tempting to use an LLM to evaluate free-text answers or provide nuanced grading. But LLM grading introduces non-determinism — the same answer might be graded differently on different runs. For a test-prep app where students need to trust their scores, this is a dealbreaker. Every grading decision in Helivar is a simple equality check against a pre-computed answer key.

Adaptive Difficulty Without a Backend

With no database, the entire student model lives in localStorage. This means migration logic for schema changes (v1 to v2 added the studentModel field), forward-compatibility guards so new topics auto-populate, and a single loadData() / saveData() API to prevent direct localStorage access scattered across components.

Keeping Sessions Engaging for 11-13 Year Olds

The target audience has short attention spans and high test anxiety. We addressed this through framing ("Let's discover your strengths" rather than "Let's test you"), variable rewards (randomized bonus XP to sustain motivation), interleaved sessions (mixing topics leverages the spacing effect from cognitive science), and making timers optional to reduce anxiety.

Content Monotony

Early versions delivered all practice as 4-choice multiple choice questions — the same tap-A/B/C/D loop every session. For 11-13 year olds, format variety is critical for sustained engagement. We added four new content formats (timed challenges, estimation, spot-the-error, matching/ordering), each with distinct interactions and pacing. The session planner now enforces that no two consecutive activities share the same format, and high-proficiency review topics rotate through diverse activity types instead of defaulting to more quizzes.

Hydration Mismatches with Procedural Generation

Procedurally generated questions use randomness, which means server-rendered HTML won't match client-rendered HTML. The solution was to generate all math questions client-side only, avoiding Next.js hydration errors entirely.

Accomplishments that we're proud of

The procedural math engine — 10 topic templates generating infinite unique questions with guaranteed-correct answers and pedagogically meaningful distractors. No two students see the same problem set.
The routing/grading separation — By restricting the LLM to planning and content generation while keeping all grading deterministic, we eliminated an entire class of trust issues. Students can rely on their scores being consistent and correct.
The proficiency model — The EMA-based student model with asymmetric difficulty scoring turned out to be both simple to implement and surprisingly effective at targeting the zone of proximal development. It adapts within a single session.
Zero-backend adaptive learning — The entire student model, session history, achievements, and streak tracking runs client-side in localStorage with robust migration logic. No server costs, no database, no user data collection.
Seven content formats — From 60-second speed rounds to timed estimation with speed bonuses, spot-the-error "be the teacher" activities, and drag-to-reorder sequencing — each session feels different while still targeting the student's weak areas.
Full content pipeline — From diagnostic calibration through AI-planned sessions with seven content types to periodic reassessment — the complete adaptive learning loop is functional end-to-end.

What we learned

LLMs are better coaches than calculators. The biggest insight was discovering where LLMs add value (personalized planning, generating teaching content, encouragement) versus where they're a liability (math computation, consistent grading). Drawing that line early saved us from building on an unreliable foundation.
Procedural generation beats static banks for math. Hand-crafting distractor logic for each question archetype is tedious, but the payoff is enormous — every distractor corresponds to a real student misconception, which makes wrong answers diagnostic rather than random.
Engagement design matters as much as content. Variable rewards, streak mechanics, and careful framing ("discover your strengths" vs. "take a test") measurably affect whether a 12-year-old comes back tomorrow. The behavioral science behind gamification is real.
Prompts are product. The quality of AI-generated lessons and session plans depends entirely on prompt engineering. We treated prompts as first-class code — version-controlled, tested, and iterated just like any other component.

What's next for Helivar

Backend persistence (MongoDB Atlas) — Migrating from localStorage to MongoDB Atlas with a 4-collection schema (users, profiles, sessions, achievements). Design document: MongoDB Migration Plan
Parent/teacher dashboard — A read-only view of student progress for parents and educators to track improvement over time
Mobile app — A React Native wrapper so students can practice on their phones during commutes
Multiplayer challenges — Optional head-to-head quizzes between students to add a social/competitive layer
Analytics and insights — Surfacing patterns like "you tend to miss geometry questions after 8pm" to help students study smarter

Setup

npm install
cp .env.local.example .env.local  # add your API keys
npm run dev

Environment Variables

Variable	Description
`GOOGLE_GENAI_API_KEY`	Google Gemini API key
`GOOGLE_CLIENT_ID`	Google OAuth client ID
`GOOGLE_CLIENT_SECRET`	Google OAuth client secret
`AUTH_SECRET`	NextAuth session secret

Deploying

npx vercel --prod

Built With

gemini
nextauth
nextjs
tailwind
vercel

Updates

Alain Brown started this project — Feb 09, 2026 08:23 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.