## Inspiration
AI tools made traditional code quizzes broken overnight. ChatGPT will solve any multiple-choice coding question in five seconds. Conestoga instructors told us they can't tell anymore whether a student understands the material or just pasted the prompt into a chatbot.
So we asked the inverse question: what's the one thing you can't fake by pasting into an AI?

Reasoning. Specifically, comparative reasoning about code that already exists. Don't ask students to write code — show them several solutions and ask which one's best and why, in their own words. AI hallucinates fluent-sounding answers; real students hesitate, hedge, miss tradeoffs. The shape of the explanation tells you what the shape of the understanding is. We built NewGen Learning around that flip.

## What it does

NewGen Learning is a reasoning-first code assessment platform that integrates into the eConestoga LMS look-and-feel. Four flows:

  • Question Builder. Instructor writes a scenario and the correct solution. Gemini generates three plausible wrong solutions with built-in flaws (off-by-one, wrong complexity class, broken
    contract). Instructor reviews each one and accepts, rejects, or regenerates.
  • Quiz Taking. Student sees the scenario and the multiple code blocks. They pick the best one and explain — in plain language — why their pick beats the alternatives.
  • AI Validation. Instructor clicks "Analyze All." Claude runs over every submission for the quiz, scoring the reasoning quality per answer: strengths, gaps, and an AI-detection confidence.
    Re-running replaces the old analysis.
  • Instructor Dashboard. Per-student scores and Claude's feedback, plus class-level aggregates: average score, most common gaps, AI-flag count.

In our demo: three students, same question. Two picked the right answer with strong reasoning One picked wrong with a vague "it just works" justification with Claude flagging "didn't engage with alternatives" and "failed to recognize that 'it works' ignores efficiency." That delta is the whole pitch.

## How we built it

Three FastAPI microservices and a React frontend, sharing a single Neon Postgres database:

  • Quiz API (port 8001) — owns quizzes, questions, solutions; integrates with Gemini for AI-generated distractors.
  • Submission API (port 8002) — owns submissions and per-question reasoning.
  • Analysis Service (port 8003) — shells out to the Claude CLI via subprocess and stores per-answer analyses.
  • Frontend — React + Vite, styled with the actual eConestoga LMS CSS so it's visually indistinguishable from the real thing.

A few architectural calls we're glad we made early:

  • Backends never call each other over HTTP. Coordination happens through the shared DB, with each service owning specific tables. This kept the contracts simple and the team unblocked — backend
    and frontend could move in parallel without stepping on each other.
  • Validation is instructor-triggered, batch by quiz_id — not auto-on-submit. Cost control, plus the instructor gets a satisfying "analyze the class" moment.
  • Role-based response shaping via ?role=student|instructor — the same endpoint serves both, just hides the answer key for students. No auth needed, fits the hackathon scope.
  • Stack: React (Vite), FastAPI (Python 3.13), asyncpg, Pydantic, Neon Postgres, Claude CLI subprocess, Gemini API.
    ## Challenges we ran into

  • The Bug. Accepting an AI-generated wrong option for Question 2 silently added it as a new solution to Question 1. Took us a while to realize it wasn't a backend bug — it was a render-scope bug. The pending-options state was a single flat array, and the AI panel rendered inside every question's card, so the user was clicking what looked like Q2's accept button but was actually rendering under Q1. Fix: state keyed by questionId, panel only renders for the question it was generated for.

  • Claude CLI subprocess plumbing. First call hung for three seconds waiting on stdin (it was looking for piped input). Solved with stdin=subprocess.DEVNULL. Cut latency from ~9s/call to ~7s/call. Across 3 submissions, that's a noticeable demo improvement.

  • Two-layer JSON unwrap. The Claude CLI returns {"type":"result","result":""} — we had to json.loads twice and add defensive markdown-fence stripping for when Claude added ```json despite our explicit instructions.

  • JSONB codec mystery. asyncpg's set_type_codec('jsonb', ...) registered cleanly but JSONB columns kept returning as strings on read. Worked around with a _decode_jsonb() helper. Still don't fully understand why the codec didn't take effect.

  • Gemini billing wall. Mid-demo we got a 429: prepayment credits depleted even though we were on the free tier. Turns out billing-tier projects gate every model behind the same prepay check. Spun up a fresh project's API key — instant fix.

Accomplishments that we're proud of

  • End-to-end flow working in under 48 hours. Quiz creation with AI distractors → student submission with reasoning → Claude validation → instructor dashboard. Every step backed by real data, not
    mocks.
  • The pedagogical signal actually works. Claude's per-answer feedback isn't generic. For our weak-reasoning student it specifically called out "didn't compare against alternatives" and "treats 'it works' as sufficient" — these are teachable moments an instructor would actually want to act on.
  • AI-detection delta is meaningful even when the flag is false. Strong reasoning came back at 0.35 confidence; vague reasoning at 0.90 confidence. The flag itself stayed false, but the confidence is a usable instructor signal we can surface on the dashboard.
  • Strict service contracts kept us moving. Two people, three backend services, one frontend, parallel git branches, zero merge conflicts that mattered.
  • eConestoga clone. The UI is so close to the real LMS that the demo looks like an actual feature shipping inside Conestoga's existing system.

What we learned

  • De-risk the scariest integration first. We wrote a 30-line standalone Claude CLI smoke test before any FastAPI plumbing. If claude -p had failed under subprocess, we would have known on hour 2, not hour 18. Worth it every time.
  • AI in a product means defending against AI failure modes. Rate limits, malformed JSON, latency spikes, occasional markdown fences — every Claude/Gemini call wrapped in defensive parsing. Lost an hour the first time we trusted the response shape.
  • Service contracts pay compounding rent. Every cross-service ambiguity we resolved up front in CONTRACT.md saved at least one merge conflict later. The ?role= decision saved us from building auth in the hackathon.
  • Reasoning-first is a fundamentally different product than MC. Building it forced us to think about pedagogy, not just data flow. What does a "good gap" look like? When does a strength actually demonstrate understanding vs. parrot the prompt? Those are product questions, not engineering ones.

## What's next for NewGen Learning

  • Docker Compose for one-command local dev — currently in progress, scoping the Claude CLI subprocess inside a container.
  • GCP deployment via Terraform. Cloud Run for the three services, frontend on Firebase Hosting or Cloud Run, keep Neon as the DB. Will require swapping the Claude CLI subprocess for the Anthropic API SDK for portability.
  • Surface the AI-detection confidence on the dashboard — currently stored, not displayed. Confidence delta is more useful than the boolean flag.
  • Custom evaluation criteria UI. Instructors already write eval_criteria per question; it gets injected into Claude's prompt. We want a friendlier editor — focus areas, weights, and previewable
    rubrics.
  • Class-level analytics. Hardest question, most-misunderstood concept, students whose reasoning consistently flags as AI-generated. Trend lines across multiple quizzes.
  • Multi-language support beyond Python. The plumbing is there; just need test prompts for C#, JavaScript, Java distractor generation that Gemini handles well.
  • Rubric scoring transparency. Right now Claude scores 0–100 and explains why — but instructors can't easily see which criteria most influenced the score. Adding a per-criterion breakdown is next.

Built With

Share this project:

Updates