Inspiration

As a team, we work across math, computer science, economics, finance, and statistics. While these fields share tools, equations, and notation, we kept running into the same issue: understanding breaks in different ways depending on how someone reasons, not just what subject they are studying.

In math, it breaks when a proof relies on an unstated assumption. In computer science, when a concept works in theory but fails in implementation. In economics and finance, when intuition diverges from formal models. In statistics, when formulas are applied without understanding the data behind them.

Despite this, most learning tools treat all mistakes the same. Slides, videos, problem sets, and even AI tutors primarily check whether an answer is correct. They do not examine how the learner reasoned, they do not adapt when two students struggle for different reasons, and they do not maintain a model of misunderstanding over time.

We realized this was not a content gap. It was a systems gap.

Great instructors act like intelligent agents. They observe reasoning, interrupt when logic breaks, ask students to externalize their thinking through diagrams and explanations, and adapt continuously based on feedback. They build a mental model of the learner and use it to decide what to do next.

We asked a Turing City question: what if learning software behaved the same way?

Big Brain was built as an agentic learning system that studies the learner, not just the task. It diagnoses misunderstanding, adapts learning paths, and makes cognition visible through an interactive canvas. Big Brain is designed for complex, abstract domains where correctness is not enough and understanding must hold when conditions change.

The goal is not faster answers. The goal is robust understanding. That is why we built Big Brain.

What it does

Big Brain is an agentic learning platform that continuously models how a student thinks.

Instead of starting with lessons, it starts with diagnosis. Instead of delivering static content, it decides what to do next based on observed reasoning, feedback, and learning dynamics.

At a system level, Big Brain:

  • Generates diagnostic quizzes that surface conceptual gaps, partial understanding, and false confidence rather than testing memorization
  • Builds and updates a personalized learning graph instead of following a fixed syllabus
  • Dynamically adapts learning paths by pruning mastered concepts and expanding fragile areas
  • Pulls relevant course videos and explanations at the concept and timestamp level
  • Learns user strengths, weaknesses, and recurring misconception patterns over time
  • Verifies understanding through teach-back instead of one-shot answers

Big Brain is built around real-time reasoning interaction, designed to feel like university office hours with a TA or professor.

Rather than providing static explanations, the system engages in back-and-forth problem solving. Learners explain their thinking out loud while working through problems, and the agent listens, responds, and intervenes in real time.

This spoken reasoning assistance allows Big Brain to:

  • Listen to learners articulate their reasoning as they solve problems
  • Detect hesitation, uncertainty, and conceptual confusion as it emerges
  • Interrupt at the moment reasoning breaks, not after an answer is submitted
  • Ask clarifying questions, reframe prompts, or redirect logic mid-solution
  • Resolve doubts immediately through dialogue rather than delayed feedback

Speech is treated as a first-class learning signal, not just an interface convenience. Pauses, corrections, revisions, and response timing inform the system’s model of understanding, enabling it to adapt explanations, pacing, and difficulty on the fly.

Alongside speech, Big Brain uses a canvas-first interface where learners externalize their thinking through diagrams, sketches, and partial solutions. Canvas interactions are treated as first-class signals, and the agent responds directly to what is drawn or written.

Together, speech and canvas form a continuous learning loop where reasoning is visible, diagnosable, and improvable in real time.

Each interaction updates a persistent cognitive state representation of the learner. That state drives future questions, content selection, pacing, and review.

Big Brain behaves like a learning agent operating in a closed loop of observation, decision, action, and feedback.

How we built it

We built Big Brain as an end to end agentic system rather than a single AI feature.

At the core is a continuous feedback loop that models learner cognition and decides how to intervene. Every interaction updates an internal representation of understanding, which then determines the system’s next action.

The architecture consists of four layers:

1. Diagnostic and reasoning layer
We generate adaptive quizzes and open ended prompts designed to expose reasoning, not memorization. Responses are analyzed for partial understanding, misconception patterns, and false confidence rather than binary correctness.

2. Learning graph and content orchestration
Concepts are represented as a dependency graph instead of a linear syllabus. Based on diagnostic signals, the agent prunes mastered concepts, expands fragile areas, and reorders learning paths. Course videos and explanations are dynamically selected at the concept and timestamp level.

3. Cognitive memory layer
We maintain a persistent learner state that tracks strengths, weaknesses, error patterns, and retention dynamics. This allows the system to avoid repeating ineffective explanations and to revisit concepts before understanding decays.

4. Interactive canvas interface
The frontend is built around a canvas where users externalize their thinking through diagrams, sketches, and explanations. Canvas interactions are treated as first class signals. The agent responds directly to these interactions, making reasoning observable and actionable.

This closed loop architecture allows Big Brain to function as a continuously adapting agent rather than a stateless tutor.

Real Time Speech and Multimodal Learning with LiveKit

We used LiveKit as the real time communication and streaming backbone for our speech based learning interactions. Our goal was to make spoken reasoning a first class signal rather than a secondary input.

Low Latency Speech Capture

LiveKit enables continuous, low latency audio streaming between the learner and our system. This allows students to explain their thinking out loud in real time while solving problems, rather than typing partial or polished answers.

Audio streams are captured with minimal latency and forwarded to downstream reasoning and transcription components for immediate analysis.

Spoken Reasoning as a Learning Signal

By integrating LiveKit, we treat speech as a core modality for understanding how a learner thinks. Spoken explanations often surface hesitation, uncertainty, and incomplete mental models that are difficult to capture through text alone.

LiveKit allows us to preserve the temporal structure of speech, including pauses and corrections, which are essential signals for diagnosing understanding.

Event Level Observability

LiveKit session events are logged alongside learning interactions, enabling us to associate:

  • audio segments with specific concepts or quiz prompts
  • timing information such as response latency and pauses
  • transitions between speaking, drawing, and answering

This creates a unified multimodal trace of each learning interaction.

Real Time Feedback Loop

Because LiveKit supports bi directional real time communication, our system can respond immediately to spoken reasoning. The agent can interrupt, ask clarifying questions, or redirect the learner based on what is said, not just the final answer.

This enables a conversational learning loop that more closely resembles human tutoring.

Infrastructure First Design

LiveKit serves as the real time layer that makes speech based, agent driven learning possible at scale. It integrates cleanly with our tracing and evaluation pipeline, allowing speech interactions to be analyzed, evaluated, and improved over time using the same tooling as text based generations.

LiveKit transformed speech from an interface feature into an observable, analyzable component of our learning system.

How We Used Arize to Imporve Our Agent

We treated quiz generation as a first class LLM system and used Arize as the core observability, evaluation, and experimentation layer. Our primary goal was not just to generate quizzes, but to rigorously measure and improve how different prompting strategies affect downstream reasoning quality.

Tracing at Scale

We instrumented our application with Arize tracing to capture fine grained generation level data across the entire quiz generation lifecycle. Each trace records:

  • the raw user learning prompt
  • the full quiz generation prompt template
  • the generation configuration including temperature and mode
  • the LLM generated quiz content
  • structured metadata such as concept tags and quiz intent

This allowed us to collect a high volume of comparable traces across multiple prompt strategies and learning contexts.

Multi-Strategy Prompt Evaluation

We implemented two distinct quiz generation strategies:

  • a creative, higher temperature strategy designed to surface exploratory reasoning
  • a structured, lower temperature strategy designed for precise assessment

Rather than selecting a strategy heuristically, we used Arize evaluations to empirically compare their performance.

LLM-as-a-Judge Evaluators

We built LLM as a judge evaluators inside Arize to score each generated quiz on multiple dimensions:

  • conceptual alignment with the user prompt
  • strength and clarity of the reasoning signal elicited
  • effectiveness for diagnosing misconceptions

Evaluators were run directly on traced generations, enabling automated, repeatable comparison across prompt variants.

Dataset Construction and Golden Set

From our trace store, we created a curated dataset of quiz generations. We then constructed a golden dataset by having humans label which quiz better exposed reasoning gaps. This allowed us to benchmark LLM based evaluations against human judgment and validate evaluator reliability.

Controlled Experiments

Using Arize experiments, we ran repeated evaluations across the same dataset while iterating on prompt design. By holding the dataset constant and varying only prompt strategy, we were able to observe consistent metric improvements and isolate the impact of each change.

Measurable Outcomes

Insights from Arize experiments directly informed runtime behavior. The system now selects quiz generation strategies based on empirical evaluation results rather than fixed rules. This led to measurable improvements in reasoning signal clarity and diagnostic usefulness across user interactions.

Advanced Arize Usage

We actively used the Arize Prompt Playground to iterate on prompts, inspect trace level behavior, and refine evaluators. The pipeline is designed to extend naturally into Phoenix APIs for deeper reasoning and error mode analysis.

Arize functioned as the experimentation backbone of our system, enabling a closed loop from trace collection to evaluation, dataset creation, experimentation, and agent improvement.

Challenges we ran into

1. Modeling understanding instead of answers
Most learning systems are built around correctness. We had to rethink how to detect partial understanding, false confidence, and reasoning gaps. This meant designing prompts and diagnostics that reveal how someone thinks, not just whether they are right.

2. Balancing flexibility with structure
Different subjects require different representations. Math needs diagrams and symbolic reasoning. Computer science needs step by step logic. Economics and finance rely heavily on intuition and models. Designing a single system that supports all of these without becoming rigid or confusing was a major challenge.

3. Making adaptation feel intentional, not random
Adaptive systems can easily feel unpredictable to users. We had to ensure that changes in the learning path felt purposeful and understandable, not arbitrary. This required careful sequencing and clear feedback.

4. Working within hackathon constraints
Building an end to end learning system in a limited time forced us to prioritize core ideas over completeness. We focused on making the intelligence and interaction visible rather than covering every edge case.

These challenges shaped Big Brain into a system that emphasizes diagnosis, transparency, and real understanding over surface level performance.

Accomplishments that we're proud of

  • Built a learning system that models how students think and adaptively changes content, questions, and pacing based on reasoning patterns rather than correctness alone.

  • Designed and implemented a canvas first learning experience where understanding is made visible through diagrams, sketches, and partial solutions, enabling direct AI feedback beyond chat.

  • Developed an adaptive diagnostic and syllabus generation pipeline that dynamically prunes mastered concepts and expands fragile understanding across multiple technical disciplines.

  • Created a teach back mechanism that verifies real understanding by challenging learners to explain concepts, interrupting faulty logic, and preventing surface level mastery.

These accomplishments reflect our focus on building intelligence that teaches, not just answers.

What we learned

  • Understanding is hard to measure, and modeling how someone thinks requires looking at process, not just outcomes or correctness.

  • Small UI decisions deeply affect learning behavior, and moving from chat to a canvas changes users from passive consumers into active thinkers.

  • Adaptation only feels intelligent when it is transparent and grounded in clear diagnostic signals rather than hidden heuristics.

  • Building agentic systems is less about adding autonomy and more about designing tight feedback loops between observation, decision making, and action.

These lessons shaped both our technical architecture and our approach to designing learning systems.

What's next for Big Brain

  • Introduce collaborative canvas sessions where students can learn together while the system tracks individual reasoning paths.

  • Build instructor dashboards that surface misconception patterns and learning gaps at the class level.

  • Integrate with learning management systems to support real coursework rather than standalone practice.

  • Improve multimodal interaction by adding voice and structured handwriting recognition on the canvas.

  • Explore spatial and augmented reality learning modes for subjects that benefit from visual and geometric intuition.

Our goal is to continue building Big Brain into a system that understands learners more deeply over time and helps them develop knowledge that holds beyond individual problems.

Share this project:

Updates