EduMeter

Measuring Mentor

@Inspiration

The rapid growth of AI tutors powered by Large Language Models (LLMs) inspired us to think about their real educational value. While these systems are great at generating fluent responses, we noticed that they often:

↪ Fail to spot student mistakes precisely.

↪ Provide vague or misleading guidance.

This sparked the idea behind EduMeter: a system that evaluates whether AI (or human) tutor responses are both pedagogically sound and helpful for learning.

@What We Learned

⇱ The difference between content correctness and pedagogical effectiveness — sometimes tutors say the right thing, but in a way that doesn’t actually help the student.

⇱ How to classify responses into nuanced categories like Yes, To some extent, and No, instead of just binary correctness.

⇱ That evaluating meta-reasoning is as important as evaluating the answer itself.

⇱ Practical experience with annotated educational dialogues and designing metrics for education-focused NLP.

@How We Built It

⟫ Dataset: Annotated student–tutor dialogues where the final tutor turn was labeled along two dimensions:

» Mistake Identification

» Providing Guidance

⟫ Modeling:

» We experimented with sentence embeddings (using Sentence-BERT) to capture semantic meaning.

» Built two classifiers (one per track) using Logistic Regression and XGBoost.

» Also tested LLM prompting (few-shot examples) to compare zero/few-shot performance with fine-tuned models.

⟫ Evaluation:

» Classification labels: ${Yes, To\ some\ extent, No}$

» Metrics: Macro-F1 and accuracy, with special attention to class imbalance.

@Challenges We Ran Into

⇲ Ambiguity in Labels: Sometimes tutor responses were partially right, making it difficult to distinguish between Yes and To some extent.

⇲ Guidance vs. Identification: A tutor could identify the mistake correctly but fail to provide meaningful guidance — keeping these tracks independent was tricky.

⇲ Data Balance: The “To some extent” label was frequent, so models tended to overpredict it.

⇲ LLM Hallucinations: When prompting large models, they occasionally over-explained, drifting away from precise evaluation.

@Accomplishments We’re Proud Of

⨳ Designing a framework that goes beyond correctness to evaluate teaching effectiveness.

⨳ Achieving competitive performance on both tracks with relatively lightweight models.

⨳ Creating a project that can be extended to real-world educational platforms to ensure students get not just answers, but better learning experiences.

🎯What’s Next

◪ Multi-subject extension: Currently focused on mathematics, but the same idea can apply to science, programming, and languages.

◪ Human–AI comparison: Benchmarking human tutor responses against LLMs.

◪ Integration: Building a plug-in for EdTech platforms to automatically evaluate tutor quality.