Inspiration

People are building their class schedules now for the next semester, and many wish they had a reliable place to see which classes are most GPA-friendly. There used to be a site that listed easy-A classes, but it’s gone and lacked key features like professor-level rankings, full comment visibility, and AI-assisted summaries. SBU-Recs fills that gap by combining complete grade-distribution data with cleaned student comments and an AI-backed assistant so students can make smarter scheduling decisions.

What it does

  • Ranks courses and instructors by a robust "ease" score derived from real grade-distribution data (not just raw A-counts).
  • Shows full grade-distribution details per section and per instructor so you can judge not only how many As are given but how many students were in the class.
  • Displays every cleaned student comment ("valuable" vs "improvement") for each course/section, with search and filtering.
  • Provides professor/instructor leaderboards (rankings) for departments and courses.
  • Includes a floating RAG chat assistant that answers free-text questions about courses, instructors, and trends:
    • SBC/vector mode: semantic retrieval + LLM synthesis (when an API key is available).
    • Lenient/offline mode: a fast, heuristic retriever that uses the cleaned CSVs so answers work without an API key.
  • Client-side CSV ingestion and a server/CLI toolchain to clean, split, and prepare large datasets for efficient retrieval.
  • Prefers pre-cleaned datasets stored in cleaned/ if present, making nightly or batch cleaning straightforward.

How we built it

  • Frontend: React + TypeScript (Vite) for a fast SPA experience, Tailwind-like utilities for styling, and Zustand for local app state.
  • CSV parsing & client logic: PapaParse + custom normalization in csv.ts.
  • Data cleaning & pipeline scripts: Python (pandas) scripts in scripts/ to clean comments, remove footers/signatures, sanitize SBC tags, and split large RAG datasets into <50 MB parts.
  • RAG & CLI tooling: rag.py supports both vector-backed retrieval (OpenAI embeddings + Chroma) and a lenient offline mode.
  • Developer helpers: split/recombine scripts and a manifest system (index.json) to let the frontend prefer cleaned assets.
  • Assisted development: used Cursor and GitHub Copilot to accelerate prototyping and refactoring.

Challenges we ran into

  • Data acquisition: getting the full Classie Evals exports and aligning headers/encodings was nontrivial.
  • Data cleaning: comments contain repeated footers, signatures, and noisy tokens (SBC tags with digits or long tokens). Building robust cleaning heuristics required iterating on real data.
  • Serving large datasets: single CSVs can exceed hosting limits; we wrote strict splitting/recombination tools to keep parts under size limits for vector ingestion and hosting.
  • Ranking math: raw A-counts aren’t meaningful — a single-student section shouldn’t outrank a 200-student section. We used Bayesian smoothing toward the global A-rate and weighted volume to avoid small-sample bias.
  • AI & infra: making RAG reliable was tricky — vector stores, embeddings, and prompt design needed tuning. The LLM sometimes ignored evidence; the offline retriever gives stability.

Accomplishments that we're proud of

  • A full-stack pipeline that goes from raw Classie CSVs to cleaned, split, and consumable datasets.
  • A responsive UI showing grade distributions, instructor rankings, and all original comments.
  • A flexible RAG assistant with a graceful offline fallback for users without API keys.
  • Robust heuristics for cleaning comments and normalizing SBC tags.
  • Tools to split large RAG datasets into <50 MB parts for hosting and ingestion.

What we learned

  • Data quality takes the most time — cleaning messy comments required far more effort than expected.
  • Small-sample bias is real — naive metrics are misleading without smoothing.
  • Offline fallback support dramatically improves reliability.
  • Iterating in small steps (clean → split → ingest → UI) reduces risk and speeds development.

What's next for SBU-Recs

  • Deeper AI integration: better prompt engineering, reranking, and domain-specific tuning.
  • Schedule generator: propose GPA-optimized semester plans with prerequisites and offering patterns.
  • More data enrichment: instructor office hours, workload estimates, cross-listing cleanup.
  • UX polish: better mobile UI for the assistant, more visualizations, per-department dashboards.
  • Deployment & privacy: admin tools for dataset refresh and improved privacy controls for comments.

Built With

Share this project:

Updates