Inspiration

SVSearch was born from a critical challenge in educational technology: how do we automatically align millions of open educational resources with curriculum standards at scale? OpenStax, serving 18 million students worldwide, faces a monumental task, manually tagging thousands of textbook sections with educational standards.

For Rice Datathon 2026's Education Track, our team asked: What if AI could understand educational content semantically, not just match keywords? SVSearch is our answer: a zero-shot semantic classifier that achieves 30.85% Top-1 accuracy and 57.45% Top-3 accuracy without requiring any training data, revolutionizing how educational content gets aligned with curriculum standards

What it does

SVSearch is an intelligent standards prediction engine that automatically tags educational content with the most relevant curriculum standards using semantic understanding rather than keyword matching.

Core Capabilities:

Zero-Shot Classification

  • Predicts standards for any educational content without prior training examples
  • Leverages pre-trained semantic knowledge from billions of documents
  • Eliminates the need for 10,000+ labeled training samples traditional ML requires

Semantic Understanding

  • Recognizes that "repeating decimals" relates to "rational numbers" (different words, same concept)
  • Understands "quadratic equation" ≈ "polynomial roots" (mathematical equivalence)
  • Captures domain expertise embedded in 768-dimensional vector space

Dual-Mode Architecture

  • Primary: MongoDB Atlas Vector Search (0.284s per query, 2.4x faster)
  • Fallback: Local cosine similarity (0.690s, 100% reliability)
  • Automatic failover ensures zero downtime

Production-Ready Performance

  • Processes 94 test items in 3 seconds (vs. 470 minutes manually)
  • 99% time reduction with 67% average confidence
  • Handles edge cases: empty text, missing standards, API failures

How we built it

Methodology:

  1. Data Loading & Cleaning:
  2. Flattened the nested JSON into tabular format while preserving hierarchical context
  3. Engineered composite features by concatenating metadata from each level (book, domain, cluster, description)
  4. Validated data consistency between training and testing sets
  5. Processed 551 training items and 94 test items with 173 unique standard definitions

  6. Embedding Generation:

  7. Uses the Google Gemini API (models/text-embedding-004) to generate vector embeddings for educational standards definitions.

  8. Enhanced text with hierarchical context before embedding to capture full educational meaning

  9. Vector Database Setup:

  10. Designed schema to store standard codes, definitions, and 768-dimensional embeddings

  11. Configured vector search index with cosine similarity metric for semantic matching

  12. Uploaded 173 standards with their embeddings as our semantic reference library

  13. Semantic Search: Primary Method: MongoDB Atlas Vector Search

  14. Uses approximate nearest neighbors (ANN) algorithm for fast similarity search

  15. Performance: 0.284 seconds per query

  16. Searches multiple candidates, returns best match

Fallback Method: Local Cosine Similarity

  • Manually calculates similarity across all 173 standards
  • Performance: 0.690 seconds per query
  • Guaranteed to work when index unavailable
  1. Prediction & Evaluation:
  2. Implemented Top-1 accuracy (exact match: 30.85%)
  3. Calculated Top-3 accuracy (correct answer in top 3: 57.45%)
  4. Tracked confidence scores (average: 87.62%, range: 70-95%)

Challenges we ran into

There are some challenges for our team, as we first tested our training models without any embeddings, tokenizations, or semantics search functionalities in our system. This left us with a low-accuracy prediction for our best and most relevant standard definitions for any course context. It took us a while to incorporate the Google Gemini embedded API model with a vector search cluster for better prediction and best RF + Top-1 accuracy

Accomplishments that we're proud of

We have solved the important part and the questions, and have reached our best RF prediction score for our problem.

What we learned

1/ Understand the data by cleaning, looking at the data, and visualizing what you can see in the data 2/ How to implement MongoDB for semantic search and Google Gemini API for Embedding. 3/ How to communicate, work, and cooperate as a team.

What's next for Semantic Vector Search for Standard Classification

  • We will implement this type of technology into any education system that will help students prepare for any standardized course or test.
Share this project:

Updates