Inspiration

OpenStax serves 1.7 million students in 169 countries with free textbooks. But mapping educational content to learning standards is slow and expensive. Manual labeling can take months and cost thousands of dollars, leaving under-resourced schools at a disadvantage. Our goal is to democratize curriculum alignment so every teacher can get expert-level standard mapping instantly and for free.

What it does

The objective is to assign the most relevant educational standard to each textbook item (section, example, title, etc.) using curriculum constraints and semantic similarity The solution does not rely on supervised classification or label-based training. Instead, it combines curriculum-aware filtering with Sentence-BERT embeddings and cosine similarity ranking.

How we built it

Candidate Standard Filtering Each concept block in the dataset includes a list of domain-level standard tokens (e.g., NS.A). Only standards whose full codes contain these tokens (e.g., 8.NS.A.1) are considered as candidates. This enforces curriculum structure and prevents cross-domain assignments.

Text Extraction For each textbook item, a text representation is constructed:

  • Primary source: If a URL is available, the script fetches the HTML page and extracts the section content under the heading corresponding to type + number (e.g., Section 9.6, Example 9.24).
  • Fallback source: If HTML extraction is unavailable or fails, structured JSON fields (book, concept, domain, text, description) are used instead.

Semantic Embedding

  • Standard definitions and textbook item text are embedded using Sentence-BERT.
  • Embeddings are L2-normalized to enable cosine similarity via dot product.
  • Standard embeddings are cached per domain token set for efficiency.

Similarity-Based Assignment

  • Cosine similarity is computed between the item embedding and all candidate standard embeddings. The highest-scoring standard is selected.
  • If the maximum similarity is below a configurable threshold, no standard is assigned.

Evaluation If ground-truth labels are present in testing.json, the script reports:

  • Top-1 accuracy
  • Number of evaluated items
  • Number of correct predictions
  • Missing labels and predictions Evaluation is performed strictly for reporting purposes and does not influence inference.

Challenges we ran into

  • HTML structure differences may affect section extraction quality.
  • Some standards are semantically close and difficult to distinguish using text alone.
  • Threshold selection involves a trade-off between precision and coverage.

Accomplishments that we're proud of

  • Cracked the zero-shot problem: Maintained accuracy even though 44% of standards had no training examples
  • URL scraping: Two-step approach using URL first, then JSON fallback, giving clear accuracy gains
  • Domain-aware filtering: Stops nonsensical predictions, like applying a geometry standard to algebra content
  • Keep it simple: With 98.7% of sections being single-label, we chose a straightforward approach instead of over-engineering

What we learned

At first, we built a model that got 0 accuracy. That’s when we realized we needed a completely new approach, and honestly, it was a long journey to get here!

What's next for SemanticMatching

*Addressing our limitations: *

  • Better HTML parsing: using multiple HTML extraction strategies
  • Clearer distinctions: applying hierarchical modeling and multimodal signals to separate closely related standards
  • Predictive model in real dataset: use TF-IDF when real dataset is much bigger for training purpose

Hopefully with this project

  • Schools get access to the same quality of books more
  • Textbook updates can be rolled out much more quickly
  • Standardized learning resources are available to more students, regardless of location

** BGM: Chopin Nocturne op 9 no 2**

Built With

Share this project:

Updates