SemanticMatching

Inspiration

OpenStax serves 1.7 million students in 169 countries with free textbooks. But mapping educational content to learning standards is slow and expensive. Manual labeling can take months and cost thousands of dollars, leaving under-resourced schools at a disadvantage. Our goal is to democratize curriculum alignment so every teacher can get expert-level standard mapping instantly and for free.

What it does

The objective is to assign the most relevant educational standard to each textbook item (section, example, title, etc.) using curriculum constraints and semantic similarity The solution does not rely on supervised classification or label-based training. Instead, it combines curriculum-aware filtering with Sentence-BERT embeddings and cosine similarity ranking.

How we built it

Candidate Standard Filtering Each concept block in the dataset includes a list of domain-level standard tokens (e.g., NS.A). Only standards whose full codes contain these tokens (e.g., 8.NS.A.1) are considered as candidates. This enforces curriculum structure and prevents cross-domain assignments.

Text Extraction For each textbook item, a text representation is constructed:

Primary source: If a URL is available, the script fetches the HTML page and extracts the section content under the heading corresponding to type + number (e.g., Section 9.6, Example 9.24).
Fallback source: If HTML extraction is unavailable or fails, structured JSON fields (book, concept, domain, text, description) are used instead.

Semantic Embedding

Standard definitions and textbook item text are embedded using Sentence-BERT.
Embeddings are L2-normalized to enable cosine similarity via dot product.
Standard embeddings are cached per domain token set for efficiency.

Similarity-Based Assignment

Cosine similarity is computed between the item embedding and all candidate standard embeddings. The highest-scoring standard is selected.
If the maximum similarity is below a configurable threshold, no standard is assigned.

Evaluation If ground-truth labels are present in testing.json, the script reports:

Top-1 accuracy
Number of evaluated items
Number of correct predictions
Missing labels and predictions Evaluation is performed strictly for reporting purposes and does not influence inference.

Challenges we ran into

HTML structure differences may affect section extraction quality.
Some standards are semantically close and difficult to distinguish using text alone.
Threshold selection involves a trade-off between precision and coverage.

Accomplishments that we're proud of

Cracked the zero-shot problem: Maintained accuracy even though 44% of standards had no training examples
URL scraping: Two-step approach using URL first, then JSON fallback, giving clear accuracy gains
Domain-aware filtering: Stops nonsensical predictions, like applying a geometry standard to algebra content
Keep it simple: With 98.7% of sections being single-label, we chose a straightforward approach instead of over-engineering

What we learned

At first, we built a model that got 0 accuracy. That’s when we realized we needed a completely new approach, and honestly, it was a long journey to get here!

What's next for SemanticMatching

*Addressing our limitations: *

Better HTML parsing: using multiple HTML extraction strategies
Clearer distinctions: applying hierarchical modeling and multimodal signals to separate closely related standards
Predictive model in real dataset: use TF-IDF when real dataset is much bigger for training purpose

Hopefully with this project

Schools get access to the same quality of books more
Textbook updates can be rolled out much more quickly
Standardized learning resources are available to more students, regardless of location

** BGM: Chopin Nocturne op 9 no 2**