SCHEMA

Inspiration

PDFs store important information, but they are difficult to search effectively. Traditional search depends on exact keyword matching, which fails when you remember a concept but not the exact wording.

We built Schema to enable semantic search over documents so users can retrieve information based on meaning instead of exact strings.

What Schema Does

Schema is a semantic document storage and retrieval platform for PDFs.

Users can:

Create and manage collections
Upload PDFs into collections
Rename PDFs without changing physical storage paths
Mark whether a PDF contains handwritten content
Track indexing status
Perform:
- Global search across all collections
- Collection-level search for higher precision

Search results return:

PDF name
Collection name
Page number
Relevant text snippet
Similarity score
Direct navigation to the correct page

How We Built Schema

Tech Stack

Frontend: React + Vite + TypeScript
Backend: FastAPI
Relational Database: Supabase (Postgres)
Vector Database: Actian VectorAI
Embedding Model: BAAI/bge-large-en-v1.5

Architecture

Schema uses three storage layers:

Relational database for metadata
File storage for raw PDFs using stable internal IDs
Vector database for chunk embeddings and semantic retrieval

Ingestion Pipeline

When a PDF is uploaded:

Store the file using a stable internal ID
Extract text (PyMuPDF or TrOCR for handwritten PDFs)
Split text into moderate-sized chunks while preserving page numbers
Generate embeddings for each chunk
Store embeddings with metadata for scoped retrieval

Each chunk links to:

pdf_id
collection_id
user_id
page_number
chunk_index

Search Pipeline

When a user submits a query:

Convert the query into an embedding
Perform nearest-neighbor search in vector space
Scope retrieval by user and optionally by collection

Challenges

Choosing chunk sizes that balance precision and context
Supporting both global and collection-level search safely
Finding an embedding model that was both effective and realistically hostable within our infrastructure.

What We Learned

Embeddings alone do not guarantee strong search quality.
Chunking strategy and metadata design significantly affect retrieval performance.
Vector search operates as geometric nearest-neighbor search in high-dimensional space.
Clean architectural separation improves extensibility and maintainability.