Inspiration

PDFs store important information, but they are difficult to search effectively. Traditional search depends on exact keyword matching, which fails when you remember a concept but not the exact wording.

We built Schema to enable semantic search over documents so users can retrieve information based on meaning instead of exact strings.

What Schema Does

Schema is a semantic document storage and retrieval platform for PDFs.

Users can:

  • Create and manage collections
  • Upload PDFs into collections
  • Rename PDFs without changing physical storage paths
  • Mark whether a PDF contains handwritten content
  • Track indexing status
  • Perform:
    • Global search across all collections
    • Collection-level search for higher precision

Search results return:

  • PDF name
  • Collection name
  • Page number
  • Relevant text snippet
  • Similarity score
  • Direct navigation to the correct page

How We Built Schema

Tech Stack

  • Frontend: React + Vite + TypeScript
  • Backend: FastAPI
  • Relational Database: Supabase (Postgres)
  • Vector Database: Actian VectorAI
  • Embedding Model: BAAI/bge-large-en-v1.5

Architecture

Schema uses three storage layers:

  1. Relational database for metadata
  2. File storage for raw PDFs using stable internal IDs
  3. Vector database for chunk embeddings and semantic retrieval

Ingestion Pipeline

When a PDF is uploaded:

  1. Store the file using a stable internal ID
  2. Extract text (PyMuPDF or TrOCR for handwritten PDFs)
  3. Split text into moderate-sized chunks while preserving page numbers
  4. Generate embeddings for each chunk
  5. Store embeddings with metadata for scoped retrieval

Each chunk links to:

  • pdf_id
  • collection_id
  • user_id
  • page_number
  • chunk_index

Search Pipeline

When a user submits a query:

  1. Convert the query into an embedding
  2. Perform nearest-neighbor search in vector space
  3. Scope retrieval by user and optionally by collection

Challenges

  • Choosing chunk sizes that balance precision and context
  • Supporting both global and collection-level search safely
  • Finding an embedding model that was both effective and realistically hostable within our infrastructure.

What We Learned

  • Embeddings alone do not guarantee strong search quality.
  • Chunking strategy and metadata design significantly affect retrieval performance.
  • Vector search operates as geometric nearest-neighbor search in high-dimensional space.
  • Clean architectural separation improves extensibility and maintainability.

What's Next

  • Hybrid keyword + semantic search
  • Improved result grouping
  • Support for additional document types
  • Summaries and question-answering over collections

Built With

Share this project:

Updates