Inspiration
PDFs store important information, but they are difficult to search effectively. Traditional search depends on exact keyword matching, which fails when you remember a concept but not the exact wording.
We built Schema to enable semantic search over documents so users can retrieve information based on meaning instead of exact strings.
What Schema Does
Schema is a semantic document storage and retrieval platform for PDFs.
Users can:
- Create and manage collections
- Upload PDFs into collections
- Rename PDFs without changing physical storage paths
- Mark whether a PDF contains handwritten content
- Track indexing status
- Perform:
- Global search across all collections
- Collection-level search for higher precision
- Global search across all collections
Search results return:
- PDF name
- Collection name
- Page number
- Relevant text snippet
- Similarity score
- Direct navigation to the correct page
How We Built Schema
Tech Stack
- Frontend: React + Vite + TypeScript
- Backend: FastAPI
- Relational Database: Supabase (Postgres)
- Vector Database: Actian VectorAI
- Embedding Model: BAAI/bge-large-en-v1.5
Architecture
Schema uses three storage layers:
- Relational database for metadata
- File storage for raw PDFs using stable internal IDs
- Vector database for chunk embeddings and semantic retrieval
Ingestion Pipeline
When a PDF is uploaded:
- Store the file using a stable internal ID
- Extract text (PyMuPDF or TrOCR for handwritten PDFs)
- Split text into moderate-sized chunks while preserving page numbers
- Generate embeddings for each chunk
- Store embeddings with metadata for scoped retrieval
Each chunk links to:
- pdf_id
- collection_id
- user_id
- page_number
- chunk_index
Search Pipeline
When a user submits a query:
- Convert the query into an embedding
- Perform nearest-neighbor search in vector space
- Scope retrieval by user and optionally by collection
Challenges
- Choosing chunk sizes that balance precision and context
- Supporting both global and collection-level search safely
- Finding an embedding model that was both effective and realistically hostable within our infrastructure.
What We Learned
- Embeddings alone do not guarantee strong search quality.
- Chunking strategy and metadata design significantly affect retrieval performance.
- Vector search operates as geometric nearest-neighbor search in high-dimensional space.
- Clean architectural separation improves extensibility and maintainability.
What's Next
- Hybrid keyword + semantic search
- Improved result grouping
- Support for additional document types
- Summaries and question-answering over collections
Built With
- actiandb
- baai/bge-large-en-v1.5
- fastapi
- postgresql
- react
- supabase
- trocr
- typescript
- vite
Log in or sign up for Devpost to join the conversation.