High-Precision RAG: Elastic + Jina AI

About the project

What Inspired Me

During my time as a Software Engineer Intern at Capillary Technologies working on AI-powered conversational bots, I encountered a recurring bottleneck in Retrieval-Augmented Generation (RAG) pipelines: context poisoning. The generative models rarely lacked reasoning capability; rather, they hallucinated because the underlying search infrastructure fed them peripheral or irrelevant context.

Traditional BM25 lexical search provides excellent precision for exact keyword matching but completely fails to grasp semantic intent. Conversely, pure vector search using bi-encoders understands broad concepts but struggles with exact IDs and suffers from vocabulary mismatches. I was inspired by the Elastic Blogathon 2026 theme, Vectorized Thinking, to solve this. I wanted to build an architecture that bridges the gap between raw data retrieval and human-like contextual understanding—a two-stage semantic reranking engine that guarantees the absolute most relevant document sits at position #1.

How I Built the Project

I built a full-stack, AI-powered bookstore catalog MVP using the CMU Book Summary Dataset (containing 16,559 Wikipedia book plots). The architecture operates entirely on Elasticsearch 8.17+ and integrates natively with Jina AI.

Data Ingestion & Vectorization: I engineered an Elasticsearch ingest pipeline utilizing the open Inference API. As the CMU dataset streams in, the jina-embeddings-v3 model automatically converts the plot_summary field into 1024-dimensional dense vectors stored via an HNSW graph.
Stage 1 - Hybrid Retrieval: When a user submits a query from the React (Vite) frontend, the FastAPI middle-tier routes it to Elasticsearch. The system first executes a hybrid search, running BM25 and kNN concurrently. I fused these signals using Reciprocal Rank Fusion (RRF), defined mathematically as:

$$RRF_Score = \frac{1}{k + rank_{BM25}} + \frac{1}{k + rank_{kNN}}$$

where $k$ is a rank constant (typically 60). This broad net pulls the top 100 candidate books.
Stage 2 - Semantic Reranking: Instead of returning those 100 books to the user, the new text_similarity_reranker framework intercepts them. It ships the candidate set to the jina-reranker-v3 cross-encoder. Unlike bi-encoders, this model computes causal self-attention across both the query and the document simultaneously within the same transformer window, assigning highly calibrated relevancy scores.
Final Output: The API returns the top 10 strictly reordered results to the React UI, surfacing books based on deep narrative intent rather than superficial keyword overlap.

Challenges I Faced

The Latency vs. Quality Tradeoff: Cross-encoders are computationally expensive because their complexity scales quadratically with sequence length. Initially, attempting to rerank the entire retrieval corpus resulted in unacceptable API latency. I overcame this by analyzing relevance saturation (the Pareto distribution) and implementing a strict Top-K threshold, proving that reranking only the top 30–100 documents yields $\approx 90\%$ of the maximum possible NDCG@10 gain while keeping latency well under budget.
Mapping and Normalization Contracts: While configuring the dense_vector fields, I encountered scoring anomalies. I learned that mixing vector similarity metrics requires strict adherence to mathematical contracts. Because I opted to use cosine similarity to measure the angular distance between vectors, I had to ensure all generated query and document vectors were strictly $L_2$-normalized ($||v||_2 = 1$).
Architectural Complexity: Orchestrating a two-stage retrieval pipeline traditionally requires heavy application-side logic to merge arrays and sort scores. I mitigated this challenge by adopting Elasticsearch 8.16+'s new unified retriever API framework, which allowed me to compress the entire BM25, kNN, and external Jina AI inference flow into a single, elegant JSON query block.

What I Learned

The Power of Cross-Encoders: I learned exactly why bi-encoders leave a "precision gap" at the top of search results and how late-interaction models and cross-encoders calculate absolute semantic overlap to fix it.
Infrastructure Optimization: I gained deep insights into scaling Elasticsearch for vector workloads, such as tuning HNSW graph parameters (m and ef_construction), reducing replica counts to prevent cache contention, and leveraging shard-level request caching to protect the JVM heap from OutOfMemory exceptions during heavy inference loads.
Context Engineering: I realized that vectors are no longer just a niche machine learning feature—they are the foundational database infrastructure required for the future of autonomous Agentic AI. Searching is no longer just matching text; it's reasoning about intent.

Built With

Updates

Ravjot Singh started this project — Feb 27, 2026 07:23 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.