You begin by pasting your source
choose chunking strategy
visualise retrieved chunks
choose kind of embedding
visualise embedding
visulise embeddings and relationships
select distance metric
select reranking
view retrived chunks
view difference because of reranking
choose system prompt
view llm output
response analysis
Use LLM as judge

RAG Test Suite - Interactive RAG Pipeline Testing Platform

Inspiration

The RAG Pipeline Educator was born from the need to test various controls and knobs for Retrieval-Augmented Generation systems for developers, researchers, and students. The intricate processes of chunking, embedding, retrieval, and generation contain various modalities and possibilities. This project was inspired by the desire to create an interactive platform where users can experiment with every knob and dial in a RAG pipeline, making the complex simple through visualisation and hands-on experimentation.

What it does

The RAG Test Suite is a comprehensive testing and educational platform that provides:

🔧 Interactive RAG Component Testing

Chunking Playground: Test different text segmentation strategies (fixed-size, hierarchical, sentence-based, paragraph-based) with real-time parameter adjustment and visual feedback
Embedding Laboratory: Visualise how text transforms into vectors using real AWS Bedrock models (Titan, Cohere) with 2D/3D scatter plots showing semantic relationships
Query Engine: Experiment with multiple similarity metrics (cosine, euclidean, manhattan, dot product) and advanced reranking strategies (cross-encoder, BM25 hybrid, LLM-based)
Generation Studio: See how retrieved context affects LLM responses with prompt construction visualization and guardrail monitoring
Automated Evaluation Pipeline: Tests the RAG to measure faithfulness, answer relevancy, context precision, and context recall

How we built it

Frontend Architecture (React + TypeScript)

Modular Component System: Each RAG stage has its own interactive component with real-time parameter controls
D3.js Visualizations: Custom interactive visualizations for embedding spaces, similarity networks, and hierarchical chunk relationships
Real-time State Management: Zustand for seamless data flow between components with instant visual feedback

Python Backend (FastAPI + LangChain)

AWS Bedrock Integration: Direct integration with production-grade embedding models (Titan, Cohere) and LLMs (Claude 3, Titan Text)
Advanced Chunking: LangChain's RecursiveCharacterTextSplitter with hierarchical parent-child relationships
FAISS Vector Search: High-performance similarity search with multiple distance metrics
Guardrails & Safety: Content validation, token limit monitoring, and relevance scoring

Agentic RAGAS System

LLM-Powered Code Generation: Uses Claude 3 to interpret natural language evaluation requirements
RAGAS Framework Integration: Automatically generates evaluation pipelines using the RAGAS library
Synthetic Data Pipeline: Creates realistic test datasets with controlled variations and difficulty levels
Evaluation Orchestration: Manages end-to-end evaluation workflows from data generation to metric calculation

Challenges we ran into

Real-time Performance Optimization

AWS Bedrock Rate Limits: Implemented intelligent caching and request batching to handle real-time parameter adjustments without hitting API limits
Visualization Performance: Large embedding spaces with hundreds of points required Canvas-based rendering and level-of-detail optimization
Memory Management: Frequent re-computations for parameter changes needed careful cleanup and efficient data structure reuse

Educational UX Complexity

Cognitive Load: Balancing comprehensive functionality with intuitive user experience - solved through progressive disclosure and guided tours
Visual Clarity: Making complex mathematical concepts (vector similarities, dimensionality reduction) visually understandable through interactive scatter plots and connection networks

Agentic System Reliability

Code Generation Accuracy: Ensuring generated RAGAS code is syntactically correct and semantically meaningful required extensive prompt engineering and validation
Evaluation Consistency: Making synthetic datasets realistic while maintaining evaluation reliability across different domains and use cases

Accomplishments that we're proud of

🎯 Educational Impact

Interactive Learning: Created the first comprehensive visual RAG education platform where users can see every step of the pipeline in action
Real Production Models: Integration with actual AWS Bedrock models means users learn with the same tools they'll use in production
Hierarchical Visualization: Pioneered interactive parent-child chunk relationship visualization that makes complex document structures intuitive

🚀 Technical Innovation

Multi-Modal Visualization: Seamless integration of text processing, vector mathematics, and interactive graphics in a single coherent interface
Production-Ready Architecture: Built with scalability in mind - can handle concurrent users and large document processing

🔬 Research Contribution

Evaluation Democratization: Made advanced RAG evaluation accessible to developers without deep ML expertise
Parameter Exploration: Enabled systematic exploration of RAG parameter spaces that was previously manual and time-consuming

What we learned

🧠 RAG System Complexity

Parameter Interdependence: Discovered how chunking strategies dramatically affect downstream embedding quality and retrieval accuracy
Evaluation Challenges: Learned that traditional metrics often miss nuanced quality aspects that only human evaluation or sophisticated synthetic datasets can capture
Model Behavior: Gained deep insights into how different embedding models (Titan vs Cohere) perform across various text types and domains

🎨 Educational Technology

Visualization Psychology: Learned that interactive exploration is far more effective than static explanations for complex technical concepts
Progressive Complexity: Discovered the importance of layered learning - starting simple and gradually revealing complexity as users gain confidence

🤖 Agentic System Design

Prompt Engineering: Mastered the art of creating prompts that generate reliable, executable code from natural language descriptions
Evaluation Framework Integration: Learned to seamlessly integrate multiple evaluation frameworks (RAGAS, custom metrics) into a unified system

What's next for RAG Test Suite

🔮 Advanced Agentic Capabilities

Multi-Agent Evaluation: Deploy specialized agents for different evaluation aspects (factual accuracy, coherence, relevance) that collaborate on comprehensive assessments
Adaptive Learning: Implement reinforcement learning to improve synthetic dataset generation based on evaluation results and user feedback
Domain Specialization: Create domain-specific evaluation agents (medical, legal, technical) with specialized knowledge and evaluation criteria

🌐 Platform Expansion

Collaborative Features: Multi-user workspaces where teams can share configurations, compare results, and collaborate on RAG system optimization
Integration Ecosystem: APIs and plugins for popular RAG frameworks (LlamaIndex, Haystack, LangChain) to bring the testing capabilities into existing workflows
Cloud Deployment: Fully managed SaaS version with enterprise features, team management, and advanced analytics

📊 Advanced Analytics

Performance Benchmarking: Comprehensive benchmarking suite comparing different RAG configurations across standardized datasets
Cost Optimization: Intelligent recommendations for balancing performance, accuracy, and cost based on use case requirements
A/B Testing Framework: Built-in experimentation platform for systematic RAG system optimization

🎓 Educational Evolution

Certification Program: Structured learning paths with assessments and certifications for RAG system design and optimization
Research Integration: Direct integration with academic research, allowing researchers to test new methods and share results with the community
Industry Case Studies: Real-world case studies and best practices from successful RAG implementations across different industries

Key Features Summary

Interactive RAG Pipeline Components

Chunking Module
- Multiple strategies: Fixed-size, Hierarchical, Sentence-based, Paragraph-based
- Real-time parameter adjustment
- Visual chunk boundary highlighting
- Parent-child relationship visualization
Embedding Module
- AWS Bedrock integration (Titan, Cohere models)
- vector space visualization in 2D
- Interactive similarity exploration
- Dimensionality reduction (t-SNE, PCA)
Retrieval Module
- Multiple similarity metrics
- Advanced reranking strategies
- Real-time query processing
- Visual result highlighting
Generation Module
- Prompt construction visualization
- Context window management
- Guardrail monitoring
- Response quality analysis
Eval with LLM as judge
It adds evals at the end and offers explanations for the evals at the end.

The RAG Test Suite represents a paradigm shift from black-box RAG development to transparent, interactive, and scientifically rigorous RAG system design and evaluation.

Built With

Updates

Haider Ali started this project — Oct 20, 2025 11:29 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.