RAG Test Suite - Interactive RAG Pipeline Testing Platform

Inspiration

The RAG Pipeline Educator was born from the need to test various controls and knobs for Retrieval-Augmented Generation systems for developers, researchers, and students. The intricate processes of chunking, embedding, retrieval, and generation contain various modalities and possibilities. This project was inspired by the desire to create an interactive platform where users can experiment with every knob and dial in a RAG pipeline, making the complex simple through visualisation and hands-on experimentation.

What it does

The RAG Test Suite is a comprehensive testing and educational platform that provides:

🔧 Interactive RAG Component Testing

  • Chunking Playground: Test different text segmentation strategies (fixed-size, hierarchical, sentence-based, paragraph-based) with real-time parameter adjustment and visual feedback
  • Embedding Laboratory: Visualise how text transforms into vectors using real AWS Bedrock models (Titan, Cohere) with 2D/3D scatter plots showing semantic relationships
  • Query Engine: Experiment with multiple similarity metrics (cosine, euclidean, manhattan, dot product) and advanced reranking strategies (cross-encoder, BM25 hybrid, LLM-based)
  • Generation Studio: See how retrieved context affects LLM responses with prompt construction visualization and guardrail monitoring
  • Automated Evaluation Pipeline: Tests the RAG to measure faithfulness, answer relevancy, context precision, and context recall

How we built it

Frontend Architecture (React + TypeScript)

  • Modular Component System: Each RAG stage has its own interactive component with real-time parameter controls
  • D3.js Visualizations: Custom interactive visualizations for embedding spaces, similarity networks, and hierarchical chunk relationships
  • Real-time State Management: Zustand for seamless data flow between components with instant visual feedback

Python Backend (FastAPI + LangChain)

  • AWS Bedrock Integration: Direct integration with production-grade embedding models (Titan, Cohere) and LLMs (Claude 3, Titan Text)
  • Advanced Chunking: LangChain's RecursiveCharacterTextSplitter with hierarchical parent-child relationships
  • FAISS Vector Search: High-performance similarity search with multiple distance metrics
  • Guardrails & Safety: Content validation, token limit monitoring, and relevance scoring

Agentic RAGAS System

  • LLM-Powered Code Generation: Uses Claude 3 to interpret natural language evaluation requirements
  • RAGAS Framework Integration: Automatically generates evaluation pipelines using the RAGAS library
  • Synthetic Data Pipeline: Creates realistic test datasets with controlled variations and difficulty levels
  • Evaluation Orchestration: Manages end-to-end evaluation workflows from data generation to metric calculation

Challenges we ran into

Real-time Performance Optimization

  • AWS Bedrock Rate Limits: Implemented intelligent caching and request batching to handle real-time parameter adjustments without hitting API limits
  • Visualization Performance: Large embedding spaces with hundreds of points required Canvas-based rendering and level-of-detail optimization
  • Memory Management: Frequent re-computations for parameter changes needed careful cleanup and efficient data structure reuse

Educational UX Complexity

  • Cognitive Load: Balancing comprehensive functionality with intuitive user experience - solved through progressive disclosure and guided tours
  • Visual Clarity: Making complex mathematical concepts (vector similarities, dimensionality reduction) visually understandable through interactive scatter plots and connection networks

Agentic System Reliability

  • Code Generation Accuracy: Ensuring generated RAGAS code is syntactically correct and semantically meaningful required extensive prompt engineering and validation
  • Evaluation Consistency: Making synthetic datasets realistic while maintaining evaluation reliability across different domains and use cases

Accomplishments that we're proud of

🎯 Educational Impact

  • Interactive Learning: Created the first comprehensive visual RAG education platform where users can see every step of the pipeline in action
  • Real Production Models: Integration with actual AWS Bedrock models means users learn with the same tools they'll use in production
  • Hierarchical Visualization: Pioneered interactive parent-child chunk relationship visualization that makes complex document structures intuitive

🚀 Technical Innovation

  • Multi-Modal Visualization: Seamless integration of text processing, vector mathematics, and interactive graphics in a single coherent interface
  • Production-Ready Architecture: Built with scalability in mind - can handle concurrent users and large document processing

🔬 Research Contribution

  • Evaluation Democratization: Made advanced RAG evaluation accessible to developers without deep ML expertise
  • Parameter Exploration: Enabled systematic exploration of RAG parameter spaces that was previously manual and time-consuming

What we learned

🧠 RAG System Complexity

  • Parameter Interdependence: Discovered how chunking strategies dramatically affect downstream embedding quality and retrieval accuracy
  • Evaluation Challenges: Learned that traditional metrics often miss nuanced quality aspects that only human evaluation or sophisticated synthetic datasets can capture
  • Model Behavior: Gained deep insights into how different embedding models (Titan vs Cohere) perform across various text types and domains

🎨 Educational Technology

  • Visualization Psychology: Learned that interactive exploration is far more effective than static explanations for complex technical concepts
  • Progressive Complexity: Discovered the importance of layered learning - starting simple and gradually revealing complexity as users gain confidence

🤖 Agentic System Design

  • Prompt Engineering: Mastered the art of creating prompts that generate reliable, executable code from natural language descriptions
  • Evaluation Framework Integration: Learned to seamlessly integrate multiple evaluation frameworks (RAGAS, custom metrics) into a unified system

What's next for RAG Test Suite

🔮 Advanced Agentic Capabilities

  • Multi-Agent Evaluation: Deploy specialized agents for different evaluation aspects (factual accuracy, coherence, relevance) that collaborate on comprehensive assessments
  • Adaptive Learning: Implement reinforcement learning to improve synthetic dataset generation based on evaluation results and user feedback
  • Domain Specialization: Create domain-specific evaluation agents (medical, legal, technical) with specialized knowledge and evaluation criteria

🌐 Platform Expansion

  • Collaborative Features: Multi-user workspaces where teams can share configurations, compare results, and collaborate on RAG system optimization
  • Integration Ecosystem: APIs and plugins for popular RAG frameworks (LlamaIndex, Haystack, LangChain) to bring the testing capabilities into existing workflows
  • Cloud Deployment: Fully managed SaaS version with enterprise features, team management, and advanced analytics

📊 Advanced Analytics

  • Performance Benchmarking: Comprehensive benchmarking suite comparing different RAG configurations across standardized datasets
  • Cost Optimization: Intelligent recommendations for balancing performance, accuracy, and cost based on use case requirements
  • A/B Testing Framework: Built-in experimentation platform for systematic RAG system optimization

🎓 Educational Evolution

  • Certification Program: Structured learning paths with assessments and certifications for RAG system design and optimization
  • Research Integration: Direct integration with academic research, allowing researchers to test new methods and share results with the community
  • Industry Case Studies: Real-world case studies and best practices from successful RAG implementations across different industries

Key Features Summary

Interactive RAG Pipeline Components

  1. Chunking Module

    • Multiple strategies: Fixed-size, Hierarchical, Sentence-based, Paragraph-based
    • Real-time parameter adjustment
    • Visual chunk boundary highlighting
    • Parent-child relationship visualization
  2. Embedding Module

    • AWS Bedrock integration (Titan, Cohere models)
    • vector space visualization in 2D
    • Interactive similarity exploration
    • Dimensionality reduction (t-SNE, PCA)
  3. Retrieval Module

    • Multiple similarity metrics
    • Advanced reranking strategies
    • Real-time query processing
    • Visual result highlighting
  4. Generation Module

    • Prompt construction visualization
    • Context window management
    • Guardrail monitoring
    • Response quality analysis
  5. Eval with LLM as judge

  6. It adds evals at the end and offers explanations for the evals at the end.

The RAG Test Suite represents a paradigm shift from black-box RAG development to transparent, interactive, and scientifically rigorous RAG system design and evaluation.

Built With

Share this project:

Updates