RAG Knowledge Search System

Architecture

What Inspired This Project

As a DevOps engineer, I found myself constantly searching through internal documentation, troubleshooting guides, and knowledge bases to solve infrastructure challenges. Traditional search tools often returned irrelevant results or outdated information, making it difficult to find the right solutions quickly.

I was inspired to build this RAG system after seeing how AI-powered search could transform how we access and utilize organizational knowledge. The idea was to create a system that could understand the intent behind queries and provide contextual, accurate answers based on our actual documentation and runbooks.

What I Learned

Building this RAG system taught me several key concepts that are crucial for DevOps engineers working with AI:

RAG Architecture Fundamentals

Embeddings: How text gets converted to numerical vectors for semantic search
Vector Databases: Why Elasticsearch with HNSW indexing is powerful for similarity search
Retrieval Strategies: Balancing recall vs. precision in document retrieval
Token Management: Critical importance of staying within LLM token limits (20k for Vertex AI)

Production Considerations

Scalability: Auto-scaling Cloud Run services with proper resource allocation
Cost Management: Hybrid local/GCP deployment to minimize operational costs
Monitoring: Comprehensive logging and metrics for system health
Security: VPC networking and IAM roles for secure cloud deployments

DevOps Integration Patterns

Infrastructure as Code: Terraform for reproducible GCP deployments
CI/CD: Cloud Build for automated Docker image building and deployment
Container Orchestration: Docker Compose for local development, Cloud Run for production
Configuration Management: Environment-specific configs with proper secrets handling

How I Built This Project

Phase 1: Core RAG Implementation

# Key architectural decisions:
# 1. Multi-provider abstraction for LLMs and embeddings
# 2. Elasticsearch as vector store with HNSW indexing
# 3. Document chunking with overlap for better retrieval
# 4. RESTful API design with FastAPI

Phase 2: Production Deployment

Local Development: Docker Compose with Ollama for cost-effective testing
GCP Production: Cloud Run services with Vertex AI for enterprise-grade performance
Infrastructure: Terraform for reproducible, version-controlled deployments

Phase 3: DevOps Integration

Monitoring: Comprehensive metrics and logging
Cost Optimization: Pause/resume scripts for development environments
Documentation: Complete guides for local and production deployment

Challenges I Faced

Token Limit Management

Challenge: Vertex AI embeddings have a 20,000 token limit, but some documents exceeded this during ingestion.

Solution: Implemented intelligent chunking with token validation:

def count_tokens(text: str) -> int:
    """Accurate token counting with tiktoken"""
    encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

# Automatic truncation for oversized chunks
if token_count > 15000:  # Conservative buffer
    text = truncate_text(text, target_tokens=15000)

Hybrid Local/Cloud Deployment

Challenge: Balancing development efficiency with production capabilities.

Solution: Created dual deployment modes:

Local: Ollama + sentence-transformers for rapid iteration
Production: Vertex AI for enterprise-grade performance
Unified Interface: Same API regardless of backend

Infrastructure Complexity

Challenge: Managing VPC networking, IAM roles, and service dependencies in GCP.

Solution: Infrastructure as Code with Terraform:

# VPC connector for Cloud Run → GCE communication
resource "google_vpc_access_connector" "connector" {
  name = "rag-vpc-connector"
  ip_cidr_range = "10.8.0.0/28"
  network = "default"
  region = "us-central1"
}

Cost Optimization

Challenge: GCP costs can escalate quickly with always-on services.

Solution: Implemented cost management strategies:

Pause/Resume Scripts: Stop Elasticsearch VM when not needed
Auto-scaling: Cloud Run scales to zero when idle
Local Development: Complete local stack for testing

DevOps Applications

This RAG system demonstrates several DevOps principles:

Observability

Metrics: Query latency, satisfaction rates, cost tracking
Logging: Structured logging with correlation IDs
Health Checks: Service health monitoring and alerting

Reliability

Error Handling: Graceful degradation and retry logic
Circuit Breakers: Protection against cascading failures
Data Consistency: Proper indexing and change detection

Scalability

Horizontal Scaling: Auto-scaling Cloud Run services
Performance: Optimized vector search with HNSW indexing
Efficiency: Batch processing and connection pooling

Built With

artifact-registry
cloud-build
cloud-run
compute-engine
elasticsearch
fastapi
gcp
iam
python
streamlit
terraform
vertex
vpc

Updates

Jim Bromfield started this project — Oct 18, 2025 05:22 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.