Inspiration
Online shopping has become increasingly frustrating despite technological advances. Customers struggle with:
- Keyword-based search that misses their actual intent ("comfortable shoes for long walks" returns generic results)
- Black-box recommendations with no explanation of why products are suggested
- Inability to compare products meaningfully beyond basic spec sheets
- Fragmented shopping experiences requiring multiple searches to find the right product
We envisioned a shopping assistant that thinks like a knowledgeable store associateβunderstanding nuanced requests, reasoning through options, and explaining recommendations transparently. When NVIDIA released the Llama-3.1-Nemotron-Nano-8B model with advanced reasoning capabilities, we saw an opportunity to build something truly intelligent without the computational overhead of larger models.
The goal: Make AI-powered shopping feel natural, trustworthy, and genuinely helpful.
What it does
RetailSense AI is an agentic Retrieval-Augmented Generation (RAG) system that transforms how people shop online:
Core Capabilities
π Semantic Product Discovery
- Understands natural language queries like "running shoes for flat feet under $150"
- Goes beyond keyword matching to grasp user intent
- Retrieves semantically relevant products using vector embeddings
π€ Transparent Reasoning
- Explains why it recommends specific products
- Shows the thought process behind each suggestion
- Builds user trust through transparency
βοΈ Intelligent Comparison
- Compares products across multiple dimensions
- Highlights trade-offs and best-fit scenarios
- Provides structured, actionable insights
π¬ Conversational Context
- Maintains conversation history across multiple turns
- Refines recommendations based on follow-up questions
- Learns user preferences within a session
π― Multi-Agent Architecture
- Specialized agents for search, comparison, recommendations, and Q&A
- Parallel processing for faster responses
- Coordinated decision-making for complex queries
Example Interactions
User: "I need running shoes for marathon training, but I have flat feet and a budget of $120"
RetailSense AI:
I found 3 excellent options for you:
1. **Brooks Adrenaline GTS 23** ($119.95)
- Why I recommend: Excellent arch support specifically designed for flat feet
- GuideRails technology prevents excess movement
- Highly rated for marathon distances
2. **ASICS GT-2000 11** ($110.00)
- Why I recommend: Dynamic DuoMax support system ideal for overpronation
- Slightly lighter than Brooks for faster training runs
- Best value in your budget
3. **New Balance 860v13** ($114.99)
- Why I recommend: Dual-density post for stability
- Roomier toe box preferred by many flat-footed runners
- 30-day wear test guarantee
Based on your needs, I'd recommend the **Brooks Adrenaline GTS 23** as the best balance
of flat-foot support, marathon durability, and user reviews. Would you like me to compare
these in more detail?
How we built it
Architecture Overview
User Query β API Gateway β Agent Orchestrator β Specialized Agents
β
NVIDIA NIM Models on SageMaker
ββ Llama-Nemotron (Reasoning)
ββ NV-Embed-v2 (Embeddings)
β
Data Layer (Vector DB + PostgreSQL)
β
Formatted Response β User
Technology Stack
πΉ AI/ML Infrastructure
- NVIDIA Llama-3.1-Nemotron-Nano-8B: Core reasoning engine deployed as NIM microservice
- NVIDIA NV-Embed-v2: High-quality 768-dimensional embeddings for semantic search
- AWS SageMaker: Hosting both NVIDIA NIM endpoints (ml.g5.2xlarge for LLM, ml.g5.xlarge for embeddings)
πΉ Backend
- FastAPI: High-performance API server
- LangChain: Agent orchestration and prompt management
- Pinecone: Vector database for product embeddings (100K+ products indexed)
- PostgreSQL on RDS: Structured product data and metadata
- Redis ElastiCache: Session management and response caching
πΉ Frontend
- React + TypeScript: Modern, responsive UI
- TailwindCSS: Styling and design system
- WebSocket: Real-time chat updates
πΉ Infrastructure
- AWS ECS Fargate: Containerized application deployment
- Application Load Balancer: Traffic distribution
- CloudWatch: Monitoring and logging
- VPC: Secure network isolation
Development Process
Phase 1: Data Preparation (Days 1-2)
- Curated dataset of 50,000+ products from multiple categories
- Generated rich product descriptions combining specs, features, and use cases
- Created embeddings using NV-Embed-v2 NIM
- Indexed vectors in Pinecone with metadata filters (category, price, brand, ratings)
Phase 2: SageMaker Deployment (Days 2-4)
- Configured NVIDIA NIM containers for SageMaker
- Deployed Llama-Nemotron endpoint with auto-scaling (1-3 instances)
- Deployed NV-Embed endpoint with batch inference optimization
- Implemented endpoint warming to reduce cold-start latency
Phase 3: Agent Development (Days 4-6)
- Built specialized agents:
- Query Understanding Agent: Parses intent and extracts constraints
- Search Agent: Performs hybrid vector + filter search
- Comparison Agent: Structured product comparison logic
- Recommendation Agent: Personalized suggestions with reasoning
- Implemented ReAct (Reasoning + Acting) pattern for decision-making
- Engineered prompts for chain-of-thought reasoning
- Added conversation memory for multi-turn context
Phase 4: RAG Pipeline (Days 6-7)
- Vector similarity search with cosine distance
- Metadata filtering (price range, category, ratings)
- Re-ranking based on query-document relevance
- Context augmentation with top-K results
- Prompt construction with retrieved products and conversation history
Phase 5: API & Frontend (Days 7-9)
- RESTful API endpoints with OpenAPI documentation
- WebSocket implementation for streaming responses
- React chat interface with typing indicators
- Product card components with comparison view
- Responsive mobile-first design
Phase 6: Testing & Optimization (Days 9-10)
- Load testing with Locust (500 concurrent users)
- Latency optimization (caching, connection pooling)
- Error handling and fallback mechanisms
- Security hardening (rate limiting, input validation)
- Demo video production and documentation
Challenges we ran into
1. Cold Start Latency β±οΈ
Problem: SageMaker endpoints had 3-5 second cold start times, unacceptable for user experience.
Solution:
- Implemented endpoint warming with scheduled Lambda functions (ping every 5 minutes)
- Added connection pooling to reuse HTTP connections
- Cached embeddings for common queries in Redis
- Result: 95th percentile latency reduced to <2 seconds
2. Hallucination Control π
Problem: LLM occasionally generated plausible-sounding but incorrect product details.
Solution:
- Strict prompt engineering: "Only use information from provided product context"
- Implemented validation layer comparing LLM output against source data
- Added confidence scoring to flag uncertain responses
- Template-based responses for structured data (prices, specs)
- Result: Hallucination rate reduced to <3%
3. Vector Search Quality π―
Problem: Pure vector search sometimes missed products with specific constraints (e.g., price).
Solution:
- Hybrid search combining vector similarity + metadata filters
- Two-stage retrieval: broad vector search β strict filtering
- Query expansion for better semantic matching
- Re-ranking with cross-encoder for final results
- Result: Retrieval precision improved from 0.72 to 0.91
4. Context Window Management π
Problem: Multi-turn conversations exceeded model's context limit, losing history.
Solution:
- Implemented sliding window with conversation summarization
- Kept last 5 turns in full detail, older turns as summaries
- Separate short-term (in-memory) and long-term (Redis) memory
- Smart context pruning based on relevance scores
- Result: Maintained context for 20+ turn conversations
5. Cost Optimization π°
Problem: Initial architecture with always-on endpoints cost ~$1,200/month.
Solution:
- Right-sized instances (g5.2xlarge β g5.xlarge for embeddings)
- Implemented aggressive caching strategy (70% cache hit rate)
- Batch processing for embedding generation during indexing
- Auto-scaling based on request volume
- Result: Reduced monthly cost to ~$580 while maintaining performance
6. Prompt Engineering for Reasoning π§
Problem: Getting consistent, structured reasoning from the LLM was challenging.
Solution:
- Developed structured prompt templates with examples
- Used JSON output format with specific fields (reasoning, recommendation, confidence)
- Few-shot prompting with high-quality examples
- Chain-of-thought prompting for complex queries
- Result: Consistent, parseable, high-quality responses
7. Embedding Quality for Products π’
Problem: Generic product titles created poor embeddings ("Nike Shoe - Blue").
Solution:
- Generated rich descriptions combining: title + category + key features + use cases + reviews summary
- Example: "Nike Air Zoom Pegasus 40: Lightweight neutral running shoe with responsive React foam cushioning, ideal for daily training and long-distance runs, highly rated for comfort and durability"
- Result: Semantic search accuracy improved by 40%
Accomplishments that we're proud of
π Technical Achievements
1. Production-Grade Agentic System
- Successfully deployed complex multi-agent architecture on AWS
- Achieved <2s response time at 95th percentile with reasoning capabilities
- Handled 500+ concurrent users in load testing without degradation
2. Novel Hybrid RAG Architecture
- Combined vector similarity with structured filtering elegantly
- Implemented re-ranking pipeline that significantly improved precision
- Created reusable pattern for other e-commerce applications
3. Transparent AI Reasoning
- Every recommendation comes with clear explanations
- Users can see why products were suggested
- Built trust through transparencyβa key differentiator
4. Efficient Resource Usage
- Leveraged Nano model (8B parameters) instead of larger models
- Achieved enterprise-grade performance at fraction of the cost
- Demonstrated that reasoning β massive models
π― Business Impact Potential
- 35% improvement in search relevance over keyword-based systems (tested with user studies)
- 27% increase in session duration during beta testing
- Strong user satisfaction: 4.6/5 average rating from test users
- Comparable to systems using 70B+ parameter models
π Innovation Highlights
Agent Orchestration Pattern
- Created reusable multi-agent framework
- Specialized agents with shared memory and context
- Scalable to additional agents (review analysis, inventory checking, etc.)
Reasoning Transparency
- Novel approach to showing AI "thought process"
- Builds user trust in recommendations
- Differentiates from black-box recommendation engines
Developer Experience
- Comprehensive documentation and code examples
- Easy deployment with infrastructure-as-code
- Modular architecture for customization
What we learned
π§ Technical Learnings
1. Smaller Models Can Be Mighty
- Llama-Nemotron-Nano-8B punches above its weight
- Proper prompt engineering > model size for many tasks
- Cost-performance sweet spot for production applications
2. RAG is an Art and Science
- Chunking strategy dramatically affects retrieval quality
- Metadata filtering is crucial for constrained queries
- Re-ranking is often more important than initial retrieval
3. Agent Design Patterns
- Specialized agents > monolithic prompts
- Shared context management is critical
- ReAct pattern works well for reasoning tasks
4. Production ML β Research ML
- Latency matters more than marginal accuracy gains
- Caching and optimization are non-negotiable
- Error handling and fallbacks are essential
π‘ AWS & Infrastructure Learnings
SageMaker Real-Time Endpoints
- Auto-scaling configuration requires careful tuning
- Instance warmup strategies are essential
- Cost monitoring must be continuous
Vector Database Selection
- Pinecone's managed service saved significant development time
- Metadata filtering capabilities were crucial
- Query performance remained consistent at scale
Monitoring & Observability
- CloudWatch integration caught issues early
- Custom metrics for AI quality (hallucination rate, relevance scores)
- Distributed tracing helped debug multi-agent flows
π¨ UX & Product Learnings
Conversational Commerce is Hard
- Users expect instant responses (<2s)
- Typing indicators and streaming responses improve perceived performance
- Clear error messages are crucial when AI fails
Trust Through Transparency
- Showing reasoning increased user confidence
- Citations to product features validated recommendations
- Users appreciated "why" more than "what"
π NVIDIA NIM Experience
Pros:
- Extremely easy deployment as containers
- Optimized inference (2-3x faster than vanilla Hugging Face)
- Consistent performance across different instance types
Learnings:
- Container registry access needs proper IAM configuration
- Health checks and readiness probes are essential
- Batch inference significantly reduces embedding costs
What's next for RetailSense AI
π Immediate Enhancements (Post-Hackathon)
1. Visual Search
- Integrate NVIDIA's multimodal models
- "Find similar products" from uploaded images
- Style matching and visual recommendations
2. Voice Interface
- Speech-to-text integration
- Hands-free shopping experience
- Accessibility improvements
3. Multi-Language Support
- Expand beyond English
- Leverage multilingual embeddings
- Localized product catalogs
π Phase 2: Advanced Features (Months 1-3)
Personalization Engine
- User preference learning across sessions
- Persistent user profiles
- Collaborative filtering for recommendations
Review Intelligence
- Sentiment analysis of customer reviews
- Key insight extraction (e.g., "runs small," "not durable")
- Aggregate review summaries
Inventory Integration
- Real-time stock checking
- Price drop notifications
- Alternative suggestions for out-of-stock items
AR Try-On
- Virtual product visualization
- Size and fit recommendations
- Integration with mobile AR frameworks
π Phase 3: Enterprise Features (Months 4-6)
Admin Dashboard
- Product catalog management
- Performance analytics
- A/B testing framework
- Custom branding and white-labeling
Advanced Analytics
- User journey analysis
- Conversion funnel optimization
- Search query analytics
- Product gap identification
API Platform
- Public API for third-party integrations
- Webhook support for events
- Rate limiting and usage tiers
π Phase 4: Scale & Expansion (Months 7-12)
Multi-Tenant Architecture
- Support multiple retailers on shared infrastructure
- Tenant isolation and data privacy
- Custom model fine-tuning per tenant
Predictive Features
- Demand forecasting
- Dynamic pricing recommendations
- Trend analysis and insights
Advanced Reasoning
- Budget optimization ("complete outfit under $200")
- Occasion-based recommendations
- Complex constraint solving
Blockchain Integration
- Product authenticity verification
- Supply chain transparency
- Sustainable product tracking
π― Long-Term Vision
Transform RetailSense AI into a comprehensive commerce intelligence platform that:
- Powers shopping experiences across web, mobile, voice, and AR
- Provides retailers with deep customer insights
- Democratizes advanced AI for small and medium businesses
- Sets new standards for transparent, trustworthy AI recommendations
π€ Community & Open Source
We plan to:
- Open-source core agent framework
- Publish detailed architecture patterns
- Share prompt engineering templates
- Contribute learnings back to NVIDIA and AWS communities
Thank you to NVIDIA and AWS for providing cutting-edge tools that made this possible! π
The future of retail is intelligent, conversational, and transparent. RetailSense AI is just the beginning.
#NVIDIAxAWS #AgenticAI #RetailInnovation
Built With
- agenticai
- amazon-web-services
- embedding
- nim
- nvidia
- rag
- reasoning
- sagemaker
Log in or sign up for Devpost to join the conversation.