AIVA — AI Virtual Assistant for Autonomous Shopping
💡 Inspiration
The idea for AIVA was born from a simple frustration — online shopping is overwhelming. With thousands of products across multiple platforms, comparing prices, reading reviews, and finding the best deal within a budget is a time-consuming, repetitive task that shouldn't require human effort in 2026.
We asked ourselves: What if an AI agent could do the entire shopping process for you — from understanding what you want through a simple voice command, to finding the best product, to actually adding it to your cart?
The vision was clear — build an autonomous shopping agent that doesn't just recommend products but actually acts on your behalf across real e-commerce platforms. Not a chatbot that gives links. Not a price tracker that sends alerts. A true end-to-end agent that shops like a human but thinks like an AI.
🧠 What We Learned
Technical Learnings
1. Prompt Engineering is an Art Getting Google Gemini to reliably extract structured data (intent, budget, category) from casual conversational input required extensive prompt iteration. We learned that providing explicit output format instructions and few-shot examples dramatically improved accuracy from ~60% to ~95%.
2. Browser Automation is Fragile E-commerce websites constantly change their DOM structure. A CSS selector that works today breaks tomorrow. We learned to build multi-strategy element detection — trying 20 different selectors sequentially until one works:
selectors = [
"#add-to-cart-button", # Primary
"input[name='submit.add-to-cart']", # Fallback
"button[data-action='add-to-cart']", # Alternative
# ... 17 more fallbacks
]
3. Voice Recognition Accuracy Depends on Environment Google Speech API accuracy drops significantly with background noise. We implemented ambient noise calibration at startup:
with sr.Microphone() as source:
recognizer.adjust_for_ambient_noise(source, duration=2)
# Energy threshold dynamically set: ~45-300 range
4. Budget Extraction is a Natural Language Problem Users express budgets in wildly different ways. We built a regex pipeline supporting 6 patterns:
$$P_{match} = \bigcup_{i=1}^{6} \text{regex}_i(query)$$
Where patterns cover: "under ₹50k", "below 60000", "budget of 70k", "max 80000", "within 50,000", "up to ₹45000"
Product Learnings
- Users trust voice over typing for quick shopping tasks
- "Why this product?" reasoning from AI builds more trust than just showing results
- Cart addition is the "wow moment" — seeing the browser move autonomously is powerful
🔨 How We Built It
Architecture Overview
AIVA is built as a 4-layer autonomous agent:
Voice/Text Input → AI Processing → Product Search → Browser Automation
Layer 1: User Interface
We built two parallel interfaces for different use cases:
Streamlit Web UI — for chat-based shopping from a browser. Streamlit's chat_input() and session_state made it possible to build a conversational interface in under 100 lines of UI code.
Tkinter Desktop GUI — for voice-first shopping. Tkinter's zero-dependency nature meant users could run it without installing anything extra.
Layer 2: AI Processing (Google Gemini 2.5 Flash)
Every user query passes through Gemini for understanding:
prompt = f"""
Analyze this shopping query: "{user_input}"
Extract:
1. Product type/category
2. Budget (if mentioned)
3. Brand preference (if any)
4. Key features requested
"""
response = model.generate_content(prompt)
Gemini also generates recommendation reasoning — explaining why a particular product is the best match. This transformed AIVA from a search tool into a shopping advisor.
Layer 3: Product Search & Ranking
We implemented a multi-factor scoring algorithm:
$$\text{Score}(p) = 0.4 \cdot R_{relevance} + 0.3 \cdot R_{value} + 0.2 \cdot R_{rating} + 0.1 \cdot R_{popularity}$$
Where:
- $R_{relevance}$ = keyword match ratio between query and product title
- $R_{value}$ = $1 - \frac{price}{budget}$ (how much budget is left)
- $R_{rating}$ = normalized product rating $\frac{rating}{5.0}$
- $R_{popularity}$ = normalized review count
Products scoring below threshold $\theta = 0.2$ are filtered out.
Layer 4: Browser Automation (Selenium)
The most complex layer. Selenium drives a real Chrome browser to:
- Navigate to Amazon/Flipkart
- Search for products
- Extract live product data from the page
- Select the AI-recommended product
- Add to cart autonomously
The key challenge was reliability. We built an explicit wait strategy using WebDriverWait:
element = WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((By.ID, "add-to-cart-button"))
)
Voice Pipeline
The voice system chains three technologies:
$$\text{Audio} \xrightarrow{\text{PyAudio}} \text{Stream} \xrightarrow{\text{Google Speech API}} \text{Text} \xrightarrow{\text{Gemini}} \text{Action}$$
And for feedback:
$$\text{Response} \xrightarrow{\text{pyttsx3}} \text{Speech Output}$$
⚡ Challenges We Faced
1. Anti-Bot Detection
Amazon and Flipkart have sophisticated bot detection. Our Selenium automation was initially blocked within minutes. We solved this by:
- Using realistic
User-Agentheaders - Adding randomized delays between actions (1-3 seconds)
- Starting the browser in maximized mode (bots typically use headless/small windows)
- Handling cookie consent and popup dialogs programmatically
2. Dynamic DOM Structures
E-commerce sites use dynamic class names that change on every deployment. A selector like div.sg-col-inner breaks frequently. Our solution: 20-selector fallback chains per action, tested weekly.
3. Voice Recognition in Noisy Environments
Initial voice accuracy was ~70% in real-world conditions. We implemented:
- 2-second ambient noise calibration at startup
- Dynamic energy threshold adjustment
- Pause threshold tuning (0.8s) to avoid cutting off mid-sentence
4. Budget Extraction Edge Cases
Users say budgets in unpredictable ways:
- "around fifty thousand" (words, not numbers)
- "50-60k range" (ranges)
- "not more than 50000" (negation-based)
- "₹50,000/-" (Indian currency formatting)
Building a regex system that handles all these required multiple iterations and real-user testing.
5. Cross-Platform Consistency
Amazon and Flipkart have completely different page structures, search mechanisms, and cart flows. We built a platform adapter pattern — a common interface with platform-specific implementations:
class AmazonAdapter:
search_box_selector = "#twotabsearchtextbox"
add_to_cart_selector = "#add-to-cart-button"
class FlipkartAdapter:
search_box_selector = "input[name='q']"
add_to_cart_selector = "button._2KpZ6l"
6. Keeping It Free
A core constraint was zero cost. Every technology choice was made to stay within free tiers:
- Google Gemini: 60 requests/minute free
- Google Speech API: Free tier sufficient for demo
- Selenium: Open source
- Streamlit: Free deployment
- No database hosting costs (in-memory storage)
This constraint actually forced better engineering decisions — simpler architecture, fewer dependencies, faster performance.
🏁 Final Thoughts
Building AIVA taught us that the gap between "AI chatbot" and "AI agent" is enormous. A chatbot tells you what to buy. An agent buys it for you. That difference required solving real-world problems — browser automation, voice processing, anti-bot detection — that no amount of prompt engineering alone could handle.
The result is an AI that truly acts on your behalf, saving hours of comparison shopping and reducing it to a single voice command.
Built With
- beautifulsoup4
- bigbasket
- browser:
- chrome
- data
- database:
- flipkart.com
- google-speech-recognition-api-cloud-services:-google-cloud-platforms:-amazon.in
- in-memory
- lxml
- native
- numpy
- pyaudio
- python
- python-dotenv
- pyttsx3
- requests
- selenium
- speechrecognition
- streamlit
- structures)
- tkinter
- urllib-apis:-google-gemini-2.5-flash-api
- via
- webdriver-manager

Log in or sign up for Devpost to join the conversation.