AIVA — AI Virtual Assistant for Autonomous Shopping

💡 Inspiration

The idea for AIVA was born from a simple frustration — online shopping is overwhelming. With thousands of products across multiple platforms, comparing prices, reading reviews, and finding the best deal within a budget is a time-consuming, repetitive task that shouldn't require human effort in 2026.

We asked ourselves: What if an AI agent could do the entire shopping process for you — from understanding what you want through a simple voice command, to finding the best product, to actually adding it to your cart?

The vision was clear — build an autonomous shopping agent that doesn't just recommend products but actually acts on your behalf across real e-commerce platforms. Not a chatbot that gives links. Not a price tracker that sends alerts. A true end-to-end agent that shops like a human but thinks like an AI.

🧠 What We Learned

Technical Learnings

1. Prompt Engineering is an Art Getting Google Gemini to reliably extract structured data (intent, budget, category) from casual conversational input required extensive prompt iteration. We learned that providing explicit output format instructions and few-shot examples dramatically improved accuracy from ~60% to ~95%.

2. Browser Automation is Fragile E-commerce websites constantly change their DOM structure. A CSS selector that works today breaks tomorrow. We learned to build multi-strategy element detection — trying 20 different selectors sequentially until one works:

selectors = [
    "#add-to-cart-button",           # Primary
    "input[name='submit.add-to-cart']",  # Fallback
    "button[data-action='add-to-cart']", # Alternative
    # ... 17 more fallbacks
]

3. Voice Recognition Accuracy Depends on Environment Google Speech API accuracy drops significantly with background noise. We implemented ambient noise calibration at startup:

with sr.Microphone() as source:
    recognizer.adjust_for_ambient_noise(source, duration=2)
    # Energy threshold dynamically set: ~45-300 range

4. Budget Extraction is a Natural Language Problem Users express budgets in wildly different ways. We built a regex pipeline supporting 6 patterns:

$$P_{match} = \bigcup_{i=1}^{6} \text{regex}_i(query)$$

Where patterns cover: "under ₹50k", "below 60000", "budget of 70k", "max 80000", "within 50,000", "up to ₹45000"

Product Learnings

Users trust voice over typing for quick shopping tasks
"Why this product?" reasoning from AI builds more trust than just showing results
Cart addition is the "wow moment" — seeing the browser move autonomously is powerful

🔨 How We Built It

Architecture Overview

AIVA is built as a 4-layer autonomous agent:

Voice/Text Input → AI Processing → Product Search → Browser Automation

Layer 1: User Interface

We built two parallel interfaces for different use cases:

Streamlit Web UI — for chat-based shopping from a browser. Streamlit's chat_input() and session_state made it possible to build a conversational interface in under 100 lines of UI code.

Tkinter Desktop GUI — for voice-first shopping. Tkinter's zero-dependency nature meant users could run it without installing anything extra.

Layer 2: AI Processing (Google Gemini 2.5 Flash)

Every user query passes through Gemini for understanding:

prompt = f"""
Analyze this shopping query: "{user_input}"
Extract:
1. Product type/category
2. Budget (if mentioned)
3. Brand preference (if any)
4. Key features requested
"""
response = model.generate_content(prompt)

Gemini also generates recommendation reasoning — explaining why a particular product is the best match. This transformed AIVA from a search tool into a shopping advisor.

Layer 3: Product Search & Ranking

We implemented a multi-factor scoring algorithm:

$$\text{Score}(p) = 0.4 \cdot R_{relevance} + 0.3 \cdot R_{value} + 0.2 \cdot R_{rating} + 0.1 \cdot R_{popularity}$$

Where:

$R_{relevance}$ = keyword match ratio between query and product title
$R_{value}$ = $1 - \frac{price}{budget}$ (how much budget is left)
$R_{rating}$ = normalized product rating $\frac{rating}{5.0}$
$R_{popularity}$ = normalized review count

Products scoring below threshold $\theta = 0.2$ are filtered out.

Layer 4: Browser Automation (Selenium)

The most complex layer. Selenium drives a real Chrome browser to:

Navigate to Amazon/Flipkart
Search for products
Extract live product data from the page
Select the AI-recommended product
Add to cart autonomously

The key challenge was reliability. We built an explicit wait strategy using WebDriverWait:

element = WebDriverWait(driver, 10).until(
    EC.element_to_be_clickable((By.ID, "add-to-cart-button"))
)

Voice Pipeline

The voice system chains three technologies:

$$\text{Audio} \xrightarrow{\text{PyAudio}} \text{Stream} \xrightarrow{\text{Google Speech API}} \text{Text} \xrightarrow{\text{Gemini}} \text{Action}$$

And for feedback:

$$\text{Response} \xrightarrow{\text{pyttsx3}} \text{Speech Output}$$

⚡ Challenges We Faced

1. Anti-Bot Detection

Amazon and Flipkart have sophisticated bot detection. Our Selenium automation was initially blocked within minutes. We solved this by:

Using realistic User-Agent headers
Adding randomized delays between actions (1-3 seconds)
Starting the browser in maximized mode (bots typically use headless/small windows)
Handling cookie consent and popup dialogs programmatically

2. Dynamic DOM Structures

E-commerce sites use dynamic class names that change on every deployment. A selector like div.sg-col-inner breaks frequently. Our solution: 20-selector fallback chains per action, tested weekly.

3. Voice Recognition in Noisy Environments

Initial voice accuracy was ~70% in real-world conditions. We implemented:

2-second ambient noise calibration at startup
Dynamic energy threshold adjustment
Pause threshold tuning (0.8s) to avoid cutting off mid-sentence

4. Budget Extraction Edge Cases

Users say budgets in unpredictable ways:

"around fifty thousand" (words, not numbers)
"50-60k range" (ranges)
"not more than 50000" (negation-based)
"₹50,000/-" (Indian currency formatting)

Building a regex system that handles all these required multiple iterations and real-user testing.

5. Cross-Platform Consistency

Amazon and Flipkart have completely different page structures, search mechanisms, and cart flows. We built a platform adapter pattern — a common interface with platform-specific implementations:

class AmazonAdapter:
    search_box_selector = "#twotabsearchtextbox"
    add_to_cart_selector = "#add-to-cart-button"

class FlipkartAdapter:
    search_box_selector = "input[name='q']"
    add_to_cart_selector = "button._2KpZ6l"

6. Keeping It Free

A core constraint was zero cost. Every technology choice was made to stay within free tiers:

Google Gemini: 60 requests/minute free
Google Speech API: Free tier sufficient for demo
Selenium: Open source
Streamlit: Free deployment
No database hosting costs (in-memory storage)

This constraint actually forced better engineering decisions — simpler architecture, fewer dependencies, faster performance.

🏁 Final Thoughts

Building AIVA taught us that the gap between "AI chatbot" and "AI agent" is enormous. A chatbot tells you what to buy. An agent buys it for you. That difference required solving real-world problems — browser automation, voice processing, anti-bot detection — that no amount of prompt engineering alone could handle.

The result is an AI that truly acts on your behalf, saving hours of comparison shopping and reducing it to a single voice command.

Built With

beautifulsoup4
bigbasket
browser:
chrome
data
database:
flipkart.com
google-speech-recognition-api-cloud-services:-google-cloud-platforms:-amazon.in
in-memory
lxml
native
numpy
pyaudio
python
python-dotenv
pyttsx3
requests
selenium
speechrecognition
streamlit
structures)
tkinter
urllib-apis:-google-gemini-2.5-flash-api
via
webdriver-manager

Updates

Naman Mishra started this project — Feb 15, 2026 11:19 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.