Bella: See It. Want It. Buy It.

Hands-free shopping with Meta Ray-Ban smart glasses, powered by a multi-model AI pipeline and autonomous checkout.


Inspiration

Shopping should be as simple as looking at something and saying "I want that." But today, identifying a product you see in the wild, a stranger's sunglasses or your crush’s bucket hat, requires awkward Googling, squinting at logos, and hoping for the best.

We built Bella to close the gap between seeing a product and owning it. With Meta Ray-Ban smart glasses, you just look at an item and ask. Our AI agent identifies the exact brand and model, finds the best prices across retailers, and can even complete the purchase for you — entirely hands-free.


What It Does

Just look at any product and speak naturally:

“Bella, what bag is that?”
“Find me those shoes.”
“Buy it.”

Bella identifies the exact item, compares listings across retailers, and autonomously completes the purchase.

No searching. No typing. No hassle.

1. Glasses Mode (Hands-Free) Look at any product with your Ray-Bans and speak naturally. For example, “Find me those shoes”. Bella identifies the exact item, compares listings across retailers, and can autonomously complete the purchase all within seconds.

2. Phone Mode (Camera Scan) Open the mobile app, snap a photo of any product, and instantly get brand identification and price comparison. Tap to view details, add to cart, or open the retailer link directly.

3. Autonomous Checkout (Browserbase Agent) When you're ready to buy, our Browserbase-powered browser agent navigates to the retailer, adds items to cart, fills in shipping and payment details, and completes the purchase autonomously. Just pick a card and confirm through voice or through the app.


How We Built It

The Detection Pipeline

Product identification is hard. A photo of Gigi Hadid wearing her Prada shades gives "sunglasses" instead of "Prada SPR 17WS Cat-Eye." We solved this with a three-stage pipeline:

Stage 1: Object Localization (Grounding DINO) When the user provides a text hint (voice: "the blue earrings on the right," or auto-detected: "product"), Grounding DINO (a custom model we hosted using Runpod) locates the specific object in the image and crops it. For crowded scenes like store shelves with dozens of products, we generate multiple candidate crops and let GPT-4o pick the best match.

Stage 2: Brand Identification (GPT-4o) GPT-4o analyzes the cropped image with vision and identifies the specific brand, model, color, and material, returning a precise product name like "Hermes Birkin 25 Togo Leather Gold Hardware" that drives an accurate Google Shopping search.

Stage 3: Web Intelligence (SerpAPI) SerpAPI retrieves relevant shopping links for the identified product, returning direct purchase URLs and marketplace listings. This connects the scanned item to real-time buying options online, enabling instant add-to-cart functionality.

Each of stages 2 and 3 run in parallel via a ThreadPoolExecutor, cutting detection time by ~2 seconds. The entire pipeline gracefully degrades: if any stage fails, the next one picks up the slack with the best information acquired thus far.

Meta Ray-Ban Glasses

We hacked the Ray-Ban glasses to display a live, persistent feed of what the user sees at all times along with a continuous audio feed. To do this, we use Messenger video calling and then parse the feed using Chrome video and audio capture. We then send the latest frame to our LiveKit agent for real-time processing along with audio from the user to identify user needs. In order to send commands back to the glasses, we create a subscriber that overrides the system microphone to then relay over to the glasses.

Voice Agent (LiveKit + OpenAI Realtime)

The voice shopping experience is built on LiveKit's WebRTC infrastructure with OpenAI's Realtime API (GPT-4o audio model, "marin" voice). The agent:

Listens: continuously reads video frames from the glasses' stream and stores the latest one in memory. It also receives the user's voice via LiveKit audio.

Sees and Understands: the agent converts the latest frame to JPEG and uses OpenAI Realtime to process the user's speech. When the user says something like "how much is this water bottle?", the model triggers a function tool call (search_product) that uses our vision pipeline.

Speaks: the agent reads back the results conversationally ("I found three options...") through LiveKit audio. These results come from the tool calls and additional reasoning from the model.

Acts: the user can say "add option 2 to my cart" or "buy it", triggering add_to_cart(saves to Supabase and shows on the user’s mobile app) or buy_item (triggers autonomous purchasing via Browserbase/Stagehand and navigates through Visa payment portal).

The agent runs with LiveKit Noise Cancellation (Krisp) for clear audio in noisy environments like stores.

Autonomous Purchase (Browserbase + Stagehand)

When the user is ready to checkout, our Browserbase agent takes over. Using Stagehand's AI-powered browser automation with GPT-4o as the decision model, the agent:

  1. Opens a cloud browser session on Browserbase
  2. Navigates to the retailer
  3. Searches for each product by title
  4. Adds items to cart (respecting quantities)
  5. Fills shipping information
  6. Enters payment details from the user's selected card
  7. Completes checkout and confirms the order

The entire flow is autonomous: the agent understands web UIs, handles dynamic pages, and adapts to different store layouts.

Visa Payment Integration

We use Visa’s Acceptance API to authorize real payment transactions when users check out through the app. The mobile app displays the Visa-branded cards. When a user checks out, the app sends card details to the back end which is then processed through a FastAPI route that verifies the credentials and processes these card payments in a sandbox environment. This provides easy payment automation.

Real-Time Multi-Device Sync

Cart state syncs in real-time across all devices (glasses, phone, web) via Supabase Postgres with realtime subscriptions. When the voice agent adds an item through the glasses, the phone app updates instantly. Scan history and purchase records persist across sessions.


Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        META RAY-BAN GLASSES                         │
│                     (POV camera + microphone)                       │
│                  Casts livestream to browser tab                     │
└─────────────────────────────┬───────────────────────────────────────┘
                              │ Screen capture (getDisplayMedia)
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│                  meta-glasses-extract (Next.js)                      │
│                                                                     │
│  • Captures glasses video feed via getDisplayMedia                   │
│  • Captures audio (window audio or mic)                              │
│  • Publishes video + audio tracks to LiveKit room                    │
│  • Receives agent voice (routed to selected output device)           │
│  • Displays shopping results via LiveKit data channels               │
└────────────────────────────┬────────────────────────────────────────┘
                             │ LiveKit (WebRTC)
                             │ Video + Audio tracks
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                   LiveKit Agent — "Bella" (Python)                   │
│                                                                     │
│  • OpenAI Realtime API for voice conversation                        │
│  • Continuously buffers latest video frame from glasses              │
│  • On user voice command → grabs frame → calls backend               │
│  • Tools:                                                            │
│    search_product  →  POST /pipeline (frame + text query)            │
│    add_to_cart     →  POST /add-to-cart                              │
│    get_cart        →  GET  /cart                                      │
│    buy_item        →  POST /buy (autonomous purchase)                │
│  • Speaks results back via LiveKit audio                             │
│  • Pushes structured data to web UI via LiveKit data channels        │
└────────────────────────────┬────────────────────────────────────────┘
                             │ HTTP
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     FastAPI Backend (Python)                         │
│                                                                     │
│  POST /pipeline ─────────────────────────────────────────────────── │
│  │                                                                  │
│  ├─ 1. Grounding DINO (RunPod GPU) — locate object by text query    │
│  │      └─ GPT-4o picks best crop if multiple candidates            │
│  ├─ 2. GPT-4o — brand + product name identification   │
│  ├─ 4. SerpAPI — Google Shopping price search                       │
│  └─ 5. Supabase — upload images + save to glasses_captures          │
│                                                                     │
│  POST /add-to-cart ──► Supabase cart table                          │
│  GET  /cart ──► Supabase cart query                                 │
│  POST /buy ──► Browserbase/Stagehand autonomous purchase            │
│  POST /payment/authorize ──► Visa Acceptance / Cybersource          │
└───────┬──────────────┬──────────────┬──────────────┬────────────────┘
        │              │              │              │
        ▼              ▼              ▼              ▼
┌──────────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────────────┐
│  RunPod GPU  │ │   Supabase   │ │ Browserbase │ │ Visa/Cybersource │
│              │ │              │ │ + Stagehand │ │                  │
│ Grounding    │ │ • cart       │ │             │ │ • HTTP Signature │
│ DINO (tiny)  │ │ • glasses_   │ │ Cloud       │ │   authentication │
│ on RTX 4090  │ │   captures   │ │ browser     │ │ • Sandbox API    │
│ ~5s warm     │ │ • purchase_  │ │ agent       │ │ • Mock fallback  │
│              │ │   history    │ │ (GPT-4o)    │ │   for test cards │
└──────────────┘ │ • Storage    │ │             │ └──────────────────┘
                 │   bucket     │ │ Navigates   │
                 └──────────────┘ │ mock store  │
                        ▲         └──────┬──────┘
                        │                │
                        │                ▼
                        │         ┌─────────────┐
                        │         │  Mock Store  │
                        │         │  (Next.js)   │
                        │         │              │
                        │         │ Amazon-style │
                        │         │ e-commerce   │
                        │         │ cart +       │
                        │         │ checkout     │
                        │         └─────────────┘
                        │
┌───────────────────────┴─────────────────────────────────────────────┐
│                   Expo Mobile App (React Native)                     │
│                                                                     │
│  • Product scanning via phone camera                                 │
│  • Shopping results display                                          │
│  • Cart management with Visa card UI                                 │
│  • Checkout → POST /payment/authorize (Visa API)                     │
│  • Purchase history (Supabase)                                       │
└─────────────────────────────────────────────────────────────────────┘

Challenges We Ran Into

Brand-specific detection was inaccurate at first. We tried multiple approaches before landing on the three-stage pipeline: Grounding DINO for spatial precision, Cloud Vision for web intelligence, and GPT-4o for brand expertise. The key insight was that each model excels at a different aspect of identification; for example, Grounding DINO is good at identifying general items with CV, while GPT-4o is good at identifying specific items. We also added multi-modality with both video and text to give more context about the item.

Running DINO was too slow. Often, running the Grounding DINO model locally was too slow and ate up 2-3 minutes per call. To make it faster, we hosted the Grounding DINO model using Runpod. This uses the GPUs to accelerate the process and only takes ~30 seconds. We had some initial problems deploying, but we figured out it was due to the virtual environment dependencies in Python.

Processing Meta Glasses Feed. The SDK that Meta provides only has Swift support and was buggy, had a severe amount of lag, and didn’t contain up-to-date documentation, so we wasted numerous hours trying to implement. Ultimately, we ended up hacking the glasses and using a streaming approach with web-servers which had minimal lag and allowed us to stream data in real time.

Payment Sandboxes We wanted to add automated payments, but we didn’t want the agent to actually buy items. To fix this, we made an entirely independent mock store that is deployed on the internet which uses Visa’s payment API. Additionally, we were initially hitting Captchas with our implementation of Browserbase handling. Using this mock store, we were bypassing these Captchas and robot verification tests.

iOS HEIC format handling. iPhones capture photos in HEIC format by default. Our pipeline needed pillow-heif support to handle these natively without forcing JPEG conversion at the capture layer.


Accomplishments We're Proud Of

  • End-to-end hands-free flow: Look at a product with your glasses, say "buy it," and have it purchased without touching a screen.
  • Accurate brand detection: Correctly identifies luxury brands (Hermes Birkin, Prada sunglasses, Grand Seiko watches) and niche products (specific Cetaphil skincare on a crowded CVS shelf)
  • Sub-45-second pipeline: From photo capture to price comparison results, with parallel API execution and hosted AI deployments using Runpod
  • Autonomous checkout: A browser agent that actually navigates stores, fills forms, and completes purchases using Browserbase

What We Learned

  • Multimodal AI pipelines are more powerful than any single model. Grounding DINO, Cloud Vision, and GPT-4o each contribute something the others can't for either video or audio.
  • Browser automation with Browserbase (Stagehand) is surprisingly capable but needs careful prompt engineering along with context to handle edge cases
  • Real-time voice interfaces require much stricter prompt engineering than text. The agent must never switch languages, must stay concise, and must handle ambient noise gracefully.
  • Hardware integration (Meta glasses) adds significant complexity but creates a genuinely differentiated UX that can't be replicated in software alone.

What's Next

We see this product going beyond just shopping for day-to-day items. 24/7 AR glasses can be a new way to interact with the physical world.

Already, the system can automate routine tasks, from reordering household items to scheduling services and managing errands based on what the user hears and sees.

More importantly, we see this technology having massive implications for blue-collar and frontline workers. Since the glasses can understand everything in real-time with low latency, the glasses can give insight on equipment repair, warehouse logistics, construction workflows, and even patient metadata.

Ultimately, we see the glasses becoming the assistant that lives on you at all times. This will create a new type of agent. An agent that actually understands your problems and gives you personalized insight on context we may not even be aware about.


Built With

AI/ML: Grounding DINO (IDEA-Research), GPT-4o (OpenAI), OpenAI Realtime API

AI Hosting: Runpod

Hardware: Meta Ray-Ban Smart Glasses

Voice: LiveKit (WebRTC), OpenAI Realtime, LiveKit Noise Cancellation (Krisp)

Browser Automation: Browserbase, Stagehand

Payments: Visa Cybersource Sandbox

Backend: FastAPI (Python), SerpAPI (Google Shopping)

Frontend: Expo (React Native), TypeScript, React Navigation

Infrastructure: Supabase (PostgreSQL + Storage + Realtime), LiveKit Cloud

Dev Tools: PyTorch, Transformers (HuggingFace), Pillow, httpx

Built With

Share this project:

Updates