Most AI assistants today are trapped inside a text box. They can answer questions, but they cannot truly act within the user’s workflow. I wanted to build an agent that goes beyond explanation — an AI that sees your screen, understands context, and takes intelligent action in real time. The goal was to redefine how humans interact with AI systems by moving from static responses to autonomous, context-aware execution. Specifically, I was inspired to apply this concept to the everyday task of online shopping, where users often spend significant time comparing prices and searching for deals across multiple platforms. This led to the idea of Phoenix Shopping Sniper: an AI that doesn't just find products but intelligently navigates the web to secure the best deals.

How I Built It: The project is a full-stack masterpiece built with a robust multimodal pipeline: Interactive AI Questionnaire: A React-based frontend that uses an AI consultant to perfectly match user needs before the search begins. Phoenix Engine v11: A Python-based automation core using Playwright and Set-of-Mark (SoM) visual tagging to navigate Amazon, eBay, and Walmart autonomously. Gemini Multimodal Intelligence: Powered by gemini-2.0-flash for real-time sentiment analysis, smart alternatives, and deal recommendations. Modern Backend: Built with tRPC v11 and Express for type-safe API communication, with Drizzle ORM and SQLite for efficient data management.

Challenges I Faced: One of the biggest challenges was transforming visual interpretation into executable structured actions. It required designing a robust action schema for the Phoenix Engine to prevent "hallucinated" clicks on diverse e-commerce layouts. Ensuring real-time responsiveness while maintaining safe and accurate execution of autonomous browser actions was a significant technical hurdle that required implementing advanced circuit breakers and error isolation strategies.

What I Learned: Action > Reasoning: Multimodal AI becomes exponentially more powerful when paired with structured execution. Agents must move beyond text to deliver real-world value. Architecture Matters: At scale, a well-designed system architecture (like the one used here with tRPC and Phoenix Engine) is more critical than prompt engineering. Resilience is Key: Building autonomous agents requires a "fail-safe" mindset, leading me to implement transparent local storage fallbacks and per-site circuit breakers.

Built With

Share this project:

Updates