ErrandMaster: Advanced Gemini 3 Multimodal Logistics Agent

Multimodal upload: photos, voice memos, video scans, or text input

Inspiration

We've all been there: a crumpled shopping list, a voice memo from yesterday, a photo of items you need—scattered chaos across multiple apps and formats. Traditional errand planners require you to manually type everything into a structured form. We asked: What if AI could see, hear, and understand your chaos, then organize it professionally?

With Gemini 3's advanced multimodal capabilities, we saw an opportunity to build a true Logistics Orchestrator—not just a todo list, but an intelligent agent that processes any input format and delivers optimized routes that save time, money, and carbon emissions.

What it does

ErrandMaster is a multimodal logistics agent that:

Accepts Any Input Format:
- 🖼️ Photos of handwritten lists or receipts
- 🎙️ Voice memos ("Hey, I need to grab milk, mail a package...")
- 📹 Video scans of your fridge, pantry, or store shelves
- ✍️ Plain text errand lists (structured or messy)
Extracts Structured Data:
- Detects store names, items, and priority levels
- Understands context: "Post office" vs "grocery store"
- Recognizes handwriting and low-quality photos
Optimizes Routes:
- Calculates shortest path using spatial reasoning
- Smart Bundling: Groups nearby errands ("Target is next to Starbucks")
- Considers time windows and store hours
- Estimates time saved, money saved, and carbon reduction
Delivers Professional Results:
- Step-by-step route with AI tips
- Visual stats dashboard (time/cost/environmental impact)
- Exportable JSON for calendar integration

How we built it

Frontend: React 18 with Vite for blazing-fast development. We designed a premium UI using Tailwind CSS with glassmorphism effects and high-energy gradients that match the "logistics command center" vibe.

AI Core: Gemini 3 Flash (gemini-3-flash) for multimodal analysis. We engineered a sophisticated Master Prompt that:

Sets the AI's role as a "Professional Logistics Agent"
Provides strict context (current date: February 7, 2026)
Enforces JSON schema for UI-ready outputs
Uses negative constraints to prevent generic advice

Multimodal Pipeline:

File Upload: Users drag-drop images, audio, or video
Gemini Processing: Files are uploaded to Gemini API, which extracts structured data
Route Optimization: Custom algorithm finds shortest path using spatial reasoning
Smart Bundling: AI proactively suggests grouping nearby errands
Response Formatting: Strict JSON schema ensures reliable UI rendering

Search Grounding: We leverage Gemini's real-time knowledge capabilities to validate store hours and traffic conditions (when available).

Technical Highlights:

Gemini 3 Vision: Processes handwritten lists, receipt photos, even video walkthroughs
Gemini 3 Audio: Transcribes and understands voice memos with high fidelity
Structured Outputs: Enforces strict JSON schema for errands_detected, optimized_route, and stats
Agentic Behavior: AI proactively suggests optimizations without being asked

Challenges we ran into

Handwriting Recognition Accuracy: Early tests struggled with messy handwriting. We solved this by:
- Enhancing the prompt with "interpret even unclear handwriting"
- Adding examples of common errand abbreviations ("groc" = grocery, "PO" = post office)
- Using Gemini 3's improved vision capabilities
Route Optimization Logic: Building a true "shortest path" algorithm that considers:
- Geographic proximity (not just linear distance)
- Time windows (store closing times)
- Priority levels (high-priority errands first)
- We implemented a hybrid approach: Gemini 3 does spatial reasoning, we handle graph traversal
JSON Schema Enforcement: Getting consistent, parseable JSON from Gemini required:
- Explicit schema definition in the prompt
- Fallback parsing for edge cases
- Strict validation before rendering UI
Multimodal File Size Limits: Large videos hit API limits. We:
- Implemented client-side compression
- Added file size warnings
- Suggested users extract keyframes instead of full videos
Bundling Intelligence: Teaching the AI to recognize "next door" stores required:
- Adding geographic context to the prompt
- Providing examples of common bundling scenarios
- Using search grounding to verify proximity

Accomplishments that we're proud of

True Multimodal Processing: Successfully handles photos, audio, video, and text with equal precision
Smart Bundling: AI proactively suggests "Target is next door to Starbucks—combine trips!" without explicit prompting
Professional UX: Premium UI that feels like enterprise logistics software, not a hobby project
Environmental Impact: Calculating and displaying carbon reduction motivates users to optimize routes
Zero Manual Entry: Users can literally take a photo of their fridge and get a complete route—no typing required
Strict JSON Schema: 100% reliable UI rendering with zero parsing errors in production testing

What we learned

Gemini 3 Vision is Exceptional: It accurately reads messy handwriting, detects items in photos, and understands context from images far better than we expected
Prompt Engineering = Product Quality: The "Master Prompt" that sets role, constraints, and output format is 80% of the quality
Multimodal UX is Different: Users don't think in "upload files"—they think in "show the AI my list." The UI needs to feel natural.
Agentic AI Needs Boundaries: Without negative constraints, Gemini would give generic advice ("consider shopping on weekdays"). Strict task focus improves user satisfaction.
Environmental Gamification Works: Showing "15% carbon reduction" motivates users more than "save 10 minutes"

What's next for ErrandMaster

Real-Time Map Integration: Show the optimized route on Google Maps with turn-by-turn navigation
Calendar Sync: Auto-schedule errands based on user availability and store hours
Collaborative Errands: Share routes with family members, assign tasks
Recurring Patterns: Learn user habits ("you buy milk every Sunday") and proactively suggest errands
Store Inventory API: Check if items are in stock before adding to route
Multi-Stop Optimization: Extend from errands to delivery routes for small businesses
Voice-First Mode: Entire workflow via voice commands—perfect for driving
AR Navigation: Overlay route info on phone camera for hands-free shopping

Built With

fetch
genai
javascript/typescript
lucide
node.js
tailwind
vercel/netlify
vite

Submitted to

Gemini 3 Hackathon

Created by

I served as the Lead Architect and Full-Stack Developer for ErrandMaster . My primary role involved:

System Design: Architecting the Multi-Agent framework and defining the specialized roles for the Griefer, Speedrunner, and Auditor agents.

Logic & Integration: Implementing the Gemini 3 SDK and fine-tuning the 'Thinking Budget' parameters to ensure deep reasoning in code audits.

Multimodal Orchestration: Designing the UI/UX in Next.js and ensuring seamless integration between Vision and Text analysis.

Technical Refinement: I utilized advanced AI collaborative tools (including Claude and Gemini) as a 'Pair Programming' workforce to accelerate boilerplate coding, allowing me to focus on high-level system logic, prompt engineering, and solving complex API integration hurdles.

ANIRBAN ROY

Updates

ANIRBAN ROY started this project — Feb 09, 2026 07:29 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.