🌱 Inspiration

Robots don’t fail because models are weak.
They fail because training data doesn’t cover edge cases.

Most perception pipelines train on flat, single-view images.
That works in labs β€” but it breaks in the real world.

Rare objects blend into environments.
Shadows become part of objects.
Labels merge into boxes.

We built Harvest AI to turn world models into a data factory:

  • 🌍 Real locations
  • 🧱 Explorable 3D worlds
  • πŸ“Έ Consistent multi-view imagery
  • 🧠 Verified edge-case labels at scale

🧠 What It Does

Harvest AI generates edge-case training data from world models using a multi-stage pipeline:

πŸ—ΊοΈ Location Capture via Google Maps

Users click anywhere on a photorealistic 3D Google Map. The system captures satellite imagery from four cardinal directions (0Β°, 90Β°, 180Β°, 270Β°) using the Google Maps Static API and resolves place names via the Geocoding API.

🌍 World Model Generation

Captured images and azimuth metadata are uploaded to World Labs, which generates an explorable 3D world from the multi-view input.

πŸ”„ Multi-View Extraction

The world’s panoramic render is projected into multiple perspective views using yaw and pitch sweeps, producing near-360Β° coverage.

🎯 Object Detection with Judge Verification

A reference object image is matched against every extracted view using GPT-5.2 Vision routed through the Keywords AI gateway.
Each bounding box is verified by a second GPT-5.2 judge call that confirms, corrects, or removes detections β€” up to two correction iterations per box.

🧩 Optional Product Placement

A product image can be composited into the scene using Gemini for AI-aware placement with proper lighting and perspective, or a deterministic ground-plane fallback.


βš™οΈ How It Works

πŸ—ΊοΈ Google Maps β†’ Satellite Capture

The frontend renders Google Maps 3D using the gmp-map-3d web component. On click, the app fetches four directional satellite images via the Static Maps API and resolves the location name via the Geocoding API.

🌍 World Labs Generation

Each satellite image is uploaded via signed URLs to the World Labs API. Explicit azimuth angles preserve view consistency. The backend polls the World Labs operation endpoint until the world is ready.

πŸ“ Panorama β†’ Perspective Views

The panoramic render is downloaded and projected into configurable perspective views using yaw and pitch sweep parameters.

πŸ” Keywords AI Gateway + Detection Pipeline

All GPT-5.2 Vision calls are routed through the Keywords AI gateway, providing centralized logging, token tracking, latency metrics, and workflow tracing. Detection prompts return bounding boxes as structured JSON.

πŸ§‘β€βš–οΈ Judge Iteration System

Each bounding box is verified by a judge agent (GPT-5.2 via Keywords AI). The judge receives:

  • Reference object image
  • Cropped bounding-box region
  • Full scene with the box drawn

The judge returns:

  • CORRECT β†’ keep
  • INCORRECT β†’ corrected coordinates for re-judging
  • NOT_FOUND β†’ remove false positive

Runs up to two iterations per detection.

πŸ“‘ Real-Time Streaming

The entire pipeline streams progress to the frontend via Server-Sent Events (SSE). A Gateway Log UI panel shows every LLM call in real time:

  • Call type (DETECT / JUDGE)
  • Model
  • Latency
  • Token counts
  • Color-coded judge verdicts

πŸ—„οΈ Supabase Storage

Generated worlds, extracted images, and metadata are stored in Supabase. Images are organized by world ID, and world records persist in Postgres for reuse across sessions.

🎨 Lovable

Used for rapid frontend scaffolding and UI prototyping.


🚧 Challenges We Ran Into

  • Maintaining view consistency across multi-image world generation and panorama-to-perspective extraction
  • Handling base64-encoded images reliably through the Keywords AI gateway
  • Building a judge iteration loop that re-crops and re-judges without compounding errors
  • Real-time SSE streaming for long-running pipelines with dozens of LLM calls
  • Dependency conflicts between keywordsai-tracing and OpenTelemetry on Python 3.9

πŸ† Accomplishments We’re Proud Of

  • End-to-end pipeline: real-world location β†’ verified training dataset
  • Judge system that catches and corrects bad bounding boxes without retraining
  • Full observability of every LLM call via Keywords AI
  • Real-time Gateway Log UI with token usage, latency, and verdicts
  • All outputs stored and reusable via Supabase
  • Inline prompt fallback for immediate usability without managed prompts

🧰 Built With

  • World Labs API
  • Google Maps Platform (Maps JavaScript API, Static Maps API, Geocoding API)
  • Keywords AI (Gateway, Tracing, Prompt Management)
  • OpenAI GPT-5.2 Vision
  • Google Gemini
  • Supabase (Postgres + Storage)
  • Lovable
  • React, Vite, Tailwind CSS v4
  • FastAPI, Python
  • Server-Sent Events (SSE)

πŸ“š What We Learned

Edge cases are a data problem, not a model problem.
Multi-view context is the missing layer between simulation and reality.

The detect β†’ judge verification pattern generalizes beyond robotics.
Any vision pipeline benefits from a second-pass verifier.

Centralized LLM gateways are essential. Without tracing and logging, debugging multi-agent pipelines is guesswork.


πŸš€ What’s Next

  • True 3D mesh ingestion
  • Physics-aware product placement for robotic manipulation tasks
  • Continuous video-based multi-view synthesis
  • Deeper integration with robotics simulators and perception training pipelines
  • Keywords AI managed prompts for A/B testing detection and judge logic

Built With

  • axios
  • fastapi
  • framer-motion
  • google-gemini
  • google-genai
  • javascript
  • keywords-ai
  • lovable
  • numpy
  • onnx-runtime
  • openai-api
  • openai-gpt-4o
  • openai-gpt-5.2-vision
  • opencv
  • pillow
  • python
  • react
  • react-three/drei
  • react-three/fiber
  • rembg
  • shadcn-ui
  • sse
  • supabase
  • tailwind-css
  • tailwindcss
  • three.js
  • trae
  • typescript
  • ultralytics
  • uvicorn
  • vite
  • worldlabs
  • worldlabs-api
Share this project:

Updates