π± Inspiration
Robots donβt fail because models are weak.
They fail because training data doesnβt cover edge cases.
Most perception pipelines train on flat, single-view images.
That works in labs β but it breaks in the real world.
Rare objects blend into environments.
Shadows become part of objects.
Labels merge into boxes.
We built Harvest AI to turn world models into a data factory:
- π Real locations
- π§± Explorable 3D worlds
- πΈ Consistent multi-view imagery
- π§ Verified edge-case labels at scale
π§ What It Does
Harvest AI generates edge-case training data from world models using a multi-stage pipeline:
πΊοΈ Location Capture via Google Maps
Users click anywhere on a photorealistic 3D Google Map. The system captures satellite imagery from four cardinal directions (0Β°, 90Β°, 180Β°, 270Β°) using the Google Maps Static API and resolves place names via the Geocoding API.
π World Model Generation
Captured images and azimuth metadata are uploaded to World Labs, which generates an explorable 3D world from the multi-view input.
π Multi-View Extraction
The worldβs panoramic render is projected into multiple perspective views using yaw and pitch sweeps, producing near-360Β° coverage.
π― Object Detection with Judge Verification
A reference object image is matched against every extracted view using GPT-5.2 Vision routed through the Keywords AI gateway.
Each bounding box is verified by a second GPT-5.2 judge call that confirms, corrects, or removes detections β up to two correction iterations per box.
π§© Optional Product Placement
A product image can be composited into the scene using Gemini for AI-aware placement with proper lighting and perspective, or a deterministic ground-plane fallback.
βοΈ How It Works
πΊοΈ Google Maps β Satellite Capture
The frontend renders Google Maps 3D using the gmp-map-3d web component. On click, the app fetches four directional satellite images via the Static Maps API and resolves the location name via the Geocoding API.
π World Labs Generation
Each satellite image is uploaded via signed URLs to the World Labs API. Explicit azimuth angles preserve view consistency. The backend polls the World Labs operation endpoint until the world is ready.
π Panorama β Perspective Views
The panoramic render is downloaded and projected into configurable perspective views using yaw and pitch sweep parameters.
π Keywords AI Gateway + Detection Pipeline
All GPT-5.2 Vision calls are routed through the Keywords AI gateway, providing centralized logging, token tracking, latency metrics, and workflow tracing. Detection prompts return bounding boxes as structured JSON.
π§ββοΈ Judge Iteration System
Each bounding box is verified by a judge agent (GPT-5.2 via Keywords AI). The judge receives:
- Reference object image
- Cropped bounding-box region
- Full scene with the box drawn
The judge returns:
- CORRECT β keep
- INCORRECT β corrected coordinates for re-judging
- NOT_FOUND β remove false positive
Runs up to two iterations per detection.
π‘ Real-Time Streaming
The entire pipeline streams progress to the frontend via Server-Sent Events (SSE). A Gateway Log UI panel shows every LLM call in real time:
- Call type (DETECT / JUDGE)
- Model
- Latency
- Token counts
- Color-coded judge verdicts
ποΈ Supabase Storage
Generated worlds, extracted images, and metadata are stored in Supabase. Images are organized by world ID, and world records persist in Postgres for reuse across sessions.
π¨ Lovable
Used for rapid frontend scaffolding and UI prototyping.
π§ Challenges We Ran Into
- Maintaining view consistency across multi-image world generation and panorama-to-perspective extraction
- Handling base64-encoded images reliably through the Keywords AI gateway
- Building a judge iteration loop that re-crops and re-judges without compounding errors
- Real-time SSE streaming for long-running pipelines with dozens of LLM calls
- Dependency conflicts between
keywordsai-tracingand OpenTelemetry on Python 3.9
π Accomplishments Weβre Proud Of
- End-to-end pipeline: real-world location β verified training dataset
- Judge system that catches and corrects bad bounding boxes without retraining
- Full observability of every LLM call via Keywords AI
- Real-time Gateway Log UI with token usage, latency, and verdicts
- All outputs stored and reusable via Supabase
- Inline prompt fallback for immediate usability without managed prompts
π§° Built With
- World Labs API
- Google Maps Platform (Maps JavaScript API, Static Maps API, Geocoding API)
- Keywords AI (Gateway, Tracing, Prompt Management)
- OpenAI GPT-5.2 Vision
- Google Gemini
- Supabase (Postgres + Storage)
- Lovable
- React, Vite, Tailwind CSS v4
- FastAPI, Python
- Server-Sent Events (SSE)
π What We Learned
Edge cases are a data problem, not a model problem.
Multi-view context is the missing layer between simulation and reality.
The detect β judge verification pattern generalizes beyond robotics.
Any vision pipeline benefits from a second-pass verifier.
Centralized LLM gateways are essential. Without tracing and logging, debugging multi-agent pipelines is guesswork.
π Whatβs Next
- True 3D mesh ingestion
- Physics-aware product placement for robotic manipulation tasks
- Continuous video-based multi-view synthesis
- Deeper integration with robotics simulators and perception training pipelines
- Keywords AI managed prompts for A/B testing detection and judge logic
Built With
- axios
- fastapi
- framer-motion
- google-gemini
- google-genai
- javascript
- keywords-ai
- lovable
- numpy
- onnx-runtime
- openai-api
- openai-gpt-4o
- openai-gpt-5.2-vision
- opencv
- pillow
- python
- react
- react-three/drei
- react-three/fiber
- rembg
- shadcn-ui
- sse
- supabase
- tailwind-css
- tailwindcss
- three.js
- trae
- typescript
- ultralytics
- uvicorn
- vite
- worldlabs
- worldlabs-api


Log in or sign up for Devpost to join the conversation.