Inspiration

We have a problem: too many small physical objects, not enough brain space to remember where they are. The solution? Treat physical objects on a desk exactly like querying a database.

What it does

Place an item on any slot of the carousel. A camera detects it, encodes it into a vector embedding with CLIP, strips the background of the image, and displays it in a web UI grid. Want it back? Just say so — the system uses speech recognition to match your request to the right item and physically spins the turntable to that slot. You can also click a cell in the grid directly. No typing, no sorting, no labels. Just put things down and talk to get them back.

How we built it

The turntable itself is custom 3D-printed, driven by a stepper motor connected to a microcontroller. The backend is a FastAPI server running OpenCV for camera monitoring and CLIP for image encoding and matching. Speech input is captured and transcribed, then matched against the CLIP vector inventory using cosine similarity to find the right slot.

Challenges we ran into

Getting CLIP's zero-shot labels to be consistently useful on arbitrary household objects was trickier than expected. The image processing also required careful calibration to avoid false triggers from lighting changes and backgrounds.

Accomplishments that we're proud of

You put a pen down, it appears on screen with its background removed, and you can say "pen" and the turntable spins to it. Watching the full loop work end-to-end for the first time, with no manual input at any step, felt really nice.

What we learned

Getting hardware to work is painful but also very rewarding!

What's next for Fetch

More slots, smarter labeling (fine-tuned CLIP for richer descriptions), and multi-turntable support, essentially turning it into an automated warehouse system.

Built With

Share this project:

Updates