Inspiration

Almost every workspace — warehouses, workshops, classrooms, and small businesses — runs into the same frustrating problem:

“Where did that thing go?”

Traditional inventory systems rely on barcodes, scanners, and manual updates. While they work in theory, they depend heavily on people consistently scanning items or updating spreadsheets. In practice, that rarely happens. Items get moved quickly, updates are forgotten, and the system becomes unreliable.

We wanted to explore a simpler approach: what if inventory systems didn’t depend on people remembering to log items at all?

What if the system could simply watch what you were doing and remember for you?

That idea became Storganize — an AI-powered, hands-free inventory system where users hold up an item, place it in a zone, and later ask where it is using natural language.

What it does

Storganize allows users to track and locate items without scanning barcodes, typing entries, or manually updating logs.

A fixed camera observes the workspace. When a user holds up an item, the system identifies it using computer vision. The user then places the item in a predefined zone, and the system automatically records where it was placed.

Later, users can simply ask:

“Where is the blue toolbox?”

The system highlights the correct storage zone on the screen and reads the location aloud.

The result is a completely hands-free inventory experience.

How we built it

Storganize runs entirely in the browser and combines computer vision, voice interaction, and spatial tracking.

When motion is detected in the camera feed, the system captures a high-quality frame and sends it to Google Gemini Vision from Google to identify the object. The AI returns structured information such as the item name, category, and distinguishing features.

While placing the item, MediaPipe Hands tracks the user’s hand position in real time. When the hand remains within a predefined zone long enough, the item is automatically assigned to that location.

To interact with the system, users can speak natural language commands such as:

“Scan item”

“Where is the screwdriver?”

“Remove toolbox”

Voice responses are generated using ElevenLabs, providing natural spoken feedback.

To make the system efficient, we implemented a local frame quality gate. Instead of sending every camera frame to the AI model, the system evaluates frames locally using motion detection and image quality checks. Out of hundreds of frames processed, only a few are actually sent to the vision model, dramatically reducing API usage.

Challenges we ran into

One of the biggest challenges was efficiency. Continuously streaming camera frames to an AI model would be expensive and introduce unnecessary latency. To solve this, we built a motion detection system and frame quality filter that determines when a frame is actually worth sending to the model.

Another challenge was spatial accuracy. Since the system relies on a camera feed, we needed a way to ensure that items were assigned to the correct storage zones. We solved this by storing zones using normalized coordinates so they remain aligned regardless of screen size.

We also introduced a simple depth estimation technique using palm size as a proxy for distance. After calibration, the system can determine whether a user’s hand is actually inside a zone rather than just overlapping it in 2D.

Accomplishments that we're proud of

One of our biggest accomplishments was successfully building a fully functional hands-free inventory prototype within a short hackathon timeframe. Instead of relying on traditional barcode systems or manual input, we created a workflow where users can simply show an item, place it in a zone, and later ask the system where it is using natural language.

We are especially proud of the efficiency architecture we designed. Rather than continuously sending camera frames to an AI model, we implemented a local motion detection and frame-quality gating system that filters frames in the browser. Out of hundreds of frames analyzed, only a few high-quality frames are sent to the AI model, significantly reducing API usage and making the system more scalable.

Another accomplishment was integrating multiple complex technologies into a seamless experience. We combined computer vision for item recognition, real-time hand tracking for spatial placement, voice commands for interaction, and visual zone highlighting to create an intuitive interface that requires no manual data entry.

Finally, we are proud of turning an idea into a working system that demonstrates how AI and edge processing can simplify real-world workflows, making inventory management faster, more natural, and easier for small teams and businesses.

What we learned

Building Storganize taught us how powerful edge processing can be when combined with AI.

By handling motion detection, frame filtering, and hand tracking locally in the browser, we were able to create a system that feels responsive while significantly reducing API calls.

We also learned that successful AI applications depend not only on the model itself, but on good interaction design. Voice commands, visual feedback, and simple workflows made the system feel intuitive rather than technical.

What's next for Storganize

Storganize currently works as a single-camera prototype, but the concept can scale much further.

Future improvements could include:

Cloud-synced inventories across multiple devices

Multi-camera environments for larger storage spaces

Improved object recognition for more complex inventories

Automatic detection when items are removed or relocated

Our goal is to make inventory systems effortless, allowing people to focus on their work instead of tracking where things are.

Built With

Share this project:

Updates