We started with shelf sight. this is a web app that uses llm endpoints like gemini or something to help sight-poor shoppers, features, such as analyzing the scene to identify products, read prices, spot discount labels, and extract key packaging details, then lets the user ask natural language questions and receive spoken answers through their earphone. We saw online that many sight-poor shoppers have trouble reading the information of products, especially the price when it is written in small letters.

Afterwards, we realized we could actually do most of the usable functionality locally, for free on an iPhone.

The following is our proof that utilities for sight-poor people shouldn't be costly and slow, and they can (and should) be completely private and run for free.

we prioritized local inference as much as possibly possible, so the following two utilities are completely private and run on even 2020 iPhones with a maximum delay of 3 seconds.

pepperonipizza and itemfinder: both of these run on iOS, made with swift, mlx. they run on a backbone image classifier called YOLOE, which is a performant version of YOLO. this can classify images more than 30 times per second (on an iPhone 13, probably way faster on modern ones)

itemfinder is built directly on YOLOE, and it helps sight-poor people find their items. we encoded about 1000 popular items into our YOLOE classifier, which the user can simply speak. there is haptic feedback based on how close the user is to the item, and to what extent the phone is pointing to the item.

pepperonipizza is slightly more complicated. it uses a version of YOLOE constrained to only classify price tags. When the user hovers a price tag, YOLOE sees it and it extracts that price tag and shoves it into a tiny qwen3.5 (0.8 billion params) that runs on iPhone in abt under 1GB of ram (iPhone apps are given ~2GB to work with). This qwen model has its output constrained (it can only "see" numbers and a "." character) so it can only output prices. this allows us to make this quite reliable even with such a small model. the price is displayed in landscape mode in massive characters to make it more visible. we estimate this has about a ~90% accuracy (sorta impressive considering its running on iPhone 😭). we faced challenges at first when we tried ocr, which failed miserably because somehow Coles and woolworhts price tags don't have a decimal point, so cereal would be like 200 dollars or smth.

ok that's it, bye bye

Built With

Share this project:

Updates