In April 2024, humane.ai released a video titled This is the Humane Ai Pin. In what was supposed to be an exciting tech demo of their industry-defining product, the presenter infamously held out a hand of almonds and asked the device, "how many grams of protein?" The device confidently responded, "these almonds have 15 grams of protein."

Wait a minute.
There's only like 12 almonds in his hand! That doesn't add up, that would be more than a gram of protein per almond!
Thus the idea for this hackathon was born: can we make a better almond counter than an AI startup?
Spoiler alert: the answer is YES.
This wasn't an easy question to answer, however. First, we needed data. A lot of data. As far as we could tell, there isn't a readily accessible dataset of hands holding various amounts of almonds, so we had to make our own. We created a tool to label these images and annotate their locations. With that ready, we embarked on an adventure (a walk to piedmont) to get pictures of hands holding almonds with a plethora of backgrounds. When we returned, we painstakingly annotated all 1280 images.
Finally, we could start designing and training our model. We decided to take an approach where a CNN estimates a density map of almonds in an image. We then use that density map to estimate the number of almonds in the image. After hours of training, re-architecting, and re-training, we were able to achieve a mean average error of 1.5 on validation sets. You might be asking, why don't we just use a traditional computer vision approach and use cv2 filters and edge detectors? The kicker here is that we need our model to work in any environmental condition with any background behind the hand. Unfortunately, this made the problem a gazillion times more complex and annoying, but we needed to win.
To tie it all together, we threw together a simple frontend and web server to ingest images, run the model, and return nutritional information about how many almonds are in the image.
Now back to our original question - we inputted a screenshot from the humane video to our model. The output? 13.9. Some napkin math tells us this is roughly 3.9g of protein. No way. We just demolished an AI startup's performance in almond counting.
We also benchmarked performance against some other lightweight VLM models, and compared to the best-performing, moondream:1.8B, we surpassed its accuracy on our validation dataset.

Log in or sign up for Devpost to join the conversation.