Inspiration
I've also like building fun stuff, so there was no inspiration per se. I would classify myself as an AI/ML enthusiast and I like Computer Vision in particular and have built other fun projects like a Where's Waldo? Finder. This started out as a me trying out Gemini Pro 1.0 Vision's capabilities and as I tested out more and more things, I thought I could build a product recommender using Gemini's multimodal capabilities and started out with clothing as that was the simplest to start with.
What it does
It takes an image and generates recommendations from a known dataset of images (in my case clothing). Practical application would be for clothing and furniture. If a customer sees a social media/fashion influencer or something in an interior design catalogue, they can this recommender tool search for things that which are similar to the "look" they want -- likely cheaper alternatives.
How we built it
Written in Python, I used Gemini API for the multi-modal and embedding pieces. I used Retrieval-Augmented Generation (RAG) to ground the answers to a specific data (image) set. It finds matches using cosine similarity between the descriptions of the RAG image dataset and the submitted image.
Challenges we ran into
Getting the right prompts/asking Gemini the right questions to get the results we wanted. It's not always consistent in returning the results I wanted and required me to write some additional logic to filter out unwanted responses and regenerate the content (rinse and repeat) as needed.
I initially wanted to create this using Gemini Pro 1.5, but being in Canada, I couldn't get access to it much later than most other people.
Accomplishments that we're proud of
I was able to create a functional minimum viable product in a relatively short amount of time balancing between work and family. I'm also participating in this solo.
What we learned
Prompting matters. I can write my own logic, but that only goes so far.
What's next for MR Stylist
I want to update the prototype to use Gemini Pro 1.5. I also want to further enhance the accuracy of the recommendations. I want to combine cosine similarity of not only the image description text embeddings, but also perform matches based on the image vector embeddings as well. In order to achieve this, I will need to use Google's Vision API to first detect and crop out individual pieces of clothing in the submitted image, perform a cosine similarity. Combine with the text description cosine similarity results and feed into a ranking algorithm to produce a single answer for each article of clothing.
I also would love to deploy this on GCP like I have with my other mini-projects.
Log in or sign up for Devpost to join the conversation.