Inspiration

I've also like building fun stuff, so there was no inspiration per se. I would classify myself as an AI/ML enthusiast and I like Computer Vision in particular and have built other fun projects like a Where's Waldo? Finder. This started out as a me trying out Gemini Pro 1.0 Vision's capabilities and as I tested out more and more things, I thought I could build a product recommender using Gemini's multimodal capabilities and started out with clothing as that was the simplest to start with.

What it does

It takes an image and generates recommendations from a known dataset of images (in my case clothing). Practical application would be for clothing and furniture. If a customer sees a social media/fashion influencer or something in an interior design catalogue, they can this recommender tool search for things that which are similar to the "look" they want -- likely cheaper alternatives.

How we built it

Written in Python, I used Gemini API for the multi-modal and embedding pieces. I used Retrieval-Augmented Generation (RAG) to ground the answers to a specific data (image) set. It finds matches using cosine similarity between the descriptions of the RAG image dataset and the submitted image.

Challenges we ran into

Getting the right prompts/asking Gemini the right questions to get the results we wanted. It's not always consistent in returning the results I wanted and required me to write some additional logic to filter out unwanted responses and regenerate the content (rinse and repeat) as needed.

I initially wanted to create this using Gemini Pro 1.5, but being in Canada, I couldn't get access to it much later than most other people.

Accomplishments that we're proud of

I was able to create a functional minimum viable product in a relatively short amount of time balancing between work and family. I'm also participating in this solo.

What we learned

Prompting matters. I can write my own logic, but that only goes so far.

What's next for MR Stylist

I want to update the prototype to use Gemini Pro 1.5. I also want to further enhance the accuracy of the recommendations. I want to combine cosine similarity of not only the image description text embeddings, but also perform matches based on the image vector embeddings as well. In order to achieve this, I will need to use Google's Vision API to first detect and crop out individual pieces of clothing in the submitted image, perform a cosine similarity. Combine with the text description cosine similarity results and feed into a ranking algorithm to produce a single answer for each article of clothing.

I also would love to deploy this on GCP like I have with my other mini-projects.

Built With

Share this project:

Updates