Inspiration
This project is inspired by traditional machine learning use cases.
What it does
It simplifies the traditional process of image classification by removing the need to train a model to recognize images and learn the data associated with the images.
In this project I use Gemini Pro Vision as a multimodal tool to classify 2 categories of objects using zero-shot and few-shot prompts.
Category 1 - Minerals were classified using one image per prompt while gradually expanding on the prompt to show the difference in responses. I started with zero-shot prompting and ended with a few-shot prompt with 1 formatted example. In the few-shot prompt it was not necessary to give a picture example with the formatted example to receive an accurate answer. In each case Gemini was able to provide correct features for each mineral.
Category 2 - Cars were classified using 1 few-shot prompt and 2 examples. The required features and answers were given in the provided template format with the example images. Gemini used these examples to give the correct features for the final image.
How I built it
I uploaded jpeg files to google cloud storage and used the quick start template to run the Gemini Pro Vision API. I tested prompts on various images until I received responses that provided the information I requested based on the images that were uploaded.
Challenges I ran into
I had some trouble getting used to the cloud environment, but once I was able to create a storage bucket and enable the APIs I was ready to go.
In most machine learning use cases the number of input/ output pairs can reach 100-1000s. As the number of images increase it would be difficult to tell if all responses are correct using a generative tool due to hallucinations.
Hallucinations are responses created by generative AI that look correct, but are not actually true.
Accomplishments that I'm proud of
I'm proud to submit an entry to my first Hackathon : )
What I learned
I learned how to use a multimodal API and run a Colab notebook in GCP.
What's next
Image classification has a wide range of applications across various industries such as crop monitoring in agriculture, defect detection in manufacturing, product categorization in retail, and content monitoring in entertainment.
Multimodal AI tools will revolutionize fields that use image classification similar to the impact of generative AI had on Natural Language Processing uses cases such as semantic analysis in Marketing.
Although there are advantageous risks should still be considered. As the number of input images increase it becomes difficult to understand whether each output is correct. Regular evaluation of generative responses and analysis of the impact on multiple communities is needed to responsibly implement the tool in real world use cases.
Log in or sign up for Devpost to join the conversation.