Inspiration
We were inspired to create a versatile tool that combines image understanding with natural language processing. We wanted to build something that could not only answer questions about media but also translate those answers into multiple languages.
What it does
Our Flask app leverages Google's Gemini model to provide intelligent responses to questions about media. Users can input a prompt and upload an image or video, and the app generates a textual response based on the prompt and the content of the image. Additionally, users have the option to translate these responses into various languages using the Gemini model's language translation capabilities.
How we built it
We built the app using Flask for the web framework, integrating Google's Gemini model for both image or video understanding and language translation. The app interacts with Google Cloud Storage to store and retrieve media. We also utilized Vertex AI to manage and deploy machine learning models seamlessly.
Challenges we ran into
One of the main challenges we faced was fine-tuning the Gemini model to ensure accurate and coherent responses, especially when dealing with diverse prompts and images. Handling file uploads and managing media data within the app also posed some technical hurdles that we had to overcome.
Accomplishments that we're proud of
We're proud of successfully integrating both media question answering and language translation functionalities into a single, user-friendly interface. Overcoming the technical challenges and achieving a seamless user experience was a significant accomplishment for us.
What we learned
Through building this app, we gained valuable experience in working with machine learning models, especially in the context of natural language processing and computer vision. We also learned how to effectively utilize cloud services like Google Cloud Storage and Vertex AI to build scalable and efficient applications.
What's next for Smart Image Question Answer with Language Translation
In the future, we aim to further enhance the accuracy and robustness of our media question answering system by exploring more advanced machine learning techniques. Additionally, we plan to expand the language translation feature to support a wider range of languages and improve the overall performance and responsiveness of the app.
Built With
- ai
- api
- css3
- docker
- flask
- gcloud
- gcp
- gemini
- generative-ai
- git
- github
- google-cloud
- html5
- javascript
- machine-learning
- multimodal
- natural-language-processing
- python
- rest
- vertex-ai
- yaml
Log in or sign up for Devpost to join the conversation.