Inspiration
Ever since childhood, I was fascinated by the idea of having a personal AI assistant like Jarvis from the Iron Man movies - an intelligent system that could see, hear, understand commands, and carry out tasks efficiently. This inspiration drove me to create an AI application that could be the ultimate helper and digital companion. Juggling multiple responsibilities, we longed for an assistant that could anticipate my needs and offer solutions. Moreover, I was motivated by the desire to help those with visual or auditory impairments by developing an AI that could serve as their eyes and ears, bridging the gap between them and the world. After years of relentless effort, overcoming obstacles in machine learning and technology limitations, my dream became a reality. The AI application could perceive its surroundings through advanced computer vision and audio processing, comprehend spoken commands, and engage in natural language conversations. Seeing our creation's ability to profoundly impact lives, transforming challenges into opportunities, filled me with a deep sense of appreciation and modest pride. What started as a youthful fantasy had evolved into a powerful tool that could uplift humanity, tear down obstacles, and pave the way for a more inclusive, accessible world.
What it does
This AI assistant can answer user queries through visual perception by observing them and listening to their questions, then responding accordingly.
How we built it
We constructed it using the Gemini Vision Pro model , granting it access to the camera and microphone for speaking, listening, and seeing capabilities. We utilized TypeScript, JavaScript, and Nodejs to build this application.
Challenges we ran into
- The first challenge was integrating the web speech API key. After studying the documentation and seeking assistance from Bard, we overcame this obstacle.
- The second challenge involved integrating the application with the camera and microphone, which we resolved by studying other code databases. -Reducing the time taken for response with eventually occurs after the image analysis when the submission of the user's query is asked .
Accomplishments that we're proud of
- Our primary accomplishment with this project is creating the software version of the AI assistant. The application is robust and capable of altering our perception of the world and facilitating learning.
- We encourage you to experience it firsthand by visiting the link below (remember to obtain your Gemini Vision Pro API key, as it only functions on desktop devices) link
What we learned
- We acquired proficiency in a new programming language, TypeScript.
- We gained substantial knowledge about computer vision.
- We built our own transformer model.
- We learned to integrate web applications with machine learning technologies.
- We started thinking about how the AI model can think.
What's next for Multimodal_Interaction_System (MMIS)
- The first upgrade will enable the system to comprehend and communicate in multiple languages.
- The second enhancement will grant the system access to the user's device.
- The third improvement will involve integrating the system with additional vision-based large language models (LLMs).
- The fourth upgrade will transform the system into a versatile hardware device.
Log in or sign up for Devpost to join the conversation.