System Architecture
Main Functionalities: ASL Translation and Search for ASL Learning Materials
ASL Translation: Realtime video and Browse from own video
VisuAlly Splash Screen

Project Story

Inspiration

Our inspiration for this project stemmed from a deep desire to make a meaningful impact on the lives of those who are deaf/mute. We were driven by the United Nations Sustainable Development Goals (SDGs), particularly the goal of ensuring inclusive and equitable quality education and promoting lifelong learning opportunities for all. We saw an opportunity to leverage technology to bridge communication barriers and empower individuals with disabilities to express themselves and engage more fully in society.

What It Does

Our app is a sophisticated Sign-Language application designed to enhance communication between individuals who are deaf/mute and those not proficient in sign language. As a powerful educational platform, it enables users to learn basic sign language gestures, practice their skills, and interact in real time. The app features a unique capability where users can use ASL to gesture in real-time video or through recorded videos selected from their phone gallery, and the app translates these gestures into text. Additionally, the app supports searching for ASL learning materials, further enhancing ASL education by allowing users to access a wealth of resources directly through the app.

How We Built It

Our development journey began with comprehensive research into existing sign language resources and apps. With structured guidance from our experienced team members and mentor, we meticulously outlined the user journey and defined crucial app features and functionalities. To keep our project organized and on track, we utilized Trello for project management, which helped us coordinate tasks and deadlines effectively across the team.

Development Stack

Front-End: Developed using Android, with user interface designs created in Figma to ensure an intuitive user experience.
Back-End: Built with Django REST Framework, handling data operations and interactions seamlessly.
Cloud Services: Integrated with Google Cloud Services, including Vertex AI and Agent Builder, to enhance our application's capabilities.

Technical Implementation

Video Processing and Translation

Encoding: Videos are encoded in base64 on the mobile app and sent to the server.
Initial Translation: Utilizing Google’s Gemini Pro Vision model, our server processes these videos to extract the sequence of actions performed by the signer, employing Few-Shot Prompting to refine the output.
Sequence Translation: The described actions are translated into text by another Gemini Pro Model, finetuned using Vertex AI’s Supervised Finetuning library. This model, trained with 25 examples across 5 classes (Help, Thank you, Good morning, Afternoon, and Deaf), translates sign language actions to text.
Model Training Metrics: Finetuning parameters included a batch size of 4 and 5 epochs, achieving a total loss reduction from 11 to 5.2. Although initially based on a limited dataset, the framework is designed to scale with additional examples and epochs.

Search Functionality

Utilizing Vertex AI Agent Builder, we developed a Vertex AI Search Agent, incorporating a Google Cloud Storage backend as the Data Store. This store houses ASL learning materials mined from startasl.com and processed into a JSONLines document format. This approach provided a robust, low-code retrieval system, enhancing our app’s educational value.

Challenges We Ran Into

The foremost challenge we faced was managing time constraints, both in terms of project deadlines and the duration required for model training and fine-tuning. The process of fine-tuning the Gemini model to accurately translate ASL into text was particularly time-consuming, demanding numerous iterations to optimize performance. Additionally, our development journey saw various experimental phases where we tested different components to refine the app’s functionality:

Component Experimentation: Initially, we planned to integrate a fully functional chat feature using Vertex AI Conversational Agent. However, through testing and development, we pivoted to using the Vertex AI Search Agent, which better suited our needs for streamlined search functionality.
Storage and Retrieval Shifts: In our quest to perfect the search feature for ASL materials, we initially considered using BigQuery but eventually transitioned to Google Cloud Storage. This change was driven by the need for a more efficient and scalable solution to store and retrieve educational content.

Accomplishments We're Proud Of

Despite the hurdles, we achieved significant milestones that underscore our team's resilience and innovation:

Successful Gemini Model Fine-tuning: We managed to fine-tune a Gemini model that adeptly translates sign language into text, a core feature of our app that enhances its educational and communicative value.
Leveraging New Technologies: Our adoption and implementation of the recently announced Vertex AI Agent Builder was a game-changer, significantly accelerating the development process and enhancing the app's functionality.
Functional Sign-Language App Development: We developed a fully functional Sign-Language app that not only serves as a powerful educational tool but also facilitates easier communication for individuals who are deaf/mute. This app stands to make a considerable impact in fostering inclusivity and aiding communication.

Our journey through this project, punctuated by both challenges and achievements, highlights our commitment to innovation and social impact. We are excited about the future potential of our app to contribute positively to the community and further the cause of accessible communication technology.