Inspiration
Our inspiration comes from a personal desire for self betterment and additional assistance for today's youth. In our professional and personal lives, we constantly see ourselves and our peers finding themselves in compromising emotional states subjugating us to high amounts of unnecessary (and in some cases, imaginary) stress stemming from a feeling of inadequacy. It's hard not to hold these emotions personally in a world that is so fast-paced and interconnected.
That is why we believe it is important to use the incoming technologies to focus on humanity's mental health as much as it's advancement. Thus, this project was born.
What it does
You are able to provide pre-recorded video input to the system for it to analyze both the visual and auditory portions, then provide the user with an assessment of what they are feeling, alongside resources for help (should help be needed) and a schedule for checking in on the user.
How we built it
We designated two parallel systems to analyze the visual and auditory portions of the video, which feeds into an agent that combines and analyzes the results of the two outputs.
The LLM of focus for this project uses Google Gemini as the backbone of the architecture.
The auditory system uses a variety of techniques for the analysis, ranging from pitch detection through Mel-Frequency Coefficients (MFCC's) and other various techniques taken from the area of AI audio analysis.
The visual system is less in depth, as computer vision techniques have already been built into many LLM's, including Gemini. A couple of agents focus on different parts of the body, such as face and posture, to analyze the emotional state based on the physical properties of the user.
Challenges we ran into
Data: Our system in theory takes up massive amounts of data, and the analysis of pre-recorded videos could far surpass the capabilities of our GCP server. It's no surprise to us now how any large scale AI company suffers from this problem as well, given that we were struggling to define a data threshold in our small environment.
Speed: Speed of analysis is constrained not to our personal hardware but to that of the GCP server, which could potentially be limited to an ineffective device in the event of a GCP shutdown, or due to constrained budget requirements. Additionally, the length of a video will greatly determine upload speed to the server alongside the time for analysis, which could cause user satisfaction issues.
Sessions State: Saving/creating variables in the session memory is not an entirely intuitive process, and analysis of Google's examples using their pre-built "memorize" function was necessary for us, given our limited knowledge of the product itself. However, we are certain that we missed something in our interpretation of the product and potential to manipulate data. Given appropriate time, we are certain that Google has provided an easier way to utilize the session state
Accomplishments that we're proud of
We were able to successfully implement a working version of this, although it is not 100% what we were hoping for in terms of performance. Communication with our team is also something that we are proud of, including the honesty of a former teammate who candidly described his inability to participate due to time constraints.
What we learned
We learned how powerful these machines are going to become in the coming years. This was incredibly low-code, requiring mostly prompt engineering and an understanding of core concepts for audio/video analysis to achieve an MVP that we found acceptable. This level of abstraction from what is normally a very in-depth programming and knowledge standpoint is unprecedented, and a working knowledge on these types of machines will become necessary in the coming years.
What's next for Therapy Bot For Video Usage
We think it would be cool to flesh this out a bit for real-time usage and not pre-recorded videos. It would not only save us space, but also allow a user to get instant feedback on their usage. Also, a solely audio input could be useful as well, and integrated into systems like Amazon Alexa or Google Home.
Built With
- adk
- gcp
- python
Log in or sign up for Devpost to join the conversation.