Inspiration

The core inspiration behind Whispr stems from a desire to address the growing prevalence of dementia and Alzheimer's disease. A close friend, Arjun, has a grandparent suffering from dementia, and this personal experience highlighted the challenges faced by families dealing with the loss of loved ones to this condition. As individuals increasingly rely on their phones to capture life's moments through photos and videos, these memories often remain unvisited and fade with time. Whispr aspires to build this app for families who are suffering the losses of loved ones and those who have dementia so they can relive the lives of those they lost. Whispr aims to revitalize these captured moments, particularly those associated with travel and specific locations, offering a way for people to reconnect with and relive cherished experiences. The project seeks to facilitate the "transportation" of memories and culture, bridging the gap between the past and the future.

What it does

Whispr is a platform for recording and sharing ephemeral, geographically-tied memories. Users post "whispers" – short-form audio and video – geographically tagged. Anonymous, location-specific whispers offer another way to visit a location through others' personal experiences. Whispr really allows users to leave behind bits of their experience so that others can visit and experience a collective, dynamic memory-scape.

How we built it

Whispr is built upon a foundation of location-based content sharing, enhanced by advanced AI-driven analysis. The core workflow involves users uploading "whispers" (audio/video) tied to their current location. These whispers are then processed to extract and categorize their emotional content. A critical component of Whispr is its sophisticated sentiment analysis system. We developed a custom AI model, drawing inspiration from research involving the manipulation of individual vectors within images to analyze sentiment. Our model combines Natural Language Processing (NLP) and Speech Emotion Recognition (SER). Here's a more detailed look at how we represent the data and process it using vectors: When a user uploads a video, we analyze each frame to extract visual information. This could include things like the dominant colors, the types of textures present (e.g., smooth, rough), and the objects that are detected (e.g., people, trees, buildings). We quantify each of these observations. For example, if we're looking at color, we might measure the amount of red, green, and blue in a frame. Each of these measurements becomes a component in a list. In mathematics, a list of numbers like this is called a "vector." If we have 'n' different visual features that we're measuring, we represent the video's visual content as a vector v in what's called an 'n'-dimensional space. We write this as:

v E R^n

The symbol R^n tells us that each component of the vector v is a real number, and there are 'n' such numbers in the vector. So, v is essentially a list of 'n' numbers that describe the visual characteristics of the video. Similarly, when a user uploads audio, we extract several characteristics of the sound. These include: Pitch: How high or low the sound is. Cadence: The rhythm and flow of speech. Energy: The loudness or intensity of the audio. Additionally, we process the text of what is being said, as explained below. We end up with 'm' different acoustic features, and we organize these into another vector:

a E R^n

This vector a is a list of 'm' numbers, each representing one aspect of the audio. To determine the emotional content of a whisper, our model needs to take these two vectors (v and a) and map them to a specific emotion. We express this mapping using a function:

f : R^n X R^m --> S

Let's break down what this equation means: (f): This represents the function or process that our model learns. It's like a set of rules that the model uses to make a decision. (\mathbb{R}^{n} \times \mathbb{R}^{m}): This indicates the input to the function. It means the function takes two vectors as input: the visual feature vector v (from the video) and the acoustic feature vector a (from the audio). The "x" symbol here indicates that the input is a combination of these two vector spaces. (\mathcal{S}): This represents the output of the function. It's the set of all possible sentiment categories. For example, S might be {'happy', 'sad', 'angry', 'neutral'}. The function (f) assigns the input vectors to one of these categories. Our custom AI model is trained to learn this function (f). Here's the process: First, the audio from the user's whisper is converted into text using speech-to-text technology. Then, we use NLP techniques to extract meaningful information from the text. One common technique is to use "word embeddings." Imagine each word being represented by a vector, where the numbers in the vector capture the word's meaning and how it's used in context. For example, the words "happy" and "joyful" would have vectors that are close to each other in this vector space. These word vectors are then processed to calculate a sentiment score. This score (e.g., a value between -1 and 1, where 1 is very positive, -1 is very negative, and 0 is neutral) is included as one of the components in the acoustic feature vector a. We analyze the raw audio signal to extract acoustic features like pitch, cadence, and energy. We combine these audio-derived features with the sentiment score that we calculated from the text analysis in the NLP step. This combination of audio features and text-based sentiment forms our final acoustic feature vector, a. So, a now contains information about how something is said (e.g., tone of voice) and what is said (e.g., the words used). The AI model is trained to adjust its internal parameters. It does this by comparing its predictions to a large number of "training examples," where we know the correct sentiment. Through a process called "machine learning," the model gradually improves its ability to map the input vectors v and a to the correct sentiment category in S.

Challenges we ran into

Developing a robust and accurate sentiment analysis model presented several challenges. Key challenges included data variability, as "whispers" can vary significantly in audio/video quality, background noise, and speaker characteristics, making it challenging to extract consistent and reliable features. The subjectivity of emotion also posed a challenge, as emotional expression is inherently subjective, and accurately categorizing subtle emotional nuances is a complex task. Furthermore, the need to process uploaded whispers quickly and efficiently required optimization of the AI model and the overall system architecture. Balancing the need for user anonymity with the requirement to moderate content and ensure a positive user experience was another significant hurdle.

Accomplishments that we're proud of

Despite all these challenges, we managed to accomplish some significant milestones. We managed to design and train a custom AI model that effectively combines visual and acoustic analysis for better sentiment detection. Our combination of SER and NLP in our system provides a more comprehensive and accurate interpretation of the emotional content of whispers compared to one-modal-based approaches. We also created a unique platform that integrates location data with user-generated content in a harmonious and natural manner, evoking a sense of togetherness and shared experience. Last but not least, we created a functional prototype, demonstrating the technical feasibility of gathering, processing, and exchanging location-based memories in an ephemeral and anonymous manner.

What we learned

Through developing Whispr, we learned several things. We found that combining multiple data modalities (visual and audio) significantly enhances the accuracy and robustness of sentiment analysis. We also realized that location plays a significant role in deciding the meaning and impact of shared memories. Developing a system that balances user anonymity with content moderation and community welfare is a fine balance. Finally, we realize that building an AI model from scratch is a lot of work including data collection, data preprocessing, model selection, training, and testing.

What's next for Whispr

The future of Whispr holds exciting possibilities. We envision several key areas for further development. We plan to refine our sentiment analysis model, potentially incorporating more advanced deep learning techniques and expanding the range of recognized emotions. We aim to enhance the user interface and add features such as interactive maps, advanced search filters, and personalized whisper recommendations. We will also focus on optimizing the system architecture to ensure scalability and handle a growing volume of whispers. We hope to foster a vibrant community of users who contribute and engage with whispers, creating a rich and evolving tapestry of shared memories. Additionally, we are exploring the integration of augmented reality (AR) and virtual reality (VR) technologies to create more immersive and interactive experiences.

Built With

Share this project:

Updates