Staying connected with friends and family is more important now than it ever has been. For many of us, video calling is the only method available to safely have a face-to-face conversation with those we care about. However, despite the plethora of available tools, it can still be hard to connect with less tech-savvy users.
FaceMe aims to make it easier to stay in touch by making video calling more accessible, using clear visual indicators and audio cues to help our users connect with their loved ones.
What it does
FaceMe uses a combination of the OpenCV library and the Google Cloud Vision library to initially detect the user's face. It utilizes a Haar feature-based cascade classifier in OpenCV along with a pre-trained neural network to generate bounding boxes for faces detected in the frame. FaceMe combines these results with the vertices it gets from Google Cloud's Vision API.
FaceMe uses the bounding boxes to get information about the user's location relative to the camera. The interface will warn the user when they begin to leave the center of the frame, by guiding them back to the center with handy little arrows. On-screen cues will also advise the user when they get too close or too far away from the camera. Back up grandpa! Additionally, sound cues alert the user to when they leave and enter the screen.
We added speech commands to the application by interfacing with Google's Speech-to-Text API, allowing the user to easily issue commands to the application instead of navigating a cluttered graphical interface. FaceMe utilizes the PyAudio library to listen for keywords such as "Mute," "Unmute," and "Start/Stop." The audio files are immediately deleted after they are translated to text to ensure the privacy of FaceMe's users.
How we built it
Once we had finalized the control flow and key features of our application, we decided to use a bottom-up design approach. Team members broke up into sub-groups focused on implementing specific sub-features of our application (such as using the speech-to-text API), making sure to wrap functions in classes with easily accessible functions. This ensured that the integration process would go smoothly.
Once we had implemented our individual features, we tested a wide variety of cases including both normal-use cases and edge cases to ensure that our program could adapt to various user inputs. We integrated function calls—such as recognizing if a user is too close to the camera— and tested the code regularly to ensure that each feature worked as we intended. If not, we quickly made changes to the subclasses. To practice proper version control and not get bogged down in potential merge errors, we had everyone working on branches instead of continuously committing to the main branch.
What's next for FaceMe
Our goal is to integrate FaceMe into existing video call services like Zoom and Google Hangouts to help people connect all around the world. There are still some improvements we would like to make including increasing FaceMe’s facial recognition speed, making the speech detection local (as many modern apps do), and providing a better visual interface for the product. Once we have a working prototype, we would like to conduct real-time user tests and use the feedback we receive to guide the development of FaceMe.
As we keep developing FaceMe a goal of ours is to add speech generation so that the application can speak to the user, allowing FaceMe to help connect the visually impaired with their loved ones as well. During this time of quarantine, we realized that video conferencing technology isn’t equally usable to everyone--the goal of FaceMe is to eliminate that barrier.
Challenges we ran into
Tenzin: For me personally, learning how to authorize myself and work with the Google API was a surprisingly difficult and involved process, in the end, we relied on SpeechRecognition (the library) to interface between our application and the Google API. Hopefully, in the future, I can get more experience with it though.
Kelvin: At a point, the OpenCV recognition became quite faulty. It rarely recognized our faces and registered practically everything else as a face. It once thought every window blind was a separate face! Luckily (after a painful bout of debugging), we remedied this issue by removing a conditional and altering a greyscale parameter.
Nile: As an OSX user, I ran into several compatibility issues. For instance, during the development of FaceMe, I had to specify a specific Python environment separate from my other projects and had difficulty getting all the packages I needed in the right directories. Additionally, the voice recognition library was particularly difficult to implement on my particular machine, and in the development timeline, it made more sense to centralize prototyping on our team’s Windows machines.
Nikola: I definitely ran into some of the same issues as Nile. There was definitely work involved in configuring a variety of computer vision and voice recognition libraries in multiple development environments across different platforms.
Devesh: One of the challenges was encoding the NumPy matrix into a 64bit string that can be processed by the cloud vision API and then combining the face polygons given by the API call with the results generated from the OpenCV face cascade. Additionally, I was not sure on how to ensure that processing audio data from the microphone doesn’t slow down face processing and vice versa while still ensuring we have quick feedback to user commands.
Accomplishments that we're proud of
Tenzin: Perhaps it's because of how much I struggled with it, but I'm really proud of how we got the speech recognition commands to work. Other than that, I think our team did a great job with the facial recognition, and it's great that they did, as it's essential to our project.
Kelvin: I’m particularly proud of the visual and audio cues. Not only were they dynamically programmed to address the size of the user's screen, they actually respond to the location (or lack of) the user’s face. It’s a cool, live responding feature.
Nile: I was happy with the responsive sizing and color grading of the indicator arrows. Although the implementation was rather simple, it added a nice aesthetic and user-friendly experience to the overall interface.
Nikola: One of my favorite parts of FaceMe was how quickly we were able to get comfortable with the variety of computer vision tools, and then implement them into a responsive and communicative interface that can provide responsive feedback and be visually pleasing.
Devesh: One thing I was proud about was developing a communication system between the microphone audio processor and the face processor code. I was also able to figure out how to use Non-Maximum Suppression to combine results from the vision API along with OpenCV to ensure that we don’t have overlapping bounding boxes.
What we learned
Tenzin: I learned that you just have to get out there and draft things, even if you aren't sure it's gonna work. Sometimes with the design process, prototyping even if you're completely lost is a good way to find out what works, and then move forward with your project.
Kelvin: I’d never used computer vision, speech recognition, nor audio outputting before. Implementing those and dealing with computer vision data was certainly an educational experience. Furthermore, I learned to have things run asynchronously!
Nile: This was my first collaborative project alongside other programmers, and using version control with regard to others’ work was fascinating. I’d always used GitHub for my own personal projects, but I hadn’t used this much git functionality up until now.
Nikola: I think something we definitely learned about the rapid development process of a hackathon is that you have to start small. We were able to quickly implement our barebones functionality and then iterate from there, which helped relieve some stress.
Devesh: Through this project, I learned about a lot of different computer vision tools and realized just how much existing work has gone into effectively detecting human faces. I also learned about multi-threading and lock systems since I had the audio processor running on a separate thread from the main thread and used a thread-safe queue to communicate between the two processes.