Zinteract

Inspiration

COVID-19 has made video conferencing platforms like Zoom ubiquitous, as these platforms remain one of the few means for isolated individuals to enjoy genuine interactions with other people. This, however, raises concerns about accessibility. Why should individuals who struggle with using a keyboard and mouse due to disability and other circumstances be deprived of speaking face-to-face with other people?

What it does

Zinteract aims to solve the aforementioned problem of video conferencing monopolizing valuable human interaction combined with limited accessibility of these platforms. Our solution offers gesture-based control of Zoom functions—such as reactions, raise hand, toggle audio input toggle video input—and automatic dictation of Zoom chat messages. This means that the user can simply make a V-symbol (or any of the several available gestures) to send a Thumbs Up reaction or mute their audio. They can also make a gesture to begin dictation of their speech, which is automatically typed and sent into the Zoom chat box.

How we built it

To enable the gesture-based control of Zoom functions, we computed landmarks of the user's hand using a TensorFlow model published by Google MediaPipe, and then used heuristics based on those landmarks to estimate gestures. We then associated particular gestures with Zoom actions, hooking into Zoom via a GUI automation testing framework, and running an appropriate bridge function when a particular gesture is detected. To automate dictation of chat messages, the user first performs an associate gesture to begin dictation. We then stream audio to Google Cloud Platform's Speech-to-Text API, receive the transcription, and type the transcription into Zoom chat via a bridge function.

Challenges we ran into

We faced problems with parallel video input, as both the gesture recognition service and Zoom required access to webcam feed. To work around this limitation, we used a desktop application made by Snap called Snap Cam that creates a secondary camera device that emulates the primary one, allowing us to assign one camera to Zoom and another to gesture recognition.

In addition, we ran into the issue of scoping our project, which initially had more features such as automatically muting upon detecting an unrecognized voice using GCP diarization. However, we quickly realized our solution would not be in a workable state by the deadline if we pursued every feature, so we made the decision to prioritize some features and eliminate others.

Finally, some of our team members had issues setting up dependencies on their local machines, so to circumvent this, we set up two AWS Lightsail instances for them to SSH into and work from remotely.

Accomplishments that we're proud of

Coordinating the various moving parts of this application required networking such as TCP and message passing libraries such as ZeroMQ, which none of our group member had prior experience in. Similarly, we needed to interface with GCP APIs, which was also new. Learning these technologies during the event is something we are very proud of.