The digital mouse was a revolutionary inspiration of it's time. However, with advances in computer vision and voice recognition, we can now bring digital interfaces into a modern generation. Our project HandsFreeTV focuses on discovering new ways to control entertainment consumption without the need of tactile hardware.

What it does

HandsfreeTV was created with the intent of giving everyone a hands-free experience while watching **entertainment. The target platform we chose for the scope of this hackathon was YouTube, as it is, widespread and readily available.

We approached this idea by combining two previously disjoint user-interfaces, voice and gestures. Only together can these two modules provide users with a natural and comprehensive way to interact with today’s digital entertainment. The packages we created are easily transferable to other systems such as smart home devices and other entertainment platforms.

As of now, HandsfreeTV listens to users for YouTube videos to search for. HandsfreeTV also has a camera that processes signals to do various actions (such as play/pause, volume up/down, and more)!

How we built it

For the hand tracking, we used Google's open source MediaPipe project along with their Hand Landmark Model to create a pipeline to detect and track hands/fingers. To classify a hand pose as an action, we then created a dataset of over 15,000 images of different hand poses. Those images were fed into our own deep convolutional neural network built using Google's TensorFlow and trained on GCP’s Compute Engine, creating a model that classifies hand poses as actions.

For the voice tracking, we leveraged Google Cloud's Speech-to-Text service to perform speech transcription. We used a streaming technique so our application is continuously ready to accept and process user input.

To control the YouTube browser programmatically, we used Selenium to simulate traditional input sources.

Challenges we ran into

A major issue we ran into was training our own gesture detection model. We ran into many errors with setting up a GPU instance and installing the necessary drivers and programs to leverage the compute power. In the end, we had to fall back to training our machine learning models on CPU instances, which was significantly slower. This resulted in us being unable to iteratively improve our model as much as we’d have hoped for.

Concurrency was another common issue. The hand-tracking software we used (MediaPipe) ran in its own process, thus we needed a way to send data from the tracking software to our main application. This, we were able to solve with a rudimentary hack by queueing up files to avoid read-write conflicts. In addition, we needed to asynchronously track the two main components for inputs, the gesture-recognition and voice-control. To solve this problem, we used a pub-sub architecture in a multi-threaded application.

Accomplishments that we're proud of

We are very proud of developing an end-to-end product that may very well be the future of user interfaces. As we are both rookie hackers, delivering a finished project is something we are very pleased with.

In addition, we are proud of…

What we learned

We gained much technical knowledge in areas such as concurrency and computer vision. We also learned a lot in the process of training our custom CNN from scratch. In addition to hard skills, we learned a great deal about the necessary considerations and foresight needed to build a complete product.

What's next for HandsfreeTV

There is much to build on top of HandsfreeTV. Since there are still a few quirks in our hand detection model, that would be the first place to improve. Another early target for improvement is simply to improve the general robustness of HandsfreeTV. Going further, it would be amazing to test HandsfreeTV on platforms such as Netflix, Hulu and Disney+.

Built With

Share this project: