RSI (repetitive strain injury) affects more than 3 million people per year. Repetitive strain injury impacts the muscles and tendons, and are often found in people performing office work and excessive use of devices, such as computers or phones. While the risk for RSI can be reduced by proper technique, and taking adequate breaks, it's difficult to treat. And for many whose livelihoods depend on the constant use of technology, RSI can significantly impact their ability to perform productive work.

Here's one workaround by Emily Shea on her experience with RSI, and building a technology to use voice commands to write code and navigate around her desktop.

What it does

I wanted to build upon the idea of using hands-free input, so I came up with a camera-based approach to visually track and recognize hand movements, to help augment a RSI-impaired user's ability to effectively control the computer.

I've implemented two kinds of hand movements: the first is just by location, so whenever a hand appears in a set of designated bounding boxes, the custom command corresponding to that box will trigger.

The second movement is slightly more complicated, and involves two hands. When the hands are detected to be moving closer together, a "zoom in" command is triggered, such as zooming in on a page. When the hands are detected to be moving away, a "zoom out" command is triggered, such as zooming out of a page.

In addition to training and implementing the core model, I also wrote a basic command register which users can use to define their custom commands and bounding boxes. This way, they have flexibility over how the tool can be used.

Finally I was able to add a GPU through a distributed server system, increasing the FPS to about 25fps.

How I built it

I decided to use the SSD (single-shot detector) for its fast frame processing speed, and I extended an existing implementation of SSD, and trained it over Azure's Machine Learning Service. In order to properly detect hands, I trained a base model on the Egohands dataset over a couple days.

Challenges I ran into

Because the detection happened in real time, and because I wanted the model to be light enough to run on a laptop, I had to try various architectures to determine what would work best. I trained two types of base models for SSD: Mobilenet-v1 and Mobilenet-v2-lite. In the end, I picked mobilenet-v1 because my benchmarks showed that, on average, mobilenet-v1 would outperform by ~0.1seconds per frame, and I needed all the fps increase I could get.

Additionally, the images fed into the model would be read directly from the webcam, I needed a way to reduce latency between the video capture and the network as well. I looked at some ideas from (this post)[] to use a separate thread for video capture, and divert the model to its own process.

Accomplishments that I'm proud of

I was able to get it to work and with good responsiveness. The application runs smooth at >25 frames per second on a Macbook with i5, using a distributed system backend with GPU.

What I learned

I learned how to use Azure machine learning to spin up an instance and train and load saved models in Pytorch. I was able to build on top of the model without significantly impacting the latency (reading in from a webcam, processing locations, manipulating keypresses and actions on Mac OS). I learned about keeping the software flexible enough so that users could extend and modify it as needed. I learned about distributed system architecture, ZMQ, message serialization and deserialization to send and receive information across at network.

What's next for look ma no hands

I can probably try to improve the latency, which would allow for the possibility of more complex hand gestures and increase the number of functionalities that can be implemented.

Share this project: