Gesture

The Idea

We set out to create a live, direct sign-to-speech tool using two Myo armbands. Myo armbands are small, electric-sensor accelerometers that rest on the forearm. They can detect acceleration, orientation, and a coarse approximation of the electric signals of the muscles in the forearm that are triggered when flexing fingers. With a band on each arm, we envisioned signers with deliberate, careful gestures, communicating with the entire world.

We had two use-cases in mind. The first was for use by the deaf community in their day-to-day lives, so that those who can't sign can engage with them live, reacting to the words they sign and looking them in the eye, rather than using pen and paper, or avoiding interaction altogether. The second was more pedagogical; both those who interact with the deaf community and those interested in practicing and honing their ASL would learn and cement their signs with feedback.

What it does

After 1.5 days of work, Gesture can teach the Myo a small collection of 3-5 words and distinguish between them at ~80-95%, as long as the signs are not too 'close' to one another (e.g. "mother" and "father"). With more refinement of our choice of input data and machine learning parameters, we think this could improve significantly.

How we built it

To build Gesture, we first began by doing a few proof of concepts with very sparse, simple data, classifying the signs "father" and "sorry". With some success, we forged ahead, and worked on building a full-fledged tool for training and interpreting. We built a python script that can "listen" to your arm movements, chunking them into "gestures" based on moments of stillness, in the same words are chunked into utterances between moments of silence. Each gesture could then be saved to a file as an example of a word. Say we were training Myo to recognize the word "dog". By running this script and repeatedly signing "dog" with a small break in between, we could commit 50+ instances of the dog sign in a couple of minutes to a file. Then, we wrote the main Gesture tool, that had two modes of running. In both, it first constructs a model based on the training data and desired words. Then, depending on the mode, it either runs against testing data of old recorded gestures to see how the model performs, or goes into "live" interpreting mode, listening for gestures and speaking aloud its best guess for the sign. We used a python tool called scikit-learn for the modeling and predicting using support vector machines. We learned a decent amount about ML and classification, but would need to learn more to refine our approach.

Challenges we ran into

Our most disappointing challenge we encountered was near the very end of the hackathon, when we began work on coordinating two Myos instead of one. It dawned on us slowly that based on the python bindings we had used wrapping the C++ SDK, this would not be possible. The Myo team had -- for some unknown reason -- removed in late 2014 the API functionality for determining the MAC address of each band, and the wrapper around the C++ code would not return consistent objects, instantiating new copies seemingly at random, preventing us from comparing memory addresses to determine if they were the same band. This would mean for each word-training-session, the right and left arms might be inadvertently switched, which would completely derail our ability to train given our small data-sets.

Other challenges included not discovering the python bindings at all for a while and slogging through the C++; the switch sped us up immensely. We also struggled with creating good training data, as it was difficult to determine what was worth feeding into the model. Should we include the "pose" Myo opaquely predetermines for us (i.e. "fist", "extended", etc.), the raw electrical forearm data from which they derive the pose, or neither? Should we include both orientation and 'gyroscope' data? Better yet, were those two exposed properties even truly different? Should the time it takes to make a gesture be a factor? Without a few days, or more data than we could produce on our own, we couldn't answer these questions, though we tried to intuit reasonable ideas on what would and wouldn't be effective.

We toyed early on with the idea of trying to separate the Myo data-stream in to two distinct parts: the 'finger' data via electrical signals, and the 'arm' data via orientation/acceleration, and then concatenating the two results to determine signs. This seemed attractive in theory, but problems included syncing the timing of the two data streams and handling multiple finger movements for one arm movement and vice versa. After experimenting, we decided to forego the electrical data generally altogether, and focus on quality arm recognition. We similarly excluded the pose information, as it wasn't useful if it wasn't consistently accurate, which we found was not the case. Myo's margin of error is completely acceptable for, say, pausing a video with your hand, where no user minds being forced to give it a few tries, but not live communication. We eventually did choose to include time, as it gave us a decent proxy for distinguishing two gestures that were similar, but consistently took different time periods relative to each other. What we lost in flexibility for the signer to express themselves with inflection, slowly, or quickly, we gained in precision.

Accomplishments that we're proud of

We're immensely proud of the fact that we were able to successfully translate signs online. It was difficult to say for most of the Hackathon whether it would work at all. We're also proud of the efficient testing and training tool we created, which allows any user to sit down for five minutes and create a model that can read their own personal body language into spoken words.

What's next for Gesture

There are a few big picture next steps. First and foremost, we want to hear feedback, advice, and criticism from actual signers, as neither team member has any real experience in ASL beyond our research and work for this project. We'd then like to explore one of our initial hypotheses of separating the noisier electrical data from the more consistent arm data, in hopes that in isolation the small minute differences will stand out in relief and be better classified. Finally, we'd like to create a rather large testing data set over a day or two, using about 10 words of varying similarity with ~100 recordings each, and rigorously test various parameters for our modeling toolkit to determine a good baseline that doesn't overfit. Finally, with this model, we'd like to train it on someone who actually signs and see if it can work for them. Long term, if this proved successful, for the final stage we'd hook it up to a simple phone app/webpage that would speak aloud for a signer, wherever they were, allowing them to communicate freely, simply, and in realtime, with anyone.

Built With

Updates

John Whelchel started this project — Jan 31, 2016 05:23 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.