We started the hackathon with no real ideas for what to do. We looked at all sorts of fields and applications of Computer Science, ML and Computer Vision being two of the coolest ones. While we were exploring various avenues, we stumbled upon a Google built pre-trained hand pose model for tracking hands in real time over a camera. This immediately inspired us, and we jumped to the idea of a remote input device, as all of us thought that it would be an amazing way to interact with your computer and a fun project, and to demonstrate the proof-of-concept of this cool idea.
What it does
We built a gesture-based input device that mimics certain keyboard and mouse inputs which makes navigating the web feel more natural, and enables presenters to naturally manage the flow of a presentation without the need for a clicker. To do so, we derived 21 key points from a webcam feed of a user, using the MediaPipe pip package, which used ML under the hood. We identified gestures using a mathematical model that utilizes the 3D points provided by this package, and we provide these gestures as simulated inputs to the computer through Pynput, another python pip package.
How we built it, and the challenges we encountered
Combined with the 3D hand models provided by the MediaPipe API, we decided to use Pynput as it would give us OS-agnostic granular control over inputs like keyboard and mouse. With this idea, we started out by mapping the movement of an index finger to a scroll on a webpage, just to prove that the idea worked. However, we encountered many issues with this approach; the singular point (which described the position of the index finger) was unstable, because the model couldn’t be 100% certain about an absolute 3D point, as well as just due to human error (namely slightly shaking hands). Our next immediate thought was to dampen the jitter to deal with errors, as well as scale the scroll input to scroll accordingly. In order to this, in our first iteration, we used a sigmoid activation function. This turned out to be error prone, as most of our input data was centered around 0, and had a standard deviation of around 0.1, so our sigmoid function was outputting values which were all too high. To try and combat this, we decided to normalize the data. We came up with a bijective function f that mapped our input range of [0.0, 1.0] to a co-domain of (-infinity, infinity) to match the domain of the sigmoid function; this too did not work, as the functions we chose couldn’t deal with the small, jittery errors that came from the model, and caused our sigmoid function to output massive scroll values.
This led us to our final model, in which we chose a simpler data normalization function, and applied graph transformations to manipulate the data to ignore small errors and large outliers and to scale scroll input accordingly.
Upon finalizing our scrolling feature set, we decided to implement the next set of features we had planned out in our initial design: keyboard shortcuts. We wanted to implement specific shortcuts, like left/right arrow, or back/forward in history in a browser. To detect the gestures that corresponded with the key binds, we found the distance between certain salient points on the hand, applied averages, and conducted loads of iterative testing to determine crucial thresholds, as well as calculated the speed of the hand in the camera frame in order to decide if a gesture was made or not.
Accomplishments that we're proud of
We as a team are very proud of incorporating the level of mathematical precision that we did; namely, that we utilized a proper activation function, data normalization techniques, as well as our fast, iterative testing process that enabled us to prototype. Overall, it was an amazing project, and none of us ever expected to achieve the level of granularity that we did, and it was a fantastic learning experience. Oh, and SigLo is a dope name! (for those who don't know, our name comes from the words "Sign" and "Holo", creating "SigLo"!)
What we learned
For the first time, we as a team used a lot of math in new contexts that were challenging to grasp, such as scaling and normalization of data (utilizing activation functions like the sigmoid function). This was a very exciting experience, applying something we often learn in Mathematics but don't get to use in real applications often. We also experienced using a cutting edge Python library with very little documentation, meaning we had to experiment a lot with the library to get a clear understanding of how it functions. And more generally, we all learned a little bit more about Computer Vision, and its amazing applications in a variety of fields, like gesture recognition. We had a lot of fun, and we plan to continue developing SigLo in the near future!
What's next for SigLo
Thinking ahead to the future, we as a group want to, instead of using hardcoded thresholds for every gesture, pursue one of two potential avenues: the first involves ML. We considered instead creating our own trained neural network; for this, we would create labeled data using the 21 landmarks, and create a dataset which maps those to gestures and use categorical classification to identify each individual gesture. The second potential solution for registering gestures could instead be to store frames as "snapshots," in n-long vectors, and based on the current gesture being attempted, if the vector for the current gestures is close enough to the registered gesture (measured using the error vector between the current gesture and the its projected form onto the span of the registered gesture), activate the motion corresponding to it.
Beyond that, we hope to build up SigLo further to be used on devices beyond just laptops, such as TV's, consoles, and more!
[Hacker] Andrey Piterkin, [Hacker] Nishil Patel, [Hacker] Ryan Saperstein, [Hacker] Suraj Ramchandran