Blink-147
Authors
Kana Takizawa (ktakiza1), Navya Sahay (nsahay), Dixi Han (dhan30), Kara Wong (kwong60)
Important Links
Introduction
What problem are you trying to solve and why?
Total paralysis patients often rely on eye-based communication, using different eye movements to relay basic needs such as food and personal well-being. To accurately translate their messages and make this form of communication more accessible, we propose an attention-based CNN model to first detect eye location within facial images. We then use CNNs to determine and classify specific eye movements relevant to a predefined eye alphabet (“up”, “left”, “right”, and “down”) within the localized eye images.
If you are implementing an existing paper, describe the paper’s objectives and why you chose this paper.
The paper’s objectives are to develop a computational eye-based communication system for patients with motor neuron disorders, especially in low-income countries. This architecture, Blink-To-Live, is not software or hardware-specific, serving as a cost-efficient alternative to comparable sensor-based systems. Blink-To-Live establishes an eye gesture alphabet such that speech-impaired patients are able to express their basic needs.
We chose this paper as we are all interested in nonverbal communication and how technology could universalize what would otherwise be marginalized languages. We were impressed by the ingenuity of this paper, and we were inspired by the use of deep learning in medical technology. However, when we tried the original Blink-to-Live model on our own faces, we noticed that Blink to Live was unable to recognize diverse shapes of eyes and classify them accurately. Thus, we hoped to use a more diverse dataset and a simpler CNN-based model, rather than the computer vision model used in the paper, to improve on the performance of eye language recognition models for a broader range of users.
What kind of problem is this? Classification? Regression? Structured prediction? Reinforcement Learning? Unsupervised Learning? etc.
This problem is a classification of eye-positioning.
Related Work
Are you aware of any, or is there any prior work that you drew on to do your project?
What this paper does is use a facial landmark detector library (Dlib: https://github.com/davisking/dlib-models) to get 68 landmark points on faces and determine among right, left, up, and blink based on the relative location of pupils to the edges of eyes. Dlib seems like a CNN-based pre-trained library, as described below.
"This model is a ResNet network with 29 conv layers. It's essentially a version of the ResNet-34 network from the paper Deep Residual Learning for Image Recognition by He, Zhang, Ren, and Sun with a few layers removed and the number of filters per layer reduced by half." -davisking, GitHub
Please read and briefly summarize (no more than one paragraph) at least one paper/article/blog relevant to your topic beyond the paper you are re-implementing/novel idea you are researching.
The paper, “Eye-Tracking For Everyone” proposes the use of CNNs to create a dataset of eye-locations using facial images. The paper details the mechanisms of the eye-tracker model that its authors build. First, it delves into the methods behind creating the GazeCapture dataset and then trains a convolutional neural network on the dataset to predict gaze. The paper’s dataset is the one we will be using to train our model to detect the right location of eyes in a face through encoders.
Data
What data are you using (if any)? If you’re using a standard dataset (e.g. MNIST), you can just mention that briefly. Otherwise, say something more about where your data come from (especially if there’s anything interesting about how you will gather it).
We will be using the Closed Eyes in the Wild dataset, which contains facial images of open eyes and closed eyes across 2423 individuals. The 1192 "blink" images were collected from the Internet, while the 1231 open-eyed images were selected from the Labeled Faces in the Wild dataset which were also extracted from the Internet. The only expectation of these images is that they pass the Viola-Jones face detector, although this may introduce bias via the methods that the Viola-Jones detector utilizes.
How big is it? Will you need to do significant preprocessing?
For the open-eye images, we looked at every image, determined if the eye movement was relevant to our project (either left, right, up, or blink), and manually labeled them. We created a pickle file and dumped the images and labels into it after shuffling them while preserving the correct image-label association.
Before any data augmentation, the data we used ultimately contained 1493 images: 245 right, 126 up, 222 left, and 900 blinks. In terms of experimenting with data augmentation, a moderate amount of preprocessing was required to downsample "blink" images and augment (crop, contrast, saturation) all other images. However, this level of preprocessing is doable as the data otherwise is more or less cleaned up. We then split the data at a 3:1 ratio between training and testing, normalize each pixel value to 0-1, convert the labels to the assigned integers, and one-hot encode them.
Methodology
What is the architecture of your model?
We utilized a CNN with attention, so our architecture consisted of three main components: 1) convolutional layers, 2) attention blocks, and 3) a feedforward network.
The input image was first passed through four convolutional layers, each with 64 filters, in order to extract facial features from the images that are relevant to predicting the eye location in the images. LeakyRelu activation and max pooling were applied to each layer to optimize training.
The Attention Blocks consist of attention maps constructed from local feature vectors that are projected onto global features. The local features are intermediate outputs resulting from the first three convolutional layers. The global feature is the output from the last convolutional layer, serving as a “global guidance” because this last convolutional layer has the most compressed information across the whole image. The three attention maps calculated from each of the local features are then concatenated together to form the final feature vector and passed through a normalization layer.
Finally, the feature vectors from the Attention Blocks are passed into a feed forward network consisting of three dense layers, each with LeakyRelu activation. These layers produce the logits that are then used to classify an image as either “up”, “left”, “right”, or “blink”.
How are you training the model?
We are training the model on the Closed Eyes in the Wild dataset. We use CNNs for classification and incorporate attention to focus on relevant features. After concatenating these representations into a feature vector, a feed-forward neural network is used to train the model.
If you are implementing an existing paper, detail what you think will be the hardest part about implementing the model here.
We think that the hardest part about implementing the model will be the attention portion as the CNN portion seems relatively standard. The attention portion, on the other hand, will be a challenge as the feature extraction we are hoping to implement will be specific to our dataset. We are also uncertain about how to maintain the proper balance between dimensionality reduction and information retention, so we will be using some visualization techniques (CNN visualization, confusion matrix, accuracy/loss graphs) to better understand the mechanics of our model.
Metrics
What constitutes “success?”
We define success as predicting the right movement of the eyes (left, right, up, or blink).
What experiments do you plan to run?
We spare some of the datasets for testing. We also plan to do data augmentation, such as shifting or flipping facial images, to train it to deal with various images. In addition, we are going to test the model on several images we take on our own.
For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate?
We will use accuracy to measure the performance of the model. It is computed as the proportion of correct predictions to total predictions of eye movements.
Explain how you will assess your model’s performance. What are your base, target, and stretch goals?
We assess our model performance based on the accuracy of the classification.
Base: implement a model that runs comparably on training and test datasets
Target: given eye images, it can determine if the movement is right, left, up, or blink at 60% accuracy
Stretch: adding more layers to improve accuracy while considering computing capabilities within what is realistic for the resources we have
Ethics
What broader societal issues are relevant to your chosen problem space?
Individuals who have lost their natural speaking abilities require modified language that utilizes physical traits such as facial gestures or brain signals as a form of communication. Most proposed languages, including those utilizing eye alphabets, require specialized hardware with sensors, such as eye trackers, eyeglasses with infrared, and eye gaze keyboards. Most patients with speech impairments use these modified languages to communicate with their caregivers, who may not have access to the medical technology needed to implement the necessary communication systems. Regular use of these medical devices require that caretakers undergo training to learn to operate the machines. Additionally, specialized hardware can be especially expensive and cannot be easily purchased. In fact, accessibility of medical technology and equipment is a wide-spread issue, especially in terms of cost and complexity. Healthcare providers need a certain level of technological literacy in order to operate and maintain medical equipment that may be challenging to acquire.
What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain?
The datasets that we are using is the Closed Eyes in the Wild dataset for both feature extraction and eye position detection. These datasets were collected directly from the Internet, so it may not be representative of all individuals who use eye-based communication methods. Due to selection bias, the individuals who chose these images for the data collection process may tend to be individuals who have strong opinions about eye-based languages or those who are more confident in the images that they select. Since medical technology has historically been more accessible to more wealthy, affluent, and educated individuals, this means that this subset of the population is more likely to have experiences with and opinions on eye-based language equipment and technology. Therefore, individuals from this subset of the population are more likely to have submitted images for the data collection process for the Closed Eyes in the Wild datasets. However, patients who require eye-based communication methods will have all sorts of socioeconomic backgrounds, which are correlated with factors such as gender and race. Therefore, groups that are historically underrepresented in medicine and are disadvantaged, such as minority racial groups or women, may not be well represented in the datasets.
Built With
- tensorflow
Log in or sign up for Devpost to join the conversation.