VisionGuard

VisionGuard: A real-time detector for the states of drivers

Team Members: Mike Li (mli252), Ella Liang (yliang63), Jessica Yang (wyang74), Louis Zhang (szhan174) GitHub Link: https://github.com/DZXL/VisionGuard-DL-Brown.git Introduction 1.1. Motivation The technology advancement of autonomous driving has significantly enhanced vehicle safety by reducing the likelihood of accidents caused by human error. However, the current systems require manual activation, which might not be feasible when the driver is experiencing drowsiness or illness or other conditions. For example, one of our teammates Mike unintentionally fell asleep when he was driving his Tesla during a business trip. Fortunately, he engaged the autopilot system when he was merely awake and averted a potentially catastrophic accident. Inspired by this alarming experience, we decided to develop a deep learning model that can autonomously detect when a driver is not in a state to drive and activate the vehicle’s autopilot system accordingly, thereby ensuring the safety of both the driver and other road users. 1.2. Related Work There have been some recent explorations in combining human condition detection and deep learning models. For example, in 2019, Jiakang Deng from Imperial College London presented a robust single-stage face detector, named RetinaFace, which performs pixel-wise face localization on various scales of faces (Deng et.al., 2019). Besides, Florez et.al. (2023) approached drivers’ drowsiness detection by specially focusing on the eye region through utilizing Mediapipe and three neural networks including InceptionV3, VGG16 and ResNet50V2, which gave an accuracy rate of 99.71%.

Although researches have been carried out in the area, by well-renowned institutes and reaching high accuracy rates, there have not been any end-to-end model that can be applied in real-life scenarios and can directly tell whether the driver is in a suitable state to drive or not from more than one angle. For example, it might not be appropriate for the drivers to drive not only when they are drowsy, but also when they are emotional, unwell, etc. Therefore, we decided to develop a multimodal model that combines Convoluted Neural Network (CNN) and Gated recurrent units (GRU) to give a more encompassing view of whether the driver in a vehicle is suitable for driving or not. Preprocessing and Data Management: Our model has two types of data inputs: the first one is image data from four datasets, including the ‘Driver Drowsiness Dataset’ (Nasri, 2020), the ‘Drunk Face’ dataset (Roboflow Universe, 2022), the ‘FER-2013’ dataset for emotional analysis (Sambare et al., 2020), and the ‘Pain E-motion Faces Database’ (Fernandes-Magalhaes1 et al., 2022). The second one is the EEG data from the ‘Confused student EEG brainwave data’ dataset from Kaggle (Wang, 2019). The image datasets were utilized to train the model to identify the four abnormal status of the drivers through their facial expressions, including ‘Drowsy’, ‘Drunk’, ‘Angry’, and ‘Painful’. We believe the driver is not in a suitable state to drive when such expressions appear on their faces. The EEG dataset was used to train the model to recognize whether the drivers are conscious or unconscious through analyzing their EEG brainwave data. It might not be safe for the driver to drive under uncousciousness. For these two datasets, different approaches were used in preprocessing.

2.1. Image data

We built up our own image dataset through extracting 1,500 images from each of the four image datasets mentioned above, and labeled them as ‘Drowsy’, ‘Drunk’, ‘Angry’, and ‘Painful’. Additionally, we included 4,000 images categorized as ‘normal’ to depict drivers’ condition in a neutral state.

The dataset is divided into training and testing subsets, with 80% and 20% of the data reserved for these two purposes respectively. In order to enhance the robustness and generality of the model, we used TensorFlow’s ‘ImagedDataGenerator’ for the augmentations of our training data. The augmentations involve rotation, shift, shear, scaling and horizontal flipping on the images. These transformations help the model in recognizing target abnormal drivers’ status under different conditions and directions. For the testing dataset, we limit the preprocessing to normalization to ensure the validity of the data in a controlled test environment.

2.2. EEG data

The second dataset consists of EEG recordings with over 12,000 rows of data, corresponding effectively to 100 unique one-minute intervals collected while the subjects watched the video. The preprocessing involved in this dataset included extracting features and targets from these recordings to analyze cognitive status like ‘Mediation’ and ‘Attention’. Specifically, for each 60-second segment, the first 20 rows (representing the initial 10 seconds) were selected as feature rows, and the subsequent 10 rows (representing the subsequent 5 seconds) were selected as target rows, and the following part of the segment will be splitted the same way until the whole segment has been looped through. Each sample will be monitored for their ‘Meditation’ and ‘Attention’ matrices, which can help determine whether the object is conscious or not. Data points with low average scores in these metrics were labeled as ‘1’, indicating significant lapses in attention or calmness, while the rest is marked as ‘0’.

Similar to the image dataset, the EEG dataset was split into 80% for training and 20% for the testing purposes.This division is aimed to ensure that the model is trained on a representative sample of the data and validate it on an independent subset to effectively assess its predictive accuracy.

Methodology: In our model, we integrated CNN and GRUs to leverage their complementary strengths in order to enhance feature extraction and sequence processing. CNN efficiently processes spatial images from visual inputs that are crucial for detecting immediate physical indicators of the driver’s state, such as eye movement and mouth movements. Meanwhile, GRUs are adept at managing sequential EEG data, capturing ongoing neurological patterns that reflect changes over time. This integration not only improves the accuracy and dependability of our system in detecting critical states such as drowsiness or impairment but also enables real-time processing, facilitating swift actions like activating safety alerts or initiating autonomous control. In addition, the adaptive nature of GRUs allows the system to be customized based on individuals’ driving behaviors, which can further enhance the accuracy and reliability of the system.

The rest of this section will elaborate on the technical detail of the CNN and GRU parts of our model.

3.1. CNN

Our CNN architecture (Figure 1) is specially designed to process image data through a multi-layered approach. The CNN model starts from the input layer named ‘conv2d_13’ that contains 32 filters of size 3×3, which are crucial for extracting low-level features such as edges. Following the input layer, batch normalization was implemented to stabilize and accelerate the learning process by normalizing the activations from the previous layers, which is essential for maintaining the learning efficiency across the network. Then, the architecture incorporates a max pooling layer, which serves to reduce spatial dimensions while preserving the important features of the input data.

The model then feeds the information into a few more convolutional layers, namely ‘conv2d_14’, ‘conv2d_15’, and ‘conv2d_16’, each followed by a batch normalization layer and a max pooling layer. In ‘conv2d_16’, the filters expanded to 256, which allows the network to capture more intricate patterns that are vital for accurate predictions.

After the convolutional stages, the data is flattened and proceeded through two dense layers, and activated by the softmax function that computes the prediction probabilities.

Figure 1 CNN architecture of our model

3.2. GRUs

The other part of our model incorporates GRUs (Figure 2) which were designed specifically for predicting whether the driver is conscious or unconscious (labeled as ‘attention’ and ‘meditation’ in the input dataset). The GRUs component will give forecasts for the future 5 seconds through analyzing the EEG data collected over each ‘feature’ sample which span for 10 seconds. The GRUs would focus on determining the drivers’ attention levels, and provide insights that are crucial for understanding the driver’s current capacity to safely operate the vehicle.

Figure 2 Architecture for GRUs of our model

Besides, to enhance the functionality and adaptability of our model, we integrated Weights & Biases (Wandb) for automatic parameter adjustment (Figure 3). This integration enables dynamic tuning of our model’s parameters during training, which leverages real-time feedback to optimize the model performance continuously. The use of Wandb streamlines the training process by automating tedious manual adjustments, as well as significantly improves the accuracy and reliability of the model predictions by ensuring the model adapts effectively to new data patterns.

Figure 3 Parameter adjustment by Wandb

Results After multiple rounds of assessing and refining our models, we reached an accuracy of 80.43% for the CNN part of our model (Figure 4), and also over 80% for the GRUs part of our model (Figure 5).

Figure 4 CNN accuracy

Figure 5 GRU accuracy

Challenges Most of the challenges we faced were related to data. We were not able to collect a more encompassing dataset that describes the overall condition of the drivers, including their physiological states, emotional states, etc. As a result, we can only train our model using different datasets coming from different sources.

Moreover, we were not able to collect data that could cover a wider range of demographics, since people in different regions might have different driving habits, and our current dataset were not able to catch these features.

Reflections and discussions 6.1. The reflection questions How do you feel your project ultimately turned out? How did you do relative to your base/target/stretch goals? We feel the project ultimately turned out quite well. We achieved our target goals of integrating multiple models into our framework and processing different data to detect the states of the drivers. In the future, we would like to further improve our algorithms as well as enhancing the computational efficiency, which will be discussed in the next section.

Did your model work out the way you expected it to? The model worked out the way we expected, as it is capable of processing different data and returns a deterministic judgment on whether it is appropriate for the driver to keep driving or not.

How did your approach change over time? What kind of pivots did you make, if any? Would you have done differently if you could do your project over again? We pivoted on the use of datasets. We initially only used one set of data. However, following our further research, we found that we are motivated to make the model more encompassing through including more datasets. Therefore, we included four additional datasets to train our model.

What do you think you can further improve on if you had more time? We can further improve on algorithm, computational efficiency, and the variety of training and testing data. We will discuss more about this in the next section.

What are your biggest takeaways from this project/what did you learn? First, we are all very excited to have this opportunity to apply what we have learned in the Deep Learning course to solve a real world problem that we have been interested in. Second, we found that through utilizing a multimodal model, we will be able to build up a more encompassing algorithm, process different data, and return more accurate results. Third, we also realized how powerful deep learning algorithms can be, and if there is any personal data leakage and being utilized for unethical purposes, it will cause a huge harm to the society. Therefore, we should only be utilizing Deep Learning algorithms for social goods.

6.2. Future work

In the longer scope, we plan to integrate some more exciting features including combining more physiological information indicators including heart rate, blood pressure, and skin conductivity. We also plan to implement feedback systems which can reinforce our algorithm through our users' feedback.

6.3. Extended application of our model

The application of our model is not limited in the traffic/vehicle field. It can be used to carry out healthcare monitoring, e.g. monitor for the patient's condition and see whether they are feeling tired, etc.; and workforce management, which is to see if the workers are drowsy when they are in work environments such as construction sites or manufacturing factories, and we can return timely signals to prevent any accidents from happening. Also, we can apply our model to smart home systems, which can help monitor the residents’ health condition and respond to any abnormal signals.

Conclusions In conclusion, we developed a model that utilized CNN & GRU to detect whether the driver in a vehicle is suitable for driving or not, and we reached an accuracy of over 80%. Our model has an end-to-end structure which is highly applicable, and can contribute to the stability and safety of the society.

References Deng, J., Guo, J., Zhou, Y., Yu, J., Kotsia, I., and Zafeiriou, S. (2019). 'RetinaFace: Single-stage Dense Face Localisation in the Wild', arXiv:1905.00641. Available at: https://arxiv.org/abs/1905.00641 (Accessed: 25 April 2024).

Fernandes-Magalhaes, R., Carpio, A., Ferrera, D., Van Ryckeghem, D., Peláez, I., Barjola, P., De Lahoz, M.E., Martín-Buro, M.C., Hinojosa, J.A., Van Damme, S., Carretié, L. and Mercado, F. (2023) 'Pain E‑motion Faces Database (PEMF): Pain‑related micro‑clips for emotion research', Behavior Research Methods, 55, pp. 3831–3844. Available at: https://doi.org/10.3758/s13428-022-01992-4 (Accessed: 25 April 2024).

Florez et al. (2023) 'Approach to detect drowsiness in drivers', Applied Sciences, 13(13). Available at: https://www.mdpi.com/2076-3417/13/13/7849 (Accessed: 25 April 2024)

Nasri, I. (2020). Driver Drowsiness Dataset (DDD) [Data set]. Kaggle. Available at: https://www.kaggle.com/datasets/ismailnasri20/driver-drowsiness-dataset-ddd (Accessed: 27 April 2024).

Roboflow Universe. (2022). Drunk Face [Data set]. Roboflow Universe. Available at: https://universe.roboflow.com/new-workspace-8swzs/drunk (Accessed: 25 April 2024).

Sambare, M. et al. (2020). FER2013 [Data set]. Kaggle. Available at: https://www.kaggle.com/datasets/msambare/fer2013 (Accessed: 25 April 2024).

Wang, H. (2019). Confused EEG [Data set]. Kaggle. Available at: https://www.kaggle.com/datasets/wanghaohan/confused-eeg (Accessed: 25 April 2024).