Ilia Zaitsev | Freelancer Face Emotions Recognition System

Face emotions recognition system

Face Emotion Recognition System Ilia Zaitsev April 4, 2017 1 Introduction The image recognition field of machine learning one of the most rapidly growing ones. Nowadays computers can recognize and describe images, identify faces, track moving objects and do a lot of other image processing tasks. The recent breakthought committed by scientists from DeepMind (Google) allows computers to play classic Atari arcades using only raw data from screen and gamepad controls [1]. There are a lot of available data on the web today - texts, images, audio records. It is possible to successfully implement sophisticated data analysing and predicting systems based on terabytes and petabytes of media data. In this project the author is going to take into account such subdomain of image recognition field of knowledge as faces analysis. More precisely, the project’s goal is an implementation of emotions recognition system that can answer which dominant emotion (happiness, anger, anxiety, etc.) feels person depicted on a photo. Modern computer systems use emotion recognition to detect human’s mood and predict behaviour. For example, security systems in airports could analyse passengers faces to check if some person is suspiciously anxious and should be checked by airport security staff. Another possible purpose is to estimate how user reacts on intelligent system work. For example, AI robot could analyse human’s face to understand if it performs well or correctly performs task it was asked to do. Actually, the face expression is the one of the most important ways people interact with each other so it definitely would be helpful for machine to understand this non-verbal way of communication. Also, the problem of recognising human emotions is definitely solvable. One of the prominent examples is Affectiva company which is develops products in the area of emotions recognition. Another example is a recent EmotioNet Challenge competition where several research groups competed developing emotion recognition systems. The winner’s final score was around 60% for basic and compound emotions classification. 2 Definition In this section provided a preliminary discussion on a specific problem which is selected to be solved using deep learning techniques. Next paragraphs briefly formulate the problem and possible solution, as well as quality metrics that can be used to validate provided solution. 1 Figure 1: Image recognition as supervised learning problem. 2.1 Problem Statement The problem is to implement system that could recognise dominant human’s emotion using their face picture. It is clear, that this problem could be solved using supervised learning approach. If there is an appropriately labelled dataset of humans’s face expressions, then any classification technique could be used to predict emotions for unlabelled faces. Figure 1 shows a conceptual idea behind image recognition. Each image’s pixel is treated as a feature. The only issue is that image data could be quite highly dimensional and neighbour pixels are correlated with each other. Therefore, even if simple ML models could be applied to this problem, probably their results will not be as good as it possible. One of the more advanced approaches to this problem is the application of Deep Learning (DL) methods and Artificial Neural Networks (ANN). There are a lot of examples [2, 3, 4] where authors use ANN to solve or improve solution of the emotion classification task. (a) Network with fully connected layers (b) Network with convolution layer Figure 2: Difference between fully connected and convolutional layers The most successful ANN architectures used in image classification are convolutional architectures (ConvNets), i.e. networks that among alongside with fully connected layers perform operations named convolutions. Figure 2 schematically shows the difference between two types of layers. Basically, the ConvNets allow to extract a set of different features from the image. And each layer represents more abstract features then the previous one (i.e. from pixels to edges, from edges to angles, etc.) [8]. Therefore, applying convolutional 2 Figure 3: Confusion matrix example neural networks (CNN) seems like a reasonable approach to solve emotions classification task. As soon as the problem belongs to the field of supervised learning problems, the effectiveness of its solution could be measured with usage of the common metrics applied to estimate quality of classification. Such measures like precision, recall and ROC curves could be applied to check how well the model performs. Also, once trained, the model could easily be transferred to other machines and be applied to other datasets. So both the problem and its solution could be replicated and verified as much times as needed. 2.2 Evaluation Metrics As soon as the objective is to train a classifier, standard classification metrics could be used. For example, consider the following matrix depicted on Figure 3. Let’s pretend that this matrix is a result of classifying some dataset that contains only 3 emotions. Using this matrix, for each emotion the True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) values could be derived. TP rate shows how much records with a specific emotion are recognised correctly. TN value shows how much records that do not belong to the certain emotion were not classified incorrectly with this emotion. The rest two values are Type I and Type II errors respectively that could be familiar to someone who works with statistical methods and hypothesis testing. Using aforementioned values, several metrics could be calculated to estimate classifier performance. For example, here are formulas to calculate precision and recall metrics: P recision = 3 TP TP + FP Recall = TP TP + FN Another example is F1-score that is calculated as a harmonic mean of precision and recall: F1 = 2 × TP 2 × TP + FP + FN Based on original values and derived scores, several other metrics could be calculated. Every of them should be familiar for one working with data classification and statistical analysis so could easily be verified. Another possible approach is to apply cluster analysis and data segmentation techniques and calculate a couple of similarity metrics based on distances between examples. This approach allows to check how well the trained classifier works on new examples, i.e. if images that are marked with same specific emotion are close to each other in terms of some similarity metric or not. 3 Analysis In this section the selected dataset overview is provided. It is followed with a brief description of the selected classification approached, as well as definition of the benchmarking models used to setup a performance threshold. 3.1 Data Exploration The explored dataset is taken from Facial Expression Recognition Challenge [5] hosted on Kaggle. This dataset contains 35,887 examples of 48 × 48 images depicting human faces that express certain emotions along with textual labels with that emotion name. Figure 4 shows 16 examples taken from the dataset. Figure 5 shows list of emotions represented in dataset and the number of images that labelled with this emotion. The dataset is "unbalanced", i.e. some emotions are represented more widely and some other have much less examples. This feature of dataset should be considered when applying machine learning techniques. Also it should be noted that some images contain watermarks of sources where they were taken from. Some other are actually not human photos, but cartoons or faces generated with computer graphics. An attempt was made to remove such examples using OpenCV cascade face recognizer, but this simple approach didn’t help too much - a lot of valid samples were filtered out. Therefore, the decision was made to keep all available samples as is. Also, these outliers could be helpful to make classifier more robust, especially if talking about deep networks. Some other datasets were also considered. One of them was Cohn-Kanade Expression Database. This dataset contains several hundreds of face series with emotion labels. Each sequence contains face’s pictures that express certain emotion, starting from a neutral face and finishing with an emotional peak. The dataset was used in [2] to solve similar 4 Figure 4: Samples from faces dataset 5 Figure 5: Represented emotions counts problem. But as it was mentioned, the dataset seems to be too small in context of deep learning methods and has quite a big volume (≈ 2GB). 3.2 Algorithms and Techniques One of the most successful approaches to image recognition is usage of DL framework or ANN as it already was mentioned. The DL approach uses nonlinear approximators to extract features from objects. The image data is known to have a lot of features. And the higher specific image resolution, the more pixels it has and therefore - the more features. Other machine learning techniques could have quite rough time when trying to classify datasets with high number of features. Also, the image data features are obviously correlated with each other (neighbour pixels most probably will have similar colors and intensities) and hierarchical (i.e. from pixel colors to edges and outlines and to separate objects). Therefore, the key idea is to choose one of discovered CNN architectures that will allow to automatically extract most prominent features from images and use it to classify face emotions. The solution could be easily reproduced by choosing same network architecture and same network parameters (i.e. weights and biases). Also, classification tasks have clear definition of performance measure - basically, an amount of correctly classified records. 3.3 Benchmark Models It is important to have a couple of "baseline" models that can provide a lower bound for classification accuracy that is possible to achieve on the selected dataset. The following models were selected to be used as such baseline solutions: • Multiclass logistic regression (one-vs-all approach) • Extremely randomized trees The basic logistic model is used for binary classification tasks. It estimates how probable is that a specific sample belongs to a specific class. If the probability is high enough, the sample assigned to the class. As soon as there are 7 emotions in analysed dataset, the only one logistic model is not enough. One-vs-all approach allows to handle this issue by training 7 classifiers, each intended to recogize a single emotion class as 6 positive result and all the rest - as negative. Therefore, for each dataset sample, all together these classifiers give a probability vector, and an emotion class that has highest probability accepted as a classification result. Ensemble classification methods are based on definition of weak learners. Week learner is a classification model which accuracy is slightly higher then of random classifier. The key idea is that if one bring together huge number of weak learners, then together they will have much better performance. Decision tree is one of possible weak learners and random forest is a collection of decision trees. Each tree in this collection is trained on a random subset of data features. Extremely randomized trees brings even more randomness into training process and essentially consists of randomizing strongly both attribute and cut-point choice while splitting a tree node [7]. Both these techniques are able to solve classification problems and quite often used as a basic benchmarks solutions (many of Kaggle competitions include results of applying these models as a threshold values). And accuracy of both models could be quite clear expressed using confusion matrix and accuracy scores. Accuracy of these models will be used as a basic value that should be exceeded by the model build with DL approach. Also, as soon as the dataset was taken from Kaggle, competition’s leaderboard also can be taken into account. 4 Methodology In this section applied data preprocessing steps, classifier implementation, and validation process are described. Benchmark models results compared with deep networks accuracy. ROC-curves plots and other classification metrics provided to compare quality of solutions. 4.1 Data Preprocessing The benchmarking models accept image data as "plain" arrays, i.e. each image (recall that it has 48 × 48 shape) is represented as a vector: ~ximg = (x1 , x2 , ..., x2304 ) And the only data preprocessing step that is used with benchmarking models is MinMax transformation applied to each pixel (i.e. each element of the vector) and defined as follows: (i) (i) xtr ~ximg − min ~ximg = max ~ximg − min ~ximg This transformation scales each element into range [0, 1]. (Actually, in case of grayscale 8-bit images that are used in this project, the transformation works in a same way as just dividing by 255). Same preprocessing was applied before feedforward networks training. But convolutional neural networks (CNN) used at the final stage were supplied with more involved preprocessing steps. First of all, CNN uses spatial information to achieve good prediction results. Therefore, images should be reshaped into original 2D format: 7  Xcnn x1,1  x2,1  =  ..  . x1,2 x2,2 .. . ... ... .. .  x1,48 x2,48   ..  .  x48,1 x48,2 . . . x48,48 The second preprocessing step is data augmentation described in [8, 9]. (Though taken into account [8], data augmentation process could be treaded as a sort of regularization technique). The key idea of data augmentation is to introduces small perturbations into dataset like random shifts, scaling, cropping, etc. which should shrink network variance and make it more robust to noise. This step wasn’t used with every deep model, but the final selected model was trained with this technique. As it was noted in section 3.1, was made an attempt to filter bad samples out of the dataset. But this idea has been rejected and none of samples were dropped. 4.2 Implementation and Selecting the Best Model Before any deep model was trained, aforementioned benchmarking models were applied to the dataset. Figure 6 shows their evaluation results. To choose optimal model, grid search with cross validation was applied. The best found parameters were used to train each classifier on the whole available data excluding small testing subset. As benchmark report shows, extra trees performance transcends logistic regression and shows good baseline performance. Next step was the central part of this work - applying neural networks approach to classify dataset. Figure 7 shows all architectures that were used in this project. Most of them were inspired by papers [2, 10] where authors apply similar (but deeper and/or wider) models. Also, some other architectures were taken into account (i.e. AlexNet, VGG9, VGG16, etc.), but these are too computationally expensive to be trained and tested within scope of this project. Ideas of used regularization and optimization techniques, selected activation functions and optimal values for ConvNets parameters were taken (not including a dozen of internet sources) from [8] and [11]. The most difficult part of the implementation process was to create a robust and easy to use data processing and model training pipeline. To get reproducible and stable results, it is important to save models during training and to preprocess training data in the same way for each of trained models. The models basic-fc and deeper-fc could be treated as an additional baseline performance metrics. As it was expected, both of them has shown low accuracy and stopped improving after several epochs. Their accuracy didn’t exceed logistic model’s performance. Therefore, even if NN are applied, they could be too "shallow" or too simple to compete even with linear models. The models cnn1, cnn3 and cnn6 represent a sequence of ConvNets with increasing depth and number of convolutional layers. One can image them as an evolution from the first approximation to solve classification task to the final solution. Even the most simple model, namely cnn1, has shown much better accuracy score (above 50%) then feedforward models though it has only one convolution layer. But it also suffers from overfitting - after approximately 15 epochs validation loss improvement stopped showing clear U-shape. Next model, cnn3, with increased depth and additional regularization layers (batch normalization and more dropouts) has shown better results, but still starts to slightly 8 Figure 6: Benchmark Models. Left: logistic regression. Right: extra trees 9 Figure 7: Deep learning architectures used to classify emotions dataset. Before the line: simple feedforward networks. After the line: convolutional architectures 10 overfit data after approximately 40 epochs. Last model, cnn6, has shown further increased performance. Moreover, after 100 training epochs, validation loss still continued decreasing. Therefore, this model was selected as the best of trained classifiers. Figure 8 shows performance comparison between three of these aforementioned models. Training and validation loss plots as well as some other metrics are provided in Appendix A. Figure 8: Deep learning models validation accuracy 4.3 Final Benchmark To compare cnn6 performance with benchmarking models described in section 3.3, ROC curves and confusion matrix were computed, as well as standard classification performance metrics. Figure 9 shows the best model confusion matrix. Figure 10 shows ROC-curves and classification accuracy report. Comparing these values with benchmarking report, it becomes clear that cnn6 shows better results than baseline solutions. As it follows from shown figures, the best recognized emotions classes are Happy, Surprise and Disgust. The worst recognized emotion is Fear. These results seems to be interesting because Disgust emotion class was actually underrepresented. 11 Figure 9: Confusion matrix for cnn6 model Figure 10: ROC-curves and classification metrics for cnn6 model 12 5 Conclusion One could say that a picture is worth a thousand words. As a recap to this report, figure 11 shows a few examples of random human’s photos taken from the Internet by means of search engine and querying a specific emotion class. Before classification, each photo was passed into face recognition cascade. Afterwards, recognized face rectangle was cropped, converted into grayscale image, and rescaled down to 48 × 48 shape. Face 3 from aforementioned figure shows that Sad emotion is one of the most ambiguous for the classifier. Visual verification of emotions recognition results shows that the classifier is able to recognize several types of emotions from random photos. However, it should be noted that the model was tested on quite clear photos with good contrast and with a single person demonstrating clear face expression. Nevertheless, the main objective of this work was achieved: once again it was shown that even relatively small convolutional neural networks with simple architecture can successfully solve image classification tasks. The achieved accuracy is reasonable enough for such a small network and limited training time. The possible ideas for further improvement are: • More deep/wide model with increased number of training epochs • Meta-parameters tuning: different learning rate schedule, different optimization algorithm, etc. • Filtering out bad examples and collecting better data in place of dropped samples • Automatic model’s architecture search using stochastic optimization techniques [13] The most difficult part in training DL models is related to their computational expensiveness, as well as a huge amount of possible structures. The same is true for model selection: grid search and cross-validation could take a lot of time to be performed. The theoretical background staying behind these techniques could also be quite challenging. But from another point of view, there are many pre-trained models available online, a dozen of DL libraries (Keras, Caffe, Torch, etc.), and huge amount of tutorials and best practices that allow to train and use deep models even without too much amount of academic experience. 13 Figure 11 14 Figure 11: Emotions classification using best model (cnn6). Each image is retrieved from the Internet using search engine and an emotion class name as an element of query. 15 6 References 1. V. Mnih, K. Kavukcuoglu, D. Silver et al. - Human-level control through deep reinforcement learning 2. D. Duncan, G. Shine, C. English - Facial Emotion Recognition in Real Time 3. T. Rao, M. Xu, D. Xu - Learning Multi-Level Deep Representations for Image Emotion Classification 4. X. Ma, Z. Wu, J. Jia et al. - Study on Feature Subspace of Archetypal Emotions for Speech Emotion Recognition 5. Facial Expression Recognition Challenge - https://www.kaggle.com/c/challengesin-representation-learning-facial-expression-recognition-challenge/data 6. Cohn-Kanade AU-Coded Expression Database - http://www.pitt.edu/~emotion/ ck-spread.htm 7. P. Geurts, D. Ernst, L. Wehenkel - Extremely randomized trees 8. I. Goodfellow, Y. Bengio, A. Courville - Deep Learning 9. A. Krizhevsky - Learning Multiple Layers of Features from Tiny Images 10. A. Shcherbina - Tiny ImageNet Challenge 11. J. Brownlee - Deep Learning With Python 12. I. Goodfellow, J. Shlens, C. Szegedy - Explaining and Harnessing Adversarial Examples 13. E. Real et al. - Large-Scale Evolution of Image Classifiers 16 Appendix A Figures 12, 13, and 14 show training and validation curves for each of CNN trained in this project. Note that accuracy and categorical accuracy are essentially the same metric. The first one is default Keras metric and was included accidentally. For the sake of clean results reporting, it wasn’t cut away from pictures. Figure 12: Training curves for cnn1 classifier 17 Figure 13: Training curves for cnn3 classifier 18 Figure 14: Training curves for cnn6 classifier 19