Face emotions recognition system
Face Emotion Recognition System
Ilia Zaitsev
April 4, 2017
1
Introduction
The image recognition field of machine learning one of the most rapidly growing ones.
Nowadays computers can recognize and describe images, identify faces, track moving
objects and do a lot of other image processing tasks. The recent breakthought committed
by scientists from DeepMind (Google) allows computers to play classic Atari arcades using
only raw data from screen and gamepad controls [1].
There are a lot of available data on the web today - texts, images, audio records. It
is possible to successfully implement sophisticated data analysing and predicting systems
based on terabytes and petabytes of media data.
In this project the author is going to take into account such subdomain of image
recognition field of knowledge as faces analysis. More precisely, the project’s goal is an
implementation of emotions recognition system that can answer which dominant emotion
(happiness, anger, anxiety, etc.) feels person depicted on a photo. Modern computer
systems use emotion recognition to detect human’s mood and predict behaviour. For
example, security systems in airports could analyse passengers faces to check if some
person is suspiciously anxious and should be checked by airport security staff. Another
possible purpose is to estimate how user reacts on intelligent system work. For example,
AI robot could analyse human’s face to understand if it performs well or correctly performs
task it was asked to do. Actually, the face expression is the one of the most important
ways people interact with each other so it definitely would be helpful for machine to
understand this non-verbal way of communication.
Also, the problem of recognising human emotions is definitely solvable. One of the
prominent examples is Affectiva company which is develops products in the area of emotions recognition. Another example is a recent EmotioNet Challenge competition where
several research groups competed developing emotion recognition systems. The winner’s
final score was around 60% for basic and compound emotions classification.
2
Definition
In this section provided a preliminary discussion on a specific problem which is selected to be solved using deep learning techniques. Next paragraphs briefly formulate
the problem and possible solution, as well as quality metrics that can be used to validate
provided solution.
1
Figure 1: Image recognition as supervised learning problem.
2.1
Problem Statement
The problem is to implement system that could recognise dominant human’s emotion
using their face picture. It is clear, that this problem could be solved using supervised
learning approach. If there is an appropriately labelled dataset of humans’s face expressions, then any classification technique could be used to predict emotions for unlabelled
faces.
Figure 1 shows a conceptual idea behind image recognition. Each image’s pixel is
treated as a feature.
The only issue is that image data could be quite highly dimensional and neighbour
pixels are correlated with each other. Therefore, even if simple ML models could be
applied to this problem, probably their results will not be as good as it possible. One
of the more advanced approaches to this problem is the application of Deep Learning
(DL) methods and Artificial Neural Networks (ANN). There are a lot of examples [2, 3,
4] where authors use ANN to solve or improve solution of the emotion classification task.
(a) Network with fully connected layers
(b) Network with convolution layer
Figure 2: Difference between fully connected and convolutional layers
The most successful ANN architectures used in image classification are convolutional architectures
(ConvNets), i.e. networks that among alongside with fully connected layers perform operations named convolutions. Figure 2 schematically shows the difference between two
types of layers. Basically, the ConvNets allow to extract a set of different features from
the image. And each layer represents more abstract features then the previous one (i.e.
from pixels to edges, from edges to angles, etc.) [8]. Therefore, applying convolutional
2
Figure 3: Confusion matrix example
neural networks (CNN) seems like a reasonable approach to solve emotions classification
task.
As soon as the problem belongs to the field of supervised learning problems, the
effectiveness of its solution could be measured with usage of the common metrics applied
to estimate quality of classification. Such measures like precision, recall and ROC curves
could be applied to check how well the model performs. Also, once trained, the model
could easily be transferred to other machines and be applied to other datasets. So both
the problem and its solution could be replicated and verified as much times as needed.
2.2
Evaluation Metrics
As soon as the objective is to train a classifier, standard classification metrics could
be used. For example, consider the following matrix depicted on Figure 3. Let’s pretend
that this matrix is a result of classifying some dataset that contains only 3 emotions.
Using this matrix, for each emotion the True Positive (TP), True Negative (TN), False
Positive (FP) and False Negative (FN) values could be derived. TP rate shows how
much records with a specific emotion are recognised correctly. TN value shows how much
records that do not belong to the certain emotion were not classified incorrectly with this
emotion. The rest two values are Type I and Type II errors respectively that could be
familiar to someone who works with statistical methods and hypothesis testing.
Using aforementioned values, several metrics could be calculated to estimate classifier performance. For example, here are formulas to calculate precision and recall metrics:
P recision =
3
TP
TP + FP
Recall =
TP
TP + FN
Another example is F1-score that is calculated as a harmonic mean of precision and
recall:
F1 =
2 × TP
2 × TP + FP + FN
Based on original values and derived scores, several other metrics could be calculated.
Every of them should be familiar for one working with data classification and statistical
analysis so could easily be verified.
Another possible approach is to apply cluster analysis and data segmentation techniques and calculate a couple of similarity metrics based on distances between examples.
This approach allows to check how well the trained classifier works on new examples, i.e.
if images that are marked with same specific emotion are close to each other in terms of
some similarity metric or not.
3
Analysis
In this section the selected dataset overview is provided. It is followed with a brief
description of the selected classification approached, as well as definition of the benchmarking models used to setup a performance threshold.
3.1
Data Exploration
The explored dataset is taken from Facial Expression Recognition Challenge [5] hosted
on Kaggle. This dataset contains 35,887 examples of 48 × 48 images depicting human
faces that express certain emotions along with textual labels with that emotion name.
Figure 4 shows 16 examples taken from the dataset.
Figure 5 shows list of emotions represented in dataset and the number of images
that labelled with this emotion. The dataset is "unbalanced", i.e. some emotions are
represented more widely and some other have much less examples. This feature of dataset
should be considered when applying machine learning techniques.
Also it should be noted that some images contain watermarks of sources where they
were taken from. Some other are actually not human photos, but cartoons or faces
generated with computer graphics. An attempt was made to remove such examples
using OpenCV cascade face recognizer, but this simple approach didn’t help too much
- a lot of valid samples were filtered out. Therefore, the decision was made to keep
all available samples as is. Also, these outliers could be helpful to make classifier more
robust, especially if talking about deep networks.
Some other datasets were also considered. One of them was Cohn-Kanade Expression
Database. This dataset contains several hundreds of face series with emotion labels. Each
sequence contains face’s pictures that express certain emotion, starting from a neutral
face and finishing with an emotional peak. The dataset was used in [2] to solve similar
4
Figure 4: Samples from faces dataset
5
Figure 5: Represented emotions counts
problem. But as it was mentioned, the dataset seems to be too small in context of deep
learning methods and has quite a big volume (≈ 2GB).
3.2
Algorithms and Techniques
One of the most successful approaches to image recognition is usage of DL framework
or ANN as it already was mentioned. The DL approach uses nonlinear approximators
to extract features from objects. The image data is known to have a lot of features.
And the higher specific image resolution, the more pixels it has and therefore - the more
features. Other machine learning techniques could have quite rough time when trying
to classify datasets with high number of features. Also, the image data features are
obviously correlated with each other (neighbour pixels most probably will have similar
colors and intensities) and hierarchical (i.e. from pixel colors to edges and outlines and
to separate objects).
Therefore, the key idea is to choose one of discovered CNN architectures that will
allow to automatically extract most prominent features from images and use it to classify
face emotions.
The solution could be easily reproduced by choosing same network architecture and
same network parameters (i.e. weights and biases). Also, classification tasks have clear
definition of performance measure - basically, an amount of correctly classified records.
3.3
Benchmark Models
It is important to have a couple of "baseline" models that can provide a lower bound
for classification accuracy that is possible to achieve on the selected dataset. The following
models were selected to be used as such baseline solutions:
• Multiclass logistic regression (one-vs-all approach)
• Extremely randomized trees
The basic logistic model is used for binary classification tasks. It estimates how
probable is that a specific sample belongs to a specific class. If the probability is high
enough, the sample assigned to the class. As soon as there are 7 emotions in analysed
dataset, the only one logistic model is not enough. One-vs-all approach allows to handle
this issue by training 7 classifiers, each intended to recogize a single emotion class as
6
positive result and all the rest - as negative. Therefore, for each dataset sample, all
together these classifiers give a probability vector, and an emotion class that has highest
probability accepted as a classification result.
Ensemble classification methods are based on definition of weak learners. Week learner
is a classification model which accuracy is slightly higher then of random classifier. The
key idea is that if one bring together huge number of weak learners, then together they
will have much better performance. Decision tree is one of possible weak learners and
random forest is a collection of decision trees. Each tree in this collection is trained
on a random subset of data features. Extremely randomized trees brings even more
randomness into training process and essentially consists of randomizing strongly both
attribute and cut-point choice while splitting a tree node [7].
Both these techniques are able to solve classification problems and quite often used as
a basic benchmarks solutions (many of Kaggle competitions include results of applying
these models as a threshold values). And accuracy of both models could be quite clear
expressed using confusion matrix and accuracy scores. Accuracy of these models will be
used as a basic value that should be exceeded by the model build with DL approach.
Also, as soon as the dataset was taken from Kaggle, competition’s leaderboard also
can be taken into account.
4
Methodology
In this section applied data preprocessing steps, classifier implementation, and validation process are described. Benchmark models results compared with deep networks
accuracy. ROC-curves plots and other classification metrics provided to compare quality
of solutions.
4.1
Data Preprocessing
The benchmarking models accept image data as "plain" arrays, i.e. each image (recall
that it has 48 × 48 shape) is represented as a vector:
~ximg = (x1 , x2 , ..., x2304 )
And the only data preprocessing step that is used with benchmarking models is MinMax transformation applied to each pixel (i.e. each element of the vector) and defined
as follows:
(i)
(i)
xtr
~ximg − min ~ximg
=
max ~ximg − min ~ximg
This transformation scales each element into range [0, 1]. (Actually, in case of grayscale 8-bit images that are used in this project, the transformation works in a same way
as just dividing by 255). Same preprocessing was applied before feedforward networks
training.
But convolutional neural networks (CNN) used at the final stage were supplied with
more involved preprocessing steps. First of all, CNN uses spatial information to achieve
good prediction results. Therefore, images should be reshaped into original 2D format:
7
Xcnn
x1,1
x2,1
= ..
.
x1,2
x2,2
..
.
...
...
..
.
x1,48
x2,48
..
.
x48,1 x48,2 . . . x48,48
The second preprocessing step is data augmentation described in [8, 9]. (Though taken
into account [8], data augmentation process could be treaded as a sort of regularization
technique). The key idea of data augmentation is to introduces small perturbations into
dataset like random shifts, scaling, cropping, etc. which should shrink network variance
and make it more robust to noise. This step wasn’t used with every deep model, but the
final selected model was trained with this technique.
As it was noted in section 3.1, was made an attempt to filter bad samples out of the
dataset. But this idea has been rejected and none of samples were dropped.
4.2
Implementation and Selecting the Best Model
Before any deep model was trained, aforementioned benchmarking models were applied to the dataset. Figure 6 shows their evaluation results. To choose optimal model,
grid search with cross validation was applied. The best found parameters were used to
train each classifier on the whole available data excluding small testing subset. As benchmark report shows, extra trees performance transcends logistic regression and shows good
baseline performance.
Next step was the central part of this work - applying neural networks approach to
classify dataset. Figure 7 shows all architectures that were used in this project. Most
of them were inspired by papers [2, 10] where authors apply similar (but deeper and/or
wider) models. Also, some other architectures were taken into account (i.e. AlexNet,
VGG9, VGG16, etc.), but these are too computationally expensive to be trained and
tested within scope of this project. Ideas of used regularization and optimization techniques, selected activation functions and optimal values for ConvNets parameters were
taken (not including a dozen of internet sources) from [8] and [11]. The most difficult
part of the implementation process was to create a robust and easy to use data processing
and model training pipeline. To get reproducible and stable results, it is important to
save models during training and to preprocess training data in the same way for each of
trained models.
The models basic-fc and deeper-fc could be treated as an additional baseline performance metrics. As it was expected, both of them has shown low accuracy and stopped
improving after several epochs. Their accuracy didn’t exceed logistic model’s performance. Therefore, even if NN are applied, they could be too "shallow" or too simple to
compete even with linear models.
The models cnn1, cnn3 and cnn6 represent a sequence of ConvNets with increasing
depth and number of convolutional layers. One can image them as an evolution from the
first approximation to solve classification task to the final solution.
Even the most simple model, namely cnn1, has shown much better accuracy score
(above 50%) then feedforward models though it has only one convolution layer. But it
also suffers from overfitting - after approximately 15 epochs validation loss improvement
stopped showing clear U-shape.
Next model, cnn3, with increased depth and additional regularization layers (batch
normalization and more dropouts) has shown better results, but still starts to slightly
8
Figure 6: Benchmark Models. Left: logistic regression. Right: extra trees
9
Figure 7: Deep learning architectures used to classify emotions dataset. Before the line:
simple feedforward networks. After the line: convolutional architectures
10
overfit data after approximately 40 epochs.
Last model, cnn6, has shown further increased performance. Moreover, after 100
training epochs, validation loss still continued decreasing. Therefore, this model was
selected as the best of trained classifiers.
Figure 8 shows performance comparison between three of these aforementioned models. Training and validation loss plots as well as some other metrics are provided in
Appendix A.
Figure 8: Deep learning models validation accuracy
4.3
Final Benchmark
To compare cnn6 performance with benchmarking models described in section 3.3,
ROC curves and confusion matrix were computed, as well as standard classification performance metrics.
Figure 9 shows the best model confusion matrix. Figure 10 shows ROC-curves and
classification accuracy report. Comparing these values with benchmarking report, it
becomes clear that cnn6 shows better results than baseline solutions.
As it follows from shown figures, the best recognized emotions classes are Happy,
Surprise and Disgust. The worst recognized emotion is Fear. These results seems to
be interesting because Disgust emotion class was actually underrepresented.
11
Figure 9: Confusion matrix for cnn6 model
Figure 10: ROC-curves and classification metrics for cnn6 model
12
5
Conclusion
One could say that a picture is worth a thousand words. As a recap to this report,
figure 11 shows a few examples of random human’s photos taken from the Internet by
means of search engine and querying a specific emotion class. Before classification, each
photo was passed into face recognition cascade. Afterwards, recognized face rectangle
was cropped, converted into grayscale image, and rescaled down to 48 × 48 shape. Face
3 from aforementioned figure shows that Sad emotion is one of the most ambiguous for
the classifier.
Visual verification of emotions recognition results shows that the classifier is able to
recognize several types of emotions from random photos. However, it should be noted
that the model was tested on quite clear photos with good contrast and with a single
person demonstrating clear face expression. Nevertheless, the main objective of this work
was achieved: once again it was shown that even relatively small convolutional neural
networks with simple architecture can successfully solve image classification tasks. The
achieved accuracy is reasonable enough for such a small network and limited training
time.
The possible ideas for further improvement are:
• More deep/wide model with increased number of training epochs
• Meta-parameters tuning: different learning rate schedule, different optimization
algorithm, etc.
• Filtering out bad examples and collecting better data in place of dropped samples
• Automatic model’s architecture search using stochastic optimization techniques [13]
The most difficult part in training DL models is related to their computational expensiveness, as well as a huge amount of possible structures. The same is true for model
selection: grid search and cross-validation could take a lot of time to be performed. The
theoretical background staying behind these techniques could also be quite challenging.
But from another point of view, there are many pre-trained models available online, a
dozen of DL libraries (Keras, Caffe, Torch, etc.), and huge amount of tutorials and best
practices that allow to train and use deep models even without too much amount of
academic experience.
13
Figure 11
14
Figure 11: Emotions classification using best model (cnn6). Each image is retrieved from
the Internet using search engine and an emotion class name as an element of query.
15
6
References
1. V. Mnih, K. Kavukcuoglu, D. Silver et al. - Human-level control through deep
reinforcement learning
2. D. Duncan, G. Shine, C. English - Facial Emotion Recognition in Real Time
3. T. Rao, M. Xu, D. Xu - Learning Multi-Level Deep Representations for Image
Emotion Classification
4. X. Ma, Z. Wu, J. Jia et al. - Study on Feature Subspace of Archetypal Emotions for
Speech Emotion Recognition
5. Facial Expression Recognition Challenge - https://www.kaggle.com/c/challengesin-representation-learning-facial-expression-recognition-challenge/data
6. Cohn-Kanade AU-Coded Expression Database - http://www.pitt.edu/~emotion/
ck-spread.htm
7. P. Geurts, D. Ernst, L. Wehenkel - Extremely randomized trees
8. I. Goodfellow, Y. Bengio, A. Courville - Deep Learning
9. A. Krizhevsky - Learning Multiple Layers of Features from Tiny Images
10. A. Shcherbina - Tiny ImageNet Challenge
11. J. Brownlee - Deep Learning With Python
12. I. Goodfellow, J. Shlens, C. Szegedy - Explaining and Harnessing Adversarial Examples
13. E. Real et al. - Large-Scale Evolution of Image Classifiers
16
Appendix A
Figures 12, 13, and 14 show training and validation curves for each of CNN trained in
this project. Note that accuracy and categorical accuracy are essentially the same metric.
The first one is default Keras metric and was included accidentally. For the sake of clean
results reporting, it wasn’t cut away from pictures.
Figure 12: Training curves for cnn1 classifier
17
Figure 13: Training curves for cnn3 classifier
18
Figure 14: Training curves for cnn6 classifier
19