Hands-On AI Part 17: Emotion Recognition from Images Baseline Model

Published: 10/25/2017   Last Updated: 01/29/2019

A Tutorial Series for Software Developers, Data Scientists, and Data Center Managers

In this article, we will be building a baseline convolutional neural network (CNN) model that is able to perform emotion recognition from images. Emotion recognition in our case is a binary classification problem with the goal of discriminating between positive and negative images.

All the code, notebooks, and other materials, including the Dockerfile*, can be found on GitHub*.

Or you can pull it from the repository:

git clone git@github.com:datamonsters/intel-ai-developer-journey.git
cd intel-ai-developer-journey/emotions


The first step in almost all machine learning tasks should be gaining an insight into the data. Let's do it.

Dataset Structure

Raw data can be downloaded from Dropbox* (in the Baseline.ipynb notebook all the steps of this section are done automatically). Initially the data is in the Zip* archive. Let's unarchive it and see the structure of the files.


All the images are stored inside the “dataset 50:50” folder and are distributed across two folders corresponding to their classes, Negative and Positive. Note that the problem is a little bit imbalanced—53 percent of the images are positive and only 47 percent are negative. Generally, the data in classification problems are considered to be imbalanced when the number of samples in different classes differs hugely. There are a number of ways to deal with imbalanced data such as undersampling, oversampling, data reweighting, and so on. In our case, the imbalance is small and should not crucially influence the training procedure. We just need to remember that the naive classifier, which always says positive, will give around 53 percent accuracy on this dataset.

Let's take a look at a couple of images of each class.

Negative image
Negative image
Negative image


Positive image
Positive image
Positive image


At first glance the images from different classes indeed differ from each other. But let's make a deeper investigation and try to find bad examples—images from different classes that are similar.

For example, there are approximately 90 images of snakes marked as negative and around 40 very similar images of snakes labeled as positive.

Positive Snake

Negative Snake

The same duality takes place with spiders (130 negative and 20 positive), nude persons (15 negative and 45 positive), and some other classes. It seems that the labelling was made by different persons and the perception of individuals for the same image may vary. Thus the labelling contains inherent inconsistency. These two images of snakes are virtually the same and different assessors would relate them to different classes. Thus we can conclude that it's hardly possible to get 100 percent accuracy in this problem due to the nature of the task. We believe that the more realistic number should be around 80 percent accuracy, judging by the fraction of similar images in different classes observed during the preliminary visual inspection.

Train and Validation Split

We always want to build as good a model as possible. But what is a good model? There might be many different criteria such as quality, execution time (training and inference), and memory consumption. While some of them can be easily and objectively measured (time, memory) others (quality) might be hard to determine. For example, your model can give 100 percent accuracy on training examples that were seen many times during training but fail on the examples that are new. This problem is called overfitting and is one of the most important in machine learning. There is also an underfitting problem when the model cannot learn from the data and gives poor predictions even on the training set.

To tackle the overfitting problem we use the hold-out sample technique. The idea of this technique is to split the initial data into two parts:

  • Training set, which usually constitutes the bigger part of the dataset and is used for the training of the model.
  • Validation set is usually a small fraction of the initial data, which is split before all the training procedures, is never used in training, and is treated as a new sample to test on after the training has finished.

Using this technique we can see how well our model generalizes—works on the previously unseen examples.

In this article, we set the train/validation sizes fraction equal to 4/1. One more trick that we used is called stratification, and means that we split each class independently of all other classes. It allows us to keep the same balance between the size of the classes in the training and validation sets. Stratification implicitly follows the hypothesis that the distribution of the examples does not change over the data and will stay the same in the new samples.

sampling schematic

Let's illustrate the stratification concept using a simple example. Assume that we have four groups/classes of data with the corresponding number of objects in them: children (5), adolescents (10), adults (80), and seniors (5); see figure on the right (from Wikipedia*). Now we want to split the data into two samples in the proportion of 3/2. With the stratified sampling we will independently take objects from each group: 2 objects from the children group, 4 objects from adolescents, 32 objects from adults, and 2 objects from seniors. The new sample contains 40 objects, which is exactly 2/5 of the initial data. At the same time the balance among classes in the new sample corresponds to the balance in the initial data.

All the steps described above are implemented in one function called prepare_data, which can be found in the utils.py Python* file. It downloads the data, splits them into training and validation sets using fixed seed (for reproducibility), and redistributes them in the proper way between directories on the hard drive for future usage.

Preprocessing and Augmentation

In one of the previous articles, we described the preprocessing steps and possible reasons to use them in the form of data augmentation. Convolutional neural networks are quite complex models and require a lot of data to train. Here we have only 1600 examples, which is obviously not enough.

Therefore, we want to extend the dataset using data augmentation. As it was pointed out in the preprocessing article, Keras* provides us with the opportunity to augment the data on-the-fly while reading it from the hard drive. It can be done via the ImageDataGenerator class.


Here we create two instances of the generators. The first one is for training and uses many random transformations such as rotation, shift, shearing, zoom, and horizontal flip during reading of the data from the disk and feeding them into the model. Therefore our model receives already transformed examples, and this transformation is different every time the model sees the example because of their random nature. The second one is for validation and it only rescales images. Validation and training generators have only one transformation in common—rescaling. We want to have our data in the range [0; 1] instead of [0; 255] due to numerical stability reasons.

Model Architecture

Once the data investigation and preparation is done it's time to construct a model. As we have a small amount of data we want to build a relatively simple model to be able to train it appropriately and to not overfit. Let's try VGG* style architecture but with a smaller number of layers and filters.




The architecture consists of the following parts:

  • [Conv + Conv + Max pooling] x 2

    The first part contains two stacked convolutional layers with 64 filters (size 3 and stride 2) and max pooling (size 2 and stride 2) after them. This part is also usually called feature extractor because what filters effectively do is detection of meaningful features from the input (see Overview of Convolutional Neural Networks for Image Classification for details).

  • Flattening

    This part is needed because the output of the convolutional part is 4D tensors with the dimensions (examples, height, width, and channels). But for the usual dense layer we should have 2D tensor (examples, features) as an input. Thus we need to flatten the tensor along the last three axes to make them into one. In fact, it means that we treat each pixel of each feature map as a separate feature and flatten them into one vector. In the figure below there is an example of a 4x4 image with 128 channels flattened into one long vector of 1024 length.


  • [Dense + Dropout] x 2

    This is the classification part. It takes a flattened feature representation of images and tries to classify them as well as possible. It consists of two stacked dense/dropout blocks. We're already familiar with dense layers—they are usually fully connected layers. But what is dropout? Dropout is a regularization technique that prevents overfitting. One of the possible signs of overfitting is very different (orders of magnitude) weights. There are plenty of techniques to overcome this issue including weight decay and dropout. The idea of the dropout is to switch off random neurons during training (update the list of the thrown-away neurons after every batch or epoch). This implicitly makes it much more difficult for weights to become very different, and thus regularizes the network.

    An example of applying dropout (figure is taken from Dropout: A Simple Way to Prevent Neural Networks from Overfitting):

  • Sigmoid unit

    The output layer should correspond to the problem setting. Here we have a binary classification task, thus we need to have one output neuron with a sigmoid activation function, which estimates


    the probability P of belonging to class number 1 (positive, in our case). Then the probability of belonging to class number 0 (negative) can easily be calculated as 1 - P.

Training Setting and Parameters

We have chosen the architecture of the model and specified it using the Keras Python framework. One more thing to do before we can start training is compile the model.


The compilation step configures the model for training. We need to specify three main parameters:

  • Optimizer. Here we use the default Adam* optimizer, which is a kind of stochastic gradient descent algorithm with momentum and adaptive learning rate (for more details see An overview of gradient descent optimization algorithms, a blog post by S. Ruder).
  • Loss. Our problem is a binary classification task, thus it is appropriate to use the binary cross entropy loss function.
  • Metrics. It is an optional argument through which we can specify additional metrics to trace during the training procedure. Here we want to trace accuracy in addition to the objective function.

Now we are ready to train the model. Note that the training procedure is done with the generators initialized in the previous section. Number of epochs is another hyperparameter that can be tuned, but here we just set it to 10. We also want to save both the model and the training history to be able to load it later.



Now let's see how well our model performs. First of all, let's take a look at the metrics evolution during training.


One can see that validation cross entropy and accuracy do not decrease over time. Moreover, accuracy for both the training and validation dataset just fluctuates near the value of the random classifier. The final validation accuracy equals 55 percent, which is only slightly better than random.

Let's see how the predictions of the model are distributed among classes. For that purpose we build and visualize a confusion matrix using the corresponding function from the Sklearn* Python package.

Each cell in the confusion matrix has its own name:


  • True Positive Rate = TPR (top-right cell) is the fraction of positive (class 1, which is positive emotion in our case) examples classified correctly as positive.
  • False Positive Rate = FPR (bottom-right cell) is the fraction of positive examples misclassified as negative examples (class 0, which is negative emotion).
  • True Negative Rate = TNR (bottom-left cell) is the fraction of negative examples classified correctly as negative.
  • False Negative Rate = FNR (top-left cell) is the fraction of negative examples misclassified as positive.

In our case TPR and FPR are both close to 1. It means that almost all the objects have been classified as positive. Therefore our model is not far from the naive baseline with constant major class (in our case positive) predictions.

One more metric that might be interesting to look at is a receiver operating characteristic (ROC) curve and area under curve (ROC AUC). The formal definition can be found on Wikipedia. Briefly, the ROC curve shows how well the binary classifier performs.

Our CNN classifier has a sigmoid unit as an output, which gives us the probability of the example being assigned to class 1. For now, let's assume that our classifier works well and assigns low probabilities for examples of class 0 (green color in the left figure below), and high probabilities to the examples of class 1

(blue color).


The ROC curve shows how TPR depends on FPR while the classification threshold moves from 0 to 1 (right figure, above). To understand what a threshold is, recall that we have the probability of belonging to class 1 for every example. But the probability is not yet the class label. One should compare it with the threshold to decide which class the example belongs to. For instance, if the threshold equals 1, then all the examples would be classified as class 0, because the probability can't be greater than 1 and both FPR and TPR would be equal to 0 (no samples are classified as positive). This corresponds to the leftmost point of the ROC curve. At the other end of this curve there is a point where the threshold equals 0, which means that all the samples are classified as class 1, and both TPR and FPR equal 1. Intermediate points show the behavior of the TPR/FPR dependency with the changing threshold.

The diagonal line at this plot corresponds to the random classifier, and the better our classifier is the closer it is to the left-top point. Thus the objective measure of the classifier quality is the area under the ROC curve, or ROC AUC. It should be as close to 1 as possible. AUC equal to 0.5 corresponds to the random classifier.

Our model (figure above) has AUC equal to 0.57, which is not at all the best result.


All these metrics tell us one fact—our model is only slightly better than random. There might be a couple of reasons for that but the main reasons are:

  • Very small amount of data to learn to extract representative features from the images. Even data augmentation did not help here.
  • Relatively (compared with other machine learning models) complex convolutional neural network model with a huge number of parameters.


In this article, we built a simple convolutional neural network model for emotion recognition from images. We used several data augmentation techniques at the training stage and evaluated the model with several metrics such as accuracy, ROC, AUC, and a confusion matrix. The model showed only slightly better results than random due to an insufficient amount of available data.

In the next article, we tackle the problem of small dataset with a different approach called transfer learning, and see how it will significantly improve the model performance by using the weights trained on a related but different task.



Prev: Modern Deep Neural Network Architectures for Image Classification Next: Emotion Recognition from Images Model Tuning and Hyperparameters

View All Tutorials ›

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.