MSc Thesis – Michail Tamvakeras

Repository

Emotion Recognition From Images Using Deep Learning

The ability to predict emotions based on static or dynamic images has improved the computer vision (CV) and robotics fields and remains a major research topic. Computer vision is the task of detecting and recognizing objects or persons on images or videos. For example, one application is to predict the emotion of a person’s face, which is called FER (facial emotion recognition). In health care, a device or a robot is able to observe the state of a person, and based on that it can call the ambulance if this person looks to be sick. Other applications are playing song types (i.e. Soul, Blues, Rock) based on the mood of a person, detecting a suspicious person or weapon by the camera, and so on. There are plenty of other use cases where these technologies are getting applied either in commercial applications or for research purposes. In order to extract the visual information from images, convolutional neural networks (CNN’s) have to be used. These networks among other things comprise filters (kernels) and have either a shallow or deep layer structure.
The comparison of the performance of two different CNN’s for facial expression recognition (FER) tasks is the ingredient of this work. Both networks will be trained using a method called transfer learning in which pre-trained networks will be adapted and fine-tuned to predict seven emotion classes (angry, disgust, fear, happy, sad, surprise, and neutral). The dataset that will be used to train and evaluate these networks is the FER2013 dataset from the Kaggle competition, which comprises more than 28.000 48×48 grayscale images and is specially created for FER tasks. Unfortunately, the training examples are not as many as needed to train the networks. Because of that, only the training data will be augmented with additional, flipped, rotated, and so on images while the training process is executed. The following image shows a comparison of a FER task with and without CNN’s. As you can see the two stages feature extraction and the classification task are getting merged into the only one step. In the previous method, the feature extraction task gets handled either by machine learning algorithm or by hand which is error-prone.

In this project, two CNN’s will be used. One is the VGG16 which is one of the first CNN’s introduced and the newer Inception v3 network which contains inception layers and residual connections. The comparison will show if the performance of an older one is similar to a newer one for FER tasks. The application will be developed on both operations systems Ubuntu and Windows 10 and the programming language that will be used is Python. There are plenty libraries that will be applied in this work among other things NumPy, Matplotlib, OpenCV, dlib, sckit-learn, pandas, TensorFlow and Keras.
Below you can see images of how the augmaneted FER2013 dataset looks like and the performance comparison of the training and test process of both networks. The Python code is published on GitHub.

Augmented Dataset

Augmented Datase based on the first image

Result of the performance comparison

Training accuracy of the Inception-v3 network

Download
msc_thesis_2019_miis.pdf