Enhanced Real-Time Facial Expression Recognition Using Deep Learning
Abstract:
In the realm of facial expression recognition (FER), the identification and classification of seven universal emotional states, surprise, disgust, fear, happiness, neutrality, anger, and contempt, are of paramount importance. This research focuses on the application of convolutional neural networks (CNNs) for the extraction and categorization of these expressions. Over the past decade, CNNs have emerged as a significant area of research in human-computer interaction, surpassing previous methodologies with their superior feature learning capabilities. While current models demonstrate exceptional accuracy in recognizing facial expressions within controlled laboratory datasets, their performance significantly diminishes when applied to real-time, uncontrolled datasets. Challenges such as degraded image quality, occlusions, variable lighting, and alterations in head pose are commonly encountered in images sourced from unstructured environments like the internet. This study aims to enhance the recognition accuracy of FER by employing deep learning techniques to process images captured in real-time, particularly those of lower resolution. The objective is to augment the accuracy of FER in real-world datasets, which are inherently more complex and collected under less controlled conditions, compared to laboratory-collected data. The effectiveness of a deep learning-based approach to emotion detection in photographs is rigorously evaluated in this work. The proposed method is exhaustively compared with manual techniques and other existing approaches to assess its efficacy. This comparison forms the foundation for a subjective evaluation methodology, focusing on validation and end-user satisfaction. The findings conclusively demonstrate the method's proficiency in accurately recognizing emotions in both laboratory and real-world scenarios, thereby underscoring the potential of deep learning in the domain of facial emotion identification.
1. Introduction
Over the past decade, the field of artificial intelligence (AI), which aims to emulate human cognitive processes, has undergone significant advancements and encountered intriguing challenges. Among these, the analysis of subtle facial expressions represents a complex task. It is observed that the manifestation of a single emotion can vary considerably across individuals, influenced by factors such as ethnicity, age, or gender. Moreover, the interpretation of an individual's emotional state is subject to contextual variables including lighting, posture, and background. This paper delves into the intricacies of facial expression analysis in the era of AI, exploring the multitude of aspects impacting the accuracy of human emotion detection. Expression, encompassing a broad spectrum of behaviors, actions, thoughts, and feelings, is ultimately a subjective and intimate mental and physical state. The foundational work of Charles Darwin, particularly his book "The Expression of the Emotions in Man and Animals," laid the groundwork for early emotion studies. Subsequent research, notably by Ekman and Friesen in 1969, identified cross-cultural consistencies in emotional expressions, establishing six universal emotional states: happiness, sadness, anger, contempt, surprise, and fear [1], [2], [3].
Conversely, facial expression, a non-verbal form of communication, is crucial to human perception, behavior, and interaction. Facial expressions represent morphological alterations in the face [4], and it is estimated that only about 7% of the information conveyed is through words, with vocal intonation accounting for 55% and body language for 38%. The use and interpretation of body language and facial expressions often occur subconsciously, yet they play a vital role in effective communication. The increasing relevance of emotions in human-robot interaction (HRI) has sparked interest in equipping social robots with FER capabilities. HRI amalgamates disciplines such as social sciences, robotics, AI, and natural language processing [5]. This interdisciplinary approach underlines the growing need to understand and accurately interpret facial expressions, not only in human-to-human interactions but also in the evolving domain of human-robot communication.
Emotions are fundamental in HRI, rendering social robots an increasingly studied subject due to their potential in FER. The exploration of HRI necessitates a multidisciplinary approach, incorporating fields such as AI, robotics, natural language processing, design, and social sciences. Within this scope, facial recognition technologies are crucial yet encounter several limitations including restricted processing capabilities, speed, duration, and accuracy. Challenges in 2D, 3D, and temporal facial recognition methods are prevalent, primarily owing to spatial alterations, occlusions, lighting variances, and the intensive demand for computational resources. Efforts to refine classification accuracy have been observed, with some researchers opting to simplify methodologies by minimizing feature points or adopting a more objective approach. In the realm of computer vision, traditional machine learning techniques previously demonstrated efficacy but were hindered by their inability to process direct photo inputs. Contemporary face recognition systems continue to face challenges due to varying lighting conditions, backgrounds, and postures, which can significantly alter appearances and obstruct precise expression detection. The advent of deep learning has been pivotal in addressing these challenges, enhancing the recognition performance of the six core emotional expressions—sadness, disgust, anger, happiness, fear, and surprise. However, the application of deep learning models to faces captured under divergent conditions from the training dataset remains a significant limitation. A comparative analysis of traditional and deep learning techniques in facial recognition is presented in Figure 1.
In this study, a novel deep learning-based framework is introduced, designed to surmount the challenges inherent in real-time facial emotion recognition. The system employs deep learning algorithms for detection, coupled with CNNs for the extraction of features, thereby recognizing a spectrum of seven emotional states: happiness, sadness, anger, fear, surprise, disgust, and neutrality. The methodology incorporates current techniques while introducing several key enhancements:
Expanded recognition capability: The model is engineered to differentiate between seven emotional categories, thereby broadening its scope to capture a more extensive range of human emotions. This expansion enables a more precise and nuanced analysis of facial expressions.
Streamlined and resilient architecture: The system is designed with simplicity and robustness, facilitating effective real-time processing. This feature ensures the model's applicability in real-world scenarios without excessively taxing computational resources.
Enhanced accuracy: By leveraging advanced deep learning techniques, the model achieves elevated levels of accuracy in facial emotion detection. This improvement is critical for reliable outcomes, particularly in fields such as human-computer interaction, market research, and mental health assessments.
Rigorous evaluation and validation: The efficacy of the proposed model will be rigorously assessed using predefined datasets. This evaluation process is aimed at empirically demonstrating the system's proficiency and its capacity to yield valuable insights across various applications. The methodology outlined in this research encompasses several critical steps, each contributing to the development of an advanced real-time facial expression detection system:
Data collection: Initially, a comprehensive and varied dataset of facial expressions is compiled. This dataset is meticulously curated to ensure diversity and representativeness, laying a foundational basis for subsequent model training.
Model training and evaluation: Deep learning models are then rigorously trained on the assembled datasets. The focus of this training is to enhance the models' proficiency in accurately identifying the seven predefined emotional categories. Subsequent extensive testing is conducted to refine and validate the models' performance.
Application in real-time detection: Designed for practical, real-time scenarios, the system operates by selecting an image from the collected dataset as input. It then rapidly processes this image for emotion recognition, aiming to deliver prompt and reliable results.
In summation, the approach proposed in this study marks a significant advancement in real-time FER. It integrates innovative features, including simplicity, robustness, and high accuracy, making it a valuable asset in diverse applications where understanding human emotions is essential. These applications range from enhancing human-computer interaction to providing insights in market research and mental health assessments.
2. Literature Review
The implementation of facial recognition technology in smart devices has become increasingly prevalent, yet it imposes significant demands on storage and processing capabilities. In response to these challenges, a range of strategies and systems for expression recognition have been developed and are briefly reviewed herein:
Guo et al. [6] introduced an innovative approach utilizing DNNs with relativity learning (DNNRL). This method aims to contract the distances in the embedding space between samples representing the same expression, while concurrently expanding the gap between those of differing expressions. The training process involves the selection of an anchor, a positive sample (bearing the same expression as the anchor), and a negative sample (exhibiting a different expression). The core objective is to minimize the triplet loss, which effectively reduces the distance between the anchor and the positive instance in the embedding space, ensuring that it remains narrower than that between the anchor and the negative sample. DNNRL notably assigns greater weight to challenging instances based on the network's output, allowing for more nuanced learning. The efficacy of DNNRL has been validated using the SFEW and FER2013 datasets.
Feature extraction, a critical step following face detection in FER, is heavily dependent on the quality of the features extracted. Subtle or pronounced deformations in facial features such as eyebrows, lips, eyes, and nose can induce changes in facial expressions. Feature extraction methods are categorized into two types: non-geometric and geometric-based features [7]. Geometric feature extraction focuses on quantifying the size and position of facial features, including the nose, lips, forehead, chin, and eyes. These attributes are encapsulated within a facial geometry feature vector. Geometric feature extraction employs various geometric interactions, such as points, stretches, and angles, between these components to encode the features. In contrast, appearance-based feature extraction employs either a single image filter or a combination of filters applied to the entire image or specific regions to discern changes in texture and shape [8]. Furthermore, a range of computational models and methods for processing visual data are employed in feature extraction. These include tools like fuzzy logic and neural networks. Feature extraction strategies are broadly classified into four types: feature-based, appearance-based, template-based, and part-based approaches [9].
Li et al. [10] have explored the application of the k-nearest neighbor (KNN) strategy, augmented by center loss and locality-preserving loss (LP-loss), for clustering deep features and ensuring intra-class compactness. The employed deep locality-preserving CNN (DLP-CNN) maintains the local representation of each sample in the embedding space. During training, Euclidean distance is utilized to ascertain the KNN for each data point, aiming to minimize the sample's distance from the mean of its KNNs. The effectiveness of LP-loss has been evaluated using datasets such as CK+, SFEW, MMI, and RAF-DB. Center loss, while promoting intra-class compactness and consequently aiding in inter-class separation, may still permit overlap among feature regions in the embedding space. Building on this concept, Cai et al. [11] enhanced center loss by integrating an additional objective function. This modified center loss, termed as island loss, merges the original center loss with the pairwise cosine distance between class centers in the embedding space. The approach aims to increase cosine distance, thereby angularly separating the class centers. Island loss has been assessed using datasets including CK+, MMI, and Oulu-CASI. Recent advancements in facial emotion recognition have been significantly influenced by deep learning algorithms. Jain et al. [12] introduced single deep neural network (DNN) incorporating convolution layers and deep residual blocks. Lopes et al. [13], in a similar vein, presented a multiple CNN framework, complemented by a specialized image pre-processing stage for emotion recognition.
The application of FER in dynamic environments was addressed by Jain et al. [14] through the deployment of a hybrid convolution-recurrent neural network technique. A comparative analysis was conducted by Sajjanhar et al. [15] on the performance of pre-trained facial recognition algorithms, Visual Geometry Group (VGG)-facial and Inception, both initially developed for object detection. Wen et al. [16] employed a convolutional rectified linear layer as the initial layer in their CNN aggregate for facial emotion recognition, incorporating multiple hidden maxout layers to modify the architecture of each CNN. Despite notable advancements in the field of FER, research predominantly focuses on devising strategies to enhance outcomes presented in one or more datasets independently. The investigation into the impact of cross-dataset fine-tuning on performance was conducted by Zavarez et al. [17]. For this purpose, the VGG face deep CNN model was adapted for facial emotion recognition. Cross-dataset experiments were meticulously designed, utilizing one dataset as the test set while employing others for training, to ensure the reliability of the results. Wang et al. [18] proposed an innovative approach integrating FER technology with online course platforms. In this method, student facial expressions were captured using device cameras during an online course and processed through a FER algorithm (CNN model), categorizing them into eight emotional states: anger, disgust, fear, happiness, sadness, surprise, contempt, and neutral. This approach was tested in an online course using Tencent Meeting with 27 students, demonstrating consistent performance across diverse scenarios. The applicability of this concept extends beyond online educational settings, suggesting potential in various interactive environments.
Pise et al. [19] have applied contemporary deep learning models to the evolving field of automated emotion recognition within computational intelligence. This research demonstrates the integration of deep learning-based FER with architectural methods and databases, yielding highly accurate results. A diverse array of machine learning and deep learning methodologies are employed in this investigation. Saeed et al. [20] discussed a technique to enhance accuracy in facial recognition. Their proposed CNN method (fall detection-CNN), incorporating two hidden layers and four convolutional layers, serves as an automated framework. Utilizing the expanded Cohn-Kanade (CK+) dataset, which includes images portraying a range of emotions from various individuals, the process encompasses pre-processing, feature extraction, and categorization. The model's effectiveness is evaluated through metrics such as F1-score, recall, and precision, with respective values of 84.07%, 78.22%, and 94.09%. Additionally, numerous studies employing machine learning methods have contributed to this field [21], [22], [23], [24], [25], [26], [27]. Despite advancements, certain facial recognition approaches encounter challenges, including poor lighting, shadows, partial facial visibility, camera orientation issues, and lower recognition rates. This project aims to develop a CNN-based FER system enhanced with data augmentation. The proposed system is designed to classify the seven principal emotions, anger, contempt, fear, happiness, neutrality, sadness, and surprise, from visual data.
3. Proposed Methodology
Figure 2 delineates the architecture of the proposed emotion recognition model. The methodology comprises the following principal components:
(a) Data collection: This phase involves the accumulation of a diverse dataset, encompassing images that represent a range of emotions.
(b) Data preprocessing: The dataset undergoes classification, categorizing images into seven emotional states: anger, happiness, fear, disgust, neutrality, sadness, and surprise.
(c) Emotion prediction: Utilizing a deep learning model, emotion predictions are executed on the images.
(d) Performance evaluation: The final stage involves assessing the model's performance in accurately predicting emotions.
The data collection process was facilitated by a data acquisition layer, responsible for aggregating data from various online sources. This research utilized information gathered from links, data repositories, and additional internet resources. Figure 3 presents a selection of the data samples amassed for this study. The methodology incorporated two primary datasets: the FER-2013 [28] and a Random dataset [29]. The FER-2013 dataset comprises grayscale images, each measuring 48×48 pixels. It encompasses a training set of 28,000 labeled images, a development set consisting of 3,500 labeled images, and a test set with another 3,500 labeled images. This dataset encapsulates seven emotional states: happiness, sadness, anger, fear, surprise, disgust, and neutral. In contrast, the Random dataset includes a compilation of 350 images, both in color and grayscale, further categorized into six emotional categories: happiness, sadness, anger, fear, surprise, disgust, and neutral. Figure 3 showcases representative images from both datasets, illustrating the diversity and range of emotions covered.
In this phase of the research, the focus is on the utilization of deep learning models, specifically MobileNet, for real-time prediction of seven emotional categories: happiness, sadness, anger, fear, surprise, disgust, and neutrality. The MobileNet architecture is leveraged due to its efficiency in processing and reduced parameter count compared to conventional convolutional networks. Bounding boxes are employed to highlight the facial regions where emotions are detected. MobileNet, a variant of CNNs developed by Google, employs depth-separable convolutions, significantly reducing the number of parameters required. This reduction enables the deployment of DNNs on portable devices, making MobileNet an ideal foundation for compact and rapid classifiers. The architecture of MobileNet comprises several depth-separable convolutional layers, each consisting of a depth-wise convolution followed by a point-wise convolution. In total, a MobileNet architecture contains 28 layers when depth-wise and point-wise convolutions are considered separately. Furthermore, the adaptability of MobileNet is enhanced by the width multiplier hyperparameter, which allows for the adjustment of the network's complexity. Typically, a standard MobileNet comprises approximately 4.2 million parameters, with input dimensions of 224×224×3. Figure 4 presents the architectural diagram of MobileNet, highlighting its structural components [30], [31].
In the evaluation of object detection models, MobileNet is distinguished by its exceptional speed performance. Contrasting with its counterparts, which typically operate at a frame rate of 5 frames per second, MobileNet excels by achieving a remarkable 22 frames per second. This rapid processing capability significantly elevates MobileNet above other models in terms of efficiency. To illustrate this, consider the comparison with models such as regions with CNN features (R-CNN) and its enhanced version, Fast R-CNN. While these models exhibit higher accuracy rates, capturing more detailed information than MobileNet, they lag in processing speed. The defining advantage of MobileNet lies in its speed, making it a preferred option for applications where prompt and efficient object detection is crucial. This aspect is particularly vital in real-world scenarios where time-sensitive detection is paramount [32]. Figure 5 provides a comparative analysis of MobileNet against various object detection techniques, emphasizing the speed differential.
The experimental evaluation was conducted on a computer equipped with an Intel Core i5-6200U CPU, operating at 2.4 GHz and supported by 8 gigabytes of RAM. Python, chosen for its versatility and efficiency, served as the programming language for the implementation of the models. To assess the accuracy of the developed models, a comprehensive evaluation was performed using a test dataset with well-established target features. The model outputs were systematically compared against these known ground truths, facilitating a detailed analysis of their performance. A key instrument in this evaluation was the utilization of a confusion matrix. This matrix provided both a visual and numerical representation of the model's performance, indicating not only the predicted instances for each class but also the accuracy of these predictions. Moreover, various assessment parameters were calculated using specific mathematical formulae, further elucidating the models' effectiveness. These calculations and their corresponding formulae are detailed in Eq. (1), which offers a comprehensive view of the analytical methods employed in this study.
In this research, the model's performance was evaluated using a subjective assessment approach, wherein manually created images and frames depicting emotions were compared individually. The datasets utilized for testing and training encompassed the following emotional classes: happiness, sadness, anger, fear, surprise, disgust, and neutrality. Table 1 provides an extensive description of the dataset.
Classes | FER-2013 Dataset | Random Dataset |
Happiness | 879 | 50 |
Sadness | 594 | 50 |
Anger | 491 | 50 |
Fear | 528 | 50 |
Surprise | 416 | 50 |
Disgust | 55 | 50 |
Neutrality | 626 | 50 |
For the experimental analysis, images were sourced from online platforms, system directories, and those specifically collected for this study. While some images from the training data were used preliminarily to check for duplicates, the primary focus was on images not included in the training set. The proposed model's efficacy was tested across a range of image resolutions. A unique aspect of the assessment involved contrasting the proposed model with a manual approach, where an individual subjectively classified images into respective emotional categories. This method entailed manual predictions which, despite initial accuracy, exhibited uncertainty in certain cases due to behavioral similarities, such as mistaking an image of a newborn for fear when it might also be interpreted as surprise. Subsequently, these images were processed through the proposed model, and the outcomes from both methods were compared. This process constituted an image-level comparison. Figure 6 illustrates this comparison, showcasing how each method categorized images across the seven emotional classes: happiness, sadness, anger, fear, surprise, disgust, and neutrality.
As depicted in Figure 7, it was observed that the proposed model did not accurately identify certain emotional states. This limitation was primarily due to the visual similarities between different emotions. For instance, an image that predominantly exhibited characteristics of fear was erroneously classified under the category of surprise by the model. Similarly, another image, which ideally belonged to the disgust category, was incorrectly identified as sadness. This misclassification stemmed from the visual resemblance of the image to those typically associated with sadness, as perceived by the unaided eye. Figure 7 presents a selection of instances where the model's predictions were impeded by such behavioral similarities. These examples highlight the challenges faced in accurately distinguishing between emotions that share common visual traits.
The results of the proposed model, as depicted in Figure 8, demonstrated a remarkable accuracy of 100% during validation and 97.9% during training. These statistics indicate the model's successful generalization from the training dataset to the validation dataset. However, challenges arose when the model was applied to real-world images sourced from various platforms such as online resources, system directories, and captured photographs. These images encompassed a spectrum of emotional states: happiness, sadness, anger, fear, surprise, disgust, and neutrality.
The manual assessment method, which relied on human judgment to classify images, also faced difficulties in accurately identifying emotions, especially in instances where images exhibited similar emotional traits. For example, an image of a baby, which might appear fearful, could also be interpreted as showing surprise due to the ambiguity in facial expressions. This issue underscores the subjective nature of human visual perception in emotion recognition tasks. Discrepancies were noted between the model's predictions and manual assessments. The model, due to its focus on behavioral similarities, misclassified certain images into incorrect emotional categories. An image that visually suggested fear was sometimes predicted as surprise, while an image that initially appeared sad was classified as disgust. These misclassifications were attributed to the model's challenge in discerning subtle differences in emotional expressions. Figure 8 illustrates instances where the model struggled with reliable predictions due to the proximity of the emotional expressions depicted in the images. Despite its high accuracy in training and validation settings, the model's performance in real-world scenarios highlighted the intricacies of emotion recognition in photographs, particularly when dealing with minute variations and nuances.
In summary, while the model exhibited commendable performance during training and validation phases, it encountered difficulties in accurately discerning emotions in real-world images. The findings emphasize the significance of acknowledging the inherent challenges and limitations in emotion detection tasks, particularly with images exhibiting a range of similar emotional expressions. Further research and refinement may be necessary to enhance the model's capability in such scenarios.
4. Results
Table 2 presents a comparative analysis between the proposed MobileNet-V1 model and existing techniques in emotion recognition. The comparison, based on accuracy, is drawn from a range of studies and models:
Authors | Model | Accuracy |
Barsoum et al. [33] | VGG13(MV) | 83.86 % |
Li et al. [34] | TFE-JL | 84.29% |
Georgescu et al. [35] | CNNs and BOVM + global SVM | 87.76% |
Huang [36] | ResNet + VGG | 87.4% |
Wang et al. [18] | SCN + ResNet18 | 88.01% |
Nan et al. [37] | A-MobileNet | 88.11% |
Proposed model | MobileNet-V1 | 97. 9% |
The proposed MobileNet-V1 model demonstrates a significantly higher accuracy rate of 97.9%, surpassing the accuracy levels of other models cited in the comparative study. This superior accuracy rate is indicative of the model's effectiveness in emotion recognition tasks. The comparative analysis underscores the compactness and reliability of the proposed model, especially in contrast to the existing methodologies.
5. Conclusions
The method presented in this article for real-time emotion recognition incorporates user behavior analysis and is capable of differentiating between seven behavioral categories: happiness, sadness, neutrality, disgust, fear, surprise, and anger. Existing methods in the literature for content extraction and behavior recognition, while useful, are often hindered by high hardware demands and slow processing speeds. A detailed comparison of emotion recognition techniques is provided, demonstrating the simplicity, optimization, precision, and reliability of the proposed model relative to current methods. A key innovation in this study is the accurate and efficient extraction of seven distinct behaviors. The primary objective of developing an emotion detection system based on seven classifications using images or shots has been successfully achieved with the proposed approach. For assessment purposes, the experimental study utilized two datasets, FER2013 and Random datasets, both comprising images categorized into seven emotional states. In comparison to the subjective evaluation method, where an observer manually identifies the behavior from an image, the proposed model demonstrated superior performance. Extensive experiments have shown that the proposed method achieved an accuracy of 97.7% while maintaining high processing speed and efficiency. Future research directions include exploring additional classes, enhancing accuracy, and implementing real-time facial emotion recognition using cameras. A user-friendly interface is also planned for integration into an application utilizing the developed model.
The data used to support the research findings are available from the corresponding author upon request.
The authors declare no conflict of interest.