Establishing SSD-MobileNetV2 as a Robust Baseline for Driver Drowsiness Detection Toward IoT-Ready in-Driving Safety Systems

yo ceng giap; muljono; affandy; ruri suko basuki; harun al azies; deshinta arrova dewi

Outline

Open Access

Research article

Establishing SSD-MobileNetV2 as a Robust Baseline for Driver Drowsiness Detection Toward IoT-Ready in-Driving Safety Systems

Yo Ceng Giap^1,2^*

,

Muljono¹^*

,

Affandy¹

,

Ruri Suko Basuki¹

,

Harun Al Azies¹

,

Deshinta Arrova Dewi³

¹

Faculty of Computer Science, Universitas Dian Nuswantoro, 50131 Semarang, Indonesia

²

Faculty of Science and Technology, Universitas Buddhi Dharma, 15115 Banten, Indonesia

³

Faculty of Data Science and Information Technology, INTI International University, 71800 Nilai, Malaysia

International Journal of Transport Development and Integration

|

Volume 10, Issue 2, 2026

|

Pages 441-453

https://doi.org/10.56578/ijtdi100208

Received: 01-12-2026,

Revised: 05-20-2026,

Accepted: 05-22-2026,

Available online: 06-02-2026

View Full Article|

Download PDF

Abstract:

Driver drowsiness is one of the major reasons behind road accidents, emphasizing the need for accurate and efficient fatigue detection systems that can help monitor practical in-vehicle environments. While significant progress has been made in visual fatigue detection based on deep learning, many previous studies have been performed using a single dataset for training or controlled environments for testing. In this paper, we examine the reliability of lightweight driver-monitoring architectures for vision-based driver drowsiness detection based on three heterogeneous public datasets, i.e., Yawning Detection Dataset (YawDD), Driver Drowsiness Dataset (DDD), and National Tsing Hua University Drowsy Driving Dataset (NTHU-DDD), which cover different lighting conditions, facial characteristics, and head poses as encountered in driving scenarios. Among the considered architectures, Single Shot Detector (SSD)-MobileNetV2 was the most consistent, yielding an accuracy of 92%, precision of 93%, recall of 92%, and F1-score of 92% while also being computationally lighter than the other considered architectures. Reliability of the proposed architecture was statistically validated using the McNemar Test and 95% Confidence Intervals (CI). Our results show that SSD-MobileNetV2 could be a promising baseline for future lightweight driver-monitoring systems for heterogeneous driving environments.

Keywords: Driver drowsiness detection, IoT-ready safety systems, Lightweight deep learning, Intelligent transportation systems, In-vehicle monitoring, Transportation services, Vision-based driver monitoring

1. Introduction

Transportation safety remains a major global concern due to the high number of traffic accidents and fatalities reported annually. As reported by the World Health Organization (WHO) [1], road traffic accidents are among the top causes of mortality in the age group from 5 to 29 years old. The main reason for road traffic accidents is driver fatigue. It was estimated that in 2023, about 1.19 million people worldwide died due to road traffic accidents. According to Rad et al. [2] and Williamson et al. [3], Driver fatigue leads to decreased alertness and slowed reactions, which result in a higher risk of accidents and poor performance, especially at night and while driving for a long period of time. Although traditional methods of addressing driver fatigue, including rest stops, safety campaigns [4], and technology-enabled learning systems, were proposed, there is a great need for effective systems of fatigue detection [5].

Deep learning methods recently developed for the visual detection of driver drowsiness have been successful by analyzing eye movement, facial expressions, and head pose [6]. Yet, the performance of these models depends heavily on dataset quality, environment variation, and architectural complexity [6], [7], [8]. Some of the most common issues include changes in lighting conditions, facial features, camera view angles, head motion, and resolution, which may cause instability of models during driving operations. Thus, assessing the stability of such models under different driving conditions is an important requirement for intelligent transportation systems (ITS) applications.

In this research work, three different but related publicly available datasets, namely, Yawning Detection Dataset (YawDD), Driver Drowsiness Dataset (DDD), and National Tsing Hua University Drowsy Driving Dataset (NTHU-DDD), are used. YawDD provides different head poses and eye-state information, whereas DDD gives fatigue-related information [9]. In addition, the NTHU-DDD dataset consists of different fatigue states and also covers nighttime driving conditions [10]. Thus, by combining these datasets, we can have a heterogeneous evaluation environment.

For this investigation, three representative deep learning models are considered in order to establish a reliable benchmark model that can detect drowsiness in drivers in real time. The first one is YOLO11n, known for its real-time inference capabilities [6], the second one is Resnet50, famous for its excellent feature extraction, and the third one is Single Shot Detector (SSD)-MobileNetV2, known for its tradeoff between computational efficiency and detection accuracy [11].

However, even after considerable advancements in the use of deep learning techniques for detecting driver drowsiness, some issues related to robustness in various heterogeneous driving environments and implementation on constrained automotive platforms still remain important concerns for ITS. In terms of ITS, detection of driver drowsiness should not be seen only as a problem of computer vision but rather as an integral part of intelligent transportation safety and accident prevention systems [12]. Lightweight and reliable designs are crucial for real-time implementation on embedded and edge computing automotive platforms.

The study stands out in that it critically analyzes SSD-MobileNetV2 as a possible baseline model for real-time drowsiness detection of drivers through an evaluation of three different public datasets (YawDD, DDD, and NTHU-DDD). Additionally, McNemar’s Test and Confidence Intervals (CI) have been used to enhance the reliability analysis of the tested models. The results of this study will form a solid foundation for future research on IoT-enabled driver monitoring systems.

2. Related Work

2.1 Driver Drowsiness Detection in Intelligent Transportation Systems

The driver drowsiness detection is now part of the intelligent transportation safety system because of its significance in lowering the risks of accidents and facilitating real-time monitoring of drivers. The choice of appropriate datasets is crucial in a deep learning approach to driver drowsiness detection. Several public datasets have been widely used for fatigue detection research, including YawDD [6], [13], [14], [15], NITYMED [15], [16], Drozy [17], DDD [7], NTHU-DDD [10], [18], [19]. Each dataset provides different visual characteristics, driver fatigue behaviors, and environmental conditions. However, these datasets also present limitations that may affect model performance in real-time deployment scenarios.

Researchers have evaluated various deep learning approaches using these datasets. Vijaypriya and Uma utilized the YawDD dataset with a Multi-Scale Convolutional Neural Network (MCNN) and achieved an accuracy of 98.38% [18]. Zhang et al. [19] evaluated a multi-granularity deep framework using the NTHU-DDD dataset and reported an accuracy of 90.05% with approximately 37 frames per second (FPS), demonstrating suitability for real-time fatigue detection. Devi et al. utilized the DDD dataset and achieved approximately 99% accuracy using a COOT-optimized deep learning framework for real-time driver drowsiness detection [9].

2.2 Vision-Based Driver Monitoring

Vision-based driver monitoring systems commonly utilize facial cues such as eye closure, yawning behavior, facial expressions, and head movement to identify fatigue conditions. In ITS, vision-based driver monitoring plays an important role in improving road safety by enabling real-time observation of driver alertness, fatigue symptoms, and behavioral conditions during vehicle operation. Various vision-based monitoring approaches have been proposed to support real-time driver fatigue detection, including Convolutional Neural Networks (CNN) [20], [21], [22], YOLO (You Only Look Once) [6], Resnet50 [23], [24], SSD [11], [25], Long Short-Term Memory (LSTM) [26], [27], [28] architectures.

In his research, Jahan developed a CNN-based deep learning model to detect drowsiness based on eye conditions. This model showed increased accuracy in detecting fatigue, reaching 97.53% [8]. These approaches demonstrate that visual monitoring methods can effectively capture fatigue-related facial features and driver behavior patterns in real-time driving environments.

Based on model testing conducted by researchers in Table 1, using the DDD, NTHU-DDD, and YawDD datasets, or a combination of two datasets, yielded accuracy values ranging from 72.5% to 99%. However, the models achieving accuracy above 90% have mostly undergone improvements, resulting in high accuracy. Only the CNN model trained on the YawDD dataset did not undergo model improvements, yet still achieved an accuracy of 93.83%. Overall, the results of this test indicate that models using combination techniques, such as MCNN + FSA and ConNN, tend to achieve better performance, particularly in terms of accuracy and the balance between precision and recall. Because this study evaluates lightweight architectures across three heterogeneous datasets without model-specific optimizations, the 92% accuracy represents a realistic, practically meaningful performance level for IoT-ready, real-time driver-monitoring systems. These comparative results indicate that lightweight driver-monitoring architectures can provide sufficiently stable performance for practical transportation safety applications, particularly under heterogeneous driving conditions.

Table 1. Comparative performance of vision-based driver-monitoring approaches for driver drowsiness detection

Model	Dataset	Dataset Size (Images/Videos/Samples)	Optimization	Accuracy	Precision	Recall	F1 Score	Ref.
COOT	DDD	—	$\sqrt{ }$	99	98	—	98	[9]
Resnet	DDD	41.790 images	—	85.4	86.1	84.3	85.2	[29]
MLP	NTHU	200 videos	—	81	—	—	—	[30]
KNN	NTHU	—	—	72.5	—	—	—	[31]
MCNN	NTHU	—	—	90.05	—	—	—	[19]
KNN	NTHU	—	—	89.86	96.35	89.18	92.63	[18]
NB	NTHU	—	—	87.92	95.82	86.89	91.14	[18]
MCNN + FSA	NTHU	—	$\sqrt{ }$	98.26	99.45	98.10	98.77	[18]
SVM	YawDD	—	—	92.5	—	—	—	[31]
CNN	YawDD	—	—	93.83	—	—	—	[32]
SDM algorithm + CNN	YawDD	—	$\sqrt{ }$	89.55	—	—	—	[33]
KNN	YawDD	—	—	87.69	77.27	86.82	81.77	[18]
NB	YawDD	—	—	83.42	70.41	82.52	75.99	[18]
MCNN + FSA	YawDD	—	√	98.38	97.06	97.84	97.45	[18]
MTCNN + DLIB	YawDD	100 videos	√	79	—	—	—	[34]
DLIB + LSTM	YawDD	100 videos	√	74	—	—	—	[34]
ConNN	YawDD+NTHU	96.817 images	√	98.81	—	—	—	[35]

Note: DDD = Driver Drowsiness Dataset; NTHU-DDD = National Tsing Hua University Driver Drowsiness Detection Dataset; YawDD = Yawning Detection Dataset; COOT = Co-Optimal Transport; KNN = K-Nearest Neighbors; MCNN = Multi-Scale Convolutional Neural Network; FSA = Feature Selection Algorithm; SVM = Support Vector Machine; SDM = Sleep Detection Model; CNN = Convolutional Neural Network; MTCNN = Multi-task Cascaded Convolutional Neural Network; DLIB = Deep Learning Image-Based Library; LSTM = Long Short-Term Memory; MLP = Multi-Layer Perceptron; NB = Naïve Bayes.

2.3 Lightweight Deep Learning for Edge Deployment

Lightweight deep learning architectures are increasingly important for embedded automotive systems and edge-based driver monitoring platforms due to latency and computational constraints. When choosing a deep learning model to detect driver fatigue, it is necessary to balance inference speed, computational efficiency, and detection accuracy.

In real-time transportation monitoring systems, computationally efficient architectures such as YOLO11n are advantageous because they enable fast inference under limited computational resources. However, it does not perform as effectively in detecting small or overlapping objects, which can make it harder to identify subtle facial features that indicate fatigue [6]. Resnet50 achieves high accuracy and is effective across a wide range of data, although it requires greater computational resources [23]. Studies on image authenticity detection demonstrated that Resnet achieves consistent and stable performance across heterogeneous datasets. SSD architectures offer a balance between speed and accuracy; however, they are less effective at detecting small objects [11]. Therefore, the selection of the appropriate lightweight architecture must be tailored to the specific requirements of real-time driver monitoring and available hardware resources.

2.4 Cross-Dataset Generalization Challenges

One of the main challenges of deep learning driver drowsiness detection is the limited cross-dataset generalization. Previous work has often evaluated its models on a single dataset collected in a controlled environment, resulting in models that perform well on particular benchmarks but with less robustness to different visual environments. Different illumination, facial features, camera angles, head poses, and fatigue expressions of different datasets usually lead to the performance degradation of the models when they are confronted with unseen driving conditions.

This limitation becomes more and more important for real-world in-vehicle monitoring systems where the visual conditions are dynamic and unpredictable. Hence, it is necessary to evaluate lightweight architectures on heterogeneous datasets, such as DDD, NTHU-DDD, and YawDD, to verify the model stability, robustness, and practicality for intelligent transportation safety systems.

2.5 Research Gap and Contribution

Even though many deep learning approaches have been suggested for detecting drowsy drivers, almost all of them were evaluated using only one data set in experimental settings. Therefore, there is a lack of research on how robust these lightweight models are to variations in real-world driving scenarios. Also, the importance of efficient computation and reliable performance under real-world conditions is not emphasized enough in current literature.

From previous comparative evaluations and optimizations of lightweight machine learning and deep learning models, it has been shown that there is a need to balance between performance and computational efficiency in intelligent systems [36], [37]. Past comparative studies of computer vision applications have shown that lightweight and hybrid models can still perform strongly in terms of classification ability while lowering computation costs in embedded systems applications. There is, however, a lack of such studies that focus on heterogeneous driver monitoring and transportation safety edge applications.

Thus, this paper analyzes the effectiveness of three selected lightweight models, namely YOLO11n, ResNet50, and SSD-MobileNetV2, under three heterogeneous publicly available datasets such as DDD, NTHU-DDD, and YawDD that cover a wide variety of illumination, facial features, and pose of the head. Moreover, a statistical evaluation method based on McNemar’s Test and CI estimation is performed to improve the credibility of the analyzed models. The results are supposed to contribute significantly to the design of future driver monitoring systems.

3. Methodology

3.1 Research Framework

Figure 1 shows the entire framework that was employed in this study to detect driver drowsiness using lightweight deep learning frameworks. The process started with acquiring data from different public sources including DDD, NTHU-DDD, and YawDD, followed by frame extraction to get the images of drowsy and non-drowsy drivers. This was followed by pre-processing, which involved resizing and normalizing of images.

Figure 1. Methodological pipeline

In order to ensure model robustness and avoid overfitting, augmentation methods such as flipping, zooming, and rotation were used. Three light models, namely YOLO11n, Resnet50, and SSD-MobileNetV2, were subsequently trained based on optimized values of hyperparameters. Lastly, model performance was evaluated via confusion matrix analysis for robustness and reliability testing under heterogeneous driving scenarios.

3.2 Dataset Description and Preparation

The initial step is to collect a dataset, which is specifically created for detecting the state of drowsiness in drivers. The dataset used in the current study includes a variety of images, which can vary in terms of various factors like driving conditions, fatigue level, illumination, and facial pose. In this research, three heterogeneous datasets have been used, which have been collected from Kaggle, including the Driver Drowsiness Dataset (DDD) [38], NTHU-DDD [39], and YawDD [40]. These datasets include images and videos of drowsy and nondrowsy drivers. This dataset will be split into two distinct subsets, including the training set and the validation set. The training set will be used to train deep learning algorithms, such as YOLO11n, Resnet50, and SSD-MobileNetV2.

The result of the integration of the three datasets led to an increase in the variability of the combined dataset. Thus, data augmentation was done to ensure that there is still good quality and variability of the data and avoid overfitting. Table 2 shows the number of datasets created through data augmentation to test the three models:

Table 2. Distribution of image samples across drowsy and nondrowsy classes

Dataset	Drowsy	Nondrowsy	Total
DDD	83	83	166
NTHU-DDD	84	84	168
YAWDD	83	83	166
Total dataset			500

Note: Combined dataset of DDD, NTHU-DDD and YawDD; DDD = Driver Drowsiness Dataset; NTHU-DDD = National Tsing Hua University Driver Drowsiness Detection Dataset; YawDD = Yawning Detection Dataset.

3.3 Lightweight Deep Learning Architectures

In this research work, three lightweight deep learning models, such as YOLO11n, Resnet50, and SSD-MobileNetV2, are considered in order to develop a reliable baseline model for drowsiness detection among drivers in real-time scenarios by implementing intelligent transportation safety systems. These three lightweight models have been selected on account of their varying design aspects that are important for embedded vision applications for the monitoring of drivers. All three models are pre-trained with the help of initial weights before training.

The selection of YOLO11n is based on its lightweight architecture and real-time object detection feature, making it feasible to deploy in an embedded monitoring system where computational power is limited. With its ability to do fast inference, it can be deployed in a transportation safety system that requires instant driver state detection. But previous research revealed that a lightweight YOLO-based architecture faces difficulties in detecting facial fatigue due to visual challenges [6].

The ResNet50 network was used as a representative deep residual learning architecture, which can extract high-level visual features and maintain stable performance on different datasets [23]. The residual learning algorithm provides deep feature representation, which is advantageous for identifying fatigue characteristics of faces. However, the architecture consumes more computational resources than lightweight embedded systems.

SSD-MobileNetV2 integrates the SSD structure and the efficient MobileNetV2 model, which is the combination of computational efficiency and high detection accuracy [11]. This architecture is suitable for real-time deployment scenarios, and at the same time, it ensures the stable detection performance despite the different driving conditions. Previous research indicated that SSD-based lightweight structures work well in real-time driver monitoring systems [25].

The comparison of these architectures enables evaluation of the trade-off between accuracy, computational efficiency, and deployment feasibility for intelligent transportation safety systems and embedded driver-monitoring environments.

3.4 Model Training Configuration

The tested architectures, which include YOLO11n, Resnet50, and SSD-MobileNetV2, were initialized and trained under similar experimental conditions to ensure a fair comparison between them. The experiments were performed using Google Colab with NVIDIA Tesla T4 GPU support.

Prior to training, all image samples were resized to 640 × 640 pixels and normalized for consistent inputs to the models. The dataset from DDD, NTHU-DDD, and YawDD was split into train and validation sets in an 80:20 proportion. In order to make the models more robust to noise and minimize overfitting, data augmentation, including flips, zooms, and rotations, was used while training the models.

Training parameters of all models were set up using the Adam optimizer with a learning rate of 0.01, batch size 16, and 50 epochs. For consistency, the same training parameters were used in YOLO11n, Resnet50, and SSD-MobileNetV2 so that the performance difference between them could be due to architectural differences only. These parameters were chosen considering practicality as well as efficiency in terms of computational speed and reliability.

3.5 Performance Evaluation

Following the training phase, the model’s performance was analyzed based on confusion matrix criteria such as accuracy, precision, recall, and F1 score. These measures were adopted to analyze the performance of the evaluated architectures in discriminating between drowsy and non-drowsy drivers under various driving conditions. Recall and false negatives were particularly considered, because a failed identification of drowsiness could directly impact transportation safety systems.

Besides the evaluation measures, statistical validation was performed using the McNemar Test and CI. The McNemar Test was used to determine the statistical significance of differences in the performance of the examined systems, whereas the CI was utilized to measure the reliability and consistency of the achieved results.

4. Results

In this section, we will explore in depth the results of tests performed to evaluate the performance of models for detecting drowsiness. In order to establish a reliable baseline, three representative models were tested: YOLO11n, Resnet50, and SSD-MobileNetV2. The results obtained from each of these models give us valuable information on how well they perform.

As shown in the confusion matrix in Figure 2, the YOLO11n algorithm was able to detect all drowsiness cases correctly (50/50), giving a perfect recall rate for the drowsy class. This indicates that the YOLO11n model is less prone to missing drowsiness, an aspect that is very important, especially in critical areas where safety is paramount. However, 13 samples of non-drowsy were falsely categorized as drowsy, indicating over-detection.

(a)

(b)

(c)

Figure 2. Confusion matrix: (a) YOLO11n; (b) Resnet50; and (c) SSD-MobileNetV2

As shown in the confusion matrix in Figure 2, the YOLO11n algorithm was able to detect all drowsiness cases correctly (50/50), giving a perfect recall rate for the drowsy class. This indicates that the YOLO11n model is less prone to missing drowsiness, an aspect that is very important, especially in critical areas where safety is paramount. However, 13 samples of non-drowsy were falsely categorized as drowsy, indicating over-detection.

The performance of the Resnet50 model was found to be moderate when differentiating between the two states of drivers. Nevertheless, this model is prone to making false positive and negative predictions, which implies poor stability in classifying drivers under heterogeneous driving conditions. The training and validation losses illustrate that even though the former gradually reduces throughout the training process, the latter remains relatively unstable.

On the other hand, SSD-MobileNetV2 showed better generalization ability compared to Resnet50. This is because SSD-MobileNetV2 made predictions with higher accuracy on both the drowsy and non-drowsy classes while committing fewer classification errors. Furthermore, the training and validation loss kept reducing throughout the training period, suggesting better generalization ability. It can be observed that SSD-MobileNetV2 performs better under heterogeneous visual scenarios provided by DDD, NTHU-DDD, and YawDD datasets.

From the results, it is clear that SSD-MobileNetV2 is the best in terms of generalization and stability, whereas YOLO11n performs well in terms of precision. Despite its strengths, ResNet50 suffers from overfitting. Both architectures demonstrate practical applicability in terms of drowsiness detection, but SSD-MobileNetV2 provides a better balance between stability and generalization.

Table 3 shows how the three architectures performed in the same experiments by comparing their training loss and validation loss values at different epochs (1, 10, 20, 30, 40, and 50). As the number of epochs increases, the training loss decreases for all three models. This shows that all three have learned the training data well. But each model is better or worse at generalizing to validation data.

Table 3. Train and validation loss

Epoch	YOLO11n		Resnet50		Single Shot Detector (SSD)-MobileNetV2
Epoch	Train Loss	Val Loss	Train Loss	Val Loss	Train Loss	Val Loss
1	1.6731	1.6633	0.9616	0.8708	1.5757	0.8071
10	0.6028	1.0908	0.6894	0.6907	0.3438	0.2521
20	0.4335	0.7331	0.6806	0.6636	0.1780	0.1344
30	0.3190	0.4516	0.6212	0.6324	0.1364	0.0861
40	0.2898	0.3930	0.6121	0.6009	0.0992	0.0585
50	0.2004	0.7858	0.5823	0.5711	0.0822	0.0438

The YOLO11n model’s training loss steadily dropped from 1.6731 at epoch 1 to 0.2004 at epoch 50. But its validation loss did not decrease as much. The validation loss decreased from 1.6633 (epoch 1) to 0.3930 (epoch 40), then increased to 0.7858 (epoch 50). This indicates that the model is overfitting: it adapts too closely to the training data and performs worse on the validation data.

Meanwhile, Resnet50 showed a steady downward trend. Train loss decreased from 0.9616 (epoch 1) to 0.5823 (epoch 50), and validation loss also reduced from 0.8708 to 0.5711 at the same epoch. This indicates that the training process is proceeding smoothly and the model has sufficient generalisation capabilities, although the model still exhibits limited generalization capability compared to SSD-MobileNetV2.

On the other hand, the SSD-MobileNetV2 model shows the best performance. Train loss decreases sharply from 1.5757 at epoch 1 to just 0.0822 at epoch 50. The validation loss also followed a positive trend, dropping dramatically from 0.8071 to 0.0438. This shows that SSD-MobileNetV2 not only learns well from the training data but also generalises well to the validation data, without any signs of overfitting. Based on these results, SSD-MobileNetV2 demonstrated the most balanced and stable overall performance in this training scenario.

All reported evaluation metrics were calculated based on binary classification performance across the drowsy and non-drowsy classes.

The test results for the three models in Table 4 show significant differences across several important metrics, including accuracy, precision, recall, F1 score, and processing time. YOLO11n demonstrated strong precision performance, with a precision of 90.62%. The high precision value indicates that YOLO11n produces relatively few false-positive predictions. But YOLO11n achieved only 88.18% accuracy, lower than SSD-MobileNetV2. The recall rate for YOLO11n was 88.18%, indicating that the model can detect existing objects, but it is still slightly behind SSD-MobileNetV2. The YOLO11n F1 score of 88.14% indicates a good balance between precision and recall. YOLO11n demonstrated relatively stable detection capability under heterogeneous driving conditions, although its overall performance remained below SSD-MobileNetV2. However, its overall stability and processing efficiency remained lower than SSD-MobileNetV2 for practical embedded deployment scenarios.

Table 4. Results of testing three models

Model	Accuracy	Precision	Recall	F1 Score
YOLO11n	88.18	90.62	88.18	88.14
Resnet50	74	75	74	74
Single Shot Detector (SSD)-MobileNetV2	92	93	92	92

Resnet50 is more unstable in comparison to others regarding different performance measures, meaning that it is probably not the most appropriate architecture for real-time applications. Resnet50 can only give 74% of correct predictions, meaning that this network has a very high percentage of errors. Also, in terms of precision, recall, and F1 score, this network performs at 74%, implying that this neural network performs worse in object detection compared to other networks. While having a relatively reasonable processing time, Resnet50 cannot be used in real-time applications because of poor accuracy and stability.

SSD-MobileNetV2, on the other hand, was the best-performing network across almost all measures. SSD-MobileNetV2 had the highest accuracy of 92%. The SSD-MobileNetV2 model also performed extremely well when it comes to precision and recall, achieving values of 93% and 92%, respectively. These numbers demonstrate that the SSD-MobileNetV2 is capable of object detection and classification with high accuracy.

Figure 3 shows that SSD-MobileNetV2 achieved the shortest overall processing duration among the evaluated architectures, requiring approximately 0.0058 hours (about 21 seconds), compared to YOLO11n (0.068 hours) and Resnet50 (0.0233 hours). The processing time is measured based on the same experimental setup using Google Colab with NVIDIA Tesla T4 GPU support. The numbers presented above show the total processing time during the experimental evaluation period and not the inference latency time for individual images. The results show that SSD-MobileNetV2 has less computational overhead and higher computational efficiency compared to the other models tested. The proposed architecture demonstrates potential for future embedded transportation safety applications.

Figure 3. Total experimental processing duration under identical Graphics Processing Unit (GPU)-based evaluation settings

Given the same experimental environment, the reported values should be interpreted as relative indicators of the computational workload rather than real-time inference latency measurements. The actual deployment performance in real transportation systems may be dependent on embedded hardware specifications, camera resolution, and streaming conditions. Future studies should evaluate the inference latency at the frame level, FPS performance, and energy efficiency on embedded automotive platforms such as NVIDIA Jetson or edge-based IoT devices.

In order to further prove the statistical significance of the differences in performances between YOLO11n, Resnet50, and SSD-MobileNetV2, two statistical tests were performed; these include the McNemar significance test and the calculation of CI of accuracy. These tests evaluate not only the numerical superiority of each model but its statistical significance as well.

The McNemar test is a non-parametric test commonly used to evaluate whether two classifiers differ significantly in their error distributions when applied to the same dataset [41]. Several deep learning evaluation studies recommend it because it directly compares disagreement cases rather than aggregate accuracy.

$\chi^2=\frac{(|b-c|-1)^2}{b+c}$

(1)

where, $b$ = cases misclassified by model A but correctly classified by model B; $c$ = cases misclassified by model B but correctly classified by model A; H0 = There is no significant difference between the two models; and Significance threshold: $p <$ 0.05.

To quantify the statistical certainty of accuracy values, CIs were computed using the binomial proportion formula [42].

$\mathrm{C I}=p \pm 1.96 \sqrt{\frac{p(1-p)}{n}}$

(2)

where, $p$ = accuracy, $n$ = total test samples, and 1.96 = $z$-score for 95% confidence.

Statistical analysis in Table 5 and Figure 4 shows that SSD-MobileNetV2 is significantly better than Resnet50 ($p$ = 0.045), whereas the differences in the accuracy rates of SSD-MobileNetV2 compared to YOLO11n ($p$ = 0.26) and YOLO11n versus Resnet50 ($p$ = 0.47) are not statistically significant. Furthermore, based on the results of CI analysis in Table 6, it can be stated that SSD-MobileNetV2 demonstrates the narrowest interval (86.7%–97.3%), which is evidence of higher stability and reliability of the algorithm.

Table 5. McNemar test

Model	$\boldsymbol{x^2}$	$\boldsymbol{P}$-Value	Significant?
Single Shot Detector (SSD) vs Resnet50	4.00	0.045	Yes
SSD vs. YOLO11n	1.25	0.26	No
YOLO11n vs. Resnet50	0.516	0.47	No

Figure 4. Significance comparison

Table 6. Accuracy and 95% Confidence Intervals (CI)

Model	Accuracy	95% CI
Single Shot Detector (SSD)-MobileNetV2	92%	86.7%–97.3%
YOLO11n	88.18%	82.1%–94.2%
Resnet50	74%	65.8%–82.2%

Generally, SSD-MobileNetV2 emerges as the best-performing among all three in terms of overall performance metrics, including accuracy, precision, recall, F1 Score, and time. Performance of SSD-MobileNetV2 was equal to that of various previously reported light-weight driver monitoring systems [32]. On the other hand, YOLO11n shows strengths in specific situations where precision is of utmost importance, but it is less stable in comparison with SSD-MobileNetV2 as a baseline. In case when efficiency and high performance are the primary goals, then SSD-MobileNetV2 is the model of choice. As mentioned above, application of hybrid methods like COOT, MCNN + FSA, and ConNN proved to be more efficient in obtaining higher accuracy (above 90%) and achieving the required balance between precision and recall [9], [18], [35]. Thus, it could be stated that more advanced combination techniques result in greater gains in comparison with traditional models like YOLO11n, Resnet50, and SSD-MobileNetV2.

The superior performance of SSD-MobileNetV2 in the present work can be attributed to its lightweight structure, which allows for efficient feature extraction along with a low computational complexity and stable detection capacity in different driving environments. The architecture showed higher consistency in performance on the integrated DDD, NTHU-DDD, and YawDD datasets, thereby suggesting that it had superior generalization capabilities under different conditions of illumination, facial appearance, and head orientations often witnessed in practical driving situations. Such an architectural advantage makes SSD-MobileNetV2 a suitable baseline framework for embedded driver monitoring systems.

5. Discussion

The results of this study highlight the importance of computationally efficient architecture for driver monitoring systems in achieving stable performance under heterogeneous driving conditions. This aspect becomes increasingly important as modern vehicles are progressively integrated with intelligent transportation safety systems utilizing IoT and edge-computing technologies. Among the evaluated architectures, SSD-MobileNetV2 demonstrated the most stable generalization performance across the YawDD, DDD, and NTHU-DDD datasets under varying lighting conditions, facial appearances, and head poses.

These findings are consistent with previous studies showing that lightweight deep learning architectures can maintain stable classification performance while reducing computational overhead [37]. Previous comparative studies in computer vision have also demonstrated that lightweight and hybrid CNN-based architectures are capable of achieving scalable performance suitable for practical deployment environments [36]. In ITS, balancing computational efficiency and detection reliability is particularly important because embedded driver monitoring systems often operate under constrained hardware resources and dynamically changing visual conditions. Therefore, the lightweight MobileNetV2 architecture combined with a single-shot detection framework provides a promising baseline for future studies involving sequence modeling, sensor fusion, and edge-based transportation safety applications.

Figure 5 illustrates the proposed practical deployment workflow of the SSD-MobileNetV2-based driver monitoring system. An in-vehicle camera continuously captures the driver’s facial condition and transmits the visual data to an embedded AI processing unit for real-time analysis. The SSD-MobileNetV2 architecture analyzes fatigue-related features such as eye behavior, yawning activity, facial expressions, and head posture to estimate the driver’s alertness state. When drowsiness symptoms are detected, the system activates warning mechanisms, including visual alerts and audible alarms, to improve driver awareness and reduce accident risk. Due to its lightweight computational design, the proposed architecture is suitable for deployment on embedded and edge-based transportation platforms with limited hardware resources. In fleet safety environments, the system may also support centralized driver monitoring and transportation safety management through IoT-enabled monitoring infrastructure.

Figure 5. Proposed practical deployment workflow of the Single Shot Detector (SSD)-MobileNetV2-based driver monitoring system for intelligent transportation safety applications

6. Conclusions

This study evaluated the reliability of lightweight driver monitoring architectures for vision-based driver drowsiness detection, with three different heterogeneous public datasets used to reflect various driving conditions. From the experimental results, it can be seen that SSD-MobileNetV2 performed the best with respect to accuracy, precision, recall, and F1 score with less computational cost compared with other architectures. The statistical analysis of the experiment with the help of McNemar’s test and CI also shows that the model is reliable and stable. It appears that SSD-MobileNetV2 can be considered a good lightweight baseline architecture for a driver monitoring system in intelligent transportation safety environments.

Nevertheless, despite these promising results, there are certain limitations to the study, such as the use of open-source datasets, a lack of temporal modeling in frame-level analysis, and evaluation on only three representative lightweight networks. However, the study can serve as a significant stepping stone for future studies related to intelligent transportation safety. In addition, although three heterogeneous public datasets were utilized, the overall experimental scale is still relatively limited when considering real-world transportation environments. In real-world deployment scenarios, model generalization can be affected by changes in traffic conditions, driver behavior, camera placement, weather conditions, and uncontrolled illumination. Therefore, additional validation in real-world driving scenarios and more complex transportation conditions is needed to evaluate the robustness and scalability of the proposed architecture.

7. Future Work

The future research directions would involve enhancing the SSD-MobileNetV2 architecture to enable practical implementation within embedded and IoT-enabled transportation platforms. Future research may involve comparisons with more advanced lightweight networks like MobileNetV3, EfficientNet, TinyML, and temporal deep learning models, among others, to assess their computational efficiency and resilience when facing varying driving conditions. Furthermore, hybrid methods that utilize physiological parameters, vehicle telemetry, and/or driver behavior may be considered to enhance the accuracy of the algorithm in challenging driving situations. Real-world driving settings and automotive platforms would be essential in validating deployment feasibility and practicality of ITS, including frame-level inference latency, FPS measurements, memory consumption, and energy efficiency analysis on embedded and edge automotive platforms to better assess the performance of real-time implementations.

Author Contributions

Conceptualization, Y.C.G. and M.; methodology, Y.C.G.; writing—original draft preparation, Y.C.G.; supervision, M.; writing—review and editing, M. and D.A.D.; data curation, A.; formal analysis, A. and H.A.A.; investigation, A. and R.S.B.; visualization, R.S.B.; software, D.A.D.; validation, H.A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Ministry of Higher Education, Science, and Technology of the Republic of Indonesia (Kemdiktisaintek) through the BIMA Fundamental Research Program 2026 (Grant No.: 133/C3/DT.05.00/PL-MULTITAHUN LANJUTAN/2026).

Data Availability

The datasets used in this study are publicly available through Kaggle and include the Driver Drowsiness Dataset (DDD), NTHU-DDD, and YawDD datasets cited in references [38], [39], [40]. All datasets were accessed and used solely for academic research purposes in accordance with their respective public usage policies. Processed data and experimental configurations used in this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to acknowledge the support provided by Universitas Dian Nuswantoro in facilitating the completion of this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

1.

World Health Organization (WHO), Global Status Report on Road Safety 2023. [Online]. Available: https://www.who.int/teams/social-determinants-of-health/safety-and-mobility/global-status-report-on-road-safety-2023 [Google Scholar]

2.

E. H. Rad, M. Hosseinnia, N. Mousavi, A. Shekari, L. Kouchakinejad-Eramsadati, and N. Khodadadi-Hassankiadeh, “Fatigue in taxi drivers and its relationship with traffic accident history and experiences: A cross-sectional study in the north of Iran,” BMC Public Heal., vol. 24, no. 1, p. 530, 2024. [Google Scholar] [Crossref]

3.

A. Williamson, D. A. Lombardi, S. Folkard, J. Stutts, T. K. Courtney, and J. L. Connor, “The link between fatigue and safety,” Accid. Anal. Prev., vol. 43, no. 2, pp. 498–515, 2011. [Google Scholar] [Crossref]

4.

J. Lim and D. F. Dinges, “A meta-analysis of the impact of short-term sleep deprivation on cognitive variables,” Psychol. Bull., vol. 136, no. 3, pp. 375–389, 2010. [Google Scholar] [Crossref]

5.

L. Chen and W. Zheng, “Research on railway dispatcher fatigue detection method based on deep learning with multi-feature fusion,” Electronics, vol. 12, no. 10, p. 2303, 2023. [Google Scholar] [Crossref]

6.

S. Liu, Y. Wang, Q. Yu, J. Zhan, H. Liu, and J. Liu, “A driver fatigue detection algorithm based on dynamic tracking of small facial targets using YOLOv7,” IEICE Trans. Inf. Syst., vol. E106.D, no. 11, pp. 1881–1890, 2023. [Google Scholar] [Crossref]

7.

A. Sohail, A. A. Shah, S. Ilyas, and N. Alshammry, “A CNN-based deep learning framework for driver’s drowsiness detection,” Int. J. Adv. Comput. Sci. Appl., vol. 15, no. 3, pp. 169–178, 2024. [Google Scholar] [Crossref]

8.

I. Jahan, K. M. A. Uddin, S. A. Murad, M. S. U. Miah, T. Z. Khan, M. Masud, S. Aljahdali, and A. K. Bairagi, “4D: A real-time driver drowsiness detector using deep learning,” Electronics, vol. 12, no. 1, p. 235, 2023. [Google Scholar] [Crossref]

9.

G. R. Devi, H. M. Al-Tmimi, G. K. Ghadir, S. Sharma, E. Patnala, B. K. Bala, and Y. A. El-Ebiary, “COOT-optimized real-time drowsiness detection using GRU and enhanced deep belief networks for advanced driver safety,” Int. J. Adv. Comput. Sci. Appl., vol. 15, no. 4, pp. 804–814, 2024. [Google Scholar] [Crossref]

10.

S. Anber, W. Alsaggaf, and W. Shalash, “A hybrid driver fatigue and distraction detection model using AlexNet based on facial features,” Electronics, vol. 11, no. 2, p. 285, 2022. [Google Scholar] [Crossref]

11.

A. A. Almazroi, M. A. Alqarni, N. Aslam, and R. A. Shah, “Real-time CNN-based driver distraction & drowsiness detection system,” Intell. Autom. Soft Comput., vol. 37, no. 2, pp. 2153–2174, 2023. [Google Scholar] [Crossref]

12.

A. A. Sheikh and I. Z. Khan, “Enhancing road safety: Real-time detection of driver distraction through convolutional neural networks,” arXiv preprint, Art. no. arXiv:2405.17788, 2024. [Google Scholar] [Crossref]

13.

H. Zheng, Y. Wang, and X. Liu, “Adaptive driver face feature fatigue detection algorithm research,” Appl. Sci., vol. 13, no. 8, p. 5074, 2023. [Google Scholar] [Crossref]

14.

F. Majeed, U. Shafique, M. Safran, S. Alfarhood, and I. Ashraf, “Detection of drowsiness among drivers using novel deep convolutional neural network model,” Sensors, vol. 23, no. 21, p. 8741, 2023. [Google Scholar] [Crossref]

15.

R. Florez, F. Palomino-Quispe, R. J. Coaquira-Castillo, J. C. Herrera-Levano, T. Paixão, and A. B. Alvarez, “A CNN-based approach for driver drowsiness detection by real-time eye state identification,” Appl. Sci., vol. 13, no. 13, p. 7849, 2023. [Google Scholar] [Crossref]

16.

P. Christakos, N. Petrellis, P. Mousouliotis, G. Keramidas, C. P. Antonopoulos, and N. Voros, “A high performance and robust FPGA implementation of a driver state monitoring application,” Sensors, vol. 23, no. 14, p. 6344, 2023. [Google Scholar] [Crossref]

17.

A. Sedik, M. Marey, and H. Mostafa, “An adaptive fatigue detection system based on 3D CNNs and ensemble models,” Symmetry, vol. 15, no. 6, p. 1274, 2023. [Google Scholar] [Crossref]

18.

V. Vijaypriya and M. Uma, “Facial feature-based drowsiness detection with multi-scale convolutional neural network,” IEEE Access, vol. 11, pp. 63417–63429, 2023. [Google Scholar] [Crossref]

19.

H. Zhang, T. Liu, J. Lyu, D. Chen, and Z. Yuan, “Integrate memory mechanism in multi-granularity deep framework for driver drowsiness detection,” Intell. Robot., vol. 3, no. 4, pp. 614–631, 2023. [Google Scholar] [Crossref]

20.

N. N. Alajlan and D. M. Ibrahim, “DDD TinyML: A TinyML-based driver drowsiness detection model using deep learning,” Sensors, vol. 23, no. 12, p. 5696, 2023. [Google Scholar] [Crossref]

21.

M. Peivandi, S. Z. Ardabili, S. Sheykhivand, and S. Danishvar, “Deep learning for detecting multi-level driver fatigue using physiological signals: A comprehensive approach,” Sensors, vol. 23, no. 19, p. 8171, 2023. [Google Scholar] [Crossref]

22.

W. Sun, Y. Wang, B. Hu, and Q. Wang, “Exploration of eye fatigue detection features and algorithm based on eye-tracking signal,” Electronics, vol. 13, no. 10, p. 1798, 2024. [Google Scholar] [Crossref]

23.

B. E. B. Lim, K. W. Ng, and S. L. Ng, “Drowsiness detection system through eye and mouth analysis,” JOIV Int. J. Inform. Vis., vol. 7, no. 4, pp. 2489–2497, 2023. [Google Scholar] [Crossref]

24.

S. Das, S. Pratihar, B. Pradhan, R. H. Jhaveri, and F. Benedetto, “IoT-assisted automatic driver drowsiness detection through facial movement analysis using deep learning and a U-Net-based architecture,” Information, vol. 15, no. 1, p. 30, 2024. [Google Scholar] [Crossref]

25.

W. Gao, L. Li, and T. Yang, “Driving state discrimination algorithm based on lightweight network and contrast learning,” Int. J. Pattern Recognit. Artif. Intell., vol. 36, no. 11, p. 2252025, 2022. [Google Scholar] [Crossref]

26.

Z. Huang, W. Tang, Q. Tian, T. Huang, and J. Li, “Air traffic controller fatigue detection based on facial and vocal features using long short-term memory,” IEEE Access, vol. 12, pp. 56663–56682, 2024. [Google Scholar] [Crossref]

27.

S. A. Alameen and A. M. Alhothali, “A lightweight driver drowsiness detection system using 3DCNN with LSTM,” Comput. Syst. Sci. Eng., vol. 44, no. 1, pp. 895–912, 2023. [Google Scholar] [Crossref]

28.

B. Akrout and S. Fakhfakh, “How to prevent drivers before their sleepiness using deep learning-based approach,” Electronics, vol. 12, no. 4, p. 965, 2023. [Google Scholar] [Crossref]

29.

Z. Huang, “Integrating attention mechanisms and ResNet-50 for enhanced driver sleepiness detection,” Informatica, vol. 49, no. 15, pp. 135–144, 2025. [Google Scholar] [Crossref]

30.

R. Jabbar, K. Al-Khalifa, M. Kharbeche, W. Alhajyaseen, M. Jafari, and S. Jiang, “Real-time driver drowsiness detection for Android application using deep neural networks techniques,” Procedia Comput. Sci., pp. 400–407, 2018. [Google Scholar] [Crossref]

31.

V. U. Maheswari, R. Aluvalu, M. V. V. P. Kantipudi, K. K. Chennam, K. Kotecha, and J. R. Saini, “Driver drowsiness prediction based on multiple aspects using image processing techniques,” IEEE Access, vol. 10, pp. 54980–54990, 2022. [Google Scholar] [Crossref]

32.

H. He, X. Zhang, F. Jiang, C. Wang, Y. Yang, W. Liu, and J. Peng, “A real-time driver fatigue detection method based on two-stage convolutional neural network,” IFAC-PapersOnLine, vol. 53, no. 2, pp. 15374–15379, 2020. [Google Scholar] [Crossref]

33.

X. Li, J. Xia, L. Cao, G. Zhang, and X. Feng, “Driver fatigue detection based on convolutional neural network and face alignment for edge computing device,” Proc. Inst. Mech. Eng. Part D J. Automob. Eng., vol. 235, no. 10–11, pp. 2699–2711, 2021. [Google Scholar] [Crossref]

34.

L. Chen, G. Xin, Y. Liu, and J. Huang, “Driver fatigue detection based on facial key points and LSTM,” Secur. Commun. Netw., vol. 2021, pp. 1–9, 2021. [Google Scholar] [Crossref]

35.

B. K. Savaş and Y. Becerikli, “Real time driver fatigue detection system based on multi-task ConNN,” IEEE Access, vol. 8, pp. 12491–12498, 2020. [Google Scholar] [Crossref]

36.

S. Makwana, D. Vaghela, C. K. Chan, and N. Naik, “Novel hybrid approach of random forest and stacking ensemble to improve fresh water yield prediction in mobile wick solar still,” Desalination Water Treat., vol. 324, p. 101490, 2025. [Google Scholar] [Crossref]

37.

L. D., P. C., M. Batumalay, and K. M. R., “Comparative study of CNN-based architectures for early brain tumor diagnosis,” J. Appl. Data Sci., vol. 7, no. 1, pp. 529–540, 2026. [Google Scholar] [Crossref]

38.

I. Nasri, “Driver Drowsiness Dataset (DDD).” https://www.kaggle.com/datasets/ismailnasri20/driver-drowsiness-dataset-ddd [Google Scholar]

39.

I. El Hamly, “NTHU-DDD.” https://www.kaggle.com/datasets/ikhlaselhamly/nthu-ddd [Google Scholar]

40.

Enider, “Yawdd Dataset.” https://www.kaggle.com/datasets/enider/yawdd-dataset [Google Scholar]

41.

T. G. Dietterich, “Approximate statistical tests for comparing supervised classification learning algorithms,” Neural Comput., vol. 10, no. 7, pp. 1895–1923, 1998. [Google Scholar] [Crossref]

42.

R. G. Newcombe, “Two-sided confidence intervals for the single proportion: Comparison of seven methods,” Stat. Med., vol. 17, no. 8, pp. 857–872, 1998. [Google Scholar]

Cite this:

APA Style

IEEE Style

BibTex Style

MLA Style

Chicago Style

GB-T-7714-2015

Giap, Y. C., Muljono, Affandy, Basuki, R. S., Al Azies, H., & Dewi, D. A. (2026). Establishing SSD-MobileNetV2 as a Robust Baseline for Driver Drowsiness Detection Toward IoT-Ready in-Driving Safety Systems. Int. J. Transp. Dev. Integr., 10(2), 441-453. https://doi.org/10.56578/ijtdi100208

cc

©2026 by the author(s). Published by Acadlore Publishing Services Limited, Hong Kong. This article is available for free download and can be reused and cited, provided that the original published version is credited, under the CC BY 4.0 license.

pdf

Figure 1. Methodological pipeline

Table 1. Comparative performance of vision-based driver-monitoring approaches for driver drowsiness detection

Citations

Crossref: 0