Javascript is required
Search
Volume 9, Issue 4, 2025

Abstract

Full Text|PDF|XML

This study proposes a novel approach to driver drowsiness detection using the Video Vision Transformer (ViViT) model, which captures both spatial and temporal dynamics simultaneously to analyze eye conditions and head movements. The National Tsing Hua University Driver Drowsiness Detection (NTHU-DDD) dataset, which consists of 36,000 annotated video clips, was utilized for both training and evaluation. The ViViT model is compared to traditional Convolutional Neural Network (CNN) and Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM) models, demonstrating superior performance with 96.2% accuracy and 95.9% F1-Score, while maintaining a 28.9 ms/frame inference time suitable for real-time deployment. The ablation study indicates that integrating spatial and temporal attention yields a notable improvement in model accuracy. Furthermore, positional encoding proves essential in preserving spatial coherence within video-based inputs. The model’s resilience was tested across a range of challenging conditions including low-light settings, partial occlusions, and drastic head movements and it consistently maintained reliable performance. With a compact footprint of just 89 MB, the ViViT model has been fine-tuned for deployment on embedded platforms such as the Jetson Nano, making it well-suited for edge AI applications. These findings highlight ViViT’s promise as a practical and high-performing solution for real-time driver drowsiness detection in real-world scenarios.

- no more data -