A Hybrid Vision Transformer–Driven Feature Extraction and Machine Learning Framework for Automated Skin Burn Detection
Abstract:
Skin burns represent a major clinical concern due to their association with pain, functional impairment, sensory damage, and even life-threatening complications. Early and accurate assessment is critical for first aid, clinical intervention, and the prevention of secondary complications. However, conventional burn diagnosis remains highly dependent on visual inspection and clinical expertise, which can introduce subjectivity and delay timely decision-making. To address these limitations, a hybrid automated skin burn detection framework was proposed, integrating transformer-based feature extraction with classical machine learning classification. In this framework, discriminative visual features were extracted using multiple Vision Transformer (ViT) architectures, including ViT-B/16, ViT-L/16, ViT-B/32, and DINOv2 (a self-supervised Vision Transformer model). The extracted features were subsequently fused. Given the resulting high-dimensional feature space, dimensionality reduction was performed using the Chi-square (Chi$^2$) algorithm, through which 500 features were retained, reducing computational complexity and mitigating the risk of model overfitting. The reduced feature set was then employed for burn classification using six classifiers. Model effectiveness was assessed using accuracy, precision, sensitivity, and F1-score metrics. Experimental results demonstrated that the Support Vector Machine (SVM) classifier achieved the highest classification performance, yielding an accuracy of 82.29%. Comparable yet slightly lower accuracies were observed for the Light Gradient Boosting Machine (LGBM) (80.51%) and Extreme Gradient Boosting (XGBoost) (80.17%) classifiers. Overall, the proposed hybrid model consistently outperformed baseline models, highlighting its superior discriminative capability. These findings indicate that the proposed framework holds strong potential for integration into clinical decision support systems, offering a reliable and objective tool for automated skin burn detection.
1. Introduction
Burn injuries are one of the most common types of trauma. A burn is an injury caused by various factors, including hot objects, flames, chemicals, electric currents, radiation, and friction, damaging tissue. Burns affect people both physically and psychologically. If not treated appropriately, they can have fatal consequences [1]. Skin burns not only cause tissue and organ damage but also profoundly impact an individual’s mental health [2]. Because burns can occur for a variety of reasons, the severity of the injury can vary. To plan the burn treatment process appropriately, it is essential to quickly identify the cause and extent.
Early and correct intervention in burn treatment alleviates patient suffering and significantly reduces the risk of death. Inappropriate or delayed intervention can endanger human life. The integration of developing computer-aided systems into medical diagnostic processes has increased the possibilities for accurate and timely intervention in the healthcare field. In addition to other health studies, there are studies that have used artificial intelligence (AI) to detect burns and demonstrated successful results in treatment processes [3], [4], [5]. Skin burns consist of superficial first-degree, partial-thickness second-degree, and full-thickness third-degree burns. Automatic recognition of these graded burns with computer systems is important for clinical treatment processes. Pabitha and Vanathi [6] graded skin burns using the Region-Based Convolutional Neural Network (R-CNN) method and applied masking to these burns. Şevik et al. [7] achieved automatic classification of skin burn images using tissue-based feature extraction. It was found that the best combination was a multilayer feedforward Artificial Neural Network (ANN) trained with a backpropagation algorithm for classification. With this method, an F-score of 74.28% was obtained.
In addition, Abazari et al. [8] conducted a classification study to select the appropriate treatment for burn wounds based on diagnosis and wound type. Similarly, Elsarta et al. [9] used deep learning methods to classify skin burns. Multi-source data were integrated to develop an AI model. Acha et al. [10] developed a segmentation and classification model, utilizing color and texture information from burn images. An 82.26% success rate was achieved with this model. Kuan et al. [11] conducted a comparative study of 20 classification algorithms for classifying burn depths using an image mining approach. The dataset was divided into both test and training sections and evaluated using 10-fold cross-validation methods. The results showed that the best classification algorithm achieved an average accuracy of 68.9%. In the 10-fold cross-validation evaluation, the best result was determined to be an average of 73.2%. Rahman et al. [12] focused on image-based diagnosis of burn type and depth. Various data augmentation processes were applied to a dataset of 29 patients. The proposed AI model was evaluated using a 5-fold cross-validation method. It was found that the average success rate of the three classes was 79%. Khan et al. [13] developed a model using a Deep Convolutional Neural Network (DCNN) for burn detection. During model development, 65% of the dataset, comprising a total of 450 images, was used for training, while the remaining 35% was used for testing. The results showed that the proposed model achieved a success rate of 79.4%.
In this study, a hybrid deep learning model was developed to classify skin burns with high accuracy. The model combines the powerful representation capabilities of different transformer-based architectures. In the first stage, features were extracted from four different architectures and then combined, aiming to use different features of the same image together. At this stage, these models were not trained and were only used for feature extraction. Feature selection was then performed on the resulting feature map using Chi$^2$. The aim of this step is to make the model run faster and produce more successful results. Finally, the optimized feature map was classified using six different classifiers. In the remainder of this study, the dataset used, the deep learning models, and the proposed model are detailed in Section 2. Experimental results are presented in Section 3. The conclusion is presented in Section 4.
2. Methodology
In this section, the burn dataset used in the study is examined. The applied methods are also explained in detail.
The dataset used was obtained from the publicly available Kaggle website [14]. It contains a total of 6,099 images. The 200 unlabeled images within the dataset were included solely for testing purposes; therefore, the accuracy of the proposed model’s predicted values cannot be determined. Because they are unlabeled, they were excluded from the training model. Therefore, the total dataset consists of 5,899 records, of which 4,719 were used as training data and 1,180 as new test data to determine the model’s accuracy. The classes and sample images of the dataset are shown in Figure 1.
Burn severities in the dataset are labeled as first-degree burns, second-degree burns, and third-degree burns. First-degree burns are ranked as the mildest, and third-degree burns are ranked as the most severe. Examples of each degree in the dataset are shown in Figure 2.


In Figure 2, first-degree burns are shown as the lightest, while third-degree burns are listed as the most severe.
Transformer-based deep learning models, which have achieved significant success in Natural Language Processing (NLP), have demonstrated remarkable performance due to their feature extraction capabilities [15]. Among the key advantages of transformers are their ability to model long-term dependencies between input sequence elements and their capacity to support parallel processing of sequences, which is superior to recurrent networks. These strengths have led to exciting advances in several vision tasks using transformer networks [16]. The Vision Transformer (ViT) models used in this study—ViT-B/16, ViT-L/16, ViT-B/32, and DINOv2 (a self-supervised Vision Transformer model) networks—were employed for feature extraction.
ViT-B/16 is a transformer encoder model pre-trained in a supervised manner on an extensive collection of images called ImageNet-21k [17]. A standard classifier can be trained by placing a linear layer on top of the pre-trained encoder. However, in this study, in addition to being trained as a classifier, it was also used for feature extraction. The ViT-B/32 model differs from ViT-B/16 in that the images are presented to the model as a series of linearly embedded, fixed-size patches, with a resolution of 32 × 32 [18]. This model was utilized to enhance efficiency in the feature extraction portion of this study. In the ViT-L/16 model, although both models belong to the ViT family, the model’s capabilities are more comprehensive than the other base models [19]. The DINOv2 model was also used for feature extraction in the ViT models used in this study. The quality of the DINOv2 model was superior at both the image and pixel levels compared to various computer vision models [20]. The DINOv2 model has also been used particularly in medical imaging studies [21]. These features make it a valuable model for feature extraction in this study.
For each model, features obtained using ViTs were compared with those from six different classifiers. Three of these classifiers are XGBoost, Categorical Boosting (CatBoost), and Light Gradient Boosting Machine (LGBM), all of which are gradient boosting methods. The learning procedure in gradient boosting machines sequentially applies new models to ensure a more accurate prediction of the response variable. The basic idea is to create a new learner with maximum correlation with the negative gradient of the loss function [22]. However, the high computational cost of traditional Gradient Boosting Decision Tree (GBDT) algorithms has limited their use in large datasets. In this context, classifiers were created using the LGBM method, the CatBoost method, and the XGBoost method, which offer the advantages of gradient boosting-based algorithms in a faster and more scalable form [23], [24], [25]. In this study, in addition to gradient boosting-based methods, k-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Random Forest (RF) classifiers were also used. Despite its simple structure, the KNN classifier is one of the methods that can produce effective results in classification problems [26]. Furthermore, SVM is a powerful classification method that aims to maximize accuracy while reducing the risk of overfitting [27]. RF creates multiple random decision trees and combines their results to obtain more balanced predictions [28].
In the proposed hybrid method for skin burn detection, the ViT-B/16, ViT-L/16, ViT-B/32, and DINOv2 models were used for feature extraction. These obtained features were combined and classified using XGBoost, LGBM, CatBoost, KNN, SVM, and RF. In the feature selection phase, the Chi$^2$ feature selection method was used to identify the 500 most significant features. The Chi$^2$ algorithm tests the significance of relationships between features using the Chi$^2$ statistic and infers feature size [29]. This aims to save time by reducing training time and improving classification performance. The feature vectors, initially obtained with size 5899 × 3328, were reduced to 5899 × 500 using the Chi$^2$ algorithm. Figure 3 visually illustrates the general architecture of the proposed hybrid model.

The last step of the proposed model was to evaluate the performance with different metrics after skin burns were detected.
3. Results
In this section, the performance of the proposed hybrid model and the other models used for comparison is examined. First, the confusion matrices of the models are presented. The performance of the classification model is measured using the values found in the confusion matrix. Accuracy, recall, precision, and F1-score metrics are used to evaluate the performance of the models [30].
The features extracted by the ViT-B/16 model used in this study were fed to six different classifiers. The resulting confusion matrices for each of these classifiers are presented in Figure 4.






Figure 4 shows the most successful metrics in the SVM’s confusion matrix; however, it’s particularly noticeable that the model struggles to classify third-degree burns. Similarly, the results reveal that third-degree burns are generally the most challenging class to predict across all classifiers because some second-degree and third-degree burns are very similar in appearance. The distribution of success scores across classifiers is shown in Figure 5.

The evaluation of the classifiers was conducted not only in terms of accuracy but also in terms of time and performance metrics, as shown in Table 1.
It can be seen from Figure 5 and Table 1 that the highest accuracy value was achieved with the SVM classifier using features extracted from the ViT-B/16 model. This can be attributed to the short training time and its success.
Model | Accuracy (%) | Prediction Speed (images/s) | Training Time (s) |
|---|---|---|---|
SVM | 79.58 | 616.9 | 4.41 |
CatBoost | 78.98 | 32712.55 | 519.33 |
XGBoost | 77.37 | 79480.95 | 72.38 |
LGBM | 77.37 | 33229.57 | 40.49 |
RF | 71.86 | 35962.58 | 14.87 |
KNN | 71.69 | 3858.05 | 0.02 |
Feature extraction was also performed using the ViT-L/16 model for burn detection. This model provides a more comprehensive feature representation compared to ViT-B/16, and, thanks to its broad structure, can capture detailed information. The extracted features were fed to different classifiers, and the results are presented as confusion matrices for each classifier in Figure 6.






As shown in the classification results using features obtained from the ViT-L/16 model, the highest accuracy value is achieved with the XGBoost algorithm. However, this success is quite close to that of the SVM classifier. The comparison of performance metrics of classifiers using ViT-L/16 features is presented in Figure 7.
The impact of the models on system performance and the success metrics obtained accordingly are shown in Table 2.
The classification results using features obtained from the ViT-L/16 model show that the XGBoost algorithm achieved the highest accuracy. However, this success is quite similar to that of the SVM classifier. When comparing training time and prediction speed, it is evident that SVM is more efficient. However, as with the ViT-B/16 model, accurately classifying third-degree burns remains the most challenging task for all classifiers.

Model | Accuracy (%) | Prediction Speed (images/s) | Training Time (s) |
|---|---|---|---|
XGBoost | 78.98 | 65080.13 | 87.24 |
SVM | 78.73 | 456.84 | 8.46 |
CatBoost | 77.97 | 70115.02 | 639.38 |
LGBM | 77.03 | 29944.45 | 55.83 |
KNN | 74.66 | 3376.54 | 0.03 |
RF | 74.07 | 35745.70 | 46.76 |
The ViT-B/32 model used for burn detection works with a 32 × 32 patch size, allowing for investigating how resolution-based differences in feature extraction impact model performance. The resulting features were fed to six different classifiers, and the confusion matrices of these classifiers are presented in Figure 8.






As shown in the results obtained with the ViT-B/32 model, the SVM classifier achieved the highest accuracy. However, as with previous models (ViT-B/16 and ViT-L/16), correctly classifying third-degree burns proved to be the most challenging task for all classifiers. The comparison of performance metrics of classifiers using ViT-B/32 features is presented in Figure 9.
The system performance of the models was monitored and the resulting values are shown in Table 3.

Model | Accuracy (%) | Prediction Speed (images/s) | Training Time (s) |
|---|---|---|---|
SVM | 78.64 | 473.44 | 5.09 |
CatBoost | 77.97 | 91468.68 | 487.57 |
LGBM | 77.71 | 28088.98 | 37.85 |
XGBoost | 77.46 | 79301 | 65.22 |
KNN | 74.75 | 3786.16 | 0.02 |
RF | 73.14 | 35175.86 | 14.26 |
When Table 3 is examined and evaluated in terms of computational performance, it is determined that the KNN classifier achieves the fastest training time, and the SVM classifier achieves the highest prediction speed.
In addition to ViT-based models, the DINOv2 model was also included in this study to facilitate a comparative evaluation of model performance. The features obtained with DINOv2 were transferred to different classifiers, and the confusion matrices of these classifiers are presented in Figure 10.






The highest accuracy was achieved with the SVM algorithm in classifications using features extracted from the DINOv2 network. However, as with other classifiers, the model struggles with third-degree burns. The metrics calculated based on the obtained results are presented in Figure 11 for a comparative analysis of the classifiers.

The performance of the models trained with features extracted from the DINOv2 network within the dataset used is given in Table 4.
Model | Accuracy (%) | Prediction Speed (images/s) | Training Time (s) |
|---|---|---|---|
SVM | 81.27 | 547.42 | 5.06 |
XGBoost | 81.10 | 79364.97 | 65.78 |
LGBM | 80.25 | 30819.92 | 37.49 |
CatBoost | 79.41 | 41994.64 | 485.02 |
KNN | 78.22 | 4167.18 | 0.02 |
RF | 76.78 | 34803.83 | 14.10 |
Although KNN stands out as the fastest algorithm in terms of training time, it lags behind SVM in prediction speed. When accuracy and speed metrics are evaluated together, it is concluded that SVM is the most balanced and successful classifier for this model.
In the proposed hybrid model for determining the degree of skin burns, features obtained from four different ViT-based architectures were combined and evaluated using the Chi$^2$ method, selecting the 500 most significant features. These features from the burn dataset were fed to six different classifiers—XGBoost, LGBM, CatBoost, KNN, SVM, and RF—and trained. The resulting confusion matrices from the training process are presented in Figure 12.






An examination of the confusion matrices obtained from the proposed hybrid model reveals that most classifiers can predict first- and second-degree burns with high accuracy. The SVM and LGBM models, in particular, demonstrated quite successful performance in distinguishing these two classes. However, there was a common trend that all classifiers have difficulty in classifying third-degree burns. This is also clinically significant because third-degree burns cause severe tissue integrity disruption, and their visual similarity makes them more difficult for algorithms to distinguish.
Furthermore, while third-degree burns were more frequently confused with second-degree burns in the CatBoost and RF models, the SVM classifier performed relatively more evenly in this class. Despite the KNN model’s advantage in lower training time, the misclassification rate for third-degree burns was exceptionally high. The success metrics calculated from these matrices for each classifier are detailed in Figure 13.

Model | Accuracy (%) | Prediction Speed (images/s) | Training Time (s) |
|---|---|---|---|
SVM | 82.29 | 835.61 | 3.42 |
LGBM | 80.51 | 12813.90 | 56.91 |
XGBoost | 80.17 | 41513.48 | 119.67 |
CatBoost | 78.56 | 151636.96 | 92.85 |
KNN | 77.88 | 5681.70 | 0.03 |
RF | 76.10 | 7501.33 | 42.25 |
The effect of the hybrid model in terms of accuracy and system performance is shown in Table 5.
The results in Figure 12 and Table 5 show that the highest accuracy rate among the classifiers was achieved with the SVM classifier at 82.29%. LGBM and XGBoost achieved accuracies of 80.51% and 80.17%, values close to that of SVM. CatBoost and KNN achieved accuracies of 78.56% and 77.88%. The lowest performance was observed with the RF classifier with 76.10% accuracy. In terms of computational performance, the shortest training time was achieved with the KNN classifier with 0.03 seconds. Despite its slower prediction speed, the highest accuracy value was achieved with the SVM classifier. As shown in the performance metrics of the proposed model, all classifiers can distinguish first- and second-degree burns with high accuracy. However, the classifiers achieved lower performance in classifying third-degree burns.
The proposed model had particular difficulty distinguishing between second- and third-degree burns. Figure 14 shows some images of second- and third-degree burns.
In creating the dataset shown in Figure 14, some images were taken very close up, or small sections were obtained by cropping different images specifically focusing on burns. This resulted in the second-degree and third-degree burns appearing similar. These images, and similar ones, resemble each other due to their blurriness and low-pixel quality. This is considered the primary reason why the model developed in this study confuses second-degree and third-degree burns.

4. Conclusion
A burn is tissue damage caused by exposure of the skin or underlying tissues to external factors such as heat, chemicals, electricity, sunlight, or radiation. Burns are classified according to the depth and extent of the damage. The dataset used in this study categorized burns into three classes. In this study, feature extraction was performed using different transformer-based architectures to detect skin burns. These extracted feature maps were combined to bring together different features of the same image. The proposed model was then further optimized by feature selection, aiming to perform faster. The proposed model achieved an accuracy of 82.29%. However, the study has some limitations. One of the limitations is the use of a single-center public dataset. In future studies, a system that operates in real time with more data from different centers will be developed.
Conceptualization, A.Y.D. and M.K.; methodology, A.Y.D. and M.K.; software, A.Y.D. and M.K.; validation, A.Y.D., M.K., and M.Y.; formal analysis, A.Y.D.; investigation, A.Y.D. and M.K.; resources, A.Y.D. and M.K.; data curation, A.Y.D. and M.K.; writing—original draft preparation, A.Y.D., M.K., and M.Y.; writing—review and editing, A.Y.D., M.K., and M.Y.; visualization, A.Y.D., M.K., and M.Y.; supervision, M.K. and M.Y.; project administration, M.K. and M.Y. All authors have read and agreed to the published version of the manuscript.
The data used to support the research findings are available from the corresponding author upon request.
The authors declare that they have no conflicts of interest.
