Application of Deep Learning Techniques in the Diagnosis and Grading of Knee Osteoarthritis (OA)
Abstract:
Osteoarthritis (OA) affects approximately 240 million individuals globally. Knee osteoarthritis, a crippling ailment marked by joint stiffness, discomfort, and functional impairment, is particularly the most widespread kind of arthritis among the elderly. To assess the severity of this disease, physical symptoms, medical history, and further joint screening examinations including radiography, Magnetic Resonance Imaging (MRI), and Computed Tomography (CT) scans have frequently been considered. It is difficult to identify early development of this disease as conventional diagnostic methods could be subjective. Therefore, doctors utilize the Kellgren and Lawrence (KL) scale to evaluate the severity of knee OA with visual images obtained from X-ray or MRI. The detection and prediction of the severity of knee OA indeed requires a novel model that uses deep learning models, including Inception and Xception. Utilizing the KL grading scale, the model, including Xception, ResNet-50, and Inception-ResNet-v2 could determine the degree of knee OA suffered by patients. The experimental results revealed that the Xception network achieved the highest classification accuracy of 67%, surpassing ResNet-50 and Inception-ResNet-v2, demonstrating its superior ability to automatically grade OA severity from radiographic images.
1. Introduction
The pathophysiology of osteoarthritis (OA), which has a serious negative impact on the patients’ quality of life, ability to work, and finances, has long required investigation. OA goes beyond simple morphological and physiological alterations when micro- and macro-injuries start the degradation of the extracellular cartilage matrix, i.e., joint degeneration marked by abnormalities in the synovial membrane, gradual loss of cartilage between joints, bone enlargement, and loss of joint function. There is an interconnected relationship between OA and aging. Moreover, there are other risk factors such as gender, obesity, inactivity, hereditary predisposition, bone density, and trauma.
In contrast to other types of inflammatory arthritis when activity and exercise relieve symptoms, stress and excessive activity worsen the pain and stiffness in the joints. Other possible effects include instability, deformation of the joint, and loss of joint function [1], [2]. The area between the knee joints flattens down when cartilage is removed, hence accelerating the development of knee OA [3], [4].
Major alterations denoted by the term LOSS below show how knee OA develops:
L-“loss of joint space”, brought by degradation of cartilage;,O-“osteophytes formations”, which are protrusions that form along the joint’s margins;,S-“subarticular sclerosis”, which refers to a rise in bone mass along the joint line; and,S-“subchondral cysts”, which are brought forth by the bone’s fluid-filled holes that are formed near the joints.
Diagnostic procedures such as X-rays, Computed Tomography scans (CT scans), and Magnetic Resonance Imaging (MRI), are commonly employed to determine the biological condition of knee OA and to detect the structural abnormalities in the joint. Nowadays, conventional treatment of knee OA will not suffice to fully resolve the issue.
Detecting joint deformation at an early stage is vital to prevent permanent damage. The Kellgren and Lawrence (KL) grading system, recognized by the World Health Organization (WHO), is typically used to evaluate the severity of knee osteoarthritis [5], [6]. Grade 0 for low severity up to 4 for high severity were used on the 5-point semiquantitative progressive ordinal KL grading system. Figure 1 depicts the progression of knee OA and the associated KL Grade for each stage.

Computer systems which are widely used today and their concomitant benefits are the results of technological advancement. Additionally, the desire for computer use is rising daily. Despite the advancement of digital computers, it remains challenging to conduct studies on machine simulation of human functions. In order to assess OA progression, grading, and detection, new techniques and tools must be developed due to the prevalence of the disease [7], [8]. Advances in algorithms and use of machine learning and deep learning enable healthcare experts to better assess the state and course of osteoarthritis through medical image analysis using automatic or semi-automatic methods.
To improve diagnosis, a machine learning-based computer-assisted technique was required to meet the diagnostic challenge with automatic X-ray [9], [10]. Two stages for automatic OA diagnosis were introduced in recent machine learning-based research: (1) ROI (Region of Interest) segmentation, which reduces noise by eliminating background and unimportant information, and (2) categorization of OA severity based on machine learning, which standardized and streamlined difficult diagnostic criteria. These two stages are shown in Figure 1. However, the selection of features in earlier knee recognition techniques requires a lot of manual labor [11], [12]. Furthermore, symptoms of osteoarthritis may be witnessed in several bone areas. The accuracy of evaluation is impacted by machine learning algorithms for diagnosing OA, as they seldom examine the connections between these locations. Using a sizable dataset, the workflow depicted in Figure 2 was replicated and an automated deep-learning method was adopted to improve OA diagnosis in this work [13], [14], [15].
The following are some of the main contributions of this study:
1. To reduce the need for human feature engineering, an object detection Convolutional Neural Network (CNN) is being improved in order to separate the knee areas from X-ray pictures;
2. To enhance classification performance and visual transformers’ self-attention mechanism is utilized;
3. To categorize the severity of OA in a sizable dataset with the Kellgren and Lawrence (KL) grading system; and
4. To evaluate the suggested approach and the findings show higher accuracy in osteoarthritis severity classification and better efficiency of knee segmentation.

Dataset Used
The dataset employed in this study was obtained from the Osteoarthritis Initiative (OAI), a large-scale longitudinal project funded by the National Institutes of Health (NIH). It comprises data from 4,476 participants, providing a diverse and balanced cohort for knee osteoarthritis (OA) research. In this study, 4,446 radiographs were utilized, each annotated with Kellgren–Lawrence (KL) grades for both knee joints, resulting in a total of 8,260 knee joint images.The dataset distribution was as follows: Grade 0 – 3,253, Grade 1 – 1,495, Grade 2 – 2,175, Grade 3 – 1,086, and Grade 4 – 251. This distribution closely refelects the overall dataset characteristics, ensuring balanced representation across all severity levels for effective model training and evaluation.
This collection of data included both Kellgren and Lawrence (KL) grading and knee joint detection. Furthermore, there are versions of picture data that have 224 and 299 times more pixels, respectively. In this work, machine learning technology was used to predict the Kellgren and Lawrence grades of knees collected in the dataset, based on the following criteria:
Grade 0: No radiographic signs of osteoarthritis.
Grade 1: Possible osteophytic lipping with uncertain joint space narrowing (JSN).
Grade 2: Definite osteophytes with potential JSN.
Grade 3: Multiple osteophytes, evident JSN, sclerosis, and possible bone deformation.
Grade 4: Severe sclerosis, prominent JSN, large osteophytes, and significant bone deformity [16], [17].
The dataset was divided into training, testing, and validation subsets in a 70:20:10 ratio to ensure balanced evaluation across the severity level. Table 1 shows the distribution of samples across these subsets.
Dataset | Grade0 | Grade1 | Grade2 | Grade3 | Grade4 | Total |
|---|---|---|---|---|---|---|
Training | 2,286 | 1,046 | 1,516 | 757 | 173 | 5,778 |
Testing | 639 | 296 | 447 | 223 | 51 | 1,656 |
Validation | 328 | 153 | 212 | 106+ | 27 | 826 |
Total | 3,253 | 1,495 | 2,175 | 1,086 | 251 | 8,260 |
2. Methodology
Xception: Depth-Wise Split Convolutions for Deep Learning
The architecture called Xception, performs somewhat better than Inception V3 on the ImageNet dataset, where Inception V3 was created for, and much better than it on a bigger image classification dataset with 17,000 classes and 350 million pictures [18], [19].
A depthwise convolution and a pointwise convolution make up the original depthwise separable convolution:
A channel-wise n × n spatial convolution is carried out using depthwise convolution. For example, Figure \ref{fig3} applies five distinct n × n spatial convolutions if there are five channels.
The dimensionality is adjusted using pointwise convolution, which is just a 1 × 1 convolution [20], [21].
Convolution over all channels is not necessary for depthwise separable convolution, in contrast to classical convolution. Since there are fewer connections, the model is lighter and more effective.


ResNet
On the ImageNet dataset, residual networks up to 152 layers deep are evaluated, as it is eight times deeper than VGG nets while being less complex. The error of an ensemble of these residual networks on the ImageNet test set is 3.57%. This result won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2015. A 100 and 1,000 level Canadian Institute for Advanced Research (CIFAR)-10 analyses [22] were conducted.
In many visual recognition tasks, depth representation is important. Owing to the extraordinary deep representations, a 28% relative gain on the Common Objects in Context (COCO) object identification dataset was obtained [23]. In addition to winning first place in the ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation tasks, entries to the ILSVRC & COCO 2015 contests were constructed using deep residual nets [24].
Many tasks of computer vision rely on a generalized neural network called ResNet as shown in Figure 4, short for Residual Networks [25]. A 50-level convolutional neural network is called ResNet-50. The capacity of ResNet to educate the ways for creating incredibly complicated neural networks with over 150 layers is its primary innovation. This novel neural network was initially introduced by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun in their 2015 computer vision research paper, “Deep residual learning for image recognition”. One of the major shortcomings of convolutional neural networks is the “Vanishing Gradient Problem” [26], [27]. During backpropagation, the gradient value drastically drops, therefore weights barely change. ResNet is used to circumvent this. It makes use of SKIP CONNECTION.

Inception-v4
Recent years have seen the greatest advance in picture identification, mostly because of extremely deep convolutional networks. This is the case with the Inception design, which has shown to provide outstanding performance at a comparatively low computational cost. In the 2015 ILSVRC competition, the newly added residual connections along with a more conventional architecture resulted in state-of-the-art performance that was comparable to the latest Inception-v3 network as shown in Figure 5. This is dubious if combining the Inception design with the remaining connections might be advantageous. Strong empirical evidence was found in the training of Inception networks with residual links to speed up the process [28], [29].
Some research suggested that Inception networks with residual connections outperformed comparably priced Inception networks without residual connections by a small margin. Furthermore, a range of new and compact Inception network designs was provided for residual and non-residual networks. The single-frame recognition performance in the ILSVRC 2012 classification assignment was greatly enhanced by these changes. Appropriate activation scaling facilitate robust training of very large residual Inception networks. Using a collection of three residuals and one Inception-v4 model could derive a top-5 error score of 3.08 percent on the ImageNet classification (CLS) test set [30].

All models were implemented using TensorFlow and Keras frameworks. The dataset was divided into 70% training, 20% testing, and 10% validation subsets. To ensure reproducibility and fair comparison, identical hyperparameters were used across all models. The training configuration is summarized in Table 2.
| Parameter | Configuration |
|---|---|
| Loss of Function | Categorical Cross-Entropy |
| Optimizer | Adam (adaptive learning rate optimization) |
| Initial Learning Rate | 1e-4, with scheduled decay |
| Batch Size | 32 |
| Epochs | 50 (with early stopping on validation loss) |
| Metrics Monitored | Accuracy, Balanced Accuracy, Validation Loss |
3. Results and Disscussion
To predict the beginning and development of knee OA, three distinct deep learning models have been developed: Xception, ResNet 50, and Inception-ResNet-v2 [31], [32]. Their performance is compared in Table 3. The Multilayer Perceptron (MLP) model outperforms the logistic regression models in predicting the start and course of the disease [33]. The complex nonlinearity of the data structure may account for such an occurrence. As a result, it is more difficult for the linear classification method to manage the complexity of the data point distribution in this dataset than its non-linear version. As a result, it was determined that Xception model provided greater accuracy than the other two models.
Model | Accuracy | Execution Time |
|---|---|---|
Xception | 67% | 68 mins |
ResNet-50 | 65% | 80 mins |
Inception-ResNet-v2 | 64% | 56 mins |

The comparative analysis showed that the Xception model achieved the highest accuracy (67%), outperforming ResNet-50 (65%) and Inception-ResNet-v2 (64%). Its superior performance can be attributed to the use of depthwise separable convolutions, which efficiently capture localized spatial features while reducing parameter redundancy and overfitting. Conversely, the deeper architecture of ResNet-50 may have led to increased computational demand and a need for larger training data to reach full generalization. Overall these findings suggested that Xception provided an optimal balance between model complexity, training efficiency, and predictive accuracy, rendering it more suitable for real-time diagnostic applications in clinical environments.
A critical limitation of deep learning models in the medical domain is their perceived “black box” nature. To mitigate this, the Gradient-weighted Class Activation Mapping (Grad-CAM) technique was used to provide interpretability to the predictions of Xception model. Grad-CAM heatmaps as shown in Figure 6 were generated for representative images across each KL grade. The goal was to determine whether the model was attending to clinically relevant regions, particularly the joint space, bone surface, and cartilage structure.
Findings from Grad-CAM visualization include:
For Healthy, Doubtful, and Minimal grades, the activation regions concentrated around the joint space, in alignment with early signs of cartilage thinning and joint narrowing.,For Moderate and Severe grades, the model’s
4. Conclusion
This study proposed a deep learning–based framework for the automated diagnosis and grading of knee osteoarthritis (OA) from X-ray images. Three convolutional neural network architectures, i.e., Xception, ResNet-50, and Inception-ResNet-v2 were trained and evaluated using the OAI dataset. Among these, the Xception network achieved the best performance with 67% accuracy, effectively capturing key radiographic patterns linked to OA severity. These results highlight the potential of deep learning methods in enhancing radiological assessment by offering faster, more objective, and reproducible OA grading. Future research could integrate multimodal clinical data and adopt interpretable AI approaches to improve transparency and facilitate clinical implementation..
Ranganadha Reddy conceptualized the study, designed the methodology, and supervised the experimental work. Y. Varshitha handled data preprocessing, model training, and performance evaluation. B. Parvathi Devi contributed to result interpretation, visualization, and manuscript preparation.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare that they have no conflicts of interest.
