Javascript is required
Carbone, M., Adusumilli, P. S., Alexander, H. R., Baas, P., Bardelli, F., Bononi, A., Bueno, R., Felley-Bosco, E., Galateau-Salle, F., & Jablons, D. et al. (2019). Mesothelioma: Scientific clues for prevention, diagnosis, and therapy. CA Cancer J. Clin., 69(5), 402–429. [Google Scholar] [Crossref]
Carbone, M., Pass, H. I., Ak, G., Alexander Jr, H. R., Baas, P., Baumann, F., Blakely, A. M., Bueno, R., Bzura, A., & Cardillo, G. et al. (2022). Medical and surgical care of patients with mesothelioma and their relatives carrying germline BAP1 mutations. J. Thorac. Oncol., 17(7), 873–889. [Google Scholar] [Crossref]
Chicco, D. & Rovelli, C. (2019). Computational prediction of diagnosis and feature selection on mesothelioma patient health records. PloS One, 14(1), e0208737. [Google Scholar] [Crossref]
Collins, G. S., Moons, K. G. M., Dhiman, P., Riley, R. D., Beam, A. L., Van Calster, B., Ghassemi, M., Liu, X., Reitsma, J. B., & van Smeden, M. et al. (2024). TRIPOD+AI statement: Updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ, 385, e078378. [Google Scholar] [Crossref]
Dhiman, P., Ma, J., Andaur Navarro, C. L., Speich, B., Bullock, G., Damen, J. A. A., Hooft, L., Kirtley, S., Riley, R. D., & Van Calster, B. et al. (2022). Methodological conduct of prognostic prediction models developed using machine learning in oncology: A systematic review. BMC Med. Res. Methodol., 22(1), 101. [Google Scholar] [Crossref]
Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on typical tabular data? Adv. Neural Inf. Process. Syst., 35, 507–520. [Google Scholar]
Jin, W., Li, X., Fatehi, M., & Hamarneh, G. (2023). Guidelines and evaluation of clinical explainable AI in medical image analysis. Med. Image Anal., 84, 102684. [Google Scholar] [Crossref]
Kapoor, S. & Narayanan, A. (2023). Leakage and the reproducibility crisis in machine-learning-based science. Patterns, 4(9), 100804. [Google Scholar] [Crossref]
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T. Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst., 30. [Google Scholar]
Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., & Lee, S. I. (2020). From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell., 2(1), 56–67. [Google Scholar] [Crossref]
Lundberg, S. M. & Lee, S. I. (2017). A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst., 30. [Google Scholar]
Naso, J. R., Cheung, S., Ionescu, D. N., & Churg, A. (2021). Utility of SOX6 and DAB2 for the diagnosis of malignant mesothelioma. Am. J. Surg. Pathol., 45(9), 1245–1251. [Google Scholar] [Crossref]
Peters, S., Scherpereel, A., Cornelissen, R., Oulkhouir, Y., Greillier, L., Kaplan, M. A., Talbot, T., Monnet, I., Hiret, S., & Baas, P. et al. (2022). First-line nivolumab plus ipilimumab versus chemotherapy in patients with unresectable malignant pleural mesothelioma: 3-year outcomes from CheckMate 743. Ann. Oncol., 33(5), 488–499. [Google Scholar] [Crossref]
Popat, S., Baas, P., Faivre-Finn, C., Girard, N., Nicholson, A. G., Nowak, A. K., Opitz, I., Scherpereel, A., Reck, M., & ESMO Guidelines Committee. (2022). Malignant pleural mesothelioma: ESMO Clinical Practice Guidelines for diagnosis, treatment and follow-up. Ann. Oncol., 33(2), 129–142. [Google Scholar] [Crossref]
Porcel, J. M. (2018). Biomarkers in the diagnosis of pleural diseases: A 2018 update. Ther. Adv. Respir. Dis., 12, 1753466618808660. [Google Scholar] [Crossref]
Roberts, M. E., Rahman, N. M., Maskell, N. A., Bibby, A. C., Blyth, K. G., Corcoran, J. P., Edey, A., Evison, M., de Fonseka, D., & Hallifax, R. et al. (2023). British Thoracic Society Guideline for pleural disease. Thorax, 78(Suppl 3), s1–s42. [Google Scholar] [Crossref]
Scherpereel, A., Opitz, I., Berghmans, T., Psallidas, I., Glatzer, M., Rigau, D., Astoul, P., Bölükbas, S., Boyd, J., & Coolen, J. et al. (2020). ERS/ESTS/EACTS/ESTRO guidelines for the management of malignant pleural mesothelioma. Eur. Respir. J., 55(6), 1900953. [Google Scholar] [Crossref]
Shwartz-Ziv, R. & Armon, A. (2022). Tabular data: Deep learning is not all you need. Inf. Fusion, 81, 84–90. [Google Scholar] [Crossref]
Tanrikulu, A. & Er, O. (2012). Mesothelioma’s disease data set. UCI Machine Learning Repository. [Google Scholar] [Crossref]
Woolhouse, I., Bishop, L., Darlison, L., De Fonseka, D., Edey, A., Edwards, J., Faivre-Finn, C., Fennell, D. A., Holmes, S., & Kerr, K. M. et al. (2018). British Thoracic Society Guideline for the investigation and management of malignant pleural mesothelioma. Thorax, 73(Suppl 1), i1–i30. [Google Scholar] [Crossref]
Search
Open Access
Research article

Leakage-Aware Explainable Machine Learning for Malignant Mesothelioma Classification Using Clinical and Laboratory Features

Mohammed Mansour1*,
Turker Berk Donmez2
1
Department of Mechatronics Engineering, Sakarya University of Applied Sciences, 54050 Sakarya, Turkey
2
Department of Biomedical Engineering, Sakarya University of Applied Sciences, 54050 Sakarya, Turkey
Healthcraft Frontiers
|
Volume 3, Issue 3, 2025
|
Pages 139-157
Received: 07-27-2025,
Revised: 09-08-2025,
Accepted: 09-22-2025,
Available online: 09-30-2025
View Full Article|Download PDF

Abstract:

Malignant mesothelioma remains a diagnostic challenge due to the phenotypic overlap with benign pleural diseases and the reliance on invasive procedures for definitive confirmation. To address these limitations, a leakage-aware, explainable machine learning framework was developed and applied to a publicly available mesothelioma dataset comprising 324 cases (96 mesothelioma, 228 symptomatic non-mesothelioma). Variables prone to target leakage or unavailable at the point of diagnosis—such as diagnosis method, cytology results, mesothelioma subtype, and survival status—were systematically excluded. The remaining features were stratified into initial clinical presentation, post-imaging, and post-pleural-fluid analysis stages prior to model development. The dataset was partitioned into a development cohort (n = 226) and an independent hold-out cohort (n = 98). Multiple classifiers, including logistic regression, support vector machine, k-nearest neighbors, and light gradient boosting machine, were optimized via grid search and evaluated using repeated stratified 5-fold cross-validation. The diagnosis method was identified as a perfect inverse surrogate of the target variable and consequently removed. The light gradient boosting machine exhibited superior performance, achieving the highest average precision (0.543) and Matthews correlation coefficient (0.306) during cross-validation. On the unseen hold-out cohort, light gradient boosting machine yielded an area under the receiver operating characteristic curve of 0.660, average precision of 0.483, balanced accuracy of 0.615, and Matthews correlation coefficient of 0.233. At the conventional 0.50 threshold, sensitivity was 0.448, specificity 0.783, and negative predictive value 0.771; lowering the threshold to 0.30 increased sensitivity to 0.690 at the expense of specificity reduction to 0.493. SHapley Additive exPlanations (SHAP) identified age, platelet count, lung side, white blood cell count, and duration of asbestos exposure as the most influential predictors. This leakage-aware, explainable light gradient boosting machine model delivers clinically interpretable diagnostic predictions while mitigating target leakage, demonstrating moderate discrimination and potential utility in real-world clinical settings. These findings warrant further external validation and prospective evaluation to confirm generalizability and clinical impact.

Keywords: Malignant mesothelioma, Explainable artificial intelligence, Light gradient boosting machine, SHapley Additive exPlanations

1. Introduction

Malignant mesothelioma is an aggressive serosal malignancy that is strongly associated with historical and occupational asbestos exposure, typically presents after a long latency of several decades, and carries a poor prognosis with median survival commonly reported in the range of 9 to 17 months from diagnosis (C​a​r​b​o​n​e​ ​e​t​ ​a​l​.​,​ ​2​0​1​9; P​o​p​a​t​ ​e​t​ ​a​l​.​,​ ​2​0​2​2). Its rising or stable incidence in many countries with a long history of industrial asbestos use, combined with the continued global use of chrysotile asbestos, means that mesothelioma remains a clinically relevant public health problem and continues to generate pressure on diagnostic and treatment pathways (C​a​r​b​o​n​e​ ​e​t​ ​a​l​.​,​ ​2​0​1​9; P​o​p​a​t​ ​e​t​ ​a​l​.​,​ ​2​0​2​2; S​c​h​e​r​p​e​r​e​e​l​ ​e​t​ ​a​l​.​,​ ​2​0​2​0). Recent updates in first-line systemic therapy, including the long-term CheckMate 743 data supporting dual immune-checkpoint blockade over platinum-pemetrexed chemotherapy in selected subgroups (P​e​t​e​r​s​ ​e​t​ ​a​l​.​,​ ​2​0​2​2), together with evolving pathological and genetic classification frameworks (C​a​r​b​o​n​e​ ​e​t​ ​a​l​.​,​ ​2​0​2​2), have further sharpened the need for accurate and timely identification of the disease at a pre-definitive stage.

From a clinical standpoint, mesothelioma is diagnostically difficult precisely because it shares many of its early presenting features with non-malignant symptomatic pleural disease. Dyspnea, chest pain, and unilateral pleural effusion are common in both malignant and reactive pleural processes, and initial work-up based on chest imaging, pleural fluid biochemistry, and routine hematology rarely provides a definitive answer (C​a​r​b​o​n​e​ ​e​t​ ​a​l​.​,​ ​2​0​1​9; N​a​s​o​ ​e​t​ ​a​l​.​,​ ​2​0​2​1; R​o​b​e​r​t​s​ ​e​t​ ​a​l​.​,​ ​2​0​2​3; W​o​o​l​h​o​u​s​e​ ​e​t​ ​a​l​.​,​ ​2​0​1​8). Confirmation, therefore, typically depends on pleural biopsy and immunohistochemistry, sometimes repeated, and this diagnostic delay has direct clinical consequences for staging and therapy (N​a​s​o​ ​e​t​ ​a​l​.​,​ ​2​0​2​1; P​o​p​a​t​ ​e​t​ ​a​l​.​,​ ​2​0​2​2; S​c​h​e​r​p​e​r​e​e​l​ ​e​t​ ​a​l​.​,​ ​2​0​2​0; W​o​o​l​h​o​u​s​e​ ​e​t​ ​a​l​.​,​ ​2​0​1​8). Contemporary guidance from the British Thoracic Society (R​o​b​e​r​t​s​ ​e​t​ ​a​l​.​,​ ​2​0​2​3) and the clinical practice guidelines of the European Society for Medical Oncology (P​o​p​a​t​ ​e​t​ ​a​l​.​,​ ​2​0​2​2) explicitly describe the iterative nature of pleural diagnostic work-up, in which clinical, imaging, and laboratory information are combined progressively before definitive confirmation. A reliable pre-definitive triage signal, even a modest one, would therefore be of clinical interest in principle, provided that it is developed under explicit leakage control.

Machine-learning analyses of mesothelioma have largely relied on the publicly available University of California, Irvine (UCI) mesothelioma dataset, which provides a compact multi-modal benchmark of demographic, symptomatic, hematological, biochemical, pleural fluid, and imaging variables in patients with suspected or confirmed mesothelioma (T​a​n​r​i​k​u​l​u​ ​&​a​m​p​;​ ​E​r​,​ ​2​0​1​2). Several published reproductions of this dataset have reported near-perfect or perfect classification performance, but a critical re-analysis by C​h​i​c​c​o​ ​&​a​m​p​;​ ​R​o​v​e​l​l​i​ ​(​2​0​1​9​) showed that much more modest scores are obtained once leakage-prone variables are removed and that post-diagnostic variables such as diagnosis method can effectively encode the outcome label itself. Cytology, histopathology, biopsy strategy, and immunohistochemical interpretation remain central to final diagnosis, which means that machine learning studies must carefully separate variables available before diagnosis from variables that merely restate the diagnosis itself (C​h​i​c​c​o​ ​&​a​m​p​;​ ​R​o​v​e​l​l​i​,​ ​2​0​1​9; N​a​s​o​ ​e​t​ ​a​l​.​,​ ​2​0​2​1; W​o​o​l​h​o​u​s​e​ ​e​t​ ​a​l​.​,​ ​2​0​1​8). More broadly, recent evidence suggests that target-adjacent leakage is a systemic source of over-optimism across the machine-learning-for-science literature, to the point where hundreds of published studies in biomedical and other scientific domains have been shown to be affected by a small set of recurring leakage patterns (K​a​p​o​o​r​ ​&​a​m​p​;​ ​N​a​r​a​y​a​n​a​n​,​ ​2​0​2​3); and a systematic review of prognostic machine learning models in oncology has documented frequent methodological shortcomings in validation design, reporting, and calibration (D​h​i​m​a​n​ ​e​t​ ​a​l​.​,​ ​2​0​2​2).

Tree-based gradient boosting methods such as the light gradient boosting machine are well suited to heterogeneous tabular clinical data and remain state of the art for many small-to-moderate clinical datasets (K​e​ ​e​t​ ​a​l​.​,​ ​2​0​1​7). Independent comparative benchmarks have recently reaffirmed that tree-based ensembles typically match or exceed deep-learning architectures on tabular data of the size considered (G​r​i​n​s​z​t​a​j​n​ ​e​t​ ​a​l​.​,​ ​2​0​2​2; S​h​w​a​r​t​z​-​Z​i​v​ ​&​a​m​p​;​ ​A​r​m​o​n​,​ ​2​0​2​2), which supports the decision to use a gradient-boosted tree model as the primary estimator rather than a neural network. SHapley Additive exPlanations (SHAP) provide a principled game-theoretic framework for decomposing individual model predictions into additive feature contributions, unifying global and local explanations under a single attribution model and supporting both beeswarm-style cohort summaries and waterfall-style per-patient explanations (L​u​n​d​b​e​r​g​ ​&​a​m​p​;​ ​L​e​e​,​ ​2​0​1​7; L​u​n​d​b​e​r​g​ ​e​t​ ​a​l​.​,​ ​2​0​2​0). However, SHAP is a faithful description of how a model used the provided inputs; it does not transform leakage-prone variables into clinically meaningful biology, and it cannot compensate for a misspecified prediction target (J​i​n​ ​e​t​ ​a​l​.​,​ ​2​0​2​3). Trustworthy application of explainable artificial intelligence in clinical prediction, therefore, requires an explicit leakage audit and alignment with contemporary methodological reporting guidance (C​o​l​l​i​n​s​ ​e​t​ ​a​l​.​,​ ​2​0​2​4; D​h​i​m​a​n​ ​e​t​ ​a​l​.​,​ ​2​0​2​2; J​i​n​ ​e​t​ ​a​l​.​,​ ​2​0​2​3).

Accordingly, this study re-analyzed the UCI mesothelioma dataset as a leakage-aware methodological study. The aims were to audit and remove target-duplicating or post-diagnostic variables (C​h​i​c​c​o​ ​&​a​m​p​;​ ​R​o​v​e​l​l​i​,​ ​2​0​1​9; K​a​p​o​o​r​ ​&​a​m​p​;​ ​N​a​r​a​y​a​n​a​n​,​ ​2​0​2​3), align the task with pre-definitive classification of mesothelioma versus non-mesothelioma symptomatic cases, compare logistic regression, support vector machine, k-nearest neighbors, and the light gradient boosting machine under identical validation procedures (C​o​l​l​i​n​s​ ​e​t​ ​a​l​.​,​ ​2​0​2​4; D​h​i​m​a​n​ ​e​t​ ​a​l​.​,​ ​2​0​2​2), and evaluate the selected model with hold-out performance, threshold trade-offs, calibration, and SHAP-based global and local explanations (J​i​n​ ​e​t​ ​a​l​.​,​ ​2​0​2​3; L​u​n​d​b​e​r​g​ ​&​a​m​p​;​ ​L​e​e​,​ ​2​0​1​7; L​u​n​d​b​e​r​g​ ​e​t​ ​a​l​.​,​ ​2​0​2​0). The intention was not to claim clinical deployability, but to show how leakage control changes both performance estimates and model explanations.

2. Methods

A retrospective secondary analysis was conducted using the UCI mesothelioma dataset, which contains 324 anonymized patient records collected in Turkey and labeled as mesothelioma or non-mesothelioma symptomatic cases (T​a​n​r​i​k​u​l​u​ ​&​a​m​p​;​ ​E​r​,​ ​2​0​1​2). The raw dataset comprised 34 candidate predictors and one binary target variable, class of diagnosis. In the raw data, 228 cases belonged to the non-mesothelioma class and 96 to the mesothelioma class.

The intended prediction moment was defined as the period before definitive diagnosis. Because the UCI release does not provide exact timestamps for each predictor, variables were grouped according to a pragmatic pleural-diagnostic pathway: initial visit or routine work-up variables, variables expected after imaging, variables expected after pleural fluid examination, and post-diagnostic or downstream variables. Demographics, exposure history, symptoms, performance status, and routine blood markers were treated as initial-visit or routine work-up variables; lung side, pleural effusion, and pleural thickness on computed tomography as post-imaging variables; and pleural fluid biochemistry as available only after pleural fluid examination. Diagnosis method, cytology, type of mesothelioma, and dead or not were treated as post-diagnostic or downstream variables and excluded. Accordingly, the negative class is described throughout as non-mesothelioma symptomatic cases, not as healthy controls.

Dataset auditing confirmed that the raw dataset contained no missing values (total missing entries = 0), consistent with the UCI record (T​a​n​r​i​k​u​l​u​ ​&​a​m​p​;​ ​E​r​,​ ​2​0​1​2). The variable diagnosis method was found to be perfectly collinear with the target: all 96 mesothelioma records had no diagnosis method and all 228 non-mesothelioma records had one diagnosis method. This variable was therefore excluded as direct target leakage. Cytology was excluded because it reflects a definitive diagnostic work-up rather than a pre-diagnostic triage variable. The type of mesothelioma and whether dead or not were also excluded because they are unavailable at the intended prediction time or represent downstream information. Table 1 summarizes the clinical-stage availability assumptions used to separate included pre-definitive variables from excluded post-diagnostic variables. For linear and distance-based models, categorical variables were one-hot encoded and numeric variables were standardized within the pipeline. For the light gradient boosting machine, raw categorical codes were passed as declared categorical features rather than treated as ordinal quantities. Therefore, variables such as city and lung side were handled as category-based splits rather than numeric thresholds.

Table 1. Clinical-stage availability of candidate variables and leakage-control decisions

Clinical Pathway Stage

Variables

Use in Model/Leakage-Control Rationale

Initial visit/routine work-up

Age, gender, city, asbestos exposure, duration of asbestos exposure, habit of smoking, dyspnea, chest pain, weakness, duration of symptoms, performance status, white blood cell count, hemoglobin, platelet count, sedimentation, serum lactate dehydrogenase, alkaline phosphatase, total protein, albumin, glucose, and C-reactive protein

Included in the primary leakage-aware predictor set because these variables are plausibly available before definitive diagnosis.

After imaging

Lung side, pleural effusion, and pleural thickness on computed tomography

Included as pre-definitive variables when imaging has been performed; not assumed to be available at the first clinical encounter.

After pleural fluid examination

Pleural fluid white blood cell count, pleural lactate dehydrogenase, pleural protein, pleural albumin, pleural glucose, and pleural fluid pH flag

Included as later pre-definitive variables; ablation analyses tested performance without pleural fluid or imaging variables.

Post-diagnosis / downstream

Diagnosis method, cytology, type of mesothelioma, and dead or not

Excluded because these variables represent direct target leakage, definitive diagnostic evidence, subtype information, or downstream outcome status.

The leakage-aware predictor set was split by stratified random sampling with seed 42 into a development cohort of 226 cases (159 non-mesothelioma symptomatic cases, 67 mesothelioma cases) and an untouched hold-out cohort of 98 cases (69 and 29, respectively). Hyperparameter tuning was performed only on the development cohort by GridSearchCV. Performance estimation on the development cohort used repeated stratified 5-fold cross-validation with three repeats and should therefore be interpreted as a non-nested internal model-development estimate rather than as an unbiased external performance estimate. Logistic regression, support vector machine, k-nearest neighbors, and the light gradient boosting machine were compared. Model selection prioritized average precision because of class imbalance, followed by Matthews correlation coefficient, area under the receiver operating characteristic curve, and balanced accuracy. After selecting the best-performing model, the tuned estimator was refit on the full development cohort and evaluated once on the untouched hold-out cohort, which served as the primary internal test estimate. The complete candidate grids and the final selected parameters for all four model families are reported in supplementary Tables A1 and A4.

Development-set performance was summarized as means with percentile-based empirical 95% intervals across repeated cross-validation folds. Average precision was used as the imbalance-aware precision-recall summary and is reported as the primary precision-recall metric throughout. Hold-out uncertainty was estimated with 2,000 stratified bootstrap resamples that preserved class counts for the area under the receiver operating characteristic curve, average precision, balanced accuracy, precision, sensitivity, specificity, negative predictive value, F1-score, Matthews correlation coefficient, and Brier score. A default probability threshold of 0.50 defined the primary binary predictions and the hold-out local SHAP case selection because this neutral threshold avoided optimizing a decision threshold on the untouched hold-out set. Given the clinical importance of missed mesothelioma cases, a fixed threshold analysis at 0.30, 0.40, and 0.50 was additionally reported for sensitivity, specificity, negative predictive value, and Matthews correlation coefficient. An ablation analysis evaluated the effect of removing pleural fluid and imaging variables and of restricting the model to laboratory-only or symptoms/exposure-only feature sets. Additional sensitivity analyses tested whether platelet-count outlier handling materially altered performance or SHAP ranking.

Primary global SHAP analysis was performed on untouched hold-out predictions generated by the tuned light gradient boosting machine model fit on the development cohort, while feature-rank stability was assessed across repeated cross-validation folds. SHAP values are reported on the model log-odds scale. Dependence plots were generated for the four highest-ranking hold-out SHAP features and colored by observed class. For local explanation, one true positive, one true negative, one false positive, and one false negative case were selected directly from the untouched hold-out cohort. All analyses were conducted in Python 3.12.12 using scikit-learn 1.8.0, the light gradient boosting machine 4.6.0, SHAP 0.51.0, NumPy 2.4.2, and pandas 3.0.1.

Cohort characteristics were summarized overall and by observed class. Continuous variables were reported as medians with interquartile ranges because several laboratory variables (in particular platelet count, C-reactive protein, and pleural fluid biochemistry) were right-skewed, and categorical variables were reported as counts with percentages. No between-group hypothesis testing was performed on baseline characteristics because the study aim was leakage-aware predictive modeling rather than etiologic inference, and p-value reporting across a high-dimensional baseline table would have inflated the risk of spurious associations given the modest sample size. All predictive analyses, uncertainty estimates, and explainability results are reported with 95% intervals rather than with p-values, in line with the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) plus artificial intelligence guidance on reporting machine-learning prediction models (C​o​l​l​i​n​s​ ​e​t​ ​a​l​.​,​ ​2​0​2​4). Analyses were reproducible from a single random seed (42), and the analysis scripts used to generate every table and figure in this manuscript are provided alongside the data description.

3. Results

The raw dataset contained no missing values. More importantly, the diagnosis method showed perfect target duplication: diagnosis method = 0 occurred exclusively in mesothelioma records and diagnosis method = 1 occurred exclusively in non-mesothelioma symptomatic case records. This confirmed that the perfect-performance version of the model was driven by leakage rather than by clinically usable discrimination.

The full cohort (n = 324) had a median age of 55 years [interquartile range 47–63], with 58.6% of patients being male, and 86.4% reported asbestos exposure with a median reported exposure duration of 34 years. Presenting symptoms were frequent in both classes: dyspnea was positive in 81.8%, chest pain in 68.2%, and weakness in 61.1%, with a median symptom duration of 5 days (C​h​i​c​c​o​ ​&​ ​R​o​v​e​l​l​i​,​ ​2​0​1​9; Ke et al. 2017; L​u​n​d​b​e​r​g​ ​&​ ​L​e​e​,​ ​2​0​1​7; R​o​b​e​r​t​s​ ​e​t​ ​a​l​.​,​ ​2​0​2​3; T​a​n​r​i​k​u​l​u​ ​&​ ​E​r​,​ ​2​0​1​2). Routine hematology showed a median white-cell count of 8.8 103/uL and a median platelet count of 345 103/uL, and pleural effusion was documented in 87.0% of cases. Comparison by observed class showed that presenting symptoms, dyspnea, chest pain, and pleural effusion were similarly common in mesothelioma and non-mesothelioma symptomatic cases, with only modest distributional shifts in age and some biochemical variables (Table 2). This overlap is clinically expected for a pre-definitive population and motivates the leakage-aware modeling design used in this study: the target is not easily separable from routine pre-biopsy information, and any machine learning model that reports perfect performance on such a cohort should be audited for leakage before it is interpreted. Table 3 shows the development cohort repeated cross-validation performance after leakage-aware feature exclusion (means with percentile-based empirical 95% intervals).

Table 2. Baseline characteristics of the cohort by observed class

Characteristic

Overall (n = 324)

Mesothelioma (n = 96)

Symptomatic Non-Mesothelioma (n = 228)

Age, median [interquartile range]

55 [47–63]

54 [46–59]

56 [48-63]

Male, n (%)

190 (58.6)

45 (46.9)

145 (63.6)

Asbestos exposure, n (%)

280 (86.4)

87 (90.6)

193 (84.6)

Duration of asbestos exposure (years), median [interquartile range]

34 [20–43]

34 [25–44]

34 [16–42]

Duration of symptoms (days), median [interquartile range]

5 [3–7]

6 [3–8]

4 [2–7]

Dyspnea positive, n (%)

265 (81.8)

77 (80.2)

188 (82.5)

Chest pain positive, n (%)

221 (68.2)

62 (64.6)

159 (69.7)

Weakness positive, n (%)

198 (61.1)

63 (65.6)

135 (59.2)

White blood cell (103/uL), median [interquartile range]

8.8 [6.8–12.6]

8.3 [6.7–12.9]

8.8 [6.8–12.6]

Platelet count (103/uL), median [interquartile range]

345 [234–456]

323 [166–456]

355 [248–456]

C-reactive protein (mg/L), median [interquartile range]

68.0 [42.0-79.0]

77.0 [45.0–81.0]

67.0 [39.0–78.2]

Pleural effusion, n (%)

282 (87.0)

82 (85.4)

200 (87.7)

Note: Continuous variables are reported as median [interquartile range] and categorical variables as count (%). No between-group hypothesis testing was performed; the table is descriptive.
Table 3. Development cohort repeated cross-validation performance after leakage-aware feature exclusion

Model

Area Under The Receiver Operating Characteristic Curve (Empirical 95% Interval)

Average Precision (Empirical 95% Interval)

Balanced Accuracy (Empirical 95% Interval)

Matthews Correlation Coefficient (Empirical 95% Interval)

Light gradient boosting machine

0.628 [0.523, 0.765]

0.543 [0.431, 0.654]

0.637 [0.555, 0.720]

0.306 [0.118, 0.472]

k-nearest neighbors

0.619 [0.523, 0.748]

0.467 [0.328, 0.620]

0.553 [0.490, 0.622]

0.154 [–0.021, 0.378]

Support vector machine

0.626 [0.496, 0.740]

0.466 [0.339, 0.604]

0.510 [0.490, 0.552]

0.041 [–0.062, 0.204]

Logistic regression

0.625 [0.494, 0.721]

0.456 [0.331, 0.624]

0.590 [0.449, 0.709]

0.170 [–0.095, 0.392]

After leakage removal, model discrimination decreased substantially relative to the original leaked analysis. The light gradient boosting machine showed the numerically highest average precision and Matthews correlation coefficient, but confidence intervals overlapped with those of the comparator models. Logistic regression achieved a similar area under the receiver operating characteristic curve. Therefore, the rationale for retaining the light gradient boosting machine was pragmatic rather than absolute: it provided the best overall ranking on the prespecified imbalance-aware criteria while supporting tree-based SHAP interpretation. Table 4 shows the light gradient boosting machine ablation analysis across clinically motivated feature subsets (means with percentile-based empirical 95% intervals).

Table 4. Light gradient boosting machine ablation analysis across clinically motivated feature subsets

Feature Set

Number of Features

Average Precision (Empirical 95% Interval)

Balanced Accuracy (Empirical 95% Interval)

Matthews Correlation Coefficient (Empirical 95% Interval)

Symptoms and exposure-only model

12

0.566 [0.440, 0.720]

0.618 [0.531, 0.716]

0.243 [0.066, 0.430]

Without pleural fluid or imaging variables

22

0.553 [0.442, 0.670]

0.636 [0.557, 0.755]

0.289 [0.129, 0.514]

Leakage-aware full model

30

0.552 [0.431, 0.644]

0.629 [0.532, 0.722]

0.288 [0.074, 0.478]

Laboratory-only model

18

0.406 [0.323, 0.518]

0.551 [0.466, 0.629]

0.113 [–0.075, 0.303]

The simplified feature sets remained competitive with, and in some cases slightly outperformed, the full leakage-aware model. In particular, the symptoms/exposure-only model yielded the highest average precision, whereas the model without pleural fluid or imaging variables achieved the strongest Matthews correlation coefficient and balanced accuracy. This staged pattern suggests that simpler pre-definitive models may be clinically more attractive than a maximal feature set and that invasive or procedure-adjacent inputs are not required to obtain the strongest internal signal. Table 5 shows the untouched hold-out test performance of the selected light gradient boosting machine model (95% stratified bootstrap confidence interval).

Table 5. Untouched hold-out test performance of the selected light gradient boosting machine model

Metric

Hold-out Performance (95% Bootstrap Confidence Interval)

Area under the receiver operating characteristic curve

0.660 [0.527, 0.776]

Average precision

0.483 [0.361, 0.669]

Accuracy

0.684 [0.592, 0.765]

Balanced accuracy

0.615 [0.507, 0.719]

Precision

0.464 [0.308, 0.625]

Sensitivity

0.448 [0.276, 0.621]

Specificity

0.783 [0.681, 0.870]

Negative predictive value

0.771 [0.708, 0.836]

F1-score

0.456 [0.291, 0.604]

Matthews correlation coefficient

0.233 [0.015, 0.441]

Brier score

0.211 [0.174, 0.255]

The hold-out cohort contained only 29 mesothelioma cases. Therefore, sensitivity-related intervals remained wide. The point estimates confirmed only moderate internal discrimination. The Brier score was 0.211, which was close to the prevalence-only reference value of 0.208. Therefore, calibration should be interpreted cautiously rather than presented as a strength. At the prespecified 0.50 threshold, the model favored specificity over sensitivity. Lowering the threshold to 0.40 increased sensitivity from 0.448 to 0.621 while reducing specificity from 0.783 to 0.623; lowering it further to 0.30 increased sensitivity to 0.690 but reduced specificity to 0.493. The Matthews correlation coefficient was highest at 0.50 (0.233) and similar at 0.40 (0.224), showing that a lower threshold may be preferable only if the clinical priority is to reduce missed cases at the cost of more false positives. Table 6 shows the hold-out threshold analysis for the selected light gradient boosting machine model. Binary decisions were recalculated at fixed probability thresholds without retuning the model.

Table 6. Hold-out threshold analysis for the selected light gradient boosting machine model

Threshold

Sensitivity

Specificity

Negative Predictive Value

Matthews Correlation Coefficient

True Positives

False Positives

True Negatives

False Negatives

0.30

0.690

0.493

0.791

0.168

20

35

34

9

0.40

0.621

0.623

0.796

0.224

18

26

43

11

0.50

0.448

0.783

0.771

0.233

13

15

54

16

Figure 1 shows the confusion matrix on the untouched hold-out test set using a fixed probability threshold of 0.50. Figure 2 shows the hold-out discrimination and calibration of the tuned light gradient boosting machine model. The dominant global attributions on the untouched hold-out cohort were age, platelet count, lung side, white blood cell count, duration of asbestos exposure, and total protein. These should be interpreted strictly as model-attribution findings rather than causal or pathophysiological proofs. The SHAP beeswarm plot (Figure 3) and mean absolute SHAP bar plot (Figure 4) suggested that disease-related burden, exposure history, pleural biochemistry, and inflammatory or nutritional status jointly influenced model output.

Figure 1. Confusion matrix on the untouched hold-out test set using a fixed probability threshold of 0.50
Note: Non-MM = Symptomatic non-mesothelioma; MM = Mesothelioma.
Figure 2. Hold-out discrimination and calibration of the tuned light gradient boosting machine model: (a) receiver operating characteristic curve, (b) precision-recall curve labeled with average precision, and (c) calibration curve
Note: AUC = Area under the receiver operating characteristic curve.
Figure 3. Global SHapley Additive exPlanations (SHAP) beeswarm plot for the leakage-aware light gradient boosting machine model on the untouched hold-out cohort
Note: Each point = one patient’s feature contribution on the log odds scale; color = feature value; horizontal position = direction and magnitude of contribution; PLT = platelet count; WBC = white blood cell; LDH = lactate dehydrogenase.
Figure 4. Mean absolute SHapley Additive exPlanations (SHAP) values for the leakage-aware light gradient boosting machine model on the untouched hold-out cohort
Note: Larger bars = features with greater average contribution magnitude to model output; PLT = platelet count; WBC = white blood cell; LDH = lactate dehydrogenase.

Across repeated cross-validation runs, the most stable SHAP-attributed features were platelet count, lung side, duration of symptoms, duration of asbestos exposure, and age, supporting that the revised model did not depend on a single leakage-driven variable. Pleural protein and related biochemical variables remained secondary attributions, which is directionally compatible with the broader pleural biomarker literature without implying standalone diagnostic sufficiency (P​o​r​c​e​l​,​ ​2​0​1​8).

To move beyond aggregate importance, individual SHAP dependence plots were generated for the four highest-ranking global features (age, platelet count, lung side, and white blood cell count). Figures 5–8 place the feature value on the horizontal axis and the per-case SHAP contribution on the vertical axis. Therefore, the shape of each scatter plot reveals how the model's log-odds output responded to variation in that variable while holding the other inputs fixed. Markers are colored by the observed class (mesothelioma versus non-mesothelioma symptomatic case) so that the reader can simultaneously judge the direction of the SHAP effect and whether high-effect regions coincided with the true disease label. Because SHAP values are expressed on the log-odds scale, positive values on the vertical axis correspond to model output shifted toward the mesothelioma prediction and negative values correspond to shifts toward the non-mesothelioma prediction.

Figure 5. Target colored SHapley Additive exPlanations (SHAP) dependence plot for age
Note: Each point = an untouched hold-out patient; vertical axis = age-specific SHAP contribution on the log-odds scale; marker color = observed class.
Figure 6. Target-colored SHapley Additive exPlanations (SHAP) dependence plot for platelet count
Note: Positive SHAP values shift the light gradient boosting machine output toward mesothelioma and negative values shift it toward the non-mesothelioma symptomatic class.
Figure 7. Target-colored SHapley Additive exPlanations (SHAP) dependence plot for lung side
Note: Raw dataset codes: 0 = left, 1 = right, and 2 = bilateral; vertical axis = per-case SHAP contribution on the log odds scale.
Figure 8. Target-colored SHapley Additive exPlanations (SHAP) dependence plot for white blood cell count
Note: Marker color = observed class; vertical axis = each patient’s feature-specific SHAP contribution; WBC = white blood cell.

Figures 5–8 show the panel-level reading below. For age, SHAP contributions rose with increasing patient age, with older cases receiving increasingly positive SHAP values and clustering more frequently with the mesothelioma label. This is directionally compatible with the older age distribution of mesothelioma in symptomatic cohorts, but the wide vertical spread at every age value shows that age alone did not determine the decision. For platelet count, dependence was distinctly non-linear: markedly elevated platelet values received large positive SHAP contributions, whereas low-to-normal platelet values contributed negatively, consistent with thrombocytosis acting as an ancillary inflammatory marker in malignant pleural disease rather than as a diagnostic cut-off. For lung side, the categorical dependence pattern mapped left- and right-sided disease close to zero SHAP, while the bilateral code (2) received the strongest positive SHAP contribution, suggesting that the model used laterality as a proxy for more diffuse pleural involvement. For white blood cell count, the dependence plot showed a narrower and noisier pattern: moderately elevated white blood cell values received small positive SHAP contributions, but the effect size remained smaller than for age or platelet count, confirming that white blood cells acted as a supporting rather than dominant feature. Across all four panels, the vertical scatter at each feature value indicates substantial interaction with the remaining predictors, which is why univariate thresholds cannot reproduce the model’s behavior.

Global SHAP describes how the model behaves across the cohort, but clinical trust also depends on how the model reasons about individual patients. To avoid cherry-picking only correctly classified examples, four illustrative cases were selected directly from the untouched hold-out cohort using the prespecified probability threshold of 0.50: one true positive, one true negative, one false positive, and one false negative. For each case, the local SHAP waterfall plots in Figures 9–12 decompose the difference between the model's baseline log-odds output and the case-specific prediction into additive feature contributions, with features ordered by contribution magnitude. Positive bars push the log-odds toward the mesothelioma class and negative bars push toward the non-mesothelioma class.

Figure 9. Local SHapley Additive exPlanations (SHAP) waterfall plot for a true-positive hold-out case
Note: The plot decomposes the difference between the baseline model output and the case-specific predicted log-odds; red bars push toward mesothelioma and blue bars push toward the non-mesothelioma class; PLT = platelet count; LDH = pleural lactate dehydrogenase; WBC = white blood cell; CRP = C-reactive protein; ALP = alkaline phosphatase.
Figure 10. Local SHapley Additive exPlanations (SHAP) waterfall plot for a true-negative hold-out case
Note: The plot shows feature contributions that collectively move the prediction toward the non-mesothelioma symptomatic class; PLT = platelet count; LDH = pleural lactate dehydrogenase; WBC = white blood cell; CRP = C-reactive protein.
Figure 11. Local SHapley Additive exPlanations (SHAP) waterfall plot for a false-positive hold-out case
Note: The positive prediction is driven by patient-level features that resemble the mesothelioma class despite the observed non-mesothelioma label; PLT = platelet count; LDH = pleural lactate dehydrogenase; WBC = white blood cell; CRP = C-reactive protein; ALP = alkaline phosphatase.
Figure 12. Local SHapley Additive exPlanations (SHAP) waterfall plot for a false-negative hold-out case
Note: Negative feature contributions outweighed available positive evidence, illustrating the low-sensitivity failure mode of the leakage-aware model; PLT = platelet count; LDH = pleural lactate dehydrogenase.

True positive case (hold-out index 156, observed mesothelioma, predicted probability 0.91). The model correctly identified this patient as mesothelioma, and the decision was dominated by a small number of high-magnitude contributors: platelet count (SHAP = +1.64, pushing the log-odds toward mesothelioma); age (SHAP = +0.98, pushing the log-odds toward mesothelioma); pleural protein (SHAP = +0.21, pushing the log-odds toward mesothelioma); pleural lactate dehydrogenase (SHAP = +0.19, pushing the log-odds toward mesothelioma). A subset of features pushed in the opposite direction but with much smaller magnitude (sedimentation (SHAP = –0.19, pushing the log-odds toward the non-mesothelioma class); lung side (SHAP = –0.14, pushing the log-odds toward the non-mesothelioma class), which explains the confidently positive predicted probability. Clinically, this case aligns with the pattern a reviewer might expect: an older patient with thrombocytosis and supportive pleural biochemistry receives a strongly positive SHAP signal.

True negative case (hold-out index 66, observed non-mesothelioma symptomatic case, predicted probability 0.07). The model correctly ruled out mesothelioma for this patient. The decision was shaped by a broad set of negative contributors rather than a single dominant feature: age (SHAP = –0.39, pushing the log-odds toward the non-mesothelioma class); platelet count (SHAP = –0.25, pushing the log-odds toward the non-mesothelioma class); lung side (SHAP = –0.23, pushing the log-odds toward the non-mesothelioma class); sedimentation (SHAP = –0.21, pushing the log-odds toward the non-mesothelioma class). Only a few features carried small positive SHAP weights (C-reactive protein (SHAP = +0.11, pushing the log-odds toward mesothelioma)), and these did not outweigh the accumulated negative evidence. This distributed pattern of negative contributions illustrates that rejecting the mesothelioma label was driven by a combination of demographic, hematological, and pleural biochemical features rather than by any leakage-like single variable.

False positive case (hold-out index 163, observed non-mesothelioma symptomatic case, predicted probability 0.93). This case illustrates the model's characteristic failure mode. Despite the observed non-mesothelioma label, the predicted probability was high, driven by platelet count (SHAP = +1.32, pushing the log-odds toward mesothelioma); age (SHAP = +1.01, pushing the log-odds toward mesothelioma); pleural protein (SHAP = +0.22, pushing the log-odds toward mesothelioma); pleural albumin (SHAP = +0.17, pushing the log-odds toward mesothelioma). The downward contributors were small and insufficient to counterbalance this (lung side (SHAP = –0.13, pushing the log-odds toward the non-mesothelioma class)). Clinically, this pattern is interpretable: a non-mesothelioma patient who shares the high platelet count and older age phenotype of the positive class is exactly the type of patient for whom a probabilistic screening model will over-predict, and this over-prediction is a direct consequence of the overlap structure of the leakage-aware predictor set rather than a latent bug in the model.

False negative case (hold-out index 151, observed mesothelioma, predicted probability 0.11). Conversely, this patient had mesothelioma but received a low predicted probability. The top negative contributors were total protein (SHAP = –0.38, pushing the log-odds toward the non-mesothelioma class); pleural protein (SHAP = –0.24, pushing the log-odds toward the non-mesothelioma class); platelet count (SHAP = –0.23, pushing the log-odds toward the non-mesothelioma class); pleural albumin (SHAP = –0.23, pushing the log-odds toward the non-mesothelioma class), while the available positive evidence was limited to age (SHAP = +0.33, pushing the log-odds toward mesothelioma); glucose (SHAP = +0.11, pushing the log-odds toward mesothelioma). This case shows that when a mesothelioma patient's pleural protein and platelet values look closer to the non-mesothelioma distribution, the leakage-aware model cannot recover the correct class using the remaining features alone. Such cases are the direct reason that sensitivity is the weakest hold-out metric and reinforce that the model should be understood as a probabilistic aid rather than a standalone diagnostic rule.

Taken together, the four local SHAP waterfall plots show that the leakage-aware light gradient boosting machine model reasons about each case through a clinically recognizable combination of demographic, hematological, and pleural-biochemistry features, with platelet count and age acting as the dominant drivers in both correctly and incorrectly classified patients. The model is internally coherent in the sense that the same features drive predictions in the same direction across true and false cases; what differs is only whether the patient's feature profile happens to lie on the expected side of the class-overlap region. This is consistent with the global SHAP and dependence findings and supports interpreting the local explanations as faithful descriptions of model behavior rather than as independent biological evidence.

4. Discussion

The central finding of this study is methodological rather than merely predictive: once direct and near-direct leakage variables were removed, the previously perfect model collapsed to a far more modest and credible performance range. This supports the reviewers' concern that the diagnosis method was effectively encoding the target. The empirical audit in this study confirmed that the diagnosis method was a perfect inverse duplicate of the class of diagnosis, making its inclusion incompatible with any clinically meaningful prediction task. The main contribution of this study is not the production of a high-performing diagnostic model, but the demonstration that leakage-aware validation fundamentally changes the interpretation of both machine learning performance and SHAP explanations in a small clinical dataset. In the leaked analysis, SHAP primarily explained a target-duplicating variable; after leakage removal, SHAP identified weaker but more clinically interpretable attributions. This distinction matters for trustworthy medical artificial intelligence, because explainability methods can faithfully explain invalid models if the input feature set is contaminated.

The study also addresses two practical design problems simultaneously. First, it redefines the task as pre-definitive classification among symptomatic pleural disease cases rather than discrimination between mesothelioma and genuinely healthy controls. Second, it separates model development from untouched hold-out testing while still estimating variability through repeated cross-validation and stratified bootstrap intervals. The resulting performance is internally valid but exploratory and should be interpreted in the context of the dataset's modest size and single-source origin (C​h​i​c​c​o​ ​&​a​m​p​;​ ​R​o​v​e​l​l​i​,​ ​2​0​1​9; C​o​l​l​i​n​s​ ​e​t​ ​a​l​.​,​ ​2​0​2​4; T​a​n​r​i​k​u​l​u​ ​&​a​m​p​;​ ​E​r​,​ ​2​0​1​2).

The SHAP findings are more defensible after leakage removal, but they still require caution. The top hold-out SHAP features were age, platelet count, lung side, white blood cell count, duration of asbestos exposure, and total protein, whereas the most stable cross-validation features were platelet count, lung side, duration of symptoms, duration of asbestos exposure, and age. Taken together, these analyses suggest that platelet count, lung side, duration of asbestos exposure, age, white blood cell count, and symptom duration were the most consistently influential variables across the revised workflow. These findings likely represent how the model partitions the observed data rather than direct biological mechanisms. SHAP should therefore be understood as an attribution framework for model behavior, not as proof of causal disease pathways (L​u​n​d​b​e​r​g​ ​&​a​m​p​;​ ​L​e​e​,​ ​2​0​1​7; L​u​n​d​b​e​r​g​ ​e​t​ ​a​l​.​,​ ​2​0​2​0). The strong attribution of lung side and city may partly reflect dataset-specific laterality patterns or geographic exposure structure rather than transportable biology.

Several earlier machine-learning studies on the UCI mesothelioma dataset have reported accuracies above 95%, and some have reported perfect classification. The empirical audit in this study identifies the most likely mechanism: the diagnosis method is an inverse duplicate of the class of diagnosis in the released data. Therefore, any model that uses it will appear perfect without learning anything clinically transferable. The earlier critical re-analysis of this dataset by C​h​i​c​c​o​ ​&​a​m​p​;​ ​R​o​v​e​l​l​i​ ​(​2​0​1​9​) emphasized exactly this point and reported far more modest predictive performance once leakage was controlled. The proposed leakage-aware light gradient boosting machine model reaches hold-out area under the receiver operating characteristic curve, average precision, balanced accuracy, and Matthews correlation coefficient values that are closer in spirit to that re-analysis than to the perfect-accuracy reproductions, which supports interpreting earlier high-accuracy reports as upper bounds that are inflated by target-duplicating variables rather than as realistic estimates of pre-definitive model performance. This pattern is consistent with broader evidence that leakage is one of the most common causes of over-optimistic performance in machine-learning-for-science reproductions, affecting hundreds of studies across biomedicine and related fields (K​a​p​o​o​r​ ​&​a​m​p​;​ ​N​a​r​a​y​a​n​a​n​,​ ​2​0​2​3), and with systematic-review findings that machine learning prognostic models in oncology frequently omit explicit leakage audits, uncertainty quantification, or calibration reporting (D​h​i​m​a​n​ ​e​t​ ​a​l​.​,​ ​2​0​2​2).

The leakage-aware SHAP attributions observed after excluding diagnosis method, cytology, type of mesothelioma, and dead or not are compatible with the clinical phenotype of mesothelioma in symptomatic patients described by current specialty guidance, which emphasizes older age, documented asbestos exposure, and reactive or inflammatory hematological and biochemical patterns as supportive rather than diagnostic features (P​o​p​a​t​ ​e​t​ ​a​l​.​,​ ​2​0​2​2; R​o​b​e​r​t​s​ ​e​t​ ​a​l​.​,​ ​2​0​2​3; S​c​h​e​r​p​e​r​e​e​l​ ​e​t​ ​a​l​.​,​ ​2​0​2​0). None of the top leakage-aware features (age, platelet count, lung side, white blood cell count, and duration of asbestos exposure) are considered stand-alone diagnostic tests in either the practice guidelines of the European Society for Medical Oncology (P​o​p​a​t​ ​e​t​ ​a​l​.​,​ ​2​0​2​2) or the pleural disease guideline of the British Thoracic Society (R​o​b​e​r​t​s​ ​e​t​ ​a​l​.​,​ ​2​0​2​3), and both documents explicitly require histological or cytological confirmation for a positive diagnosis. The SHAP findings in this study should therefore be read as model-level descriptions of how a leakage-aware classifier partitions the observed data, not as evidence that these variables independently identify mesothelioma in the absence of pathological confirmation.

Four features distinguish this re-analysis from earlier reports. First, an explicit empirical leakage audit was performed before modeling, and the decision to exclude diagnosis method, cytology, type of mesothelioma, and dead or not is documented together with the reasoning, directly addressing the class of near-target leakage emphasized in recent methodological work (D​h​i​m​a​n​ ​e​t​ ​a​l​.​,​ ​2​0​2​2; K​a​p​o​o​r​ ​&​a​m​p​;​ ​N​a​r​a​y​a​n​a​n​,​ ​2​0​2​3). Second, development and final testing were kept separate: hyperparameter search and model selection were restricted to the development cohort, and the hold-out cohort was used only once to estimate final performance, consistent with the TRIPOD plus artificial intelligence reporting structure for clinical prediction models (C​o​l​l​i​n​s​ ​e​t​ ​a​l​.​,​ ​2​0​2​4). Third, uncertainty is reported throughout, including 2,000 stratified bootstrap resamples for hold-out metrics and percentile-based empirical intervals for repeated cross-validation. Therefore, the reader can judge precision as well as point estimates. Fourth, the explainability section is staged from global SHAP to SHAP dependence plots to local waterfall plots so that the same model is interrogated at cohort, feature, and individual-patient levels; repeated cross-validation SHAP stability checks protect against single-run artifacts and the clinical-explainable artificial intelligence evaluation guidance of Jin et al. (J​i​n​ ​e​t​ ​a​l​.​,​ ​2​0​2​3) was used as a qualitative reference for structuring the explanatory figures.

Taken at face value, the leakage-aware model's high specificity and negative predictive value suggest that pre-definitive variables carry a reasonable rule-out signal for mesothelioma in this symptomatic cohort, even though the corresponding sensitivity is low and uncertain. In a realistic clinical pathway, however, rule-out is precisely where premature reassurance is dangerous: a falsely negative probability output from a small-cohort internal model should not delay biopsy in a patient whose clinical phenotype otherwise suggests mesothelioma (P​o​p​a​t​ ​e​t​ ​a​l​.​,​ ​2​0​2​2; R​o​b​e​r​t​s​ ​e​t​ ​a​l​.​,​ ​2​0​2​3). This study therefore explicitly cautions against any interpretation of this model as a triage or rule-out tool in its current form. The appropriate clinical use of these findings is methodological: they show what kind of discrimination is achievable from pre-definitive information alone in this particular dataset, and they demonstrate that leakage-aware SHAP explanations do not independently justify deployment of a small-sample, single-source model (D​h​i​m​a​n​ ​e​t​ ​a​l​.​,​ ​2​0​2​2; J​i​n​ ​e​t​ ​a​l​.​,​ ​2​0​2​3; K​a​p​o​o​r​ ​&​a​m​p​;​ ​N​a​r​a​y​a​n​a​n​,​ ​2​0​2​3). In principle, non-definitive clinical variables could be explored for triage prioritization when biopsy or cytology is delayed; however, the current model's moderate discrimination, limited calibration evidence, and absence of external validation mean that this possibility remains hypothesis-generating rather than clinically actionable. Any clinical translation would also need to be aligned with current first-line management standards, including the evolving role of combination immune-checkpoint blockade (P​e​t​e​r​s​ ​e​t​ ​a​l​.​,​ ​2​0​2​2), and with germline risk considerations related to BRCA1-associated protein 1 (BAP1) in susceptible families (C​a​r​b​o​n​e​ ​e​t​ ​a​l​.​,​ ​2​0​2​2). The threshold analysis reinforces this caution: a threshold of 0.30 captured more mesothelioma cases than 0.50 but generated many more false positives, whereas the default 0.50 threshold preserved specificity at the cost of missed cases. Threshold choice should, therefore, be treated as a clinical-policy decision for future externally validated work, not as an optimization result from this small internal hold-out set.

Several limitations remain in this study. The dataset is relatively small and historically curated, and the untouched hold-out cohort contained only 29 mesothelioma cases. This small denominator is a major reason for the wide uncertainty intervals, especially for sensitivity. This study nevertheless retained an untouched hold-out cohort instead of relying only on nested cross-validation because a single untouched internal check provides a cleaner guard against inadvertent reuse during model selection, even though it does not substitute for external validation. Pleural fluid and imaging variables were retained in the primary leakage-aware model because they may be available before definitive diagnosis, but their exact real-world timing can vary across clinical pathways. The staged availability table is therefore a pragmatic clinical mapping of the released variables rather than a timestamp-verified reconstruction of each patient's diagnostic pathway.

A further data-quality limitation concerned platelet count, one of the strongest hold-out SHAP features. The raw dataset contained a maximum platelet count of 3335, exceeding the 99th percentile of 790.39. Because the public release does not provide a verified correction source, this value was retained as originally reported rather than manually edited. To test robustness, the analysis was repeated after removing the maximum-platelet-count record, after winsorizing platelet count at the development-set 99th percentile and after log-transforming platelet count. The hold-out area under the receiver operating characteristic curve changed from 0.660 in the primary analysis to 0.640, 0.660, and 0.660, respectively; corresponding average precision values changed from 0.483 to 0.458, 0.483, and 0.483. Platelet count remained a leading SHAP feature across these sensitivity analyses, supporting that the revised conclusions were not driven solely by a single extreme laboratory value.

In terms of future directions, the most informative next step for this line of work is not a further reproduction on the same dataset, but external validation on contemporary single-center or multi-center pleural cohorts in which the predictor timestamps are explicitly documented, in line with the TRIPOD plus artificial intelligence emphasis on prospective and transferable evaluation (C​o​l​l​i​n​s​ ​e​t​ ​a​l​.​,​ ​2​0​2​4). In parallel, prospective registries linked to established pleural diagnostic pathways (R​o​b​e​r​t​s​ ​e​t​ ​a​l​.​,​ ​2​0​2​3) would allow three methodological extensions that are difficult to perform on the UCI release: (i) time-stratified modeling that separates pre-imaging, post-imaging, and post-pleural-fluid prediction windows so that feature availability is aligned with real clinical decisions (R​o​b​e​r​t​s​ ​e​t​ ​a​l​.​,​ ​2​0​2​3); (ii) calibration-aware development, ideally with isotonic or Platt rescaling and with decision-curve analysis at clinically meaningful threshold ranges (C​o​l​l​i​n​s​ ​e​t​ ​a​l​.​,​ ​2​0​2​4; D​h​i​m​a​n​ ​e​t​ ​a​l​.​,​ ​2​0​2​2); and (iii) fairness and subgroup analysis by age, sex, germline status, and exposure profile (C​a​r​b​o​n​e​ ​e​t​ ​a​l​.​,​ ​2​0​2​2; P​o​p​a​t​ ​e​t​ ​a​l​.​,​ ​2​0​2​2), which the present cohort is too small to support. For the explainability layer, future work should compare tree SHAP attributions with model-agnostic alternatives (e.g., permutation importance and local interpretable model-agnostic explanations) and should report SHAP stability explicitly across resamples, as done in the present study, while adopting the clinical-explainable artificial intelligence evaluation criteria summarized in recent reviews (J​i​n​ ​e​t​ ​a​l​.​,​ ​2​0​2​3). Methodologically, further benchmarking against modern tabular baselines, including both gradient-boosted ensembles and deep tabular models (G​r​i​n​s​z​t​a​j​n​ ​e​t​ ​a​l​.​,​ ​2​0​2​2; S​h​w​a​r​t​z​-​Z​i​v​ ​&​a​m​p​;​ ​A​r​m​o​n​,​ ​2​0​2​2), would help determine whether more complex architectures offer any advantage over the light gradient boosting machine on leakage-aware inputs of this scale. Finally, a prospective decision-impact evaluation would be needed before any leakage-aware machine learning model of mesothelioma could be considered for use as a triage aid; such a study would assess not only discrimination and calibration, but also the downstream effect of model outputs on biopsy timing, time to treatment initiation (including immune-checkpoint combination regimens where indicated (P​e​t​e​r​s​ ​e​t​ ​a​l​.​,​ ​2​0​2​2)), and patient-reported outcomes.

5. Conclusions

This leakage aware analysis demonstrates that removing post-diagnostic and target duplicating variables eliminates the previously observed perfect classification performance in the UCI mesothelioma dataset. After controlling for data leakage, the light gradient boosting machine exhibited only moderate discriminatory ability, accompanied by wide uncertainty intervals and limited calibration performance relative to a prevalence-based Brier reference. Across hold out SHAP analyses and repeated cross validation stability assessments, the most consistently influential features were platelet count, lung side, duration of asbestos exposure, age, white blood cell count, and symptom duration. These variables are broadly consistent with the clinical characteristics commonly observed before definitive diagnosis of malignant mesothelioma; however, neither individually nor collectively can substitute for histological or cytological confirmation.

The primary contribution of this study is methodological. Leakage-aware validation substantially alters both the reported predictive performance and the interpretation of SHAP-based explanations in a small-sample clinical prediction model. In the presence of leakage, explainability methods primarily reflect the influence of target-duplicating variables. Once leakage is removed, the explanations highlight weaker but more clinically meaningful patterns that are potentially relevant to disease presentation. To improve methodological rigor, model uncertainty was quantified through stratified bootstrap resampling and repeated cross-validation, while additional sensitivity analyses confirmed that the principal findings were not driven by a single extreme laboratory measurement. These observations emphasize the importance of systematic leakage detection, robust validation procedures, and transparent uncertainty reporting in the development of clinical machine-learning models.

From a clinical perspective, the leakage-aware light gradient boosting machine should be regarded as a hypothesis-generating tool rather than a model ready for clinical deployment. Although its performance profile, characterized by higher specificity and negative predictive value than sensitivity, may have potential utility in future triage-oriented research, further evaluation is required. Such evaluation should include external validation, temporal assessment, calibration analysis, and investigation of potential clinical impact before any practical implementation is considered. At present, tissue-based diagnostic confirmation remains the reference standard for malignant mesothelioma diagnosis, and any future explainable machine-learning system should be used only as a complementary decision-support tool rather than a replacement for established diagnostic procedures. Overall, these findings support the need for externally validated, clinically informed, and leakage-audited predictive modeling in malignant mesothelioma, while underscoring the broader responsibility of explainable artificial intelligence research to avoid feature contamination, overinterpretation of small datasets, and unwarranted clinical confidence.

Author Contributions

Conceptualization, M.M and T.B.D; methodology, M.M and T.B.D; validation, M.M and T.B.D; formal analysis, M.M and T.B.D; investigation, M.M and T.B.D; resources, M.M and T.B.D; data curation, M.M and T.B.D; writing—original draft preparation, M.M and T.B.D; writing—review and editing, M.M and T.B.D; visualization, M.M and T.B.D; clinical interpretation, T.B.D. All authors have read and agreed to the published version of the manuscript.

Data Availability

The analyzed data are publicly available from the UCI Machine Learning Repository as the Mesothelioma dataset (T​a​n​r​i​k​u​l​u​ ​&​a​m​p​;​ ​E​r​,​ ​2​0​1​2).

The leakage-aware analysis scripts used for model development, SHAP analysis, sensitivity analysis, and manuscript assembly are available from the corresponding author and can be deposited in a public repository at submission.

Acknowledgments

The authors thank the original contributors of the public UCI mesothelioma dataset and the institutions that supported the analytical work. Also, the authors acknowledge Sakarya University of Applied Sciences (https://subu.edu.tr/) for the technical support provided to publish the present manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References
Carbone, M., Adusumilli, P. S., Alexander, H. R., Baas, P., Bardelli, F., Bononi, A., Bueno, R., Felley-Bosco, E., Galateau-Salle, F., & Jablons, D. et al. (2019). Mesothelioma: Scientific clues for prevention, diagnosis, and therapy. CA Cancer J. Clin., 69(5), 402–429. [Google Scholar] [Crossref]
Carbone, M., Pass, H. I., Ak, G., Alexander Jr, H. R., Baas, P., Baumann, F., Blakely, A. M., Bueno, R., Bzura, A., & Cardillo, G. et al. (2022). Medical and surgical care of patients with mesothelioma and their relatives carrying germline BAP1 mutations. J. Thorac. Oncol., 17(7), 873–889. [Google Scholar] [Crossref]
Chicco, D. & Rovelli, C. (2019). Computational prediction of diagnosis and feature selection on mesothelioma patient health records. PloS One, 14(1), e0208737. [Google Scholar] [Crossref]
Collins, G. S., Moons, K. G. M., Dhiman, P., Riley, R. D., Beam, A. L., Van Calster, B., Ghassemi, M., Liu, X., Reitsma, J. B., & van Smeden, M. et al. (2024). TRIPOD+AI statement: Updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ, 385, e078378. [Google Scholar] [Crossref]
Dhiman, P., Ma, J., Andaur Navarro, C. L., Speich, B., Bullock, G., Damen, J. A. A., Hooft, L., Kirtley, S., Riley, R. D., & Van Calster, B. et al. (2022). Methodological conduct of prognostic prediction models developed using machine learning in oncology: A systematic review. BMC Med. Res. Methodol., 22(1), 101. [Google Scholar] [Crossref]
Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on typical tabular data? Adv. Neural Inf. Process. Syst., 35, 507–520. [Google Scholar]
Jin, W., Li, X., Fatehi, M., & Hamarneh, G. (2023). Guidelines and evaluation of clinical explainable AI in medical image analysis. Med. Image Anal., 84, 102684. [Google Scholar] [Crossref]
Kapoor, S. & Narayanan, A. (2023). Leakage and the reproducibility crisis in machine-learning-based science. Patterns, 4(9), 100804. [Google Scholar] [Crossref]
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T. Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst., 30. [Google Scholar]
Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., & Lee, S. I. (2020). From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell., 2(1), 56–67. [Google Scholar] [Crossref]
Lundberg, S. M. & Lee, S. I. (2017). A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst., 30. [Google Scholar]
Naso, J. R., Cheung, S., Ionescu, D. N., & Churg, A. (2021). Utility of SOX6 and DAB2 for the diagnosis of malignant mesothelioma. Am. J. Surg. Pathol., 45(9), 1245–1251. [Google Scholar] [Crossref]
Peters, S., Scherpereel, A., Cornelissen, R., Oulkhouir, Y., Greillier, L., Kaplan, M. A., Talbot, T., Monnet, I., Hiret, S., & Baas, P. et al. (2022). First-line nivolumab plus ipilimumab versus chemotherapy in patients with unresectable malignant pleural mesothelioma: 3-year outcomes from CheckMate 743. Ann. Oncol., 33(5), 488–499. [Google Scholar] [Crossref]
Popat, S., Baas, P., Faivre-Finn, C., Girard, N., Nicholson, A. G., Nowak, A. K., Opitz, I., Scherpereel, A., Reck, M., & ESMO Guidelines Committee. (2022). Malignant pleural mesothelioma: ESMO Clinical Practice Guidelines for diagnosis, treatment and follow-up. Ann. Oncol., 33(2), 129–142. [Google Scholar] [Crossref]
Porcel, J. M. (2018). Biomarkers in the diagnosis of pleural diseases: A 2018 update. Ther. Adv. Respir. Dis., 12, 1753466618808660. [Google Scholar] [Crossref]
Roberts, M. E., Rahman, N. M., Maskell, N. A., Bibby, A. C., Blyth, K. G., Corcoran, J. P., Edey, A., Evison, M., de Fonseka, D., & Hallifax, R. et al. (2023). British Thoracic Society Guideline for pleural disease. Thorax, 78(Suppl 3), s1–s42. [Google Scholar] [Crossref]
Scherpereel, A., Opitz, I., Berghmans, T., Psallidas, I., Glatzer, M., Rigau, D., Astoul, P., Bölükbas, S., Boyd, J., & Coolen, J. et al. (2020). ERS/ESTS/EACTS/ESTRO guidelines for the management of malignant pleural mesothelioma. Eur. Respir. J., 55(6), 1900953. [Google Scholar] [Crossref]
Shwartz-Ziv, R. & Armon, A. (2022). Tabular data: Deep learning is not all you need. Inf. Fusion, 81, 84–90. [Google Scholar] [Crossref]
Tanrikulu, A. & Er, O. (2012). Mesothelioma’s disease data set. UCI Machine Learning Repository. [Google Scholar] [Crossref]
Woolhouse, I., Bishop, L., Darlison, L., De Fonseka, D., Edey, A., Edwards, J., Faivre-Finn, C., Fennell, D. A., Holmes, S., & Kerr, K. M. et al. (2018). British Thoracic Society Guideline for the investigation and management of malignant pleural mesothelioma. Thorax, 73(Suppl 1), i1–i30. [Google Scholar] [Crossref]
Appendix

The supplementary appendix includes several tables summarizing key aspects of the study. Table A1 summarizes the hyperparameter search space used during development-set tuning. Table A2 summarizes the key raw categorical encodings referenced in the SHAP figures and methods. Table A3 documents SHAP rank stability across repeated cross-validation runs. Table A4 lists the final selected hyperparameters for logistic regression, support vector machine, k-nearest neighbors, and the light gradient boosting machine. Table A5 reports platelet outlier sensitivity analyses.

Table A1. Hyperparameter grids used for development-set model tuning

Model

Parameter

Candidate Values

Logistic regression

C

0.1, 1.0, 10.0

Support vector machine

C

0.1, 1.0, 10.0

Support vector machine

gamma

scale, 0.1

k-nearest neighbors

nneighbors

5, 9, 13

k-nearest neighbors

weights

uniform, distance

k-nearest neighbors

p

1, 2

Light gradient boosting machine

nestimators

100, 200

Light gradient boosting machine

learning_rate

0.03, 0.1

Light gradient boosting machine

num_leaves

15, 31

Light gradient boosting machine

max_depth

–1, 6

Light gradient boosting machine

min_child_samples

5, 20

Light gradient boosting machine

subsample

0.8, 1.0

Light gradient boosting machine

colsample_bytree

0.8

Light gradient boosting machine

reg_lambda

0.0, 1.0

Table A2. Key raw categorical encodings used in modeling and SHapley Additive exPlanations (SHAP) visual interpretation

Variable

Raw Codes

Gender

0 = female, 1 = male

Asbestos exposure

0 = no, 1 = yes

Lung side

0 = left, 1 = right, 2 = bilateral

Habit of smoking

0 = none, 1 = rare, 2 = regular, 3 = frequent

City

0–8 = nine raw University of California, Irvine (UCI) city codes (city names not provided in the public release)

Table A3. SHapley Additive exPlanations (SHAP) feature-rank stability across repeated cross-validation runs

Feature

Mean Rank

Median Rank

Top 5 Frequency

Top 10 Frequency

Platelet count

1.53

1.0

15

15

Lung side

3.0

3.0

15

15

Duration of symptoms

5.53

6.0

7

14

Duration of asbestos exposure

6.8

6.0

5

14

Age

6.0

5.0

8

13

White blood cell count

6.8

7.0

6

13

C-reactive protein

6.8

6.0

6

12

Pleural protein

8.6

8.0

3

10

Glucose

9.4

8.0

3

9

Pleural lactate dehydrogenase

11.87

11.0

0

6

Serum lactate dehydrogenase

11.87

13.0

2

5

Pleural glucose

13.67

14.0

1

5

Total protein

13.47

15.0

1

4

Albumin

14.27

13.0

0

2

Pleural fluid white blood cell count

15.07

15.0

0

2

Sedimentation

15.27

16.0

0

2

Alkaline phosphatase

16.0

16.0

0

2

Pleural albumin

16.53

18.0

1

2

Hemoglobin

18.8

19.0

0

2

City

19.67

22.0

2

2

Chest pain

21.27

23.0

0

1

Gender

16.67

17.0

0

0

Weakness

23.2

23.0

0

0

Pleural thickness on computed tomography

23.6

24.0

0

0

Pleural fluid pH flag

24.2

25.0

0

0

Dyspnea

24.8

26.0

0

0

Performance status

25.07

25.0

0

0

Asbestos exposure

27.47

28.0

0

0

Pleural effusion

28.27

29.0

0

0

Habit of smoking

29.53

30.0

0

0

Table A4. Final selected hyperparameters from development-set grid search for all tuned models

Model

Parameter

Selected Value

Logistic regression

C

10.0

Logistic regression

class_weight

balanced

Support vector machine

C

10.0

Support vector machine

gamma

0.1

Support vector machine

class_weight

balanced

Support vector machine

probability

true

k-nearest neighbors

nneighbors

9

k-nearest neighbors

p

1

k-nearest neighbors

weights

distance

Light gradient boosting machine

colsample_bytree

0.8

Light gradient boosting machine

learning_rate

0.03

Light gradient boosting machine

max_depth

–1

Light gradient boosting machine

min_child_samples

5

Light gradient boosting machine

nestimators

100

Light gradient boosting machine

num_leaves

31

Light gradient boosting machine

reg_lambda

1.0

Light gradient boosting machine

subsample

0.8

Light gradient boosting machine

class_weight

balanced

Table A5. Platelet outlier sensitivity analyses on the untouched hold-out cohort

Analysis

Area Under the Receiver Operating Characteristic Curve

Average Precision

Balanced Accuracy

Matthews Correlation Coefficient

Platelet Rank

Mean Absolute Shapley Additive Explanations (SHAP) Value of Platelet Count

Top Features

Primary leakage-aware model

0.660

0.483

0.615

0.233

2

0.372

Age, platelet count, lung side, white blood cell count, duration of asbestos exposure, and total protein

Sensitivity 1: remove maximum platelet count record

0.640

0.458

0.569

0.136

2

0.377

Age, platelet count, lung side, white blood cell count, duration of asbestos exposure, and pleural protein

Sensitivity 2: winsorize platelet count at development-set 99th percentile

0.660

0.483

0.615

0.233

2

0.372

Age, platelet count, lung side, white blood cell count, duration of asbestos exposure, and total protein

Sensitivity 3: log-transform platelet count

0.660

0.483

0.615

0.233

2

0.372

Age, platelet count, lung side, white blood cell count, duration of asbestos exposure, and total protein


Cite this:
APA Style
IEEE Style
BibTex Style
MLA Style
Chicago Style
GB-T-7714-2015
Mansour, M. & Donmez, T. B. (2025). Leakage-Aware Explainable Machine Learning for Malignant Mesothelioma Classification Using Clinical and Laboratory Features. Healthcraft. Front., 3(3), 139-157. https://doi.org/10.56578/hf030303
M. Mansour and T. B. Donmez, "Leakage-Aware Explainable Machine Learning for Malignant Mesothelioma Classification Using Clinical and Laboratory Features," Healthcraft. Front., vol. 3, no. 3, pp. 139-157, 2025. https://doi.org/10.56578/hf030303
@research-article{Mansour2025Leakage-AwareEM,
title={Leakage-Aware Explainable Machine Learning for Malignant Mesothelioma Classification Using Clinical and Laboratory Features},
author={Mohammed Mansour and Turker Berk Donmez},
journal={Healthcraft Frontiers},
year={2025},
page={139-157},
doi={https://doi.org/10.56578/hf030303}
}
Mohammed Mansour, et al. "Leakage-Aware Explainable Machine Learning for Malignant Mesothelioma Classification Using Clinical and Laboratory Features." Healthcraft Frontiers, v 3, pp 139-157. doi: https://doi.org/10.56578/hf030303
Mohammed Mansour and Turker Berk Donmez. "Leakage-Aware Explainable Machine Learning for Malignant Mesothelioma Classification Using Clinical and Laboratory Features." Healthcraft Frontiers, 3, (2025): 139-157. doi: https://doi.org/10.56578/hf030303
MANSOUR M, DONMEZ T B. Leakage-Aware Explainable Machine Learning for Malignant Mesothelioma Classification Using Clinical and Laboratory Features[J]. Healthcraft Frontiers, 2025, 3(3): 139-157. https://doi.org/10.56578/hf030303
cc
©2025 by the author(s). Published by Acadlore Publishing Services Limited, Hong Kong. This article is available for free download and can be reused and cited, provided that the original published version is credited, under the CC BY 4.0 license.