ACADlore - HF, Volume 3, Issue 3, 2025

Malignant mesothelioma remains a diagnostic challenge due to the phenotypic overlap with benign pleural diseases and the reliance on invasive procedures for definitive confirmation. To address these limitations, a leakage-aware, explainable machine learning framework was developed and applied to a publicly available mesothelioma dataset comprising 324 cases (96 mesothelioma, 228 symptomatic non-mesothelioma). Variables prone to target leakage or unavailable at the point of diagnosis—such as diagnosis method, cytology results, mesothelioma subtype, and survival status—were systematically excluded. The remaining features were stratified into initial clinical presentation, post-imaging, and post-pleural-fluid analysis stages prior to model development. The dataset was partitioned into a development cohort (n = 226) and an independent hold-out cohort (n = 98). Multiple classifiers, including logistic regression, support vector machine, k-nearest neighbors, and light gradient boosting machine, were optimized via grid search and evaluated using repeated stratified 5-fold cross-validation. The diagnosis method was identified as a perfect inverse surrogate of the target variable and consequently removed. The light gradient boosting machine exhibited superior performance, achieving the highest average precision (0.543) and Matthews correlation coefficient (0.306) during cross-validation. On the unseen hold-out cohort, light gradient boosting machine yielded an area under the receiver operating characteristic curve of 0.660, average precision of 0.483, balanced accuracy of 0.615, and Matthews correlation coefficient of 0.233. At the conventional 0.50 threshold, sensitivity was 0.448, specificity 0.783, and negative predictive value 0.771; lowering the threshold to 0.30 increased sensitivity to 0.690 at the expense of specificity reduction to 0.493. SHapley Additive exPlanations (SHAP) identified age, platelet count, lung side, white blood cell count, and duration of asbestos exposure as the most influential predictors. This leakage-aware, explainable light gradient boosting machine model delivers clinically interpretable diagnostic predictions while mitigating target leakage, demonstrating moderate discrimination and potential utility in real-world clinical settings. These findings warrant further external validation and prospective evaluation to confirm generalizability and clinical impact.

- no more data -