Transformer-Driven Feature Fusion for Robust Diagnosis of Lung Cancer Brain Metastasis Under Missing-Modality Scenarios
Abstract:
Accurate diagnosis of lung cancer brain metastasis is often hindered by incomplete magnetic resonance imaging (MRI) modalities, resulting in suboptimal utilization of complementary radiological information. To address the challenge of ineffective feature integration in missing-modality scenarios, a Transformer-based multi-modal feature fusion framework, referred to as Missing Modality Transformer (MMT), was introduced. In this study, multi-modal MRI data from 279 individuals diagnosed with lung cancer brain metastasis, including both small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC), were acquired and processed through a standardized radiomics pipeline encompassing feature extraction, feature selection, and controlled data augmentation. The proposed MMT framework was trained and evaluated under various single-modality and combined-modality configurations to assess its robustness to modality absence. A maximum diagnostic accuracy of 0.905 was achieved under single-modality missing conditions, exceeding the performance of the full-modality baseline by 0.017. Interpretability was further strengthened through systematic analysis of loss-function hyperparameters and quantitative assessments of modality-specific importance. The experimental findings collectively indicate that the MMT framework provides a reliable and clinically meaningful solution for diagnostic environments in which imaging acquisition is limited by patient conditions, equipment availability, or time constraints. These results highlight the potential of Transformer-based radiomics fusion to advance computational neuro-oncology by improving diagnostic performance, enhancing robustness to real-world imaging variability, and offering transparent interpretability that aligns with clinical decision-support requirements.1. Introduction
Lung cancer is one of the most prevalent malignancies globally and a leading cause of cancer-related mortality [1], [2]. This malignant tumor originates from cancerous cells in lung tissue. Pathologically and therapeutically, it is classified into two main categories: small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC). The latter encompasses various histological subtypes, including adenocarcinoma and squamous cell carcinoma, while the former constitutes the remaining category [3]. SCLC accounts for approximately 15% of lung cancer cases, whereas NSCLC makes up 85% [4], [5], [6]. Over half of newly diagnosed lung cancer patients present with advanced or metastatic disease [7], with 10%–26% exhibiting brain metastases at diagnosis and another 30% developing them during disease progression [8], [9], [10]. Due to the absence of distinct early symptoms, lung cancer is often detected in advanced stages when metastasis has already spread, making treatment and management particularly challenging.
With the rapid advancement of artificial intelligence, machine learning methods have become crucial in diagnosing brain metastases from lung cancer. Radiomics is widely employed to extract high-throughput features from medical images, significantly enhancing the interpretability of image-assisted diagnosis [11], [12], [13]. For instance, Xu et al. [14] utilized radiomics to extract features from preoperative chest CT scans to predict anaplastic lymphoma kinase (ALK) gene molecular signatures. However, while traditional machine learning offers strong interpretability, it often struggles with complex nonlinear problems inherent in multi-modal MRI data. Deep learning, a subset of machine learning, excels at processing such intricate relationships [15], [16]. For example, Guo et al. [17] developed a multi-modal MRI decision fusion network for glioma classification, and Li et al. [18] proposed a Hybrid Graph Convolutional Network (HGCN) for cancer survival prediction. Despite these breakthroughs, existing methods often lack robust diagnostic capabilities when specific modalities are missing.
A dataset of lung cancer brain metastasis patients from the Affiliated Hospital of Hebei University was utilized in this study. Using radiomics, high-throughput feature information was extracted from MRI images [19]. Based on the Transformer architecture, a network was designed for missing-modality scenarios. It employs a modal interaction self-attention mechanism to learn inter-modal correlations and generate multi-level feature representations for missing modalities. This approach enables effective reconstruction and maintains diagnostic accuracy even when modalities are incomplete.
2. Methodology
Aiming to address the critical issue of ineffective feature fusion in the diagnosis of brain metastases from lung cancer when MRI image modalities are missing, a Transformer-based feature fusion network called Missing Modality Transformer (MMT) was proposed in this study to achieve effective feature fusion.
Multimodal Fusion Network—Variational Autoencoder (MFN-VAE) is a Variational Autoencoder (VAE) model designed to integrate multi-modal features. It processes multi-modal data by mapping them into a unified latent space. By learning correlations between different modalities, MFN-VAE achieves deep feature integration. The latent representations learned by MFN-VAE serve as robust features for downstream tasks, significantly improving performance. The model consists of three components: an encoder, a decoder, and a predictor. In the lung cancer brain metastasis prediction task, the dataset includes three modalities: T1-weighted imaging (T1WI), fluid-attenuated inversion recovery (FLAIR), and diffusion-weighted imaging (DWI). Each modality is processed through the VAE to learn latent representations, which are then fed into the predictor to forecast brain metastasis outcomes.
Self-attention mechanisms are a cornerstone of Transformer models and a key factor behind their outstanding performance. These mechanisms determine element importance by comparing the similarity between positions in a sequence. Specifically, through self-attention calculations, each position generates an attention weight vector reflecting its relative importance. This adaptive allocation of attention enables the model to process sequence data efficiently. The theory revolves around three core components: Query (Q), Key (K), and Value (V). In Transformer models, input sequences are encoded and mapped to these three embeddings. A classic implementation is the Scaled Dot-Product Attention (SDPA) mechanism, as illustrated in Figure 1.

The SDPA mechanism involves performing dot-product operations on $Q$ and $K$ elements, followed by scaling the result. The scaling factor is a constant introduced to prevent the dot-product result from becoming excessively large, ensuring the values remain within a reasonable range. Subsequently, the scaled dot-product result undergoes SoftMax normalization to obtain attention welights. The SoftMax function maps scores to a range between 0 and 1 , representing the attention importance of each position relative to others. Finally, the attention weights are used to compute a weighted sum of the $V$ vectors to generate the context representation for each position. This process effectively integrates information from other positions into the current position’s representation. The specific calculation formula is as follows:
where, $d_k$ denotes the dimension of the $K$ vector.
In practice, matrix multiplication enables the simultaneous calculation of similarities between multiple $Q$ and $K$ pairs, facilitating efficient batch processing. The SDPA mechanism is favored for its computational simplicity and efficiency, allowing it to effectively capture both local and global dependencies within sequences, which explains its widespread adoption in Transformer models.
Modal interaction self-attention is the core of MMT’s design, enhancing the interaction between different modalities. This enables MMT to maintain high performance even when certain modalities are missing by leveraging existing modalities to generate feature information related to the missing ones. Taking MMT-2M as an example, the four inputs to the modal interaction self-attention mechanism consist of Modal 1 features, Modal 2 features, the Missing Modality Token, and the Classification Token. These inputs first undergo standardization through a Layer Normalization (LayerNorm) layer. Subsequently, the features of Modal 1, Modal 2, and the Missing Modality Token are mapped to a larger feature space and decomposed into $Q, K$, and $V$ components. The Classification Token is similarly mapped, but only its $Q$ component is retained.
Regarding the strategy for cross-modal interaction, the proposed design principle focuses on enhancing interactions between the tokens of Modal 1, Modal 2, and the missing modality, while the Classification Token only interacts with the missing modality. Technically, $K$ and $V$ are responsible for fusing with other modalities, whereas $Q$ solely handles forward propagation. Specifically, the $K$ vectors from each modality undergo a Hadamard product operation and are summed with their corresponding $Q$ vectors. The Hadamard product is an element-wise multiplication operation that yields a result matrix of matching dimensions. The $V$ vectors of each modality are then multiplied with their corresponding $K$ vectors. Finally, the $Q$ vector associated with the Classification Token is combined with the $K$ and $V$ vectors of the missing modality for propagation.
Through this modality interaction process, the $Q$ vectors of each modality integrate feature information from all modalities. These four $Q$ vectors then pass through a LayerNorm layer and a Feedforward Neural Network (FNN) with shared weights. To prevent excessive interference from modal interactions that could obscure original features, these $Q$ vectors are ultimately combined with the pre-differentiated vectors for output. After passing through two modal interaction attention layers, the four tensors are concatenated and fed into the Multi-Layer Perceptron (MLP). Figure 2 and Figure 3 illustrate the architectures of the FNN and the MLP.


An FNN consists of two fully connected layers, each followed by a LayerNorm layer and a Dropout layer. As a widely adopted regularization technique, the Dropout layer randomly discards a portion of neurons during training to mitigate overfitting risks. The MLP consists of three fully connected layers. The first layer is preceded by a LayerNorm layer, followed by another LayerNorm layer and a Gaussian Error Linear Unit (GELU) activation function. The second fully connected layer also includes a LayerNorm layer and a GELU activation function. The third fully connected layer is terminated by a SoftMax layer.
The loss function of MMT consists of two components: reconstruction loss and classification loss. The reconstruction loss employs the Mean Absolute Error (MAE) loss. This function calculates the absolute difference between reconstructed and original features, averaging these differences to determine the final loss value. Compared to Mean Squared Error (MSE) loss, MAE demonstrates greater robustness by being less sensitive to outliers. Minimizing MAE helps the model generate reconstructed features that closely resemble the original data.
The formula is as follows:
where, $x_i$ and $\hat{x}_i$ represent the radiomics feature and reconstructed feature of the $i$-th sample, respectively, with $m$ being the number of samples.
The classification loss employs the cross entropy loss function, which quantifies the discrepancy between predicted probabilities and actual labels. When predictions perfectly match the labels, the loss is zero; otherwise, it increases. Minimizing this loss aligns the model’s predictions with the true labels. The formula is:
where, $y_i$ and $\hat{y}_i$ represent the true label and predicted label of the $i$-th sample.
The reconstruction loss and classification loss are then weighted and summed. Due to structural differences between the MMT-2M and MMT-1M models, their loss functions vary slightly, defined by Eq. (4) and Eq. (5):
where, $\alpha, \beta$, and $\gamma$ are the respective weight parameters.
The MMT model is a Transformer-based classification network designed for missing-modality scenarios. As shown in Figure 4, two MMT architectures were developed: MMT-2M for single-modality missing and MMT-1M for dual-modality missing. Its core principle is to enhance the model’s inter-modal interaction capabilities, enabling the reconstruction of missing modalities.

Taking MMT-2M as an example, it features four inputs: Modal 1 features, Modal 2 features, a Missing Modality Token, and a Classification Token. The Classification Token is a trainable neural network weight parameter updated through backpropagation, facilitating feature reconstruction and classification. The features from Modal 1 and Modal 2 are linearly mapped to a new space via a fully connected layer, then fed into the modal interaction self-attention mechanism. This layer enhances the network’s feature learning capacity. Since tokens are weight parameters following a normal distribution and contain no prior feature information, they are directly input into the self-attention mechanism.
After two modal interactions, the missing modal features regenerated by the tokens are output. These are then input into the fully connected neural network alongside Modal 1, Modal 2, and the features generated by the Classification Token to output classification results. At its core, MMT is a classification model that utilizes modal interaction self-attention and reconstruction loss to generate missing modal features via inter-modal connections. This capability not only allows MMT to reconstruct missing modalities but, more importantly, enables it to learn deeper-level feature representations. The reconstruction process serves solely as a means to enhance the network’s classification performance rather than focusing on reconstruction quality itself.
3. Experimental Research
This study constructed a multi-modal MRI dataset encompassing both SCLC and NSCLC brain metastases, which was trained and tested using the MMT network model. The MMT experiments were conducted on an Ubuntu workstation equipped with an NVIDIA GeForce RTX 2080 Ti GPU and Intel® Xeon® Gold 6240 CPU. Training parameters were configured as follows: batch size set to 32, loss weight range between 0.1 and 1, and learning rate range between $1 \times 10^{-6}$ and $1 \times 10^{-5}$. The AdamW optimizer was employed during training. As the MMT network is built on the Transformer architecture, it required substantial memory and computational resources due to its large parameter scale. A total of 200 training epochs were completed with hyperparameter tuning strategies to ensure model convergence and prevent underfitting.
The dataset, sourced from the Affiliated Hospital of Hebei University, comprises multi-modal MRI images of 279 lung cancer patients with brain metastases. It includes six subtypes: SCLC (100 cases), adenocarcinoma (153 cases), squamous cell carcinoma (17 cases), adenosquamous carcinoma (2 cases), large cell lung cancer (6 cases), and other brain metastases (1 case). Each patient received T1WI, FLAIR, and DWI scans. In clinical practice, specialists classify these subtypes into SCLC (100 cases) and NSCLC (179 cases), as shown in Table 1. To ensure accurate tumor localization, MRI images were annotated by a senior clinical expert, enabling comprehensive analysis of multi-modal data.
Disease Type | Number of Patients | Age (Years) | Number of Lesions | Gender |
|---|---|---|---|---|
SCLC | 100 | 62.4 $\pm$ 9.7 | 439 | 72/28 |
NSCLC | 179 | 60.6 $\pm$ 9.5 | 679 | 109/180 |
Medical imaging contains not only visible structures but also latent information. This study employed radiomics technology to extract and analyze such hidden data. The Python-based open-source package PyRadiomics [20] was utilized to extract morphological, textural, and statistical features from the images. In MRI image processing, feature extraction requires preprocessing filters. Using filters prior to extraction significantly improves the accuracy and robustness of subsequent analyses. The “original” filter refers to unprocessed MRI images retaining complete original information. Wavelet filters are used for noise removal and edge detection. The Laplacian of Gaussian (LoG) filter combines Gaussian filtering with Laplacian operations to enhance edges. Other filters used include square, square root, logarithmic, exponential, gradient, and Local Binary Pattern (LBP), each serving to highlight specific intensity regions or texture details.
Data augmentation expands training datasets by generating new samples through transformations of existing data. To address data imbalance, the Borderline Synthetic Minority Oversampling Technique (Borderline SMOTE) method was employed for oversampling [21]. As an enhanced version of traditional SMOTE, Borderline SMOTE focuses on creating synthetic samples near category boundaries while avoiding the interpolation of noisy points. This approach effectively prevents the generation of inaccurate synthetic samples, thereby minimizing adverse impacts on model performance. Figure 5 shows the distribution of data before and after using Borderline SMOTE for data augmentation.

4. Results
Experimental results were evaluated using five metrics: accuracy, Area Under the Curve (AUC), precision, sensitivity, and specificity. Accuracy measures the proportion of correct classifications. AUC quantifies the model’s classification performance across different thresholds. Precision indicates the accuracy of positive predictions. Sensitivity reflects the model’s ability to identify true positives, while specificity measures the capacity to distinguish true negatives. The formulas are as follows:
where, $TP$ (True Positives) refers to correctly predicted positive samples, $T N$ (True Negatives) refers to correctly predicted negative samples, $F P$ (False Positives) refers to incorrectly predicted positive samples, and $F N$ (False Negatives) refers to incorrectly predicted negative samples.
To validate MMT’s superior performance under missing modality conditions, a comparative experiment with MFN-VAE was conducted. The results are presented in Table 2 and Figure 6.
Model | Mode | Accuracy | AUC | Precision | Sensitivity | Specificity | ||
T | F | D | ||||||
MMT | √ |
|
| 0.877 ± 0.033 | 0.894 ± 0.026 | 0.860 ± 0.046 | 0.897 ± 0.047 | 0.850 ± 0.068 |
| √ |
| 0.863 ± 0.047 | 0.867 ± 0.077 | 0.858 ± 0.075 | 0.883 ± 0.056 | 0.837 ± 0.133 | |
|
| √ | 0.863 ± 0.037 | 0.846 ± 0.070 | 0.885 ± 0.038 | 0.836 ± 0.044 | 0.892 ± 0.033 | |
√ | √ |
| 0.905 ± 0.037 | 0.899 ± 0.016 | 0.926 ± 0.043 | 0.883 ± 0.026 | 0.923 ± 0.057 | |
√ |
| √ | 0.897 ± 0.032 | 0.867 ± 0.048 | 0.924 ± 0.029 | 0.867 ± 0.022 | 0.924 ± 0.048 | |
| √ | √ | 0.897 ± 0.029 | 0.874 ± 0.089 | 0.940 ± 0.054 | 0.847 ± 0.016 | 0.942 ± 0.055 | |
MFN-VAE | √ |
|
| 0.849 ± 0.044 | 0.866 ± 0.075 | 0.819 ± 0.076 | 0.897 ± 0.038 | 0.806 ± 0.064 |
| √ |
| 0.852 ± 0.040 | 0.869 ± 0.063 | 0.833 ± 0.060 | 0.888 ± 0.046 | 0.816 ± 0.085 | |
|
| √ | 0.838 ± 0.040 | 0.855 ± 0.064 | 0.816 ± 0.076 | 0.877 ± 0.063 | 0.806 ± 0.068 | |
√ | √ |
| 0.877 ± 0.031 | 0.902 ± 0.049 | 0.879 ± 0.066 | 0.883 ± 0.038 | 0.878 ± 0.069 | |
√ |
| √ | 0.880 ± 0.032 | 0.879 ± 0.051 | 0.865 ± 0.051 | 0.900 ± 0.022 | 0.859 ± 0.050 | |
| √ | √ | 0.849 ± 0.022 | 0.872 ± 0.037 | 0.825 ± 0.049 | 0.882 ± 0.031 | 0.815 ± 0.031 | |
√ | √ | √ | 0.888 ±0.026 | 0.920 ± 0.026 | 0.884 ± 0.038 | 0.924 ± 0.022 | 0.850 ± 0.044 | |

Table 2 results indicate that MMT achieves an accuracy of 0.863–0.877 in single-modality scenarios, surpassing MFN-VAE’s accuracy of 0.838–0.852 in the same setting. This performance level is comparable to MFN-VAE’s results across dual-modal conditions (0.849–0.880). When operating in dual-modal environments (missing one modality), MMT demonstrates an accuracy range of 0.897 to 0.905, outperforming MFN-VAE’s accuracy of 0.888 in full-modal scenarios. These findings demonstrate MMT’s capability to effectively leverage multi-modal feature information. By integrating existing modalities and learning inter-modal relationships through its self-attention mechanism, MMT generates contextual features for missing modalities. This approach enables MMT to achieve performance parity with full-modal models even when data is incomplete.
The Receiver Operating Characteristic (ROC) performance of the best-performing MFN-VAE and MMT models is illustrated in Figure 7.
The loss function is a crucial component of MMT optimization. As shown in Figure 8, while the overall MMT loss gradually decreases during epoch iterations, the validation loss consistently remains higher than the training loss, indicating a certain degree of overfitting. The model demonstrates good convergence of reconstruction loss during training, but shows significant fluctuations in validation loss with almost no convergence trend.
In contrast, the classification loss performs well with only slight overfitting. This discrepancy may stem from MMT directly utilizing the output of the modal interaction self-attention mechanism as the reconstructed feature. This architectural choice prioritizes classification performance over reconstruction fidelity, which explains why the reconstruction loss on the validation set fails to converge.


Based on the training weights of models under multi-modal combinations, the optimal combination of T1WI and FLAIR was selected, as shown in Table 3. The MMT loss function is a weighted sum of classification loss and reconstruction loss. In this study, the loss function hyperparameters from the study by Hutter et al. [22] were used for evaluation. Overall, the highest score was assigned to the classification loss weight, followed by the reconstruction loss weights for T1WI and DWI. For the FLAIR modality, the reconstruction loss weight consistently scored 0.1, significantly lower than other parameters. When using a single modality, only FLAIR showed a deviation from the overall results, with its classification weight score being only 0.117. Meanwhile, the feature reconstruction scores for both T1WI and DWI exceeded the classification loss score. Overall, it was found that in MMT, the classification loss, T1WI reconstruction loss, and DWI reconstruction loss carry relatively greater weight, while the FLAIR loss holds a secondary position. Therefore, during model optimization, the optimization of these three loss components was prioritized.
Model | Mode | Classification Loss | Reconstruction Loss | ||||
T | F | D | T | F | D | ||
MMT-1M | √ | 0.351 | - | 0.119 | 0.248 | ||
√ | 0.117 | 0.216 | - | 0.160 | |||
√ | 0.298 | 0.298 | 0.126 | - | |||
MMT-2M | √ | √ | 0.262 | - | - | 0.378 | |
√ | √ | 0.341 | - | 0.138 | - | ||
√ | √ | 0.292 | 0.303 | - | - | ||
Average | 0.277 | 0.272 | 0.126 | 0.262 | |||
5. Conclusions
This study proposed an MMT network to address the common issue of missing modalities in medical diagnosis. Through comparative validation with the full-modality MFN-VAE network, MMT demonstrates superior performance. MMT’s mechanisms were analyzed via loss function visualization and importance assessment, enhancing interpretability. MMT achieves a maximum accuracy of 0.905 with a single missing modality, surpassing the full-modality MFN-VAE by 0.017. This provides a reliable solution for clinical diagnosis in scenarios where data modalities are incomplete.
Conceptualization, Y.D., Y.Q.M., and S.L.; methodology, Y.D. and Y.Q.M.; software, Y.D. and Y.Q.M.; validation, K.J., Z.S.S., F.Y.G., and Z.W.C.; formal analysis, Y.D. and Y.Q.M.; investigation, Y.D., Y.Q.M., K.J., and Z.S.S.; resources, S.L. and L.Y.X.; data curation, Y.D., Y.Q.M., F.Y.G., and Z.W.C.; writing—original draft preparation, Y.D. and Y.Q.M.; writing—review and editing, S.L., L.Y.X., and Y.D.; visualization, Y.D. and Y.Q.M.; supervision, S.L. and L.Y.X.; project administration, S.L. and L.Y.X.; funding acquisition, S.L. All authors have read and agreed to the published version of the manuscript.
The data (multi-modal MRI images and extracted radiomics features) supporting our research results are under privacy or ethical restrictions. The data used to support the research findings were sourced from the Affiliated Hospital of Hebei University and are available from the corresponding author upon request for researchers who meet the criteria for accessing confidential medical data.
The authors declare that they have no conflicts of interest.
