An Intelligent PSO-Optimized Informer Framework for Predictive Maintenance of Marine Shafting Systems Based on Bearing Vibration Analysis
Abstract:
Rolling bearings are critical components of marine shafting power transmission systems, and accurate prediction of their vibration signal trends is essential for predictive maintenance. To address the limited adaptability of conventional time-series forecasting models under varying operating conditions and their insufficient ability to capture strong noise and abrupt changes, this study proposes a vibration signal prediction method that integrates particle swarm optimization (PSO) with an improved Informer model. PSO is used to adaptively optimize key Informer hyperparameters for different operating conditions, while a rolling time-window mechanism is introduced to enhance the capture of abrupt signal variations. In addition, a mixture of sparse attention (MoSA) encoder with a collaborative dense-head/sparse-head structure is designed to balance global temporal dependency modeling and local fault feature extraction. Experimental results on the Case Western Reserve University (CWRU) bearing fault dataset show that the proposed model outperforms LSTM, Transformer, Informer, iTransformer, and Flowformer in terms of MSE, MAE, and RMSE. The model achieves an MSE of 0.2015, which is 25.5% lower than that of the second-best iTransformer model. It also demonstrates robust performance under four different bearing operating states, confirming its adaptability to complex operating conditions. The proposed method provides a promising technical route for the predictive maintenance of rolling bearings in marine shafting systems.
1. Introduction
Rolling bearings are key components in the power transmission system of marine shafting, and their operating condition is directly related to the safe and stable operation of ships. As a typical form of industrial time-series data, vibration signals of rolling bearings contain rich information such as vibration, noise, and temperature, which can objectively reflect the temporal evolution of equipment operating conditions as well as the dynamic characteristics of complex mechanical motion. In-depth analysis and pattern extraction from vibration signals not only enable accurate prediction of future signal trends but also facilitate the timely identification of potential fault risks, thereby providing data support for the predictive maintenance of rolling bearings in marine shafting systems and ultimately ensuring the safe and stable operation of equipment [1].
In recent years, with the rapid development of deep learning techniques, time-series forecasting models based on attention mechanisms have attracted widespread attention in industrial applications. The Transformer model demonstrates strong capability in capturing long-range dependencies through its self-attention mechanism [2]. However, the standard Transformer suffers from high computational complexity when handling ultra-long sequences, as the computational cost of its attention mechanism grows quadratically with the input sequence length. This limitation constrains its efficiency in industrial long time-series applications. To address this issue, Zhou et al. [3] proposed the Informer model, which significantly improves the efficiency of long-sequence time-series forecasting through the ProbSparse self-attention mechanism. Meanwhile, Informer incorporates designs such as self-attention distillation and an improved encoder–decoder architecture, enabling efficient feature extraction and sequence generation while maintaining prediction accuracy. These advantages have led to successful applications of Informer in fields such as power load forecasting and traffic flow prediction [4], [5].
Despite these advances, directly applying Informer to vibration signal prediction of rolling bearings in marine shafting systems still faces several challenges. First, operating conditions during ship navigation are complex and highly variable. Vibration signals exhibit significant dynamic variations under different speeds and load conditions, making it difficult for models with fixed hyperparameters to adapt to varying data distributions, often resulting in suboptimal solutions [6]. Conventional hyperparameter tuning methods, such as grid search and random search, are not only inefficient but also fail to guarantee optimal parameter combinations in practical engineering scenarios. Second, bearing vibration signals are often characterized by strong background noise and abrupt fault features. In the early stage of faults, impact signals are typically weak and short-lived, making them easily masked by noise [7]. Traditional models struggle to simultaneously capture global temporal dependencies and focus on local abrupt features, leading to insufficient sensitivity to early fault indications. Furthermore, bearing vibration signals exhibit both low-frequency periodicity and high-frequency impact components. A single attention mechanism is insufficient to address feature extraction across these different temporal scales: global attention helps capture periodic patterns but may smooth out local abrupt changes, whereas sparse attention can focus on key points but may lose global contextual information [8].
To address these issues, researchers have explored combining optimization algorithms with deep learning models. The particle swarm optimization (PSO) algorithm proposed by Kennedy and Eberhart is a classical swarm intelligence optimization method with advantages such as strong global search capability, fast convergence, and independence from gradient information [9]. It has been successfully applied to hyperparameter optimization in neural networks [10]. At the same time, improving attention mechanisms remains an important direction for enhancing model performance. The mixture of experts (MoE) architecture proposed by Shazeer et al. [11] provides a new perspective for balancing global and local feature extraction, enabling adaptive feature learning through the collaborative operation of multiple expert networks.
Based on the above analysis, a PSO-Informer-based method for vibration signal prediction of rolling bearings is proposed in this study. The method integrates PSO with an improved Informer model, leveraging the global optimization capability of PSO to adaptively tune key Informer hyperparameters, thereby addressing the randomness and poor adaptability associated with manual parameter tuning in practical applications. To improve the detection of abrupt features under strong noise conditions, a mixture of sparse attention (MoSA) encoder is designed, adopting a collaborative dense-head/sparse-head architecture to balance global temporal dependency modeling and local fault feature extraction. In addition, a rolling time-window mechanism is introduced to dynamically segment time-series data and update input sequences in a sliding manner, enhancing the model’s ability to capture abrupt signal variations and perform continuous prediction. Experimental results on the Case Western Reserve University (CWRU) bearing fault dataset demonstrate the superiority of the proposed method in terms of prediction accuracy and adaptability to varying operating conditions.
The remainder of this paper is organized as follows. Section 2 introduces the fundamental principles of the model, including the core mechanisms of Informer and the PSO algorithm. Section 3 presents the overall framework of the proposed PSO-Informer prediction model, including the hyperparameter optimization module, the rolling time-window mechanism, and the design of the MoSA encoder. Section 4 describes the experimental setup and results analysis, including data sources, implementation details, evaluation metrics, data processing, model training, comparative experiments, analysis of parameter optimization effects, module effectiveness, adaptability to operating conditions, and study limitations. Section 5 concludes the paper and outlines directions for future work.
2. Fundamental Principles of the Model
PSO-Informer is developed as an improved version of Informer and can effectively adapt to the dynamic characteristics of marine shafting systems under different operating conditions. Informer is a time-series forecasting algorithm derived from the Transformer architecture. By introducing a sparse attention mechanism, it improves computational efficiency in natural language processing and sequence modeling tasks.
Informer adopts an improved encoder–decoder architecture. The encoder is designed to perform efficient feature extraction and dimensionality reduction for long-sequence inputs. After the input sequence is transformed into a high-dimensional vector representation through an embedding layer, key information is selected by means of the sparse attention mechanism and self-attention distillation, thereby reducing redundant computation. The decoder is structurally aligned with the encoder. While retaining the encoder–decoder attention mechanism, it introduces a masking mechanism to prevent information leakage and further improves prediction efficiency through a probabilistic sampling strategy, ultimately producing the required long-sequence forecasting results. The main modules of Informer and their characteristics are as follows.
The core role of the ProbSparse self-attention mechanism is to substantially reduce the computational complexity of self-attention while preserving modeling performance. This mechanism evaluates the importance of each query vector, samples a small number of representative queries to construct a sparse query matrix, and then performs attention computation with the key and value vectors. In this way, it avoids the redundant full matching between all queries and keys in the traditional Transformer, reducing the computational complexity of attention from O($L^2$) to O($L$ log $L$).
In Eq. (1), $Q_s$ denotes the probabilistic sparse matrix obtained by sparsifying $Q$. Combined with the multi-head attention mechanism, which enables parallel attention over multiple representation subspaces, it can effectively prevent overfitting; $d_k$ denotes the dimensionality of the key vector and is used to scale the attention score so as to avoid gradient vanishing.
Self-attention distillation is the core compression module in the Informer encoder. Its function is to compress the dimensionality of long-sequence feature maps without losing critical feature information, thereby further reducing the computational burden of subsequent operations. The main idea is to stack convolution and pooling operations to downsample the feature maps produced by the attention mechanism, preserving key features with strong representational ability while filtering out redundant information.
The main effects of self-attention distillation include feature dimensionality reduction, where pooling operations shrink the size of feature maps at a fixed ratio and substantially reduce the input scale of subsequent layers; key information preservation, where a hierarchical distillation structure is adopted and each distillation layer is based on the effective features from the previous layer, ensuring that core dependency relationships are retained; improved training stability, where the compressed feature vectors have lower dimensionality, allowing smoother gradient propagation and alleviating the gradient vanishing problem in deep networks; and optimized computational efficiency, where, in combination with the ProbSparse self-attention mechanism, a “dual-compression” effect is formed, further enhancing the model’s ability to process ultra-long sequences.
Informer inherits and adapts residual connections and layer normalization to ensure stable training of deep networks. Residual connections directly add the input of each sublayer, including the sparse attention layer, the distillation layer, and the feedforward network layer, to its output, thereby constructing shortcut paths that effectively alleviate gradient vanishing in deep networks while accelerating model convergence. Unlike the Transformer, Informer further introduces a feature alignment operation after the distillation layer to ensure dimensional consistency between the compressed features and the original input features. Layer normalization normalizes the output of each sublayer by adjusting the mean and variance of feature vectors to a fixed range, thereby reducing internal covariate shift and stabilizing the training process. To account for the distributional characteristics of long-sequence features, Informer introduces a local window-weighting strategy in layer normalization, which improves adaptability to local feature distributions in the sequence.
Since Informer follows the Transformer architecture and contains neither recurrent nor convolutional structures, it cannot inherently capture positional information in the sequence. Therefore, positional encoding is introduced to embed positional information into the input. Informer employs sinusoidal positional encoding, which is expressed in Eqs. (2) and (3):
In these equations, $pos$ denotes the sequence position index, $i$ denotes the feature dimension index, and $d_{model}$ denotes the dimensionality of the embedding vector. This positional encoding provides continuous and distinguishable positional information, ensuring that the model can accurately capture temporal dependencies within the sequence.
On the basis of retaining the encoder–decoder attention mechanism of the Transformer, the Informer decoder introduces a dual-masking mechanism and a probabilistic sampling strategy. The dual-masking mechanism consists of a future-information mask and a padding mask. The future-information mask prevents the decoder from accessing information from future positions when predicting the current position, thereby avoiding information leakage. The padding mask filters out padding elements in the input sequence, ensuring that attention computation is performed only on valid data.
The probabilistic sampling strategy is applied when the decoder generates the target sequence. Candidate outputs are selected by sampling from a probability distribution instead of using conventional greedy search. This not only improves the diversity of predictions but also reduces the computational cost of long-sequence generation, thereby meeting the efficiency requirements of vibration signal prediction for rolling bearings in marine shafting systems.
The PSO algorithm is a classical swarm-intelligence metaheuristic optimization algorithm and can be regarded as a simulation of the social foraging behavior of biological groups such as flocks of birds and schools of fish. In this algorithm, each potential solution to the optimization problem is treated as a particle in the search space. Through the sharing and interaction of individual and collective experience, the entire swarm is guided to move cooperatively toward and approach the region containing the optimal solution. The core mechanisms of the algorithm are described as follows.
In the PSO algorithm, each particle represents a candidate solution to the optimization problem, and its state is described by a position vector $x_{i}=\left(x_{i 1}, x_{i 2}, K, x_{iD}\right)$ and a velocity vector $v_{i}=\left(v_{i 1}, v_{i 2}, K, v_{iD}\right)$, where $D$ is the dimensionality of the optimization problem. After initialization, the initial position and velocity of each particle in the population are randomly generated within the predefined search range, thereby laying the foundation for subsequent iterative search. This means that the PSO algorithm can iteratively search within the selected hyperparameter space and helps identify better hyperparameter combinations.
During the flight process, each particle records and updates its own historically best position $pbest_i$. Meanwhile, the entire swarm shares and tracks the position $gbest$ with the highest fitness among all particles. The individual best position and the global best position jointly drive the swarm to cluster toward the optimal region.
In each generation of swarm iteration, every particle updates its velocity and position according to its current velocity as well as its own historical experience and the collective experience of the swarm. The update equations are given in Eqs. (4) and (5):
In these equations, $\omega$ is the inertia weight, which controls the influence of the velocity at a given moment so as to balance the global exploration of the swarm and the local exploitation of individual particles; $c_1$ and $c_2$ are the cognitive acceleration coefficient and the social acceleration coefficient, respectively, and these two coefficients determine the tendency of particles to move toward the individual best and global best positions; $r_1$ and $r_2$ are random numbers introduced to enhance the stochasticity of the search process.
3. Particle Swarm Optimization-Informer Prediction Model
The PSO-Informer-based rolling bearing vibration signal prediction framework is intended to improve prediction accuracy and adaptability to varying operating conditions by using the PSO algorithm to match vibration signals of rolling bearings in marine shafting systems under different working conditions. Its basic framework is illustrated in Figure 1.

First, the PSO algorithm is used to automatically search for the optimal hyperparameter configuration of the model. Then, to simulate the demand for continuous online prediction during condition monitoring of marine shafting systems, a rolling time-window mechanism is adopted to organize sequential data and drive the Informer model for efficient multi-step forecasting. Without repeated manual trial and error, this framework enables dynamic adaptation to specific operating conditions of marine shafting systems and achieves high-accuracy rolling prediction.
In vibration signal prediction for marine shafting systems, model performance depends heavily on hyperparameter settings. To avoid the trial-and-error cost and randomness of manual hyperparameter tuning in practical engineering applications, this module introduces an improved PSO algorithm with adaptive inertia weight and an elite learning strategy, so as to search for the optimal hyperparameter combination in an automated and global manner.
Step 1: Hyperparameter space encoding and population initialization
The set of hyperparameters to be optimized in the Informer model is defined as $H$, which includes the number of attention heads, the numbers of encoder and decoder layers, the learning rate, and other parameters. A typical hyperparameter set is listed in Table 1.
Hyperparameter | Description | Search Range |
d_model | Feature embedding dimension | {64,128,256} |
n_heads | Number of attention heads | {4,8,16} |
e_layers | Number of encoder stacking layers | {1,2,3} |
d_layers | Number of decoder stacking layers | {1,2,3} |
d_ff | Hidden dimension of the feedforward network | {256,512,1024} |
dropout | Dropout rate | {0.1,0.3} |
factor | Sparsity factor | {3} |
For each particle $i$, the position vector $x_i$ encodes a hyperparameter combination as its initial position, while the velocity vector $v_i$ defines the search direction and step size of the particle. The population is initialized within the preset range to search for the optimal hyperparameter combination.
Step 2: Adaptive iterative optimization
To achieve iterative optimization of hyperparameter combinations, the individual historical best position $pbest_i$ and the global best position $gbest$ are used to update the particle state. Considering the dynamic nature of vibration signals from rolling bearings in marine shafting systems, a nonlinearly decreasing adaptive inertia weight $\omega$ is introduced to improve the algorithm’s capability for global exploration and local optimum mining, allowing it to select appropriate hyperparameter combinations according to the iteration status under different operating conditions. Meanwhile, an elite learning strategy is incorporated so that some particles move toward the historically best global particle with a certain probability, thereby reducing the likelihood of becoming trapped in local optima [12]. The update equations are given in Eqs. (6)–(8):
In Eqs. (6)–(8), $c_1$, $c_2$, and $c_3$ denote the acceleration coefficients. $r_1$, $r_2$, and $r_3$ are random variables uniformly distributed in [0,1]. trepresents the current iteration index, and $\gamma$ is the decay rate.
$E_i^t$ denotes the elite learning term, and $\delta_i$ is an indicator function, where $\delta_i$ = 1 if the particle is selected as an elite particle, and $\delta_i$ = 0 otherwise.
$\varepsilon \sim \mathcal{N}\left(0, \sigma^2 I\right)$ represents the Gaussian perturbation term, where the perturbation intensity $\sigma$ decreases with the iteration number.
Step 3: Fitness evaluation and output of the optimal solution
The particle fitness Fitness ($x_i$) is defined by training and evaluating the prediction model once on the validation set, with the mean squared error (MSE) of the prediction results used as the evaluation metric. After convergence, the globally optimal hyperparameter combination H$^*$ is output.
Considering that vibration signals of rolling bearings in marine shafting systems are continuous time series during ship operation, a fixed rolling time window is used to locally sample long sequences for continuous multi-step prediction and online monitoring, thereby providing the model with training and prediction units containing temporal context.
Let the time series be {$z_1,z_2,…,z_T$}, the input window length be $L$, and the prediction length be $S$. At time $t(L \leq t \leq T-S)$, an input–output pair is constructed, where $\left\{z_{t-L+1}, \ldots, z_t\right\}$ serves as the historical context input and {$z_(t+1),…,z_(t+S)$} is the future target sequence.
Meanwhile, starting from the initial segment of the time series, a series of overlapping yet continuous sample pairs is dynamically generated with a fixed step size. This ensures that the model can accurately learn the pattern of predicting future sequences based on historical information and effectively alleviates the problem of error accumulation in long-sequence forecasting.
In the conventional Informer model, all attention heads adopt the same sparse pattern, and different heads lack specialized functional roles. In addition, global sparsity may excessively compress feature variations within local windows, eventually leading to the loss of fine-grained feature details. Given that vibration signals of marine shafting systems exhibit low-frequency periodicity, strong noise interference, and dynamic switching of operating conditions, this module introduces a MoSA encoder into the Informer architecture to identify locally abrupt fault features in marine shafting vibration signals. In recent years, mixed attention mechanisms have shown notable advantages in time-series forecasting tasks by effectively balancing global dependency modeling and local feature extraction through multi-expert collaboration [13].
The MoSA-based mixed sparse attention layer is used to replace the multi-head attention layer in the Informer architecture. This module adopts a collaborative “dense-head/sparse-head” structure, as illustrated in Figure 2.

Fixed dense attention heads: These heads ensure stable capture of global temporal dependencies and model convergence during training, making them suitable for modeling the low-frequency periodic patterns of shafting signals.
Learnable sparse attention heads: Each sparse head is regarded as an independent expert and is equipped with a lightweight two-layer MLP router. Taking the feature representation $X \in \mathbb{R}^{T \times d}$ from the previous layer as input, this router adaptively learns the token selection strategy $s \in[ 0,1]^T$ through end-to-end training and outputs a selection score for each token. This design, which dynamically selects key tokens through a learnable router, is consistent with recent attention-based feature enhancement mechanisms in multiscale time-series networks [14].
In Eq. (9), $\sigma$ denotes the Sigmoid function. Each sparse head $s$ independently selects the top-k key tokens according to form a subset $S$.
During attention computation, the sparse heads perform query ($Q$), key ($K$), and value ($V$) projection only on the selected tokens and calculate the internal attention weights accordingly. The corresponding expression is given in Eq. (10):
The resulting features are weighted according to the router output scores, which not only enhances the ability to capture local abrupt fault features but also suppresses strong noise interference through the token selection mechanism. At the same time, the dynamic selection strategy of sparse heads can flexibly adapt to changes in signal distribution caused by switching operating conditions, enabling adaptive feature extraction under different speeds and load conditions. This idea of multiscale feature extraction and dynamic adaptation is consistent with the design philosophy of recently proposed rich-spatial multiscale Transformer architectures for complex time-series forecasting problems [15].
This mixed sparse attention module is seamlessly integrated into the Informer encoder–decoder architecture. Through residual connections and layer normalization, feature flow and training stability are ensured, significantly improving both the time-series prediction accuracy and computational efficiency of the model for complex vibration signals in marine shafting systems. Recent studies have shown that methods integrating multiscale frequency feature extraction with attention mechanisms can effectively enhance the capture of information at different granularities in time-series data, thereby offering new directions for the prediction of complex industrial signals [16].
4. Experimental Results and Analysis
The experiments in this section were conducted using the bearing fault dataset from CWRU. The experimental platform is shown in Figure 3 and consists of a gearbox, a motor, test bearings, shaft sensors, and various sensing devices.

The experimental platform covers four typical bearing operating conditions: normal (Normal, NO), inner race fault (Inner Race Fault, IF), ball fault (Ball Fault, BF), and outer race fault (Outer Race Fault, OF). In this study, the sampling frequency was set to 48 kHz. This sampling frequency can effectively prevent frequency aliasing caused by insufficient sampling of high-frequency bearing fault impact signals, thereby preserving signal integrity. Under the 48 kHz sampling condition, the dataset includes four load conditions (0 HP, 1 HP, 2 HP, and 3 HP), corresponding to motor speeds of 1797 rpm, 1772 rpm, 1750 rpm, and 1730 rpm, respectively. For each combination of fault type, fault diameter, and load, one independent vibration signal sequence is provided, and the vibration responses at both the drive end and the fan end were synchronously recorded during data acquisition.
Time-frequency analysis was performed on the bearing vibration signals under the four operating conditions to identify the characteristic frequency ranges of different fault types. The corresponding time-frequency features are shown in Figure 4.

In the experiments, the parameters were divided into fixed parameters and interval-based optimization parameters. The latter were iteratively optimized by the PSO module in the proposed model. With properly configured inertia weight, cognitive coefficient, and social coefficient, the PSO algorithm can converge to the global optimum within a limited number of iterations through iterative updates of individual and population extrema, and its search efficiency is significantly superior to conventional tuning methods such as grid search and random search. The experimental settings are listed in Table 2.
Parameter | Setting |
Input sequence length | 768 |
Prediction sequence length | 256 |
Number of dense attention heads | 4 |
Number of mixed sparse attention heads | {8,10,12,14,16} |
Dimension of the fully connected layer | {256,512,1024} |
Moving-average window size | {8,16,24,32} |
Loss function | MSE |
Number of selected tokens for sparse heads | {32,64,96,128} |
Dropout rate | {0.1,0.2,0.3} |
Activation function | ReLU |
Number of encoder stacking layers | {1,2,3} |
Number of decoder stacking layers | {1,2,3} |
Feature embedding dimension | {64,128,256} |
To account for the dynamic characteristics and high-frequency impact features of vibration signals from rolling bearings in marine shafting systems, the core control parameters of the PSO module were configured as shown in Table 3.
Particle Swarm Optimization Parameter | Initial Value |
Inertia weight | 0.9 |
Cognitive coefficient | 1.8 |
Social Coefficient | 1.2 |
Initial population size | 30 |
Number of iterations | 50 |
The adaptive inertia weight was set with an initial value of 0.9 and a final value of 0.4, and the nonlinear decay coefficient was set to 0.02, enabling a transition from global exploration to local exploitation. The cognitive coefficient followed a linearly decreasing strategy from 1.8 to 1.2, whereas the social coefficient followed a linearly increasing strategy from 1.2 to 1.8, so as to avoid local optima caused by signal noise.
Considering the core requirements of time-series forecasting, the following three metrics were selected to evaluate the effectiveness of the proposed model:
In Eqs. (11)–(13), $y_i$ and $\hat{y}_i$ denote the true and predicted values at the $i$-th sampling point, respectively, while n denotes the length of the prediction sequence.
MSE is used to reflect the deviation between predicted and true values, RMSE is used to highlight the prediction error in abrupt bearing fault segments, and MAE is used to assess the overall stability of model prediction. In addition to these basic metrics, recent studies have emphasized the use of multidimensional evaluation methods for comprehensive assessment of forecasting models, including computational efficiency, parameter scale, and training time, so as to more fully evaluate their practical deployment value [16].
During data acquisition, vibration signals of rolling bearings in marine shafting systems are susceptible to operational noise, environmental interference, and sensor errors. Therefore, the raw vibration signals were first denoised and normalized to ensure the validity of the input data.
Considering the nonstationary nature of shafting vibration signals and the presence of abrupt impact components, the db4 wavelet basis was selected because of its good time-domain localization capability, which enables accurate capture of fault impact signals. The decomposition level was set to 3.
For a single vibration signal sequence $x(t)$, three-level wavelet decomposition was performed to obtain one low-frequency approximation component $c_{\mathrm{a}}^3$ and three high-frequency detail components $\left(c_D^1, c_D^2, c_D^3\right)$. The decomposition is expressed in Eq. (14):
In Eq. (14), $c_a^3$ is obtained by wavelet scaling of the third-level low-frequency component, and $c_d^k(k=1,2,3)$ denotes the -$k$ th high-frequency component obtained by wavelet translation.
A fixed threshold was then applied to perform soft-threshold shrinkage on the high-frequency components, suppressing low-amplitude coefficients associated with noise while preserving large-amplitude coefficients corresponding to fault impacts. The soft-threshold shrinkage formula is given in Eq. (15):
In Eq. (15), W denotes the original high-frequency coefficient, $\lambda$ is the threshold, and w' denotes the high-frequency coefficient after thresholding.
The processed low-frequency approximation component $c_{\mathrm{a}}^3$ and high-frequency detail components $\left(c_D^1, c_D^2, c_D^3\right)$ were then reconstructed by inverse wavelet transform to obtain the denoised signal $\mathrm{x}^{\prime}(\mathrm{t})$. The reconstruction formula corresponding to the code implementation is given in Eq. (16):
A comparison before and after signal processing is shown in Figure 5. The time-domain waveform indicates that wavelet denoising effectively suppresses random noise interference while preserving the impact characteristics of the signal. The spectral comparison shows that high-frequency noise components are significantly attenuated, whereas characteristic fault frequency components are retained. The energy distribution of wavelet coefficients reveals differences in the concentration of frequency-domain energy among different fault types, with fault signals exhibiting a higher proportion of energy in the high-frequency detail coefficient layers, thus providing a basis for subsequent fault feature extraction.

The performance improvement of the wavelet denoising method for the four types of bearing fault signals was systematically evaluated using multiple quantitative indicators, and the results are shown in Figure 6. The decrease in peak values under all conditions indicates that random impact noise was effectively suppressed. The reduced convergence range of RMS indicates that the signal energy distribution became more concentrated. Except for the normal state, the signal-to-noise ratios of the other three states improved substantially, with gains ranging from 20 dB to 32 dB, among which the outer race fault showed the most significant improvement.

A single signal amplitude can only reflect the basic operating condition and cannot comprehensively characterize the low-frequency periodicity, operating-condition fluctuation, and abrupt fault behavior of shafting vibration. It is therefore insufficient to fully meet the sparse attention feature-learning requirements of the MoSA module. Accordingly, multidimensional complementary features were extracted through time-series feature engineering to strengthen the representation of critical information and provide a basis for accurate model prediction.
The following five types of features were selected as the target features for the MoSA module, as shown in Table 4.
Feature Name | Description | Physical Meaning |
Raw vibration value | Denoised vibration acceleration signal | Most direct vibration state of the bearing |
Sliding mean | Mean value within the sliding window | Captures low-frequency periodic components |
Sliding variance | Variance within the sliding window | Reflects the degree of fluctuation (operating-condition variation) |
Peak value | Positions exceeding twice the standard deviation | Identifies abnormal impacts or abrupt changes |
Slope | Difference between adjacent sampling points | Captures trend-related faults |
The five types of features were aligned along the temporal dimension and concatenated to form the time-series feature vector.
In Eq. (17), $x^{\prime}(t)$ denotes the denoised signal, $m(t)$ denotes the sliding mean, $v(t)$ denotes the sliding variance, $p(t)$ denotes the peak feature, and $s(t)$ denotes the slope feature.
Because the amplitudes of vibration signals differ significantly under different operating conditions, directly feeding them into the model may lead to gradient explosion or slow convergence, thereby reducing training efficiency and prediction accuracy. Therefore, Z-score normalization (Standard Scaler) was applied to the concatenated time-series feature vector to map all features to a unified distribution with mean 0 and variance 1, thereby eliminating the influence of scale differences.
Since shafting vibration signals are continuous long sequences, directly inputting them into the model would lead to excessive computational cost and would not meet the engineering requirement of online rolling prediction. Therefore, a rolling time-window mechanism was used to construct time-series sample pairs.
The input sequence length was fixed at 768, the prediction sequence length at 256, and the sliding step size at 1. During training, a stride of 1 was adopted to increase sample diversity, while during testing, the rolling step was set to 64 to balance prediction continuity and computational efficiency. The normalized time-series feature sequence was segmented to generate historical-input/future-output sample pairs, as expressed in Eqs. (18) and (19):
In these equations, $\mathrm{X}_{\mathrm{i}}$ denotes the $-i$ th input sample and $\mathrm{Y}_{\mathrm{i}}$ denotes the corresponding output sample. Continuous sliding ensures temporal continuity between samples.
After sample generation, the data were divided into training, test, and validation sets in a ratio of 7.5:1.5:1.5 for subsequent model training.
Based on the preprocessed dataset, the training and prediction procedures of the PSO-Informer model were carried out.
During the hyperparameter optimization stage, the hyperparameter set to be optimized in the MoSA-Informer model was first taken as the particle position vector, and the initial core parameters of the PSO algorithm were set as described in Section 4.2. The model was then trained for five epochs with each hyperparameter combination, and the prediction MSE on the validation set was used as the particle fitness. Finally, the elite learning strategy was adopted to update particle velocity and position so as to achieve the global optimum, and the globally optimal hyperparameter combination was obtained.
After the training samples were fed into the model, they were first transformed into high-dimensional vectors through the embedding layer and then sent into the MoSA encoder. In this encoder, the dense attention heads are responsible for capturing global temporal dependencies, while the learnable sparse attention heads use an MLP router to select key tokens. Feature compression and enhancement are then completed through self-attention distillation, residual connections, and layer normalization. The decoder generates the prediction sequence based on the feature vectors output by the encoder, using the masking mechanism and probabilistic sampling strategy. Model parameters are iteratively updated by computing the MSE loss between predicted and true values and performing backpropagation.
After model training was completed, the input samples of the test set were fed into the model in chronological order during the testing stage to produce the result for the first prediction window. Then, the rolling mechanism was activated: the predicted result was appended to the end of the original input sequence, and the first 64 data points were removed to form a new input sequence. This process was repeated until continuous rolling prediction over the long sequence in the test set was completed, yielding the full predicted sequence.
Table 5 presents the comparison between baseline models and the final proposed model, while Table 6 reports the performance of baseline models after incorporating PSO-based hyperparameter optimization. The proposed model remains unchanged across both tables as it already includes the PSO and MoSA modules.
To verify the performance of the improved PSO-Informer model, the outer race fault data from the CWRU dataset were used for effectiveness analysis. The improved PSO-Informer model was compared with LSTM, Transformer, Informer, iTransformer, and Flowformer. In addition, to illustrate the contribution of the MoSA module to model prediction performance, the PSO module was also applied to the above models to determine their optimal training parameters.
As shown in Table 7, the comparative experiments can be divided into three categories according to different validation groups, each intended to verify the effectiveness of different modules in the proposed method.
Prediction Model | MSE | MAE | RMSE |
LSTM | 0.6126 | 0.5852 | 0.9423 |
Transformer | 0.5781 | 0.4624 | 0.7150 |
Informer | 0.4435 | 0.4371 | 0.6861 |
iTransformer | 0.4341 | 0.4210 | 0.6512 |
Flowformer | 0.5219 | 0.4603 | 0.5527 |
Proposed model | 0.2015 | 0.2538 | 0.3490 |
Prediction Model | MSE | MAE | RMSE |
LSTM | 0.4512 | 0.3835 | 0.7465 |
Transformer | 0.3324 | 0.3148 | 0.5614 |
Informer | 0.3035 | 0.2997 | 0.3781 |
iTransformer | 0.2703 | 0.2478 | 0.3513 |
Flowformer | 0.3201 | 0.2587 | 0.3375 |
Proposed model | 0.2015 | 0.2538 | 0.3490 |
Category No. | Category Name | Comparative Model | Comparison Objective |
Category 1 | Time-series models without attention mechanism | LSTM | Verify the effectiveness of attention mechanisms in feature capture |
Category 2 | Attention-based model paradigms | Transformer | Verify the effectiveness of the improved mixed sparse attention in bearing signal prediction |
Informer | |||
Category 3 | Variants of attention mechanisms | iTransformer | Verify the effectiveness of the proposed method against advanced forecasting models |
Flowformer |
The comparative results are shown in Table 5. Compared with LSTM, which relies on gated recurrence for sequence modeling, Transformer and Informer equipped with attention mechanisms achieve significantly better prediction performance. Among models with attention mechanisms, the proposed model with the mixed sparse attention mechanism achieves lower values for all evaluation metrics than the comparison models, effectively demonstrating that the mixed sparse attention mechanism can identify abrupt fault features in bearing vibration signals.
As shown in Table 5, models based on attention mechanisms, such as Transformer, Informer, and iTransformer, generally outperform the traditional LSTM model, which confirms the advantage of attention mechanisms in capturing long-range temporal dependencies. Among them, Informer reduces computational complexity while maintaining prediction accuracy through the ProbSparse self-attention mechanism, whereas iTransformer and Flowformer, as improved variants of attention mechanisms, further enhance prediction performance. By integrating PSO-based hyperparameter optimization with the MoSA mixed sparse attention mechanism, the proposed PSO-Informer model outperforms all comparison models in terms of MSE, MAE, and RMSE. In particular, the MSE value of 0.2015 is 53.6\% lower than that of the second-best iTransformer model (0.4341), verifying the effectiveness of the proposed method.
Figure 7 compares the predicted waveforms for the outer race fault in the CWRU dataset. As can be seen from the figure, the proposed model performs well in handling both abrupt bearing fault features and dynamic changes. The fitted signal curve is highly consistent with the true bearing signal in both trend and waveform, and it can effectively capture the severe impacts and periodic characteristics in the bearing fault signal.

By contrast, because LSTM lacks an attention mechanism, it exhibits an overly smooth trend in regions with impact peaks and fails to effectively capture local impact features over long sequences. Although Transformer and Informer can generally follow the global dynamic trend, they show clear amplitude deviations at the peak positions of impact signals, indicating limited accuracy in describing local abrupt signal features. The fitting performance of iTransformer and Flowformer is improved relative to the aforementioned models, but discrepancies remain in continuous impact-sequence segments. In some impact peak regions, slight lag is observed and extreme values are not restored accurately enough. The mixed sparse attention module adopts a collaborative dense-head/sparse-head architecture, in which the fixed dense attention heads capture global temporal dependencies and ensure stable modeling of the low-frequency periodic patterns of bearing signals, while the learnable sparse attention heads dynamically select key tokens through the MLP router and focus on local abrupt fault features. This design enables the model to effectively extract fault impact features under strong noise interference while suppressing noise. The waveform comparison in Figure 7 visually demonstrates this advantage: the fitted curve of the proposed model closely matches the true signal at impact peaks, whereas the comparison models exhibit varying degrees of smoothing or deviation.
To verify the engineering value of the improved PSO optimization module, the PSO module was introduced into each model to determine its optimal parameter combination. The results are presented in Table 6.
It can be seen from Table 6 that prediction performance is significantly improved for all models after the introduction of the improved PSO module. Taking Transformer as an example, the MSE decreases from 0.5781 to 0.3324, corresponding to a reduction of 42.5%. This indicates that, in practical engineering applications, adaptive optimization algorithms for hyperparameter determination can effectively reduce the randomness and trial-and-error cost of manual tuning, enabling models to better adapt to the dynamic characteristics of bearings under different operating conditions.
To verify the adaptability of the proposed model to different bearing operating states, the pretrained model was applied to data corresponding to the four operating states in the CWRU bearing dataset. The prediction results are shown in Figure 8.

As shown in Figure 8, the proposed model can effectively follow the dynamic trends of bearing vibration signals under different operating states while accurately capturing local abrupt features. This indicates that the dynamic selection strategy of the sparse heads in the MoSA module can flexibly adapt to changes in signal distribution caused by switching operating conditions, thereby achieving adaptive feature extraction under different rotational speeds and load conditions. These results effectively verify the effectiveness of the mixed sparse attention mechanism, the rolling time-window output strategy, and the optimal parameter search, indicating that the model can effectively improve adaptability under complex operating conditions.
This study still has several limitations. Therefore, the present study should be regarded as a methodological validation based on a benchmark dataset rather than a full engineering validation for real marine shafting systems. First, the experiments were validated only on the public CWRU dataset. Although this dataset is widely used, it still differs from the actual operating environment of marine shafting systems, and further validation on real marine shafting data is therefore needed. In addition, to address the problem of large distribution differences under different operating conditions, transfer learning methods have been widely applied in recent years to cross-condition fault diagnosis tasks [17], [18]. Through source-domain selection and feature alignment strategies, these methods can effectively improve the generalization ability of models in the target domain. Future work may explore the integration of transfer learning with the proposed model to further enhance its adaptability under complex and varying operating conditions. Second, the search efficiency of the PSO algorithm is strongly affected by the initial parameter settings. Although adaptive inertia weight and an elite learning strategy are adopted in this study, its convergence in ultra-large parameter spaces still requires further investigation. In addition, the number of sparse heads and the token selection ratio in the MoSA module still need to be manually specified, and future studies may explore automatic optimization methods for these parameters.
5. Conclusions
This study proposes an improved PSO-Informer prediction model for vibration signal forecasting of rolling bearings in marine shafting systems, and systematically investigates the effects of PSO-based hyperparameter optimization, the rolling time-window mechanism, and the MoSA encoder on prediction performance. The main conclusions are as follows.
(1) By exploiting the global optimization capability of the PSO algorithm, adaptive optimization of the key hyperparameters of Informer is achieved, thereby addressing the randomness of manual parameter setting and the issue of adaptability under varying operating conditions in practical applications. Experimental results show that, after introducing the PSO module, the prediction performance of all comparison models is significantly improved. For example, the MSE of Transformer decreases from 0.5781 to 0.3324, representing a reduction of 42.5%.
(2) The rolling time-window mechanism is introduced to dynamically segment time-series data, which enhances the model’s ability to capture abrupt signal variations and perform continuous prediction, thereby providing a foundation for online rolling forecasting.
(3) MoSA is designed with a collaborative dense-head/sparse-head architecture, which balances global temporal dependency modeling and local fault feature extraction and improves the model’s adaptability to signals with strong noise and varying operating conditions. Experimental results show that the proposed model significantly outperforms LSTM, Transformer, Informer, iTransformer, and Flowformer in terms of MSE, MAE, and RMSE. In particular, the MSE reaches 0.2015, which is 25.5% lower than that of the second-best iTransformer model (0.2703).
(4) The prediction results under four different bearing operating states verify the adaptability of the model to complex operating conditions. The predicted waveform comparison shows that the proposed model can effectively follow the dynamic trend of bearing vibration signals and accurately capture local abrupt features at impact peaks, whereas the comparison models exhibit varying degrees of smoothing or deviation.
This study provides an efficient and accurate time-series prediction method for the predictive maintenance of rolling bearings in marine shafting systems, addressing the challenges of adaptability and prediction accuracy under complex operating conditions and establishing a modeling foundation for subsequent predictive maintenance of marine shafting rolling bearings. Future work will focus on validation using real marine shafting datasets, exploration of automatic optimization methods for key hyperparameters in the MoSA module, and incorporation of richer feature engineering strategies to further improve the model’s ability to extract degradation information.
Conceptualization, D.C. and X.M.; methodology, D.C., X.M. and H.S.; software, H.S.; validation, X.M., H.S. and X.Z.; formal analysis, D.C., X.Z. and X.M.; investigation, X.M. and H.S.; data curation, H.S.; writing—original draft preparation, X.M. and X.Z.; writing—review and editing, D.C., X.Z. and X.M.; visualization, X.Z. and X.M.; supervision, D.C.; project administration, D.C.; funding acquisition, D.C. All authors have read and agreed to the published version of the manuscript.
The data supporting the findings of this study are publicly available from the Case Western Reserve University (CWRU) Bearing Data Center. The dataset can be accessed at: https://engineering.case.edu/bearingdatacenter. All data used in this study are included in the preprocessing and experimental procedures described in the article.
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
