Classification of Cyclin Proteins Using Amino Acid Composition and an SVM Approach: An In-Depth Analysis
Abstract:
Cyclins, commonly referred to as co-enzymes, are a pivotal family of proteins that modulate cellular growth by activating cell-cycle mediators, proving essential for the cell cycle. Due to the marked dissimilarity in their sequences, effective differentiation among cyclins remains a challenging endeavour. In this study, an innovative methodology was proposed, wherein the amino acid composition was utilized to inform an SVM-based classification approach. SVMs, being supervised machine learning algorithms, are typically employed for classification and regression tasks. From the data analyzed, eighteen (18) feature labels were extracted, culminating in an extensive set of thirteen thousand one hundred and fifty-one (13,151) discernible features. Employing the jackknife cross-validation technique revealed that this SVM-informed approach facilitated the identification of cyclins with an accuracy rate of 91.9%, a notable improvement from prior studies. Such advancements underscore the potential for more accurate and efficient differentiation of cyclins in future endeavours.
1. Introduction
Cyclins, at times referred to as co-enzymes, represent a crucial family of proteins that are implicated in the orchestration of cellular growth. These proteins function by activating cell-cycle mediators, a subset of serine proteases fundamental to the cell-cycle [1]. The presence and concentration of various cyclins fluctuate throughout the cell cycle, as depicted in Figure 1.
Two primary mechanisms are recognized for governing changes in cyclin concentrations. Firstly, variations in cyclin gene expression are often attributed to these shifts [2]. Secondly, the ubiquitin-mediated degradation pathway universally mediates alterations in cyclin concentrations [2]. With the collaboration of cyclin-dependent kinases (CDKS), complexes are formed by cyclins. Following phosphorylation—a process involving the addition of a phosphate group—CDK's active site becomes operational [2]. Subsequently, these activated complexes are known to play pivotal roles in cell cycle progression [2]. When cyclin binds with CDK, the maturation-promoting factor (MPF) is produced. This factor is responsible for the phosphorylation of various proteins, thereby facilitating distinct cell-cycle processes, notably microtubule and chromosomal reorganization [3]. It has been observed that cyclins don't possess an enzymatic active site; however, they present a surface-binding site and have the capability to localise CDKS within specific sub-cellular compartments [2].
The post-genomic era has witnessed a remarkable proliferation of biological data, particularly sequence data [4], [5], [6], [7]. Traditional methodologies for processing and understanding this information not only tend to be time-intensive and expensive but also yield relatively low success rates. Thus, swift techniques for sequence identification have become increasingly sought after the references [8], [9], [10], [11], [12]. Prevalent computational methods, such as BLAST and FASTA, can facilitate unique nucleotide to peptide sequence database searches, yet they exhibit limitations in cyclin differentiation due to sequence dissimilarities. Hence, machine learning-based classification in this domain has garnered increased attention [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24]. In prior approaches, like the StAR [25], using pseudo amino acid composition achieved an impressive accuracy rate of 83.53% for co-enzyme detection. After meticulous analysis, eighteen distinct feature labels were extracted, reducing feature dimensions. Through refined methodologies, an impressive accuracy rate of 94.3% was achieved, with a Jackknife cross-validation efficiency of 91.9%.
Emerging research has highlighted the potential role of cyclin D1 in DNA repair enhancement, potentially safeguarding transformed cells from excessive genomic instability and potentially aiding in shielding breast cancer cells from DNA-damaging treatments. Intriguingly, cyclin D1 has been recently identified as a promoter of whole-genome chromosomal instability [26]. Traditionally, the mitotic protein mechanism has been acknowledged as a vital component of cell membrane transition and regulation. This mechanism, being integral to a plethora of standard and malignant intracellular signaling processes, collaborates with CDKS to regulate the quantity of various cyclin subunits and CDK inhibitors [27]. Peptides autophagy offers a mechanism for chronological execution and harmonization of transcription transitions by working in tandem with cyclin-dependent kinases (Cdks) to modulate the number of different cyclin sub units and Cdk antagonists [28]. D-type co-enzymes, produced by three primary enzymes, have emerged as significant receptors of exogenous signal transduction, transmitting pro-inflammatory signals to the intrinsic circadian tissue generator [29]. Through interactions with CDK4 and CDK6, and by retaining CDK inhibitors p21 and p27, D-type co-enzymes are posited as primary promoters of G1 phase progression in relation to non-inflammatory stimuli. Among these, cyclin D1, one of the most extensively studied D-type cyclins, has been consistently linked to malignancy alterations. One of most investigated D-type cyclin, Cyclin D1, is typically associated with alterations in malignancy, and its abundance having more positive to cell transformation and malignancy [30]. Elevated cyclin D1 levels in tumour cells, largely resulting from aberrant protein ubiquity and stability, have been identified as markers of cancer phenotype and disease progression [31], [32].
2. Methodology
As illustrated in Figure 2, an approach was established whereby data were first collected, followed by the application of various feature selection strategies for feature extraction. Subsequently, a range of classifiers were employed, culminating in the validation of the adopted technique.
The datasets employed in this investigation were sourced from UNIPROT, a renowned protein database. Two distinct datasets were utilized: one representing cyclin proteins and the other for acyclic proteins. The search term “cyclin proteins" was used to identify cyclin proteins within the UNIPROT database, while “acyclic proteins" served to identify the latter dataset. Initial collection revealed 297 cyclin proteins and 313 acyclic proteins. To reduce redundancy and potential bias, repetitive sequences exhibiting over 70% similarity were eliminated using the CD-HIT Suite. Post homology reduction, the datasets comprised 146 cyclin proteins and 13 monocyclic peptides. These peptides were then integrated into the proposed model for both training and testing purposes. Although employing a benchmark dataset with a lower sequence identity threshold, such as 25%, might potentially improve accuracy, it was determined that this would substantially reduce the overall sample size, jeopardizing statistical validity. Consequently, such a stringent threshold was not adopted in this study.
The efficacy of machine learning-based protein classification largely hinges on the robustness of the features selected. By judiciously selecting optimal features, both the classifier's performance and the model-building process can be significantly enhanced. In this section, the features employed in this study will be elucidated.
A) Frequency Vector (FV)
Insightful details about the benchmark datasets in relation to each protein constituent within the population are furnished by frequency analysis. For each peptide position, its occurrence is computed and then encapsulated within a vector, known as the FV. The FV is adept at preserving data concerning the magnitude and composition of protein samples. The FV is derived as:
In this equation, each $r_i$ signifies the frequency of a unique amino acid position in the sequence, arranged in alphabetical order.
B) Computation of Position Relative Incidence Matrix (PRIM)
The primary sequence is instrumental in deciphering concealed attributes of a protein. The underlying mathematical paradigm of this model rests on the segmentation of acids within proteins and a subset of proteins from the benchmark dataset. A 20$\times$20 matrix, labelled as $H_{PRIM}$, is utilized to capture the relative positioning data of peptides or amino acids. This matrix is constructed from the relative segmentation of protein sample acids. Its derivation is:
In the $H_{PRIM}$ matrix, every $H_{i^-}>j$ possesses an aggregate value. This value is discerned as the relative positioning of the $j$-th residue concerning the initial appearance of the $i$-th residue. This process yields 400 coefficients. To streamline these coefficients, a set of computations is conducted, eventually resulting in 30 coefficients.
C) Computation of Reverse RPRIM
To unveil nuanced features in proteins with certain ambiguities, $H_{RPRIM}$ is determined, drawing on information from the reverse-sequenced protein sample. The formulation of $H_{RPRIM}$ is:
$H_{RPRIM}$ furnishes 400 coefficients akin to $H_{PRIM}$, undergoing an identical coefficient reduction process, and eventually generating 30 empirical coefficients.
D) Computation of Accumulative Absolute Position Incidence Vector (AAPIV)
A frequency matrix was devised to retain the locational data of peptides or amino acids and to unveil the nuanced characteristics of protein sequences associated with configurational data. However, it omits details on the relative positioning of peptides. To bridge this gap, AAPIV is employed, calculated for a set of 20 indigenous peptides:
Herein, each $u_i$ represents an AAPIV component, its formulation being:
E) Computation of Reverse Accumulative Absolute Position Incidence Vector (RAAPIV)
To expose salient features of patterns in terms of relative positioning data, RAAPIV is derived from the inverse sequence of protein samples, following a methodology analogous to AAPIV:
Upon feature extraction, duplicated and chaotic features were excluded, given their potential to significantly influence model construction. While theoretically every feature permutation could be utilized for data, processing becomes cumbersome with increasing feature dimensions. For instance, with a feature dimensionality of 100, $2^{100}$ potential optimizations and models would need to be considered. Thus, an effective method for isolating optimal trait combinations becomes paramount. Notable feature screening methodologies have been previously discussed in the literature, encompassing single-factor analysis [33], Maximum-Relevance-Maximum-Distance (MRMD) [34], and Minimum Redundancy Maximum Relevance (mRMR) [35]. In this study, two screening methodologies were employed.
A) Incremental Feature Selection (IFS)
Incremental feature selection was utilized, with features being ordered in a descending manner based on Analysis of Variance (ANOVA) values. This approach led to a rapid reduction in feature dimensionality, facilitating efficient computations. ANOVA [36] was chosen for feature ranking not only due to its distinguished attributes but also its computational efficiency. The f-score for each feature can be deduced as:
In this equation, $n_i$ is indicative of the observations in the $i$-th class, $x_i$ represents the average value of said observations, $x_{ij}$ is the value of the $j$-th observation in the $i$-th group, and $x_i$ is the average rate of all measurements. Features can then be ranked based on their respective f-scores. The feature boasting the highest f-score occupies the premier position in the selection hierarchy. The coefficients of SVM are adjusted via cross-validation. Using these values, the precision of the feature set was assessed. Thereafter, by incorporating the subsequent highest-ranked feature, a new feature set was formed and its precision was also evaluated using SVM. With a 100-dimensional feature set, this procedure yields merely 100 models, marking a substantial reduction in computational time needed for feature selection. Upon computing the accuracy for each feature set, the optimal feature set, denoted as the statistical model, was discerned.
B) Greedy Algorithm
To further refine the feature set, the greedy algorithm [37] was applied. The foundational steps of the greedy algorithm entail training the model initially with a singular dimensional feature, progressing eventually to features of 100 dimensions. Post 100 training iterations, the feature associated with the highest precision was identified. Subsequently, by adding more spatial features to the preceding ones, the classifier was trained with two-dimensional data. After 99 such comparisons, the optimal feature set was identified. This procedure was reiterated until the precision of an additionally introduced feature was found to be inferior to its predecessors.
C) SVM
The SVM [38] is a supervised learning model that has been extensively adopted in bioinformatics investigations, especially in data regression and classification challenges [39], [40], [41], [42], [43], [44], [45], [46], [47]. For the classification tasks in this study, LIBSVM [48] was used. A linear function was selected as the kernel function. Both the kernel $\gamma$ variable and the regularization variable C were fine-tuned via grid search.
For the effective functioning of a model, rigorous testing is paramount. Owing to its inherent independence, cross-validation has been identified as one of the pre-eminent methods for this purpose [49], [50], [51], [52], [53], [54], [55]. Within the realm of model validation, three predominant approaches emerge: Jackknife cross-validation, n-fold cross-validation, and independent validation. Of these, Jackknife cross-validation has been highlighted as especially potent for compact datasets, though it occasionally yields atypical results [56].
Eq. (8) delineates the calculations for sensitivity (Sn), specificity (Sp), and overall accuracy (Acc) within the context of Jackknife cross-validation.
In this equation, the terms True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) represent the number of positive instances correctly identified as positive, negative instances correctly identified as negative, negative instances incorrectly identified as positive, and positive instances incorrectly identified as negative, respectively.
The receiver operating characteristic curve (ROC) alongside the area under the curve (AUC) were utilized to elucidate the performance metrics of the model.
To ensure that the model is working properly, it must be tested. Cross-validation is one of the most popular verification methods because of its independence [49], [50], [51], [52], [53], [54], [55]. Jack knifing pass, n-fold pass, and independent confirmation are three common approaches for validating models. The best method is Jackknife cross-validation, which is excellent for tiny problems and can yield a weird solution [56]. The sensitivity (Sn), specificity (Sp), and overall accuracy (Acc) of Jackknife cross-validation were calculated as follows.
3. Results
In the course of this investigation, 18 distinct categories of attributes were procured, yielding a cumulative 13151-dimensional feature set. These features were systematically ranked based on their respective f-scores, determined through a one-way ANOVA.
Metric/Study | Current Study | Mohabatkar [25] |
Accuracy (Acc) | 91.90% | 83.53% |
Specificity (Sp) | 92.80% | - |
Sensitivity (Sn) | 91.00% | 87.44% |
Area Under Curve (AUC) | 0.9159 | 0.8944 |
Dimensions of Feature | 8 | 21 |
Figure 3 elucidates the efficacy of models trained for each feature set after the initial evaluation phase. The pinnacle of precision, observed during a 5-fold cross-validation, materialises when the feature set is comprised of 130 features, registering at 95.2% (as depicted in Figure 3). Despite the laudable precision, the dimensionality remains rather substantial. Consequently, a greedy approach was adopted to further condense the feature set's dimensions. After the secondary screening, the feature set dimension was discernibly truncated to eight. With this refined set, an accuracy of 91.9% was achieved in the jackknife validation, paired with an area under the curve of 0.9159. Although the precision exhibited by this compacted subset is marginally eclipsed by that of the original selection from the primary phase, the dimensionality reduction from 130 to 8 attributes could bolster the model's robustness and diminish overfitting susceptibility. Thus, the final model was constructed employing these 8-dimensional attributes. In contrast, in the study conducted by Mohabatkar [25], the feature set underwent no filtration, leading to a model built upon all 21 discrete attributes.
Table 1 posits that the methodology delineated here exhibits superior performance compared to previously published models.
4. Discussion
The identification of Cyclins through amino acid composition has been a topic of significant interest given its relevance to understanding cell cycle regulation and its potential applications in numerous biological fields. This study’s methodological approach of combining SVM with advanced feature extraction techniques offers promising advancements in the quest for accurate Cyclin identification.
One of the main contributions of this study is the effective reduction of features. From an extensive pool of eighteen categories and over thirteen thousand dimensional features, the optimal feature space was distilled down to just eight. This not only simplifies the model, making it computationally more efficient, but also potentially increases the generalizability by reducing the risk of overfitting. The use of ANOVA and the greedy algorithm in this feature extraction process is noteworthy. Both techniques have been utilized in various applications across disciplines, but their effectiveness in this particular context underscores their potential for broader applications in bioinformatics.
Comparatively, when set against findings from Mohabatkar’s work [25], the model proposed herein showcased enhanced accuracy and performance while operating on a reduced feature set. The reduction in feature dimensions from 21 to 8, coupled with improved performance metrics, indicates the efficacy of the methodological innovations introduced.
However, as with all studies, there are considerations to be made. Although the eight-dimensional features led to improved accuracy in the SVM model, one must question the biological relevance of each feature. Future studies might delve deeper into the understanding of why these specific features were critical in the identification process. Moreover, the external validity of the model should be tested across different datasets to ascertain its broad applicability.
Jackknife cross-validation, a significant aspect of the validation process in this research, further corroborates the model’s reliability. However, exploring other validation techniques in conjunction could provide a more comprehensive view of the model’s robustness.
In conclusion, the findings of this study provide a foundational step for future research in this domain. With continued advancements and refinements, the path towards precise and efficient identification of Cyclins seems clearer than ever.
5. Conclusions
In the ever-evolving realm of bioinformatics, the identification of Cyclins based on amino acid composition has been a notable challenge. This study took a leap forward by leveraging the strengths of the SVM in tandem with innovative feature extraction techniques. Out of a comprehensive pool of eighteen distinct categories of features, the method distilled the dimensions down to just eight, without compromising the accuracy. This reduction in feature dimensions, achieved through methods such as ANOVA and the greedy algorithm, not only enhanced the model’s accuracy but also provided an essential safeguard against the ever-looming threat of overfitting.
In direct comparison to previous models, particularly the one outlined by Mohabatkar [25], the approach delineated in this research was demonstrably superior. Beyond mere numerical supremacy, the findings have broader implications. They underscore the invaluable role of meticulous feature extraction in the bioinformatics domain, potentially paving the way for more streamlined and efficient models in the future.
Moreover, the research fills an important gap in the existing literature, serving as an exemplar of how judicious computational methods can be harmoniously combined with biological data to yield robust results. As the world stands on the cusp of more advanced computational and biological integrations, the methodologies and results from this study provide a solid foundation for future endeavours aiming to solve complex biological puzzles.
The data used to support the findings of this study are available from the corresponding author upon request.
The author declares that they have no conflicts of interest.