Back to Search
Start Over
A holistic approach towards a generalizable machine learning predictor of cell penetrating peptides.
- Source :
-
Australian Journal of Chemistry . 2023, Vol. 76 Issue 6-8, p493-506. 14p. - Publication Year :
- 2023
-
Abstract
- The development of machine learning (ML) predictors does not necessarily require the employment of expansive classifiers and complex feature encoding schemes to achieve the highest accuracy scores. It rather requires data pre-processing, feature optimization, and robust evaluation to ensure consistent results and generalizability. Herein, we describe a multi-stage process to develop a reliable ML predictor of cell penetrating peptides (CPPs). We emphasize the challenges of: (i) the generation of representative datasets with all required pre-processing procedures; (ii) comprehensive and exclusive encoding of peptides using their amino acid composition; (iii) obtaining an optimized feature set using a simple classifier (support vector machine, SVM); (iv) ensuring consistent results; and (v) verifying generalizability at the highest achievable accuracy scores. Two peptide sub-spaces were used to generate the negative examples, which are required, along with the known CPPs, to train the classifier. These included: (i) randomly generated peptides with all amino acid types being equally represented and (ii) extracted peptides from receptor proteins. Results indicated that the randomly generated dataset performed perfectly well within its own peptide sub-space, while it poorly generalized to the other sub-space. Conversely, the dataset extracted from receptor proteins, while achieving lower accuracies, showed a perfect generalizability to the other peptide sub-space. We combined the qualities of these two datasets by utilizing the average of their predictions within our ultimate framework. This functional ML predictor, WLVCPP, and associated software and datasets can be downloaded from https://github.com/BahaaIsmail/WLVCPP. Testing machine learning predictors of cell penetrating peptides against new peptide candidates is often disappointing, despite the use of sophisticated classifiers and complex feature encoding schemes. The unrepresentativeness of used datasets, ill-conditioned data, redundant feature sets, and deficient evaluation procedures commonly lead to inconsistent results and ungeneralizable models. Accordingly, we propose a holistic strategy to negate sources of error to obtain a stable, consistent, accurate, and generalizable model. [ABSTRACT FROM AUTHOR]
- Subjects :
- *MACHINE learning
*PEPTIDES
*AMINO acids
*SUPPORT vector machines
*PROTEIN receptors
Subjects
Details
- Language :
- English
- ISSN :
- 00049425
- Volume :
- 76
- Issue :
- 6-8
- Database :
- Academic Search Index
- Journal :
- Australian Journal of Chemistry
- Publication Type :
- Academic Journal
- Accession number :
- 169916526
- Full Text :
- https://doi.org/10.1071/CH22247