Back to Search Start Over

ProtPlat: an efficient pre-training platform for protein classification based on FastText

Authors :
Yuan Jin
Yang Yang
Source :
BMC Bioinformatics, Vol 23, Iss 1, Pp 1-17 (2022)
Publication Year :
2022
Publisher :
BMC, 2022.

Abstract

Abstract Background For the past decades, benefitting from the rapid growth of protein sequence data in public databases, a lot of machine learning methods have been developed to predict physicochemical properties or functions of proteins using amino acid sequence features. However, the prediction performance often suffers from the lack of labeled data. In recent years, pre-training methods have been widely studied to address the small-sample issue in computer vision and natural language processing fields, while specific pre-training techniques for protein sequences are few. Results In this paper, we propose a pre-training platform for representing protein sequences, called ProtPlat, which uses the Pfam database to train a three-layer neural network, and then uses specific training data from downstream tasks to fine-tune the model. ProtPlat can learn good representations for amino acids, and at the same time achieve efficient classification. We conduct experiments on three protein classification tasks, including the identification of type III secreted effectors, the prediction of subcellular localization, and the recognition of signal peptides. The experimental results show that the pre-training can enhance model performance effectively and ProtPlat is competitive to the state-of-the-art predictors, especially for small datasets. We implement the ProtPlat platform as a web service ( https://compbio.sjtu.edu.cn/protplat ) that is accessible to the public. Conclusions To enhance the feature representation of protein amino acid sequences and improve the performance of sequence-based classification tasks, we develop ProtPlat, a general platform for the pre-training of protein sequences, which is featured by a large-scale supervised training based on Pfam database and an efficient learning model, FastText. The experimental results of three downstream classification tasks demonstrate the efficacy of ProtPlat.

Details

Language :
English
ISSN :
14712105
Volume :
23
Issue :
1
Database :
Directory of Open Access Journals
Journal :
BMC Bioinformatics
Publication Type :
Academic Journal
Accession number :
edsdoj.31df63fc5f84405f941f21c3d9e6a87e
Document Type :
article
Full Text :
https://doi.org/10.1186/s12859-022-04604-2