Back to Search Start Over

PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated Escherichia coli protein solubility dataset.

Authors :
Zhang, Xuechun
Hu, Xiaoxuan
Zhang, Tongtong
Yang, Ling
Liu, Chunhong
Xu, Ning
Wang, Haoyi
Sun, Wen
Source :
Briefings in Bioinformatics. Sep2024, Vol. 25 Issue 5, p1-10. 10p.
Publication Year :
2024

Abstract

Protein solubility plays a crucial role in various biotechnological, industrial, and biomedical applications. With the reduction in sequencing and gene synthesis costs, the adoption of high-throughput experimental screening coupled with tailored bioinformatic prediction has witnessed a rapidly growing trend for the development of novel functional enzymes of interest (EOI). High protein solubility rates are essential in this process and accurate prediction of solubility is a challenging task. As deep learning technology continues to evolve, attention-based protein language models (PLMs) can extract intrinsic information from protein sequences to a greater extent. Leveraging these models along with the increasing availability of protein solubility data inferred from structural database like the Protein Data Bank holds great potential to enhance the prediction of protein solubility. In this study, we curated an Updated Escherichia coli protein Solubility DataSet (UESolDS) and employed a combination of multiple PLMs and classification layers to predict protein solubility. The resulting best-performing model, named Protein Language Model-based protein Solubility prediction model (PLM_Sol), demonstrated significant improvements over previous reported models, achieving a notable 6.4% increase in accuracy, 9.0% increase in F1_score, and 11.1% increase in Matthews correlation coefficient score on the independent test set. Moreover, additional evaluation utilizing our in-house synthesized protein resource as test data, encompassing diverse types of enzymes, also showcased the good performance of PLM_Sol. Overall, PLM_Sol exhibited consistent and promising performance across both independent test set and experimental set, thereby making it well suited for facilitating large-scale EOI studies. PLM_Sol is available as a standalone program and as an easy-to-use model at https://zenodo.org/doi/10.5281/zenodo.10675340. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
14675463
Volume :
25
Issue :
5
Database :
Academic Search Index
Journal :
Briefings in Bioinformatics
Publication Type :
Academic Journal
Accession number :
179874108
Full Text :
https://doi.org/10.1093/bib/bbae404