Back to Search Start Over

Improvements in viral gene annotation using large language models and soft alignments

Authors :
William L. Harrigan
Barbra D. Ferrell
K. Eric Wommack
Shawn W. Polson
Zachary D. Schreiber
Mahdi Belcaid
Source :
BMC Bioinformatics, Vol 25, Iss 1, Pp 1-19 (2024)
Publication Year :
2024
Publisher :
BMC, 2024.

Abstract

Abstract Background The annotation of protein sequences in public databases has long posed a challenge in molecular biology. This issue is particularly acute for viral proteins, which demonstrate limited homology to known proteins when using alignment, k-mer, or profile-based homology search approaches. A novel methodology employing Large Language Models (LLMs) addresses this methodological challenge by annotating protein sequences based on embeddings. Results Central to our contribution is the soft alignment algorithm, drawing from traditional protein alignment but leveraging embedding similarity at the amino acid level to bypass the need for conventional scoring matrices. This method not only surpasses pooled embedding-based models in efficiency but also in interpretability, enabling users to easily trace homologous amino acids and delve deeper into the alignments. Far from being a black box, our approach provides transparent, BLAST-like alignment visualizations, combining traditional biological research with AI advancements to elevate protein annotation through embedding-based analysis while ensuring interpretability. Tests using the Virus Orthologous Groups and ViralZone protein databases indicated that the novel soft alignment approach recognized and annotated sequences that both blastp and pooling-based methods, which are commonly used for sequence annotation, failed to detect. Conclusion The embeddings approach shows the great potential of LLMs for enhancing protein sequence annotation, especially in viral genomics. These findings present a promising avenue for more efficient and accurate protein function inference in molecular biology.

Details

Language :
English
ISSN :
14712105
Volume :
25
Issue :
1
Database :
Directory of Open Access Journals
Journal :
BMC Bioinformatics
Publication Type :
Academic Journal
Accession number :
edsdoj.11da29dbb5db42a19a182a6b11c4be02
Document Type :
article
Full Text :
https://doi.org/10.1186/s12859-024-05779-6