Sorry, I don't understand your search. ×
Back to Search Start Over

Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets

Authors :
Seonwoo Min
James T. Morton
Sungroh Yoon
Christian Dallago
Amy X. Lu
Konstantin Schütze
Maria Littmann
Burkhard Rost
Michael Heinzinger
Kevin K. Yang
Tobias Olenyi
Source :
Current Protocols. 1
Publication Year :
2021
Publisher :
Wiley, 2021.

Abstract

Models from machine learning (ML) or artificial intelligence (AI) increasingly assist in guiding experimental design and decision making in molecular biology and medicine. Recently, Language Models (LMs) have been adapted from Natural Language Processing (NLP) to encode the implicit language written in protein sequences. Protein LMs show enormous potential in generating descriptive representations (embeddings) for proteins from just their sequences, in a fraction of the time with respect to previous approaches, yet with comparable or improved predictive ability. Researchers have trained a variety of protein LMs that are likely to illuminate different angles of the protein language. By leveraging the bio_embeddings pipeline and modules, simple and reproducible workflows can be laid out to generate protein embeddings and rich visualizations. Embeddings can then be leveraged as input features through machine learning libraries to develop methods predicting particular aspects of protein function and structure. Beyond the workflows included here, embeddings have been leveraged as proxies to traditional homology-based inference and even to align similar protein sequences. A wealth of possibilities remain for researchers to harness through the tools provided in the following protocols. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC. The following protocols are included in this manuscript: Basic Protocol 1: Generic use of the bio_embeddings pipeline to plot protein sequences and annotations Basic Protocol 2: Generate embeddings from protein sequences using the bio_embeddings pipeline Basic Protocol 3: Overlay sequence annotations onto a protein space visualization Basic Protocol 4: Train a machine learning classifier on protein embeddings Alternate Protocol 1: Generate 3D instead of 2D visualizations Alternate Protocol 2: Visualize protein solubility instead of protein subcellular localization Support Protocol: Join embedding generation and sequence space visualization in a pipeline.

Details

ISSN :
26911299
Volume :
1
Database :
OpenAIRE
Journal :
Current Protocols
Accession number :
edsair.doi.dedup.....935acf08f90deb57e18b2a291f42ca39