Back to Search
Start Over
The language of proteins: NLP, machine learning & protein sequences
- Source :
- Computational and Structural Biotechnology Journal, Vol 19, Iss, Pp 1750-1758 (2021), Computational and Structural Biotechnology Journal
- Publication Year :
- 2021
- Publisher :
- Elsevier, 2021.
-
Abstract
- Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.
- Subjects :
- Word embedding
Bioinformatics
Computer science
Language models
Biophysics
Review Article
Machine learning
computer.software_genre
Contextualized embedding
Biochemistry
Tokenization
03 medical and health sciences
0302 clinical medicine
Structural Biology
Genetics
Word2vec
030304 developmental biology
Transformer
0303 health sciences
Artificial neural networks
Artificial neural network
business.industry
Deep learning
Natural language processing
Lexical analysis
Full text search
Computer Science Applications
ComputingMethodologies_PATTERNRECOGNITION
Bag-of-words model
030220 oncology & carcinogenesis
Language model
Artificial intelligence
Bag of words
business
computer
TP248.13-248.65
BERT
Biotechnology
Subjects
Details
- Language :
- English
- ISSN :
- 20010370
- Volume :
- 19
- Database :
- OpenAIRE
- Journal :
- Computational and Structural Biotechnology Journal
- Accession number :
- edsair.doi.dedup.....cebe163abd3b5a87395f15a1d5bfa89b