1. Large language models generate functional protein sequences across diverse families
- Author
-
Madani, Ali, Krause, Ben, Greene, Eric R, Subramanian, Subu, Mohr, Benjamin P, Holton, James M, Olmos, Jose Luis, Xiong, Caiming, Sun, Zachary Z, Socher, Richard, Fraser, James S, and Naik, Nikhil
- Subjects
Biochemistry and Cell Biology ,Biological Sciences ,Machine Learning and Artificial Intelligence ,Generic health relevance ,Estrogens ,Conjugated (USP) ,Amino Acid Sequence ,Proteins ,Chorismate Mutase ,Language - Abstract
Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.
- Published
- 2023