1. MISATO - Machine learning dataset of protein-ligand complexes for structure-based drug discovery
- Author
-
Till Siebenmorgen, Filipe Menezes, Sabrina Benassou, Erinc Merdivan, Stefan Kesselheim, Marie Piraud, Fabian J. Theis, Michael Sattler, and Grzegorz M. Popowicz
- Abstract
Large language models (LLMs) have greatly enhanced our ability to understand biology and chemistry. Yet, relatively few robust methods have been reported for structure-based drug discovery. Highly precise biomolecule-ligand interaction datasets are urgently needed in particular for LLMs, that require extensive training data. We present MISATO, the first dataset that combines quantum mechanics properties of small molecules and associated molecular dynamics simulations of about 20000 experimental protein-ligand complexes. Starting from the PDBbind dataset, semi-empirical quantum mechanics was used to systematically refine these structures. The largest collection to date of molecular dynamics traces of protein-ligand complexes in explicit water are included, accumulating to 170 μs. We give ML baseline models and simple Python data loaders, and aim to foster a thriving community around MISATO (https://github.com/t7morgen/misato-dataset). An easy entry point for ML experts is provided without the need of deep domain expertise to enable the next generation of drug discovery AI models.
- Published
- 2023
- Full Text
- View/download PDF