1. Applicability domains for classification problems: Benchmarking of distance to models for Ames mutagenicity set
- Author
-
Robert Körner, Gilles Marcou, Huanxiang Liu, Dragos Horvath, Roberto Todeschini, Phuong Dao, Xiaojun Yao, Douglas M. Young, Paola Gramatica, A. Varnek, A. Artemenko, Todd M. Martin, Anil Kumar Pandey, Farhad Hormozdiari, Eugene N. Muratov, Alexander Tropsha, Christophe Muller, Artem Cherkasov, Tomas Öberg, Katja Hansen, Lili Xi, Timon Schroeter, Pavel G. Polishchuk, Sergii Novotarskyi, Jiazhong Li, Volodymyr V. Prokopenko, Denis Fourches, Victor E. Kuz’min, Cenk Sahinalp, Igor I. Baskin, Klaus-Robert Müller, Igor V. Tetko, Iurii Sushko, Chimie de la matière complexe (CMC), Université de Strasbourg (UNISTRA)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS), Sushko, I, Novotarskyi, S, Körner, R, Pandey, A, Cherkasov, A, Li, J, Gramatica, P, Hansen, K, Schroeter, T, Müller, K, Xi, L, Liu, H, Yao, X, Öberg, T, Hormozdiari, F, Dao, P, Sahinalp, C, Todeschini, R, Polishchuk, P, Artemenko, A, Kuz'Min, V, Martin, T, Young, D, Fourches, D, Tropsha, A, Baskin, I, Horbath, D, Marcou, G, Varnek, A, Prokopenko, V, and Tetko, I
- Subjects
Quantitative structure–activity relationship ,General Chemical Engineering ,Quantitative Structure-Activity Relationship ,Library and Information Sciences ,computer.software_genre ,01 natural sciences ,Standard deviation ,Set (abstract data type) ,03 medical and health sciences ,CHIM/01 - CHIMICA ANALITICA ,Similarity (network science) ,030304 developmental biology ,Mathematics ,0303 health sciences ,Principal Component Analysis ,QSAR ,Mutagenicity Tests ,mutagenicity ,General Chemistry ,Classification ,0104 chemical sciences ,Computer Science Applications ,Ames test ,Data set ,010404 medicinal & biomolecular chemistry ,Benchmarking ,Test set ,Metric (mathematics) ,Data mining ,computer ,Algorithm ,[CHIM.CHEM]Chemical Sciences/Cheminformatics ,Applicability domain - Abstract
The estimation of accuracy and applicability of QSAR and QSPR models for biological and physicochemical properties represents a critical problem. The developed parameter of "distance to model" (DM) is defined as a metric of similarity between the training and test set compounds that have been subjected to QSAR/QSPR modeling. In our previous work, we demonstrated the utility and optimal performance of DM metrics that have been based on the standard deviation within an ensemble of QSAR models. The current study applies such analysis to 30 QSAR models for the Ames mutagenicity data set that were previously reported within the 2009 QSAR challenge. We demonstrate that the DMs based on an ensemble (consensus) model provide systematically better performance than other DMs. The presented approach identifies 30-60% of compounds having an accuracy of prediction similar to the interlaboratory accuracy of the Ames test, which is estimated to be 90%. Thus, the in silico predictions can be used to halve the cost of experimental measurements by providing a similar prediction accuracy. The developed model has been made publicly available at http://ochem.eu/models/1 .
- Published
- 2010