Back to Search
Start Over
Harvesting classification trees for drug discovery
- Source :
- Journal of chemical information and modeling. 52(12)
- Publication Year :
- 2012
-
Abstract
- Millions of compounds are available as potential drug candidates. High throughput screening (HTS) is widely used in drug discovery to assay compounds for a particular biological activity. A common approach is to build a classification model using a smaller sample of assay data to predict the activity of unscreened compounds and hence select further compounds for assay. This improves the efficiency of the search by increasing the proportion of hits found among the assayed compounds. In many assays, the biological activity is dichotomized into a binary indicator variable; the explanatory variables are chemical descriptors capturing compound structure. A tree model is interpretable, which is key, since it is of interest to identify diverse chemical classes among the active compounds to serve as leads for drug optimization. Interpretability of a tree is often reduced, however, by the sheer size of the tree model and the number of variables and rules of the terminal nodes. We develop a "tree harvesting" algorithm to filter out redundant "junk" rules from the tree while retaining its predictive accuracy. This simplification can facilitate the process of uncovering key relations between molecular structure and activity and may clarify rules defining multiple activity mechanisms. Using data from the National Cancer Institute, we illustrate that many of the rules used to build a classification tree may be redundant. Unlike tree pruning, tree harvesting allows variables with junk rules to be removed near the top of the tree. The reduction in complexity of the terminal nodes improves the interpretability of the model. The algorithm also aims to reorganize the tree nodes associated with the interesting "active" class into larger, more coherent groups, thus facilitating identification of the mechanisms for activity.
- Subjects :
- Chemical descriptors
Anti-HIV Agents
Databases, Pharmaceutical
General Chemical Engineering
High-throughput screening
Library and Information Sciences
Biology
Machine learning
computer.software_genre
Drug Discovery
Humans
Compound structure
Interpretability
business.industry
Drug discovery
Biological activity
General Chemistry
National Cancer Institute (U.S.)
United States
Computer Science Applications
High-Throughput Screening Assays
Tree (data structure)
Pharmaceutical Preparations
HIV-1
Data mining
Artificial intelligence
business
computer
Decision tree model
Algorithms
Subjects
Details
- ISSN :
- 1549960X
- Volume :
- 52
- Issue :
- 12
- Database :
- OpenAIRE
- Journal :
- Journal of chemical information and modeling
- Accession number :
- edsair.doi.dedup.....726493b111da7ecbdfd0c6129525283d