Back to Search
Start Over
Critiquing Protein Family Classification Models Using Sufficient Input Subsets
- Publication Year :
- 2020
- Publisher :
- Mary Ann Liebert Inc, 2020.
-
Abstract
- In many application domains, neural networks are highly accurate and have been deployed at large scale. However, users often do not have good tools for understanding how these models arrive at their predictions. This has hindered adoption in fields such as the life and medical sciences, where researchers require that models base their decisions on underlying biological phenomena rather than peculiarities of the dataset introduced. In response, we propose a set of methods for critiquing deep learning models and demonstrate their application for protein family classification, a task for which high-accuracy models have considerable potential impact. Our methods extend the sufficient input subsets technique, which we use to identify subsets of features (SIS) in each protein sequence that are alone sufficient for classification. Our suite of tools analyzes these subsets to shed light on the decision-making criteria employed by models trained on this task. These tools expose that while deep models may perform classification for biologically-relevant reasons, their behavior varies considerably across choice of network architecture and parameter initialization. While the techniques that we develop are specific to the protein sequence classification task, the approach taken generalizes to a broad set of scientific contexts in which model interpretability is essential.
- Subjects :
- model selection
Initialization
Machine learning
computer.software_genre
Models, Biological
Task (project management)
Set (abstract data type)
Machine Learning
03 medical and health sciences
0302 clinical medicine
Deep Learning
Genetics
protein classification
Humans
Molecular Biology
030304 developmental biology
Interpretability
0303 health sciences
Network architecture
Artificial neural network
business.industry
Model selection
Deep learning
Scale (chemistry)
protein domain
Computational Biology
Proteins
neural networks
Computational Mathematics
Computational Theory and Mathematics
030220 oncology & carcinogenesis
Modeling and Simulation
Multigene Family
Artificial intelligence
Neural Networks, Computer
business
interpretability
computer
Subjects
Details
- Database :
- OpenAIRE
- Accession number :
- edsair.doi.dedup.....de98b8bd73be5f5458c3ca5b94baa670
- Full Text :
- https://doi.org/10.17863/cam.58812