1. An approach to functionally relevant clustering of the protein universe: Active site profile-based clustering of protein structures and sequences.
- Author
-
Knutson, Stacy T, Westwood, Brian M, Leuthaeuser, Janelle B, Turner, Brandon E, Nguyendac, Don, Shea, Gabrielle, Kumar, Kiran, Hayden, Julia D, Harper, Angela F, Brown, Shoshana D, Morris, John H, Ferrin, Thomas E, Babbitt, Patricia C, and Fetrow, Jacquelyn S
- Subjects
Phosphopyruvate Hydratase ,Glutathione Transferase ,Sequence Analysis ,Protein ,Databases ,Protein ,active site profile ,function annotation ,functional site profile ,functionally relevant clustering ,isofunctional clusters ,mechanistic determinants ,misannotation ,Sequence Analysis ,Protein ,Databases ,Biophysics ,Biochemistry and Cell Biology ,Computation Theory and Mathematics ,Other Information and Computing Sciences - Abstract
Protein function identification remains a significant problem. Solving this problem at the molecular functional level would allow mechanistic determinant identification-amino acids that distinguish details between functional families within a superfamily. Active site profiling was developed to identify mechanistic determinants. DASP and DASP2 were developed as tools to search sequence databases using active site profiling. Here, TuLIP (Two-Level Iterative clustering Process) is introduced as an iterative, divisive clustering process that utilizes active site profiling to separate structurally characterized superfamily members into functionally relevant clusters. Underlying TuLIP is the observation that functionally relevant families (curated by Structure-Function Linkage Database, SFLD) self-identify in DASP2 searches; clusters containing multiple functional families do not. Each TuLIP iteration produces candidate clusters, each evaluated to determine if it self-identifies using DASP2. If so, it is deemed a functionally relevant group. Divisive clustering continues until each structure is either a functionally relevant group member or a singlet. TuLIP is validated on enolase and glutathione transferase structures, superfamilies well-curated by SFLD. Correlation is strong; small numbers of structures prevent statistically significant analysis. TuLIP-identified enolase clusters are used in DASP2 GenBank searches to identify sequences sharing functional site features. Analysis shows a true positive rate of 96%, false negative rate of 4%, and maximum false positive rate of 4%. F-measure and performance analysis on the enolase search results and comparison to GEMMA and SCI-PHY demonstrate that TuLIP avoids the over-division problem of these methods. Mechanistic determinants for enolase families are evaluated and shown to correlate well with literature results.
- Published
- 2017