Back to Search Start Over

From Protein Sequence to Protein Function via Multi-Label Linear Discriminant Analysis

Authors :
Chris Ding
Lin Yan
Hua Wang
Heng Huang
Source :
IEEE/ACM Transactions on Computational Biology and Bioinformatics. 14:503-513
Publication Year :
2017
Publisher :
Institute of Electrical and Electronics Engineers (IEEE), 2017.

Abstract

Sequence describes the primary structure of a protein, which contains important structural, characteristic, and genetic information and thereby motivates many sequence-based computational approaches to infer protein function. Among them, feature-base approaches attract increased attention because they make prediction from a set of transformed and more biologically meaningful sequence features. However, original features extracted from sequence are usually of high dimensionality and often compromised by irrelevant patterns, therefore dimension reduction is necessary prior to classification for efficient and effective protein function prediction. A protein usually performs several different functions within an organism, which makes protein function prediction a multi-label classification problem. In machine learning, multi-label classification deals with problems where each object may belong to more than one class. As a well-known feature reduction method, linear discriminant analysis (LDA) has been successfully applied in many practical applications. It, however, by nature is designed for single-label classification , in which each object can belong to exactly one class. Because directly applying LDA in multi-label classification causes ambiguity when computing scatters matrices, we apply a new Multi-label Linear Discriminant Analysis (MLDA) approach to address this problem and meanwhile preserve powerful classification capability inherited from classical LDA. We further extend MLDA by $\ell _1$ -normalization to overcome the problem of over-counting data points with multiple labels. In addition, we incorporate biological network data using Laplacian embedding into our method, and assess the reliability of predicted putative functions. Extensive empirical evaluations demonstrate promising results of our methods.

Details

ISSN :
23740043 and 15455963
Volume :
14
Database :
OpenAIRE
Journal :
IEEE/ACM Transactions on Computational Biology and Bioinformatics
Accession number :
edsair.doi.dedup.....d0c08fc803b428f8f6e03a9b7776115f