1. Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments
- Author
-
Marc Delarue, Henri Orland, Patrice Koehl, University of California [Davis] (UC Davis), University of California, Institut de Physique Théorique - UMR CNRS 3681 (IPHT), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Dynamique structurale des Macromolécules / Structural Dynamics of Macromolecules, Institut Pasteur [Paris]-Centre National de la Recherche Scientifique (CNRS), This research received no external funding, The work discussed here originated from a visit by P.K. at the Institut de Physique Théorique, CEA Saclay, France, during the fall of 2018. He thanks them for their hospitality and financial support. We thank D. Jones and his coworkers and W.F. Vranken for making their databases of test multiple sequence alignments available, University of California (UC), and Institut Pasteur [Paris] (IP)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
0301 basic medicine ,Normal Distribution ,Pharmaceutical Science ,Multivariate normal distribution ,Article ,Analytical Chemistry ,lcsh:QD241-441 ,03 medical and health sciences ,Medicinal and Biomolecular Chemistry ,lcsh:Organic chemistry ,Dimension (vector space) ,Models ,Theoretical and Computational Chemistry ,Drug Discovery ,Covariate ,Amino Acid Sequence ,Physical and Theoretical Chemistry ,Amino Acids ,contact predictions ,Mathematics ,Quantitative Biology::Biomolecules ,Sequence ,Models, Statistical ,Multiple sequence alignment ,covariation ,Substitution (logic) ,Organic Chemistry ,Proteins ,Statistical ,030104 developmental biology ,Chemistry (miscellaneous) ,Principal component analysis ,Mutation (genetic algorithm) ,Molecular Medicine ,multiple sequence alignment ,[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] ,Biological system ,Sequence Alignment ,Algorithms - Abstract
Residues in proteins that are in close spatial proximity are more prone to covariate as their interactions are likely to be preserved due to structural and evolutionary constraints. If we can detect and quantify such covariation, physical contacts may then be predicted in the structure of a protein solely from the sequences that decorate it. To carry out such predictions, and following the work of others, we have implemented a multivariate Gaussian model to analyze correlation in multiple sequence alignments. We have explored and tested several numerical encodings of amino acids within this model. We have shown that 1D encodings based on amino acid biochemical and biophysical properties, as well as higher dimensional encodings computed from the principal components of experimentally derived mutation/substitution matrices, do not perform as well as a simple twenty dimensional encoding with each amino acid represented with a vector of one along its own dimension and zero elsewhere. The optimum obtained from representations based on substitution matrices is reached by using 10 to 12 principal components, the corresponding performance is less than the performance obtained with the 20-dimensional binary encoding. We highlight also the importance of the prior when constructing the multivariate Gaussian model of a multiple sequence alignment.
- Published
- 2018
- Full Text
- View/download PDF