13 results on '"Chiara Sabatti"'
Search Results
2. Markov models for inferring Copy Number Variations from genotype data on Illumina platforms
- Author
-
Hui Wang, Jan Veldink, Roel Opoff, and Chiara Sabatti
- Abstract
We develop an algorithm to analyze data from Illumina genotyping arrays for the detection of copy number variations in a single individual or in a random sample of individuals. We use a Hidden Markov Model framework, appropriately extended to take into account linkage disequilibrium between nearby loci. We describe a multisample approach to estimate the frequency of copy number variants in the population. With appropriate dataset, our methodology simultaneously analyzes the data for copy-number variation and tests for association between this and a disease trait of interest.
- Published
- 2011
3. Inferring genomic loss and location of tumor suppressor genes from high density genotypes
- Author
-
Hui Wang, Yohan Lee, Stanley Nelson, and Chiara Sabatti
- Abstract
Novel technologies, such as the 10k Affymetrix genotyping array, allow scoring of genetic polymorphisms at a very high density across the genome. This allows researchers to conduct traditional inquiries at an unprecedented resolution, while simutaneously motivates novel types of analysis, aimed at exploiting the increased information contained in these datasets. We consider how genotypes of cancer cell lines can be used to reconstruct genomic loss events and map putative tumor suppressor genes (TSG). Using a hidden Markov model framework, we adapt a previously described model for genomic instability in cancers to the current data structure. Simulations indicate that our procedure can be powerful and accurate and initial application to real data leads to encouraging results.
- Published
- 2011
4. Reconstructing Ancestral Haplotypes with a Dictionary Model
- Author
-
Kristin L. Ayers, Chiara Sabatti, and Kenneth Lange
- Abstract
We propose a dictionary model for haplotypes. According to the model, a haplotype is constructed by randomly concatenating haplotype segments from a given dictionary of segments. A haplotype block is defined as a set of haplotype segments that begin and end with the same pair of markers. In this framework, haplotype blocks can overlap, and the model provides a setting for testing the accuracy of simpler models invoking only nonoverlapping blocks. Each haplotype segment in a dictionary has an assigned probability and alternate spellings that account for genotyping errors and mutation. The model also allows for missing data, unphased genotypes, and prior distribution of parameters. Likelihood evaluations rely on forward and backward recurrences similar to the ones encountered in hidden Markov models. Parameter estimation is carried out with an EM algorithm. The search for the optimal dictionary is a particularly difficult because of the variable dimension of the model space. We define a minimum description length criteria to evaluate each dictionary and use a combination of greedy search and careful initialization to select a best dictionary for a given data set. Application of the model to simulated data gives encouraging results. In a real data set, we are able to reconstruct a parsimonious dictionary that captures patterns of linkage disequilibrium well.
- Published
- 2011
5. Bayesian Gaussian Mixture Models for High Density Genotyping Arrays
- Author
-
Chiara Sabatti and Kenneth Lange
- Abstract
Affymetrix's SNP (single nucleotide polymorphism) genotyping chips have increased the scope and decreased the cost of gene mapping studies. Because each SNP is queried by multiple DNA probes, the chips present interesting challenges in genotype calling. Traditional clustering methods distinguish the three genotypes of a SNP fairly well given a large enough sample of unrelated individuals or a training sample of known genotypes. The present paper describes our attempt to improve genotype calling by constructing Gaussian penetrance models with empirically derived priors. The priors stabilize parameter estimation and borrow information collectively gathered on tens of thousands of SNPs. When data from related family members are available, Gaussian penetrance models capture the correlations in signals between relatives. With these advantages in mind, we apply the models to Affymetrix probe intensity data on 10,000 SNPs gathered on 63 genotyped individuals spread over eight pedigrees. We integrate the genotype calling model with pedigree analysis and examine a sequence of symmetry hypotheses involving the correlated probe signals. The symmetry hypotheses raise novel mathematical issues of parameterization. Using the BIC criterion, we select the best combination of symmetry assumptions. Compared to the genotype calling results obtained from Affymetrix's software, we are able to reduce the number of no-calls substantially and quantify the level of confidence in all calls. Once pedigree analysis software can accommodate soft penetrances, we can expect to see more reliable association and linkage studies with less wasted genotyping data.
- Published
- 2011
6. Empirical Bayes Estimation of a Sparse Vector of Gene Expression Changes
- Author
-
Stephen Erickson and Chiara Sabatti
- Abstract
Gene microarray technology is often used to compare the expression of thousand of genes in two different cell lines. Typically, one does not expect measurable changes in transcription amounts for a large number of genes; furthermore, the noise level of array experiments is rather high in relation to the available number of replicates. For the purpose of statistical analysis, inference on the “population” difference in expression for genes across the two cell lines is often cast in the framework of hypothesis testing, with the null hypothesis being no change in expression. Given that thousands of genes are investigated at the same time, this requires some multiple comparison correction procedure to be in place. We argue that hypothesis testing, with its emphasis on type I error and family analogues, may not address the exploratory nature of most microarray experiments. We instead propose viewing the problem as one of estimation of a vector known to have a large number of zero components. In a Bayesian framework, we describe the prior knowledge on expression changes using mixture priors that incorporate a mass at zero and we choose a loss function that favors the selection of sparse solutions. We consider two different models applicable to the microarray problem, depending on the nature of replicates available, and show how to explore the posterior distributions of the parameters using MCMC. Simulations show an interesting connection between this Bayesian estimation framework and both false discovery rate (FDR) control, and misclassification minimizing procedures. Finally, two empirical examples illustrate the practical advantages of this Bayesian estimation paradigm.
- Published
- 2011
7. Volume Measures for Linkage Disequilibrium
- Author
-
Yuguo Chen, Chia-Ho Lin, and Chiara Sabatti
- Abstract
We discuss the value of volume measures for linkage disequilibrium, showing how they are robust to small sample variation and easily generalized to multi-allelic markers. In particular we introduce Dvol, a volume analogue to D' and show that it performs substantially better when the sample size is small to moderate. Mvol is proposed as a generalization of this measure to multi-allelic markers. Finally a measure based on homozygosity Hvol is suggested as a generalization of R^2. To evaluate these measures, we introduce a sequential importance sampling algorithm. We illustrate their performance on simulated and real data.
- Published
- 2011
8. Bayesian Sparse Hidden Components Analysis for Transcription Regulation Networks
- Author
-
Chiara Sabatti and Gareth James
- Abstract
We describe a framework where DNA sequence information and expression arrays data are used in concert to analyze the effects of a collection of regulatory proteins on genomic expression levels. The search for potential binding sites in sequence data leads to the identification of potential target genes for each transcription factor. The analysis of array data with a Bayesian hidden component model allows us to identify which of the potential binding sites are actually used by the regulatory proteins in the studied cell conditions, the strength of their control, and their activation profile in a series of experiments. We apply our methodology to 35 expression studies in E. Coli.
- Published
- 2011
9. A Vocabulon Study of E.Coli Regulatory Sites with Feedback to Expression Array Analysis
- Author
-
Chiara Sabatti, Lars Rohlin, Kenneth Lange, and James Liao
- Abstract
The identification of binding sites for regulatory proteins in the up-stream region of genes is an important ingredient towards the understanding of transcription regulation. In recent years, novel experimental techniques, as gene expression arrays, and the availability of entire genome sequences have opened the possibility for more detailed investigations in this domain. Traditionally, the reconstruction of the profile of a binding site and the localization of all its occurrences in a sequence are treated as separate problems. The first is tackled using a small group of sequences, known or suspected to contain the binding site, but with neither position or pattern known. One successful approach to such reconstruction problem is based on a probabilistic model of the sequence, represented as concatenation of background and motif stochastic words. Maximum likelihood or maximum a-posteriori estimates are obtained with EM or Gibbs-sampler algorithms [13, 14]. The second problem is approached considering one or multiple sequences of variable length; the pattern characterizing the motif is assumed known. Possible locations are identified on the base of scoring functions that highlight the similarity of the motif with the sequence portions. Cut off values for such similarity scores are hard to determine: ad hoc solutions or estimations on a training set are often adopted [17, 18]. Typically these techniques are used to scan one sequence of interest against a data-base of known binding sites. While there are historical and practical reasons to consider these two problems as separate, the current post-genomic era, where we are confronted with large abundance of sequence, calls for a different approach. Consider the problem, tackled in [18], of identifying all the the binding sites of the known regulatory proteins in the genome of E. Coli. While formally similar to blasting a small sequence of interest against a data-base of known regulatory proteins, there are substantial differences in these genome-wide search.
- Published
- 2011
10. Co-expression networks reveal the tissue-specific regulation of transcription and splicing
- Author
-
Morgan Diegel, Laure Fresard, Lindsay F. Rizzardi, Yuan He, Monkol Lek, Daniel C. Rohrer, Boxiang Liu, Maximilian Haeussler, Heather M. Traino, Concepcion R. Nierras, Joseph Wheeler, Serghei Mangul, Fan Wu, Hualin S. Xi, Andrew D. Skol, Steven Hunter, Yaping Liu, Casandra A. Trowbridge, Brandon L. Pierce, Daniel Bates, Peter Hickey, Susan E. Koester, Bryan Gillard, Eric R. Gamazon, Jennifer A. Doherty, Jared L. Nedzel, Eric Haugen, Lori E. Brigham, Gao Wang, Dana R. Valley, Zachary Zappala, Emmanouil T. Dermitzakis, Seva Kashin, Ira M. Hall, John Vivian, Philip A. Branton, Barbara E. Stranger, Magali Ruffier, Melina Claussnitzer, Nancy Roche, Michael Washington, Halit Ongen, Brian Jo, Rachna Kumar, Jean Monlong, Yi-Hui Zhou, Kristen Lee, Stephane E. Castel, Mark Miklos, Alisa McDonald, Diego Garrido-Martín, Jimmie B. Vaught, Hae Kyung Im, Leslie H. Sobin, John T. Lonsdale, Audra K. Johnson, Rui Zhang, Nancy J. Cox, Christopher D. Brown, Paul Flicek, Ferran Reverter, Roderic Guigó, Tuuli Lappalainen, Sarah E. Gould, Deborah C. Mash, Michael T. Moser, Andrew B. Nobel, Takunda Matose, Jingchun Zhu, Joe R. Davis, Andrey A. Shabalin, Jie Quan, Pedro G. Ferreira, Taru Tukiainen, Ellen Gelfand, Cédric Howald, Buhm Han, Emily K. Tsang, Andrew P. Feinberg, Caroline Linke, Kane Hadley, Richard Sandstrom, Mark D. Johnson, Joshua M. Akey, Ian C. McDowell, Daniel R. Zerbino, Alexis Battle, Brian Roe, Daniel G. MacArthur, Ellen Karasik, Marcus Hunter, Anjené M. Addington, Thomas Juettemann, Konrad J. Karczewski, Duyen T. Nguyen, Lei Hou, Stephen B. Montgomery, YoSon Park, Nicole C. Lockart, Lin Chen, Rajinder Kaul, Ruiqi Jian, Robert G. Montroy, Xiao Li, Michael Snyder, Beryl B. Cummings, Kimberly M. Valentino, Ariel D. H. Gewirtz, François Aguet, Jeffrey McLean, Gary Walters, Farhad Hormozdiari, William F. Leinweber, Gad Getz, Jeffery P. Struewing, Anne Ndungu, Dan L. Nicolae, Benoit Molinie, Lihua Jiang, Michael Sammeth, W. James Kent, John Palowitch, Brian Craft, Donald F. Conrad, Kathryn Demanelis, Jason Bridge, Jin Billy Li, A. Roger Little, Nicholas Van Wittenberghe, Stephen J. Trevanion, Pejman Mohammadi, Michael S. Noble, Kate R. Rosenbloom, Marian S. Fernando, Benjamin J. Strober, Ping Guan, Brunilda Balliu, Yungil Kim, Kevin Myer, Christine B. Peterson, Pushpa Hariharan, Jae Hoon Sul, Abhi Rao, Michael F. Salvatore, Qin Li, Eun Yong Kang, Matthew T. Maurano, Ayellet V. Segrè, Dan Sheppard, Fred A. Wright, Matthew Stephens, Kasper D. Hansen, Chiara Sabatti, Kevin S. Smith, Xin Li, Ruth Barshir, Muhammad G. Kibriya, Farhan N. Damani, Manolis Kellis, Olivier Delaneau, Shin Lin, Richard Hasz, Michael J. Gloudemans, Anita H. Undale, Mary Goldman, Fidencio J. Neri, Katherine H. Huang, David E. Tabor, Manuel Muñoz-Aguirre, Maghboeba Mosavel, Simona Volpi, Latarsha J. Carithers, Anna M. Smith, Genna Gliner, Eleazar Eskin, Nikolaos I Panousis, Benedict Paten, Andrew A. Brown, Jessica Lin, Kieron Taylor, Robert E. Handsaker, Laura Barker, Casey Martin, Meng Wang, Farzana Jasmine, Scott D. Jewell, Nathan S. Abell, Kristin G. Ardlie, Shilpi Singh, Mary Barcus, Anthony Payne, Christopher Lee, Xiaoquan Wen, Nicola J. Rinaldi, Hua Tang, Yongjin Park, Christopher Johns, Saboor Shad, Judith B. Zaugg, Reza Sodaei, Maria M. Tomaszewski, David A. Davis, Joanne Chan, Laura A. Siminoff, Mark I. McCarthy, Ki Sung Um, Karna Robinson, Esti Yeger-Lotem, Martijn van de Bunt, Meritxell Oliva, Jemma Nelson, Negin Vatanian, Colby Chiang, Jeffrey A. Thomas, Alexandra J. Scott, Omer Basha, Jessica Halow, Panagiotis Papasaikas, Barbara A. Foster, Barbara E. Engelhardt, Sarah Kim-Hellmuth, Li Wang, Gireesh K. Bogu, Sandra Linder, Sarah Urbut, Ashis Saha, Gen Li, Bernadette Mestichelli, Chuan Gao, John A. Stamatoyannopoulos, Liqun Qi, Princy Parsana, Helen M. Moore, Gene Kopen, and GTEx, Consortium
- Subjects
Gene isoform ,0301 basic medicine ,Genotyping Techniques ,Bioinformatics ,RNA Splicing ,1.1 Normal biological development and functioning ,Gene regulatory network ,Method ,Genomics ,Computational biology ,Biology ,Medical and Health Sciences ,GTEx Consortium ,Transcriptome ,03 medical and health sciences ,Databases ,0302 clinical medicine ,Genetic ,Transcription (biology) ,Underpinning research ,Genetic variation ,Gene expression ,Genetics ,Humans ,ddc:576.5 ,Gene Regulatory Networks ,Polymorphism ,Gene ,Genetics (clinical) ,030304 developmental biology ,Regulation of gene expression ,0303 health sciences ,Gene Expression Profiling ,Human Genome ,Bayes Theorem ,Single Nucleotide ,Biological Sciences ,Gene expression profiling ,030104 developmental biology ,Gene Expression Regulation ,Organ Specificity ,RNA splicing ,RNA ,Generic health relevance ,Sequence Analysis ,030217 neurology & neurosurgery ,Biotechnology - Abstract
Gene co-expression networks capture biologically important patterns in gene expression data, enabling functional analyses of genes, discovery of biomarkers, and interpretation of regulatory genetic variants. Most network analyses to date have been limited to assessing correlation between total gene expression levels in a single or small sets of tissues. Here, we have reconstructed networks that capture a much more complete set of regulatory relationships, specifically including regulation of relative isoform abundance and splicing, and tissue-specific connections unique to each of a diverse set of tissues. Using the Genotype-Tissue Expression (GTEx) project v6 RNA-sequencing data across 44 tissues in 449 individuals, we evaluated shared and tissue-specific network relationships. First, we developed a framework called Transcriptome Wide Networks (TWNs) for combining total expression and relative isoform levels into a single sparse network, capturing the complex interplay between the regulation of splicing and transcription. We built TWNs for sixteen tissues, and found that hubs with isoform node neighbors in these networks were strongly enriched for splicing and RNA binding genes, demonstrating their utility in unraveling regulation of splicing in the human transcriptome, and providing a set of candidate shared and tissue-specific regulatory hub genes. Next, we used a Bayesian biclustering model that identifies network edges between genes with co-expression in a single tissue to reconstruct tissue-specific networks (TSNs) for 27 distinct GTEx tissues and for four subsets of related tissues. Using both TWNs and TSNs, we characterized gene co-expression patterns shared across tissues. Finally, we found genetic variants associated with multiple neighboring nodes in our networks, supporting the estimated network structures and identifying 33 genetic variants with distant regulatory impact on transcription and splicing. Our networks provide an improved understanding of the complex relationships between genes in the human transcriptome, including tissue-specificity of gene co-expression, regulation of splicing, and the coordinated impact of genetic variation on transcription.
- Published
- 2017
11. Brain structure-function associations in multi-generational families genetically enriched for bipolar disorder
- Author
-
Scott C. Fears, Remmelt Schür, Rachel Sjouwerman, Susan K. Service, Carmen Araya, Xinia Araya, Julio Bejarano, Emma Knowles, Juliana Gomez-Makhinson, Maria C. Lopez, Ileana Aldana, Terri M. Teshiba, Zvart Abaryan, Noor B. Al-Sharif, Linda Navarro, Todd A. Tishler, Lori Altshuler, George Bartzokis, Javier I. Escobar, David C. Glahn, Paul M. Thompson, Carlos Lopez-Jaramillo, Gabriel Macaya, Julio Molina, Victor I. Reus, Chiara Sabatti, Rita M. Cantor, Nelson B. Freimer, and Carrie E. Bearden
- Subjects
Male ,Bipolar Disorder ,neurocognition ,616.895 Psicosis maníacodepresiva (Trastornos bipolares) ,Medical and Health Sciences ,Developmental psychology ,Lingual gyrus ,Computer-Assisted ,80 and over ,Verbal fluency test ,2.1 Biological and endogenous factors ,Cognitive decline ,Aetiology ,Aged, 80 and over ,Brain ,Cognition ,temperament ,Middle Aged ,Magnetic Resonance Imaging ,Genealogía y Heráldica ,Phenotype ,Mental Health ,Brain size ,Neurological ,Female ,social and economic factors ,Psychology ,Clinical psychology ,Adult ,Adolescent ,1.1 Normal biological development and functioning ,Basic Behavioral and Social Science ,Young Adult ,pedigrees ,2.3 Psychological ,Underpinning research ,Image Interpretation, Computer-Assisted ,Behavioral and Social Science ,medicine ,Humans ,Genetic Predisposition to Disease ,Bipolar disorder ,Fenotipos ,Image Interpretation ,structural MRI ,Trastorno Bipolar ,Aged ,Psychiatric Status Rating Scales ,Neurology & Neurosurgery ,Working memory ,Temperamento ,Psychology and Cognitive Sciences ,Neurosciences ,Original Articles ,component phenotype ,medicine.disease ,Brain Disorders ,Neurology (clinical) ,Neurocognitive ,Imagen por Resonancia Magnética ,Genealogy and Heraldry - Abstract
Recent theories regarding the pathophysiology of bipolar disorder suggest contributions of both neurodevelopmental and neurodegenerative processes. While structural neuroimaging studies indicate disease-associated neuroanatomical alterations, the behavioural correlates of these alterations have not been well characterized. Here, we investigated multi-generational families genetically enriched for bipolar disorder to: (i) characterize neurobehavioural correlates of neuroanatomical measures implicated in the pathophysiology of bipolar disorder; (ii) identify brain–behaviour associations that differ between diagnostic groups; (iii) identify neurocognitive traits that show evidence of accelerated ageing specifically in subjects with bipolar disorder; and (iv) identify brain–behaviour correlations that differ across the age span. Structural neuroimages and multi-dimensional assessments of temperament and neurocognition were acquired from 527 (153 bipolar disorder and 374 non-bipolar disorder) adults aged 18–87 years in 26 families with heavy genetic loading for bipolar disorder. We used linear regression models to identify significant brain–behaviour associations and test whether brain–behaviour relationships differed: (i) between diagnostic groups; and (ii) as a function of age. We found that total cortical and ventricular volume had the greatest number of significant behavioural associations, and included correlations with measures from multiple cognitive domains, particularly declarative and working memory and executive function. Cortical thickness measures, in contrast, showed more specific associations with declarative memory, letter fluency and processing speed tasks. While the majority of brain–behaviour relationships were similar across diagnostic groups, increased cortical thickness in ventrolateral prefrontal and parietal cortical regions was associated with better declarative memory only in bipolar disorder subjects, and not in non-bipolar disorder family members. Additionally, while age had a relatively strong impact on all neurocognitive traits, the effects of age on cognition did not differ between diagnostic groups. Most brain–behaviour associations were also similar across the age range, with the exception of cortical and ventricular volume and lingual gyrus thickness, which showed weak correlations with verbal fluency and inhibitory control at younger ages that increased in magnitude in older subjects, regardless of diagnosis. Findings indicate that neuroanatomical traits potentially impacted by bipolar disorder are significantly associated with multiple neurobehavioural domains. Structure–function relationships are generally preserved across diagnostic groups, with the notable exception of ventrolateral prefrontal and parietal association cortex, volumetric increases in which may be associated with cognitive resilience specifically in individuals with bipolar disorder. Although age impacted all neurobehavioural traits, we did not find any evidence of accelerated cognitive decline specific to bipolar disorder subjects. Regardless of diagnosis, greater global brain volume may represent a protective factor for the effects of ageing on executive functioning. National Institute of Health/[R01MH075007]/NIH/Estados Unidos National Institute of Health/[R01MH095454]/NIH/Estados Unidos National Institute of Health/[P30NS062691]/NIH/Estados Unidos National Institute of Health/[K23MH074644-01]/NIH/Estados Unidos National Institute of Health/[R01HG006695]/NIH/Estados Unidos National Institute of Health/[K08MH086786]/NIH/Estados Unidos UCR::Vicerrectoría de Investigación::Unidades de Investigación::Ciencias Básicas::Centro de Investigación en Biología Celular y Molecular (CIBCM)
- Published
- 2015
12. Empirical Bayes Estimation of a Sparse Vector of Gene Expression Changes
- Author
-
Stephen W. Erickson and Chiara Sabatti
- Subjects
Statistics and Probability ,False discovery rate ,Bayes estimator ,MCMC ,Microarrays ,business.industry ,thresholding ,Pattern recognition ,Bayes factor ,FDR ,Bayesian statistics ,Computational Mathematics ,Bayes' theorem ,Multiple comparisons problem ,Genetics ,Physical Sciences and Mathematics ,Artificial intelligence ,business ,Molecular Biology ,Type I and type II errors ,Statistical hypothesis testing ,Mathematics - Abstract
Gene microarray technology is often used to compare the expression of thousand of genes in two different cell lines. Typically, one does not expect measurable changes in transcription amounts for a large number of genes; furthermore, the noise level of array experiments is rather high in relation to the available number of replicates. For the purpose of statistical analysis, inference on the “population” difference in expression for genes across the two cell lines is often cast in the framework of hypothesis testing, with the null hypothesis being no change in expression. Given that thousands of genes are investigated at the same time, this requires some multiple comparison correction procedure to be in place. We argue that hypothesis testing, with its emphasis on type I error and family analogues, may not address the exploratory nature of most microarray experiments. We instead propose viewing the problem as one of estimation of a vector known to have a large number of zero components. In a Bayesian framework, we describe the prior knowledge on expression changes using mixture priors that incorporate a mass at zero and we choose a loss function that favors the selection of sparse solutions. We consider two different models applicable to the microarray problem, depending on the nature of replicates available, and show how to explore the posterior distributions of the parameters using MCMC. Simulations show an interesting connection between this Bayesian estimation framework and both false discovery rate (FDR) control, and misclassification minimizing pro- cedures. Finally, two empirical examples illustrate the practical advantages of this Bayesian estimation paradigm
- Published
- 2011
13. Reconstructing Ancestral Haplotypes with a Dictionary Model
- Author
-
Chiara Sabatti, Kristin L. Ayers, and Kenneth Lange
- Subjects
forward and backwards algorithms ,haplotype blocks ,Computer science ,MathematicsofComputing_NUMERICALANALYSIS ,Linkage Disequilibrium ,TheoryofComputation_ANALYSISOFALGORITHMSANDPROBLEMCOMPLEXITY ,Expectation–maximization algorithm ,Statistics ,Prior probability ,Genetics ,Physical Sciences and Mathematics ,EM algorithm ,Hidden Markov model ,Minimum description length ,Molecular Biology ,Models, Genetic ,Estimation theory ,Haplotype ,Missing data ,minimum description length ,Computational Mathematics ,ComputingMethodologies_PATTERNRECOGNITION ,Computational Theory and Mathematics ,Haplotypes ,Modeling and Simulation ,Mutation (genetic algorithm) ,Algorithm ,linkage disequilibrium ,Algorithms ,MathematicsofComputing_DISCRETEMATHEMATICS - Abstract
We propose a dictionary model for haplotypes. According to the model, a haplotype is con- structed by randomly concatenating haplotype segments from a given dictionary of segments. A haplotype block is defined as a set of haplotype segments that begin and end with the same pair of markers. In this framework, haplotype blocks can overlap, and the model provides a setting for testing the accuracy of simpler models invoking only nonoverlapping blocks. Each haplotype segment in a dictionary has an assigned probability and alternate spellings that ac- count for genotyping errors and mutation. The model also allows for missing data, unphased genotypes, and prior distribution of parameters. Likelihood evaluations rely on forward and backward recurrences similar to the ones encountered in hidden Markov models. Parameter estimation is carried out with an EM algorithm. The search for the optimal dictionary is a particularly difficult because of the variable dimension of the model space. We define a mini- mum description length criteria to evaluate each dictionary and use a combination of greedy search and careful initialization to select a best dictionary for a given data set. Application of the model to simulated data gives encouraging results. In a real data set, we are able to reconstruct a parsimonious dictionary that captures patterns of linkage disequilibrium well.
- Published
- 2011
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.