42 results on '"Statistical genomics"'
Search Results
2. OpenMendel: a cooperative programming project for statistical genetics
- Author
-
Zhou, Hua, Sinsheimer, Janet S, Bates, Douglas M, Chu, Benjamin B, German, Christopher A, Ji, Sarah S, Keys, Kevin L, Kim, Juhyun, Ko, Seyoon, Mosher, Gordon D, Papp, Jeanette C, Sobel, Eric M, Zhai, Jing, Zhou, Jin J, and Lange, Kenneth
- Subjects
Biological Sciences ,Genetics ,Human Genome ,Networking and Information Technology R&D (NITRD) ,Algorithms ,Computational Biology ,Genome ,Human ,Genome-Wide Association Study ,Humans ,Models ,Statistical ,Polymorphism ,Single Nucleotide ,Programming Languages ,Software ,Statistical genomics ,GWAS ,Computational statistics ,Open source ,Collaborative programming ,stat.AP ,q-bio.GN ,Complementary and Alternative Medicine ,Paediatrics and Reproductive Medicine ,Genetics & Heredity ,Reproductive medicine - Abstract
Statistical methods for genome-wide association studies (GWAS) continue to improve. However, the increasing volume and variety of genetic and genomic data make computational speed and ease of data manipulation mandatory in future software. In our view, a collaborative effort of statistical geneticists is required to develop open source software targeted to genetic epidemiology. Our attempt to meet this need is called the OPENMENDEL project (https://openmendel.github.io). It aims to (1) enable interactive and reproducible analyses with informative intermediate results, (2) scale to big data analytics, (3) embrace parallel and distributed computing, (4) adapt to rapid hardware evolution, (5) allow cloud computing, (6) allow integration of varied genetic data types, and (7) foster easy communication between clinicians, geneticists, statisticians, and computer scientists. This article reviews and makes recommendations to the genetic epidemiology community in the context of the OPENMENDEL project.
- Published
- 2020
3. Leveraging genomic and molecular variations to understand the regulatory landscape in human cancers and differentiating stem cells
- Author
-
Urban, Lara Hanne and Stegle, Oliver
- Subjects
616.99 ,Statistical Genomics ,Computational Genomics ,Human Cancer Genomics ,Transcriptomics ,Mutational Signatures ,Single-cell Genomics ,DNA methylation ,Alternative Splicing - Abstract
Genetic and molecular variations are closely intertwined; while genetic factors drive phenotypic differences ranging from gene expression to organismal traits, phenotypic variations are the target of evolutionary selection, what eventually results in genetic changes. As technological advances have resulted in high-throughput assays for different molecular dimensions, it has become challenging to turn these large-scale data into meaningful insights and to delineate biological cause and consequence. In this thesis, I use computational modelling to detect and understand biologically meaningful associations between genetic variation and gene expression alterations. First, we use data across 27 human cancer types to probe associations between different genetic factors and gene expression levels. We describe the tumours' regulatory landscape that is highly heterogeneous across cancer types, and quantify the relationship between gene expression and various genetic features that characterise local and global mutational burden as well as distinct mutational processes. Next, we study the relationship between genetic and epigenetic variation and alternative splicing. This analysis extends studies of splicing events in bulk data to variability in splicing between single cells from the same tissue: We analyse DNA methylation and alternative splicing across single cells derived from one human donor to characterise splicing variation and its determinants across genes. Thus, we identify relevant genetic determinants of splicing in induced pluripotent stem cells as well as during their differentiation, and a significant contribution of DNA methylation to splicing variation across cells. Finally, we show how gene expression-mutagenesis screens can be applied to understand complex mutational signatures, using the cancer hallmark of DNA repair deficiency as an example. The molecular cause and consequence of homologous recombination repair deficiency are not yet fully understood. We explore genome-wide molecular aberrations caused by this repair deficiency beyond the few previously known genes. Our preliminary results point towards a genetically dominant effect of BRCA1 mutagenesis. Taken together, this thesis highlights novel dimensions of genotype-phenotype associations in highly heterogeneous molecular datasets. We describe the complex regulatory landscape across human cancer types, as well as molecular alterations and relevant epigenetic effects in differentiating pluripotent stem cells.
- Published
- 2019
- Full Text
- View/download PDF
4. Editorial: Integration of computational genomics into clinical pharmacogenomic tests: how bioinformatics may help primary care in precision medicine area.
- Author
-
Tafazoli, Alireza, Abbaszadegan, Mohammad Reza, and Patrinos, George P.
- Subjects
INDIVIDUALIZED medicine ,PHARMACOGENOMICS ,GENOMICS ,PRIMARY care ,BIOINFORMATICS - Published
- 2023
- Full Text
- View/download PDF
5. Statistical genetics and polygenic risk score for precision medicine
- Author
-
Takahiro Konuma and Yukinori Okada
- Subjects
Statistical genomics ,Genome-wide association study ,Polygenic risk score ,Precision medicine ,Pathology ,RB1-214 - Abstract
Abstract The prediction of disease risks is an essential part of personalized medicine, which includes early disease detection, prevention, and intervention. The polygenic risk score (PRS) has become the standard for quantifying genetic liability in predicting disease risks. PRS utilizes single-nucleotide polymorphisms (SNPs) with genetic risks elucidated by genome-wide association studies (GWASs) and is calculated as weighted sum scores of these SNPs with genetic risks using their effect sizes from GWASs as their weights. The utilities of PRS have been explored in many common diseases, such as cancer, coronary artery disease, obesity, and diabetes, and in various non-disease traits, such as clinical biomarkers. These applications demonstrated that PRS could identify a high-risk subgroup of these diseases as a predictive biomarker and provide information on modifiable risk factors driving health outcomes. On the other hand, there are several limitations to implementing PRSs in clinical practice, such as biased sensitivity for the ethnic background of PRS calculation and geographical differences even in the same population groups. Also, it remains unclear which method is the most suitable for the prediction with high accuracy among numerous PRS methods developed so far. Although further improvements of its comprehensiveness and generalizability will be needed for its clinical implementation in the future, PRS will be a powerful tool for therapeutic interventions and lifestyle recommendations in a wide range of diseases. Thus, it may ultimately improve the health of an entire population in the future.
- Published
- 2021
- Full Text
- View/download PDF
6. Beware the Jaccard: the choice of similarity measure is important and non-trivial in genomic colocalisation analysis.
- Author
-
Salvatore, Stefania, Rand, Knut Dagestad, Grytten, Ivar, Ferkingstad, Egil, Domanska, Diana, Holden, Lars, Gheorghe, Marius, Mathelier, Anthony, Glad, Ingrid, and Sandve, Geir Kjetil
- Subjects
- *
GENETIC regulation , *EPIGENOMICS , *ACQUISITION of data - Abstract
The generation and systematic collection of genome-wide data is ever-increasing. This vast amount of data has enabled researchers to study relations between a variety of genomic and epigenomic features, including genetic variation, gene regulation and phenotypic traits. Such relations are typically investigated by comparatively assessing genomic co-occurrence. Technically, this corresponds to assessing the similarity of pairs of genome-wide binary vectors. A variety of similarity measures have been proposed for this problem in other fields like ecology. However, while several of these measures have been employed for assessing genomic co-occurrence, their appropriateness for the genomic setting has never been investigated. We show that the choice of similarity measure may strongly influence results and propose two alternative modelling assumptions that can be used to guide this choice. On both simulated and real genomic data, the Jaccard index is strongly altered by dataset size and should be used with caution. The Forbes coefficient (fold change) and tetrachoric correlation are less influenced by dataset size, but one should be aware of increased variance for small datasets. All results on simulated and real data can be inspected and reproduced at https://hyperbrowser.uio.no/sim-measure. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
7. Editorial: Statistical and Computational Methods for Microbiome Multi-Omics Data.
- Author
-
Mallick, Himel, Bucci, Vanni, and An, Lingling
- Subjects
MEDIATION (Statistics) ,BIOLOGICAL specimens ,COMPUTATIONAL biology ,BIOLOGICAL systems ,FALSE positive error ,SYSTEMS biology ,GUT microbiome ,METAGENOMICS - Published
- 2020
- Full Text
- View/download PDF
8. Statistical genetics and polygenic risk score for precision medicine
- Author
-
Yukinori Okada and Takahiro Konuma
- Subjects
0301 basic medicine ,medicine.medical_specialty ,Genome-wide association study ,Immunology ,Population ,Psychological intervention ,Disease ,Review ,03 medical and health sciences ,0302 clinical medicine ,Polygenic risk score ,medicine ,Pathology ,Immunology and Allergy ,Statistical genomics ,RB1-214 ,Generalizability theory ,Intensive care medicine ,education ,education.field_of_study ,business.industry ,Precision medicine ,030104 developmental biology ,Statistical genetics ,Personalized medicine ,business ,030217 neurology & neurosurgery - Abstract
The prediction of disease risks is an essential part of personalized medicine, which includes early disease detection, prevention, and intervention. The polygenic risk score (PRS) has become the standard for quantifying genetic liability in predicting disease risks. PRS utilizes single-nucleotide polymorphisms (SNPs) with genetic risks elucidated by genome-wide association studies (GWASs) and is calculated as weighted sum scores of these SNPs with genetic risks using their effect sizes from GWASs as their weights. The utilities of PRS have been explored in many common diseases, such as cancer, coronary artery disease, obesity, and diabetes, and in various non-disease traits, such as clinical biomarkers. These applications demonstrated that PRS could identify a high-risk subgroup of these diseases as a predictive biomarker and provide information on modifiable risk factors driving health outcomes. On the other hand, there are several limitations to implementing PRSs in clinical practice, such as biased sensitivity for the ethnic background of PRS calculation and geographical differences even in the same population groups. Also, it remains unclear which method is the most suitable for the prediction with high accuracy among numerous PRS methods developed so far. Although further improvements of its comprehensiveness and generalizability will be needed for its clinical implementation in the future, PRS will be a powerful tool for therapeutic interventions and lifestyle recommendations in a wide range of diseases. Thus, it may ultimately improve the health of an entire population in the future.
- Published
- 2021
9. Statistical genetics and polygenic risk score for precision medicine
- Author
-
Konuma, Takahiro and Okada, Yukinori
- Published
- 2021
- Full Text
- View/download PDF
10. Méthodes pour l'inférence post-clustering appliquées à l'expression génique
- Author
-
Hivert, Benjamin, Agniel, Denis, Thiébaut, Rodolphe, Hejblum, Boris P., Bordeaux population health (BPH), Université de Bordeaux (UB)-Institut de Santé Publique, d'Épidémiologie et de Développement (ISPED)-Institut National de la Santé et de la Recherche Médicale (INSERM), Statistics In System biology and Translational Medicine (SISTM), Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)- Bordeaux population health (BPH), Université de Bordeaux (UB)-Institut de Santé Publique, d'Épidémiologie et de Développement (ISPED)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Université de Bordeaux (UB)-Institut de Santé Publique, d'Épidémiologie et de Développement (ISPED)-Institut National de la Santé et de la Recherche Médicale (INSERM), Vaccine Research Institute (VRI), Université Paris-Est Créteil Val-de-Marne - Paris 12 (UPEC UP12), Rand Corporation, CHU Bordeaux [Bordeaux], and Hivert, Benjamin
- Subjects
inférence sélective ,analyse circulaire ,géno- mique statistique ,selective inference ,statistical genomics ,double-dipping ,données de grande dimension ,high- dimensional data ,[MATH] Mathematics [math] ,[MATH]Mathematics [math] ,Classification non supervisée ,Clustering - Abstract
The analysis of RNA-seq gene expression data is often organised aroundtwo successive steps : i) clustering using all of the genes to group the observation units (pa-tients, cells, etc.) into separate and homogeneous subgroups ; then ii) differential analysisof individual genes using hypothesis testing to identify which genes, i.e. which variables,are differentially expressed between the subgroups. However, several subgroups construc-ted in i) can actually contain only units coming from the same homogeneous population :clustering will then artificially create differences between those spurious subgroups, lea-ding to false positives in ii). We propose two inference methods to take into account theinitial clustering step for differential analysis and thus guarantee an effective control of thetype I error. This first method is based on the concept of selective inference while the se-cond one use unimodality and multimodality to describe the separation between clusters.We evaluate the performance of both approaches in extensive numerical simulations aswell as in an application to a real, low dimensional dataset. Both proposed methods leadto valid p-values under their null hypothesis of no difference between subgroups in expres-sion at a selected gene independently of the clustering, while maintaining good statisticalpower. In high dimension, this type I error inflation can be overcome by the dilution of theclustering information, provided that the variables are independent. Yet, in the presenceof correlation (as for gene expression), spurious clusters appear, even though they are notseparable. An adaptation of the above methods to this high dimensional context is thusnecessary., L’analyse des données d’expression génique est souvent organisée autour de deux étapes successives : i) une classification non supervisée utilisant l’ensemble des gènes pour regrouper les unités d’observations (patients, échantillons ou cellules) en sous-groupes distincts et homogènes ; puis ii) l’analyse différentielle se faisant à l’aide de tests d’hypothèse visant à identifier quels gènes, c’est-à-dire quelles variables, sont différentiellement exprimés entre ces sous-groupes. Cependant, cette approche utilisant les même données lors des deux étapes ne permet pas de garantir un bon contrôle de l’erreur de type I à l’étape ii). Nous proposons deux méthodes d’inférence pour tenir compte de l’étape initiale de classification non supervisée lors de l’analyse différentielle et ainsi garantir un contrôle effectif de l’erreur de type I. La première méthode se base sur le concept d’inférence sélective tandis que la seconde repose sur une définition de la séparation de classes faisant uniquement intervenir les concepts d’unimodalité et de multimodalité. Nous avons évalué les performances des deux méthodes grâces à différentes simulations numériques, ainsi que dans une application sur un jeu de données réelles de faible dimension. Les méthodes proposées conduisent à des p-valeurs valides sous l’hypothèse nulle d’absence de différence entre les sous-groupes dans l’expression d’un gène sélectionné, indépendamment de la classification, tout en conservant une bonne puissance statistique. En grande dimension, cette inflation de l’erreur de type I peut-être contre-balancée par la dilution du signal utilisé pour la classification, à condition que les variables soient indépendantes. En revanche, en présence de corrélation (comme c’est le cas en pratique pour l’expression génique), des classes artificielles apparaissent alors que celles-ci ne sont pas séparables.Une adaptation des méthodes à ce contexte de grande dimension est donc nécessaire.
- Published
- 2022
11. Genome-wide estimation of heritability and its functional components for flowering, defense, ionomics, and developmental traits in a geographically diverse population of Arabidopsis thaliana.
- Author
-
Yang, Rong-Cai and Bell, J.
- Subjects
- *
HERITABILITY , *PLANT breeding , *ARABIDOPSIS thaliana , *FLOWERING time , *PLANT genetics - Abstract
Narrow-sense heritability (portion of the total phenotypic variation attributable to additive genetic effect, h2) is a critical parameter in plant breeding and genetics, but its estimation is difficult for populations with unknown pedigree information. This study applied a marker-based linear mixed model (LMM) analysis to estimate narrow-sense heritability and its seven functional components corresponding to SNPs in coding and noncoding regions for each of 107 flowering, defense, ionomics, and developmental traits in an Arabidopsis ( Arabidopsis thaliana) population of 199 inbred lines with unknown genetic relatedness. Genetic relationship matrix (GRM) based on 214 051 SNPs and component GRMs based on seven subsets of SNPs were computed for LMM estimation of h2 and functional components contributing to h2, respectively. The h2 estimates for flowering traits were higher than those for defense, ionomics, and developmental traits, supporting a general view that the fitness-related traits have lower heritabilities than other traits. The function component owing to SNPs in coding (exon) regions was the least contributor to h2. Our LMM analysis provides an opportunity to gain a comprehensive view on heritability and its functional components for populations with unknown structure but with genome-wide DNA markers. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
12. GSuite HyperBrowser: integrative analysis of dataset collections across the genome and epigenome.
- Author
-
Simovski, Boris, Vodák, Daniel, Gundersen, Sveinung, Domanska, Diana, Azab, Abdulrahman, Holden, Lars, Holden, Marit, Grytten, Ivar, Rand, Knut, Drabløs, Finn, Johansen, Morten, Mora, Antonio, Lund-Andersen, Christin, Fromm, Bastian, Eskeland, Ragnhild, Gabrielsen, Odd Stokke, Ferkingstad, Egil, Nakken, Sigve, Bengtsen, Mads, and Nederbragt, Alexander Johan
- Subjects
- *
GENOMES , *ACQUISITION of data , *OPEN source software - Abstract
Background: Recent large-scale undertakings such as ENCODE and Roadmap Epigenomics have generated experimental data mapped to the human reference genome (as genomic tracks) representing a variety of functional elements across a large number of cell types. Despite the high potential value of these publicly available data for a broad variety of investigations, little attention has been given to the analytical methodology necessary for their widespread utilisation. Findings: We here present a first principled treatment of the analysis of collections of genomic tracks. We have developed novel computational and statistical methodology to permit comparative and confirmatory analyses across multiple and disparate data sources. We delineate a set of generic questions that are useful across a broad range of investigations and discuss the implications of choosing different statistical measures and null models. Examples include contrasting analyses across different tissues or diseases. The methodology has been implemented in a comprehensive open-source software system, the GSuite HyperBrowser. To make the functionality accessible to biologists, and to facilitate reproducible analysis, we have also developed a web-based interface providing an expertly guided and customizable way of utilizing the methodology. With this system, many novel biological questions can flexibly be posed and rapidly answered. Conclusions: Through a combination of streamlined data acquisition, interoperable representation of dataset collections, and customizable statistical analysis with guided setup and interpretation, the GSuite HyperBrowser represents a first comprehensive solution for integrative analysis of track collections across the genome and epigenome. The software is available at: https://hyperbrowser.uio.no. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
13. Linkage analysis between dominant and co-dominant makers in full-sib families of out-breeding species
- Author
-
Alexandre Alonso Alves, Leonardo Lopes Bhering, Cosme Damião Cruz, and Acelino Couto Alfenas
- Subjects
statistical genomics ,exogamic populations ,recombination frequency and maximum likelihood ,Genetics ,QH426-470 - Abstract
As high-throughput genomic tools, such as the DNA microarray platform, have lead to the development of novel genotyping procedures, such as Diversity Arrays Technology (DArT) and Single Nucleotide Polymorphisms (SNPs), it is likely that, in the future, high density linkage maps will be constructed from both dominant and co-dominant markers. Recently, a strictly genetic approach was described for estimating recombination frequency (r) between co-dominant markers in full-sib families. The complete set of maximum likelihood estimators for r in full-sib families was almost obtained, but unfortunately, one particular configuration involving dominant markers, segregating in a 3:1 ratio and co-dominant markers, was not considered. Here we add nine further estimators to the previously published set, thereby making it possible to cover all combinations of molecular markers with two to four alleles (without epistasis) in a full-sib family. This includes segregation in one or both parents, dominance and all linkage phase configurations.
- Published
- 2010
- Full Text
- View/download PDF
14. Efficiency of the multilocus analysis for the construction of genetic maps
- Author
-
Leonardo Lopes Bhering, Cosme Damião Cruz, Edmar Soares de Vasconcelos, Márcio Fernando Ribeiro de Resende Junior, Willian Silva Barros, and Tatiana Barbosa Rosado
- Subjects
Statistical genomics ,simulation ,mapping. ,Plant culture ,SB1-1110 ,Biotechnology ,TP248.13-248.65 - Abstract
The use of genetic maps is a useful tool in genetic research. The association between map distance andrecombination frequency is expressed by a genetic mapping function. However, several of these functions do not presupposethe joint recombination percentage. In other words, they are not multilocus probabilities. This work aimed to compare,through simulations, the efficiency in the use of different mapping functions with and without multilocus analysis as a tool inthe construction of genetic maps. A genome constituted of three linkage groups (50, 100 and 200 cM) was simulated for acomparative study. Four mapping populations were simulated, F2, with 50, 100, 200 and 400 individuals, with 10 replicaseach. It was verified, after the analyses, that the multilocus analysis was not efficient to rescue the size of the connectiongroups, concluding that the non use of the multilocus analysis would be viable.
- Published
- 2009
15. Comparing the Statistical Fate of Paralogous and Orthologous Sequences.
- Author
-
Massip, Florian, Sheinman, Michael, Schbath, Sophie, and Arndt, Peter F.
- Subjects
- *
SEQUENCE alignment , *NUCLEOTIDE sequence , *BIOINFORMATICS , *PROBABILITY theory , *BIOLOGICAL evolution , *GENETIC code , *COMPARATIVE studies , *QUANTITATIVE research - Abstract
For several decades, sequence alignment has been a widely used tool in bioinformatics. For instance, finding homologous sequences with a known function in large databases is used to get insight into the function of nonannotated genomic regions. Very efficient tools like BLAST have been developed to identify and rank possible homologous sequences. To estimate the significance of the homology, the ranking of alignment scores takes a background model for random sequences into account. Using this model we can estimate the probability to find two exactly matching subsequences by chance in two unrelated sequences. For two homologous sequences, the corresponding probability is much higher, which allows us to identify them. Here we focus on the distribution of lengths of exact sequence matches between protein-coding regions of pairs of evolutionarily distant genomes. We show that this distribution exhibits a power-law tail with an exponent α = 25: Developing a simple model of sequence evolution by substitutions and segmental duplications, we show analytically and computationally that paralogous and orthologous gene pairs contribute differently to this distribution. Our model explains the differences observed in the comparison of coding and noncoding parts of genomes, thus providing a better understanding of statistical properties of genomic sequences and their evolution. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
16. A guide to genome-wide association analysis and post-analytic interrogation.
- Author
-
Reed, Eric, Nunez, Sara, Kulp, David, Qian, Jing, Reilly, Muredach P., and Foulkes, Andrea S.
- Subjects
- *
COMPUTER software , *DATABASES , *BIOINFORMATICS , *SEQUENCE analysis - Abstract
This tutorial is a learning resource that outlines the basic process and provides specific software tools for implementing a complete genome-wide association analysis. Approaches to post-analytic visualization and interrogation of potentially novel findings are also presented. Applications are illustrated using the free and open-source R statistical computing and graphics software environment, Bioconductor software for bioinformatics and the UCSC Genome Browser. Complete genome-wide association data on 1401 individuals across 861,473 typed single nucleotide polymorphisms from the PennCATH study of coronary artery disease are used for illustration. All data and code, as well as additional instructional resources, are publicly available through the Open Resources in Statistical Genomics project: http://www.stat-gen.org. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
17. Pharmacogenomic and Statistical Analysis.
- Author
-
Bai H, Zhang X, and Bush WS
- Subjects
- Pharmacogenomic Testing, Phenotype, Pharmacogenetics methods, Research Design
- Abstract
Genetic variants can alter response to drugs and other therapeutic interventions. The study of this phenomenon, called pharmacogenomics, is similar in many ways to other types of genetic studies but has distinct methodological and statistical considerations. Genetic variants involved in the processing of exogenous compounds exhibit great diversity and complexity, and the phenotypes studied in pharmacogenomics are also more complex than typical genetic studies. In this chapter, we review basic concepts in pharmacogenomic study designs, data generation techniques, statistical analysis approaches, and commonly used methods and briefly discuss the ultimate translation of findings to clinical care., (© 2023. The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature.)
- Published
- 2023
- Full Text
- View/download PDF
18. Constructing a polygenic risk score for childhood obesity using functional data analysis.
- Author
-
Craig SJC, Kenney AM, Lin J, Paul IM, Birch LL, Savage JS, Marini ME, Chiaromonte F, Reimherr ML, and Makova KD
- Abstract
Obesity is a highly heritable condition that affects increasing numbers of adults and, concerningly, of children. However, only a small fraction of its heritability has been attributed to specific genetic variants. These variants are traditionally ascertained from genome-wide association studies (GWAS), which utilize samples with tens or hundreds of thousands of individuals for whom a single summary measurement (e.g., BMI) is collected. An alternative approach is to focus on a smaller, more deeply characterized sample in conjunction with advanced statistical models that leverage longitudinal phenotypes. Novel functional data analysis (FDA) techniques are used to capitalize on longitudinal growth information from a cohort of children between birth and three years of age. In an ultra-high dimensional setting, hundreds of thousands of single nucleotide polymorphisms (SNPs) are screened, and selected SNPs are used to construct two polygenic risk scores (PRS) for childhood obesity using a weighting approach that incorporates the dynamic and joint nature of SNP effects. These scores are significantly higher in children with (vs. without) rapid infant weight gain-a predictor of obesity later in life. Using two independent cohorts, it is shown that the genetic variants identified in very young children are also informative in older children and in adults, consistent with early childhood obesity being predictive of obesity later in life. In contrast, PRSs based on SNPs identified by adult obesity GWAS are not predictive of weight gain in the cohort of young children. This provides an example of a successful application of FDA to GWAS. This application is complemented with simulations establishing that a deeply characterized sample can be just as, if not more, effective than a comparable study with a cross-sectional response. Overall, it is demonstrated that a deep, statistically sophisticated characterization of a longitudinal phenotype can provide increased statistical power to studies with relatively small sample sizes; and shows how FDA approaches can be used as an alternative to the traditional GWAS., Competing Interests: Declarations of interest none
- Published
- 2023
- Full Text
- View/download PDF
19. Computational cancer biology: education is a natural key to many locks.
- Author
-
Emmert-Streib, Frank, Shu-Dong Zhang, and Hamilton, Peter
- Subjects
- *
ONCOLOGY , *COMPUTATIONAL biology , *GENOMICS , *STATISTICS , *CANCER education - Abstract
Background Oncology is a field that profits tremendously from the genomic data generated by high-throughput technologies, including next-generation sequencing. However, in order to exploit, integrate, visualize and interpret such high-dimensional data efficiently, non-trivial computational and statistical analysis methods are required that need to be developed in a problem-directed manner. Discussion For this reason, computational cancer biology aims to fill this gap. Unfortunately, computational cancer biology is not yet fully recognized as a coequal field in oncology, leading to a delay in its maturation and, as an immediate consequence, an under-exploration of high-throughput data for translational research. Summary Here we argue that this imbalance, favoring 'wet lab-based activities', will be naturally rectified over time, if the next generation of scientists receives an academic education that provides a fair and competent introduction to computational biology and its manifold capabilities. Furthermore, we discuss a number of local educational provisions that can be implemented on university level to help in facilitating the process of harmonization. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
20. Statistical and Computational Methods for Microbiome Multi-Omics Data
- Author
-
Vanni Bucci, Lingling An, and Himel Mallick
- Subjects
metagenomics ,statistical genomics ,Computer science ,microbiome ,biostatistics ,Computational biology ,multi-omics ,metabolomics ,Editorial ,computational biology ,Metagenomics ,Genetics ,Multi omics ,data science ,Microbiome - Published
- 2020
21. Grand Challenges in Statistical Genetics and Methodology
- Author
-
Hemant K Tiwari and Nicholas J Schork
- Subjects
Epigenomics ,Proteomics ,bioinformatics ,statistical genomics ,Functional Genomics ,next generation sequencing ,Genetics ,QH426-470 - Published
- 2011
- Full Text
- View/download PDF
22. Statistical Genomics and Bioinformatics
- Author
-
Prem Narain
- Subjects
Statistical Genomics ,Bioinformatics ,Fruit Crops ,eQTL ,Annotated Sequence Databases ,Sequence Similarity Search ,Plant culture ,SB1-1110 - Abstract
Some important and interesting topics in the newly emerging disciplines of Statistical genomics andbioinformatics have been discussed briefly in relation to plants with possible references to fruit crops. This paper is therefore divided into two parts relating to the two disciplines, respectively. In the first part, mapping of quantitative trait loci (QTL), association mapping, mapping of gene expression transcripts (eQTL), marker-assisted selection, and a systems approach to quantitative genetics have been dealt with. In the second part, generation of databases, annotation, annotated sequence databases, and sequence similarity search have been described.
- Published
- 2010
- Full Text
- View/download PDF
23. Genomic structure predicts metabolite dynamics in microbial communities.
- Author
-
Gowda, Karna, Ping, Derek, Mani, Madhav, and Kuehn, Seppe
- Subjects
- *
BACTERIAL communities , *BIOGEOCHEMICAL cycles , *BIOPHYSICS , *MICROBIAL communities , *GENE expression , *GENE mapping - Abstract
The metabolic activities of microbial communities play a defining role in the evolution and persistence of life on Earth, driving redox reactions that give rise to global biogeochemical cycles. Community metabolism emerges from a hierarchy of processes, including gene expression, ecological interactions, and environmental factors. In wild communities, gene content is correlated with environmental context, but predicting metabolite dynamics from genomes remains elusive. Here, we show, for the process of denitrification, that metabolite dynamics of a community are predictable from the genes each member of the community possesses. A simple linear regression reveals a sparse and generalizable mapping from gene content to metabolite dynamics for genomically diverse bacteria. A consumer-resource model correctly predicts community metabolite dynamics from single-strain phenotypes. Our results demonstrate that the conserved impacts of metabolic genes can predict community metabolite dynamics, enabling the prediction of metabolite dynamics from metagenomes, designing denitrifying communities, and discovering how genome evolution impacts metabolism. [Display omitted] • Metabolite fluxes in microbial communities are predictable from individual genotypes • A diverse collection of 79 bacterial isolates was sequenced and phenotyped • Gene presence and absence predict metabolic phenotypes of isolates via regression • A consumer-resource model predicts community metabolite fluxes from phenotypes The presence or absence of specific genes within communities of wild bacterial isolates is sufficient to predict community-level metabolite dynamics without detailed knowledge of pathway regulation or complex ecological processes. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
24. Virulence factor prediction in Streptococcus pyogenes using classification and clustering based on microarray data.
- Author
-
López-Kleine, Liliana, Torres-Avilés, Francisco, Tejedor, Fabio, and Gordillo, Luz
- Subjects
- *
STREPTOCOCCUS pyogenes , *MICROARRAY technology , *SUPPORT vector machines , *MICROBIAL virulence genetics , *GENE expression , *DATA analysis - Abstract
Interesting biological information as, for example, gene expression data (microarrays), can be extracted from publicly available genomic data. As a starting point in order to narrow down the great possibilities of wet lab experiments, global high throughput data and available knowledge should be used to infer biological knowledge and emit biological hypothesis. Here, based on microarray data, we propose the use of cluster and classification methods that have become very popular and are implemented in freely available software in order to predict the participation in virulence mechanisms of different proteins coded by genes of the pathogen Streptococcus pyogenes. Confidence of predictions is based on classification errors of known genes and repetitive prediction by more than three methods. A special emphasis is done on the nonlinear kernel classification methods used. We propose a list of interesting candidates that could be virulence factors or that participate in the virulence process of S. pyogenes. Biological validations should start using this list of candidates as they show similar behavior to known virulence factors. [ABSTRACT FROM AUTHOR]
- Published
- 2012
- Full Text
- View/download PDF
25. Efficiency of the multilocus analysis for the construction of genetic maps.
- Author
-
Bhering, Leonardo Lopes, Cruz, Cosme Damião, de Vasconcelos, Edmar Soares, de Resende Junior, Márcio Fernando Ribeiro, Barros, Willian Silva, and Rosado, Tatiana Barbosa
- Subjects
- *
GENE mapping , *GENETIC research , *GENOMES , *MEIOSIS , *SIMULATION methods & models , *QUANTITATIVE research , *BIOTECHNOLOGY , *RESEARCH - Abstract
The use of genetic maps is a useful tool in genetic research. The association between map distance and recombination frequency is expressed by a genetic mapping function. However, several of these functions do not presuppose the joint recombination percentage. In other words, they are not multilocus probabilities. This work aimed to compare, through simulations, the efficiency in the use of different mapping functions with and without multilocus analysis as a tool in the construction of genetic maps. A genome constituted of three linkage groups (50, 100 and 200 cM) was simulated for a comparative study. Four mapping populations were simulated, F2, with 50, 100, 200 and 400 individuals, with 10 replicas each. It was verified, after the analyses, that the multilocus analysis was not efficient to rescue the size of the connection groups, concluding that the non use of the multilocus analysis would be viable. [ABSTRACT FROM AUTHOR]
- Published
- 2009
- Full Text
- View/download PDF
26. Quantitative Trait Nucleotide Analysis Using Bayesian Model Selection.
- Author
-
BLANGERO, JOHN, GÖRING, HARALD H. H., KENT JR., JACK W., WILLIAMS, JEFF T., PETERSON, CHARLES P., ALMASY, LAURA, and DYER, THOMAS D.
- Subjects
- *
NUCLEOTIDE analysis , *BAYESIAN analysis , *GENETIC polymorphisms , *HUMAN genetic variation , *GENETIC markers - Abstract
Although much attention has been given to statistical genetic methods for the initial localization and fine mapping of quantitative trait loci (QTLs), little methodological work has been done to date on the problem of statistically identifying the most likely functional polymorphisms using sequence data. In this paper we provide a general statistical genetic framework, called Bayesian quantitative trait nucleotide (BQTN) analysis, for assessing the likely functional status of genetic variants. The approach requires the initial enumeration of all genetic variants in a set of resequenced individuals. These polymorphisms are then typed in a large number of individuals (potentially in families), and marker variation is related to quantitative phenotypic variation using Bayesian model selection and averaging. For each sequence variant a posterior probability of effect is obtained and can be used to prioritize additional molecular functional experiments. An example of this quantitative nucleotide analysis is provided using the GAW12 simulated data. The results show that the BQTN method may be useful for choosing the most likely functional variants within a gene (or set of genes). We also include instructions on how to use our computer program, SOLAR, for association analysis and BQTN analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2009
- Full Text
- View/download PDF
27. Quantitative Trait Nucleotide Analysis Using Bayesian Model Selection.
- Author
-
Blangero, John, Göring, Harald H. H., Kent Jr., Jack W., Williams, Jeff T., Peterson, Charles P., Almasy, Laura, and Dyer, Thomas D.
- Subjects
- *
HUMAN biology , *GENETIC polymorphisms , *HEREDITY , *BAYESIAN analysis , *NUCLEOTIDES , *COMPUTER software - Abstract
Although much attention has been given to statistical genetic methods for the initial localization and fine mapping of quantitative trait loci (QTLs), little methodological work has been done to date on the problem of statistically identifying the most likely functional polymorphisms using sequence data. In this paper we provide a general statistical genetic framework, called Bayesian quantitative trait nucleotide (BQTN) analysis, for assessing the likely functional status of genetic variants. The approach requires the initial enumeration of all genetic variants in a set of resequenced individuals. These polymorphisms are then typed in a large number of individuals (potentially in families), and marker variation is related to quantitative phenotypic variation using Bayesian model selection and averaging. For each sequence variant a posterior probability of effect is obtained and can be used to prioritize additional molecular functional experiments. An example of this quantitative nucleotide analysis is provided using the GAW12 simulated data. The results show that the BQTN method may be useful for choosing the most likely functional variants within a gene (or set of genes). We also include instructions on how to use our computer program, SOLAR, for association analysis and BQTN analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2005
- Full Text
- View/download PDF
28. OPENMENDEL: a cooperative programming project for statistical genetics
- Author
-
Jing Zhai, Benjamin B. Chu, Janet S. Sinsheimer, Gordon D. Mosher, Christopher A. German, Kenneth Lange, Jin Zhou, Juhyun Kim, Eric M. Sobel, Douglas M. Bates, Hua Zhou, Sarah S. Ji, Jeanette C. Papp, Kevin L. Keys, and Seyoon Ko
- Subjects
FOS: Computer and information sciences ,Computer science ,Big data ,Cloud computing ,Genome-wide association study ,Software ,Models ,GWAS ,Genetics (clinical) ,Genetics & Heredity ,0303 health sciences ,Genome ,Data manipulation language ,030305 genetics & heredity ,Single Nucleotide ,Statistical ,Open source ,Variety (cybernetics) ,Networking and Information Technology R&D ,Networking and Information Technology R&D (NITRD) ,Statistical genetics ,q-bio.GN ,Algorithms ,Human ,Collaborative programming ,Context (language use) ,Statistics - Applications ,Polymorphism, Single Nucleotide ,Article ,Paediatrics and Reproductive Medicine ,03 medical and health sciences ,Complementary and Alternative Medicine ,Genetics ,Humans ,Statistical genomics ,Applications (stat.AP) ,Quantitative Biology - Genomics ,Polymorphism ,stat.AP ,030304 developmental biology ,Genomics (q-bio.GN) ,Models, Statistical ,business.industry ,Genome, Human ,Human Genome ,Computational Biology ,Data science ,Genetic epidemiology ,FOS: Biological sciences ,Computational statistics ,Programming Languages ,business ,Genome-Wide Association Study - Abstract
Statistical methods for genomewide association studies (GWAS) continue to improve. However, the increasing volume and variety of genetic and genomic data make computational speed and ease of data manipulation mandatory in future software. In our view, a collaborative effort of statistical geneticists is required to develop open source software targeted to genetic epidemiology. Our attempt to meet this need is called the OPENMENDELproject (https://openmendel.github.io). It aims to (1) enable interactive and reproducible analyses with informative intermediate results, (2) scale to big data analytics, (3) embrace parallel and distributed computing, (4) adapt to rapid hardware evolution, (5) allow cloud computing, (6) allow integration of varied genetic data types, and (7) foster easy communication between clinicians, geneticists, statisticians, and computer scientists. This article reviews and makes recommendations to the genetic epidemiology community in the context of the OPENMENDEL project., 16 pages, 2 figures, 2 tables
- Published
- 2019
29. A guide to genome‐wide association analysis and post‐analytic interrogation
- Author
-
Muredach P. Reilly, David Kulp, Sara Nunez, Jing Qian, Eric R. Reed, and Andrea S. Foulkes
- Subjects
relatedness ,Statistics and Probability ,SNP filtering ,statistical genomics ,parallel processing ,heatmap ,regional association plot ,Epidemiology ,Computer science ,IBD ,Bioconductor ,lambda statistic ,imputation ,Genomics ,Genome browser ,computer.software_genre ,Manhattan plot ,tutorial ,Tutorial in Biostatistics ,Software ,Databases, Genetic ,call rate ,heterozygosity ,Humans ,principal component analysis (PCA) ,minor allele frequency (MAF) ,substructure ,business.industry ,ancestry ,Q–Q plot ,Computational Biology ,Hardy–Weinberg equilibrium (HWE) ,Data science ,sample filtering ,3. Good health ,Visualization ,R code ,Graphics software ,genome‐wide association (GWA) study ,UCSC Genome Browser ,business ,computer ,Imputation (genetics) ,Genome-Wide Association Study - Abstract
This tutorial is a learning resource that outlines the basic process and provides specific software tools for implementing a complete genome‐wide association analysis. Approaches to post‐analytic visualization and interrogation of potentially novel findings are also presented. Applications are illustrated using the free and open‐source R statistical computing and graphics software environment, Bioconductor software for bioinformatics and the UCSC Genome Browser. Complete genome‐wide association data on 1401 individuals across 861,473 typed single nucleotide polymorphisms from the PennCATH study of coronary artery disease are used for illustration. All data and code, as well as additional instructional resources, are publicly available through the Open Resources in Statistical Genomics project: http://www.stat-gen.org. © 2015 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.
- Published
- 2015
30. GSuite HyperBrowser: integrative analysis of dataset collections across the genome and epigenome
- Author
-
Eivind Hovig, Alexander J. Nederbragt, Diana Domanska, Ivar Grytten, Marit Holden, Geir Kjetil Sandve, Abdulrahman Azab, Sigve Nakken, Sveinung Gundersen, Ingrid K. Glad, Christin Lund-Andersen, Antonio M. Mora, Johannes Andreas Akse, Daniel Vodak, Odd S. Gabrielsen, Bastian Fromm, Hildur Sif Thorarensen, Morten Johansen, Lars Holden, Mads Bengtsen, Ragnhild Eskeland, Knut Dagestad Rand, Boris Simovski, Finn Drabløs, Egil Ferkingstad, Raunvísindastofnun (HÍ), Science Institute (UI), Verkfræði- og náttúruvísindasvið (HÍ), School of Engineering and Natural Sciences (UI), Háskóli Íslands, and University of Iceland
- Subjects
0301 basic medicine ,Epigenomics ,statistical genomics ,Computer science ,Gagnasöfn ,Interface (computing) ,Datasets as Topic ,Health Informatics ,Genomics ,ENCODE ,computer.software_genre ,Epigenesis, Genetic ,03 medical and health sciences ,Genamengi ,Technical Note ,genomics ,Humans ,Statistical genomics ,Software system ,data integration ,genome analysis ,Tölfræði ,Whole Genome Sequencing ,Genome, Human ,genomic track ,Genomic track ,Epigenome ,Genome analysis ,Data science ,Computer Science Applications ,Galaxy ,030104 developmental biology ,Disparate system ,Data integration ,computer ,Software ,Reference genome - Abstract
Background: Recent large-scale undertakings such as ENCODE and Roadmap Epigenomics have generated experimental data mapped to the human reference genome (as genomic tracks) representing a variety of functional elements across a large number of cell types. Despite the high potential value of these publicly available data for a broad variety of investigations, little attention has been given to the analytical methodology necessary for their widespread utilisation. Findings: We here present a first principled treatment of the analysis of collections of genomic tracks. We have developed novel computational and statistical methodology to permit comparative and confirmatory analyses across multiple and disparate data sources. We delineate a set of generic questions that are useful across a broad range of investigations and discuss the implications of choosing different statistical measures and null models. Examples include contrasting analyses across different tissues or diseases. The methodology has been implemented in a comprehensive open-source software system, the GSuite HyperBrowser. To make the functionality accessible to biologists, and to facilitate reproducible analysis, we have also developed a web-based interface providing an expertly guided and customizable way of utilizing the methodology. With this system, many novel biological questions can flexibly be posed and rapidly answered. Conclusions: Through a combination of streamlined data acquisition, interoperable representation of dataset collections, and customizable statistical analysis with guided setup and interpretation, the GSuite HyperBrowser represents a first comprehensive solution for integrative analysis of track collections across the genome and epigenome. The software is available at: https://hyperbrowser.uio.no., This work was supported by the Research Council of Norway (under grant agreements 221580, 218241, and 231217/F20), by the Norwegian Cancer Society (under grant agreements 71220’PR-2006-0433 and 3485238-2013), and by the South-Eastern Norway Regional Health Authority (under grant agreement 2014041).
- Published
- 2017
31. Comparing the Statistical Fate of Paralogous and Orthologous Sequences
- Author
-
Michael Sheinman, Florian Massip, Peter F. Arndt, Sophie Schbath, Statistique en grande dimension pour la génomique, Département PEGASE [LBBE] (PEGASE), Laboratoire de Biométrie et Biologie Evolutive - UMR 5558 (LBBE), Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire de Biométrie et Biologie Evolutive - UMR 5558 (LBBE), and Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
0301 basic medicine ,Genome evolution ,statistical genomics ,Sequence analysis ,[SDV]Life Sciences [q-bio] ,Sequence Homology ,Genomics ,Sequence alignment ,Computational biology ,comparative genomics ,Biology ,Investigations ,genome evolution ,01 natural sciences ,Genome ,Homologous Sequences ,Homology (biology) ,Evolution, Molecular ,03 medical and health sciences ,Segmental Duplications, Genomic ,Genetics ,0101 mathematics ,DNA duplications ,Probability ,030304 developmental biology ,Mathematics ,Segmental duplication ,Comparative genomics ,0303 health sciences ,Exact sequence ,Models, Genetic ,010102 general mathematics ,Computational Biology ,030104 developmental biology ,Exponent ,Sequence Alignment ,Orthologous Gene - Abstract
For several decades, sequence alignment has been a widely used tool in bioinformatics. For instance, finding homologous sequences with a known function in large databases is used to get insight into the function of nonannotated genomic regions. Very efficient tools like BLAST have been developed to identify and rank possible homologous sequences. To estimate the significance of the homology, the ranking of alignment scores takes a background model for random sequences into account. Using this model we can estimate the probability to find two exactly matching subsequences by chance in two unrelated sequences. For two homologous sequences, the corresponding probability is much higher, which allows us to identify them. Here we focus on the distribution of lengths of exact sequence matches between protein-coding regions of pairs of evolutionarily distant genomes. We show that this distribution exhibits a power-law tail with an exponent α=−5. Developing a simple model of sequence evolution by substitutions and segmental duplications, we show analytically and computationally that paralogous and orthologous gene pairs contribute differently to this distribution. Our model explains the differences observed in the comparison of coding and noncoding parts of genomes, thus providing a better understanding of statistical properties of genomic sequences and their evolution.
- Published
- 2016
32. Statistical Genomics and Bioinformatics
- Author
-
Narain, Prem
- Subjects
statistical genomics ,Focus ,annotated sequence databases ,sequence similarity search ,food and beverages ,Plant culture ,bioinformatics ,fruit crops ,eqtl ,SB1-1110 - Abstract
Some important and interesting topics in the newly emerging disciplines of Statistical genomics andbioinformatics have been discussed briefly in relation to plants with possible references to fruit crops. This paper is therefore divided into two parts relating to the two disciplines, respectively. In the first part, mapping of quantitative trait loci (QTL), association mapping, mapping of gene expression transcripts (eQTL), marker-assisted selection, and a systems approach to quantitative genetics have been dealt with. In the second part, generation of databases, annotation, annotated sequence databases, and sequence similarity search have been described.
- Published
- 2010
33. Computational cancer biology: education is a natural key to many locks
- Author
-
Frank, Emmert-Streib, Shu-Dong, Zhang, and Peter, Hamilton
- Subjects
Computational biology ,Computational genomics ,Universities ,Debate ,Data Interpretation, Statistical ,Neoplasms ,Systems medicine ,Humans ,Statistical genomics ,Genomics data ,Computational oncology ,Cancer - Abstract
Background Oncology is a field that profits tremendously from the genomic data generated by high-throughput technologies, including next-generation sequencing. However, in order to exploit, integrate, visualize and interpret such high-dimensional data efficiently, non-trivial computational and statistical analysis methods are required that need to be developed in a problem-directed manner. Discussion For this reason, computational cancer biology aims to fill this gap. Unfortunately, computational cancer biology is not yet fully recognized as a coequal field in oncology, leading to a delay in its maturation and, as an immediate consequence, an under-exploration of high-throughput data for translational research. Summary Here we argue that this imbalance, favoring ’wet lab-based activities’, will be naturally rectified over time, if the next generation of scientists receives an academic education that provides a fair and competent introduction to computational biology and its manifold capabilities. Furthermore, we discuss a number of local educational provisions that can be implemented on university level to help in facilitating the process of harmonization.
- Published
- 2014
34. Low-frequency variant detection in viral populations using massively parallel sequencing data
- Author
-
Verbist, Bie, Thas, Olivier, and Clement, Lieven
- Subjects
Mathematics and Statistics ,Viral Populations ,Statistical genomics ,Massively Parallel Sequencing ,Variant Calling - Published
- 2014
35. Integrating biological knowledge in gene expression data analysis
- Author
-
Causeur, David, Lê, Sébastien, Institut de Recherche Mathématique de Rennes (IRMAR), AGROCAMPUS OUEST, Institut national d'enseignement supérieur pour l'agriculture, l'alimentation et l'environnement (Institut Agro)-Institut national d'enseignement supérieur pour l'agriculture, l'alimentation et l'environnement (Institut Agro)-Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Université de Rennes 2 (UR2), Université de Rennes (UNIV-RENNES)-École normale supérieure - Rennes (ENS Rennes)-Centre National de la Recherche Scientifique (CNRS)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA), Institut de Recherche Mathématique de Rennes ( IRMAR ), Université de Rennes 1 ( UR1 ), Université de Rennes ( UNIV-RENNES ) -Université de Rennes ( UNIV-RENNES ) -AGROCAMPUS OUEST-École normale supérieure - Rennes ( ENS Rennes ) -Institut National de Recherche en Informatique et en Automatique ( Inria ) -Institut National des Sciences Appliquées ( INSA ) -Université de Rennes 2 ( UR2 ), Université de Rennes ( UNIV-RENNES ) -Centre National de la Recherche Scientifique ( CNRS ), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-École normale supérieure - Rennes (ENS Rennes)-Université de Rennes 2 (UR2)-Centre National de la Recherche Scientifique (CNRS)-INSTITUT AGRO Agrocampus Ouest, and Institut national d'enseignement supérieur pour l'agriculture, l'alimentation et l'environnement (Institut Agro)-Institut national d'enseignement supérieur pour l'agriculture, l'alimentation et l'environnement (Institut Agro)
- Subjects
[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST] ,Statistical genomics ,[ MATH.MATH-ST ] Mathematics [math]/Statistics [math.ST] ,COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS ,[STAT.TH]Statistics [stat]/Statistics Theory [stat.TH] ,[ STAT.TH ] Statistics [stat]/Statistics Theory [stat.TH] - Abstract
Integrating biological knowledge in gene expression data analysis
- Published
- 2011
36. Análise de ligação e mapeamento de QTLs em populações simuladas
- Author
-
Alves, Alexandre Alonso, Cruz, Cosme Damião, Resende, Marcos Deon Vilela de, Alfenas, Acelino Couto, Bhering, Leonardo Lopes, and Guimarães, Lúcio Mauro da Silva
- Subjects
QTL mapping ,Statistical genomics ,Mapeamento genético ,Mapeamento de QTLs ,Genetic mapping ,CIENCIAS BIOLOGICAS::GENETICA::GENETICA QUANTITATIVA [CNPQ] ,Estatística genômica - Abstract
Conselho Nacional de Desenvolvimento Científico e Tecnológico Como os recentes avanços na tecnologia têm levado ao desenvolvimento de novas tecnologias de genotipagem, no futuro, é mais provável que os mapas de ligação de alta densidade serão construídos a partir de marcadores dominantes e co-dominantes. Recentemente, uma abordagem estritamente genética foi proposta para a estimação da freqüência de recombinação (r) entre marcadores co-dominantes em famílias de irmãos completos. O conjunto completo de estimadores quase foi obtido, mas infelizmente, uma configuração envolvendo a estimativa da distância entre os marcadores dominantes, que segregam na proporção 3:1 e marcadores co-dominantes, não foi levada em consideração. Aqui novos nove estimadores são acrescentados ao conjunto previamente publicado, tornando possível cobrir todas as combinações de marcadores moleculares com dois a quatro alelos (sem epistasia) em uma família de irmãos completos. Isso inclui a segregação em um ou ambos os genitores, dominância e todas as configurações de fases de ligação. Como populações de retrocruzamentos (RC) são frequentemente utilizadas como populações de mapeamento, tanto em espécies autógamas, quanto em espécies alógamas foi conduzido um estudo de simulação para testar as implicações do tamanho da população, herdabilidade da característica, propriedades do QTL (r2, a e posição) e densidade de marcadores no poder de detecção e precisão do mapeamento de QTLs. Para tanto foram simuladas populações com diferentes tamanhos, com diferentes características (h2, número de QTLs e posição) e os dados analisados com dois métodos de mapeamento de QTLs comumente utilizados (mapeamento por intervalo simples (MIS) e mapeamento por intervalo composto (MIC)). Verificou-se que o tamanho da amostra tem uma grande implicação no poder de detecção e como conseqüência na estimação da magnitude da variação explicada pelo QTL e no efeito genético, em função de populações pequenas não permitirem o mapeamento de QTLs de pequeno efeito, principalmente quando esses estão envolvidos no controle genético de características de baixa herdabilidade. Também foi verificado que o posicionamento de QTLs baseados em MIC é mais acurado que MIS e que em média os QTLs mapeados estavam próximos as suas posições simuladas. Um resultado interessante é que o MIC tende a subestimar os valores de magnitude (r2) especialmente em populações grandes/ características de baixa herdabilidade e superestimá-la em populações pequenas, o que pode ser um reflexo do pequeno coeficiente de variação do erro utilizado, ou devido ao fato de quando os marcadores não se encontram na exata posição do QTL, esse parâmetro é de fato esperado ser subestimado. Destaca-se também, o fato que quando marcadores estão amplamente distribuídos ao longo do genoma (~10cM), e desse modo cobrindo a região do QTL, se um dos marcadores já estiver próximo ao QTL, um maior número de marcadores (~1cM) não melhora a precisão do mapeamento do QTL em populações suficientemente grandes. Baseado nesses resultados recomenda-se o uso de populações de tamanho adequado, ≥500, se a intenção é mapear QTLs em populações de RC, porque nessa situação, mesmo mapas de média densidade podem ser usados para mapear QTLs de grande ou pequeno efeito com grande confiabilidade. Finalmente, como os procedimentos de mapeamento de ligação e mapeamento de QTLs em famílias de irmãos completos (FIC) de espécies alógamas são bastante diversos, foi conduzido um estudo comparando o método de mapeamento por pseudo-testcross modificado (PST) (usando microsatélites), com o método de mapeamento baseado na FIC; em termos de ordenamento dos marcadores, distância entre os marcadores, comprimento total do mapa, variância das estimativas de distância e estresse. Investigou-se também o poder de detecção e a precisão de métodos de mapeamento de QTLs por intervalos baseados nos mapas PST ou no mapa para a FIC. Verificou-se que em geral as duas estratégias geram mapas altamente correlacionados com comprimentos dos grupos de ligação proporcionais. Verificou-se também que independentemente da abordagem de mapeamento de QTLs utilizadas, o poder de detecção é reduzido em populações pequenas, especialmente em situações onde a herdabilidade da característica ou magnitude do QTL é pequena. Também foi verificado que apesar dos dois métodos serem aparentemente equivalentes em termos de posicionamento do QTL para características de alta herdabilidade/ QTLs de grande efeito, o MIC baseado nos mapas pseudo-testcross prove dados mais acurados para características de baixa herdabilidade/QTLs de pequeno efeito. Como relação à magnitude dos QTLs, notou-se que ambos os métodos parecem ser equivalentes, sendo os valores superestimados para características de alta herdabilidade e subestimados para características de baixa herdabilidade, independentemente do tamanho amostral. Assim para espécies alógamas com médio nível de recursos genômicos, propõem-se que tanto a abordagem de PST quanto a abordagem baseada na FIC, e métodos de mapeamento de QTLs relacionados, possam ser utilizados para gerar mapas genéticos e mapear QTLs com alta confiabilidade. É importante ressaltar, entretanto, que outros estudos, usando diferentes cenários, i.e. diferentes coeficientes de variação do erro, diferentes números de QTLs, diferentes distribuições de marcadores, que coletivamente podem tornar a simulação um pouco mais realística, são necessários para verificar que os resultados deste trabalho se mantêm em todas as situações. As high-throughput genomic tools have led to the development of novel genotyping procedures, it is likely that, in the future, high density linkage maps will be constructed from both dominant and co-dominant markers. Recently, a strictly genetic approach was described for estimating the recombination frequency (r) between co-dominant markers in full-sib families. The complete set of maximum likelihood estimators for r in full-sib families was almost obtained, but unfortunately, one particular configuration involving dominant markers, segregating in a 3:1 ratio and co-dominant markers, was not considered. Here we add nine further estimators to the previously published set, thereby making it possible to cover all combinations of molecular markers with two to four alleles (without epistasis) in a full-sib family. This includes segregation in one or both parents, dominance and all linkage phase configurations. As backcross (BC) populations are often used as mapping populations both in self pollinating species, and in out-breeding species we also undertook a simulation study to test implications of population size, trait heritability, QTL properties (r2, a and position) and marker density in the power and precision of QTL mapping. For that we have simulated populations with different sizes, with different characteristics (h2, QTL number and location) and the data analyzed with two xv QTL mapping methods (simple interval mapping (SIM) and composite interval mapping (CIM)). We found that sample size has a major implication in the detection power and as consequence in the estimation of the magnitude and additive genetic effect, as small populations do not allow mapping of low effect QTLs, especially if these QTLs are involved in the genetic control of traits with low heritability. We also found that the positioning of the QTLs based on CIM is more accurate than SIM and that on average the mapped QTLs are close to their simulated position. The results showed that CIM tend to underestimate the magnitude (r2) values especially in large population sizes/low heritabilities traits and overestimate it in smaller populations, which can be a reflection of the low coefficient of variation of the error used, or due to fact that when markers aren´t in the same of the QTL, this parameter is indeed expected to be underestimated. We also highlight the fact, that when markers are evenly distributed across the genome (~10 cM), and therefore covering the QTL region, if one of the markers is already close to the QTL, larger number of markers (~ 1cM) do not improve the precision of QTL mapping in sufficiently large mapping populations. Based on our results we recommend the use of adequate sample size, say ≥500, if the intention is map QTLs in backcross populations, because in this situation even mid-density genetic maps can be used to map QTLs of large or small effect with high confidence. Finally, as the procedures for linkage and QTL mapping in fullsib families of outbreeding species are quite diverse, we also undertook a simulation study comparing the modified pseudo-testcross (using SSR markers) and the full-sib mapping designs in terms of marker ordering, distance between markers, total map size, distance variance and stress. We also investigated the power and precision of interval mapping procedures based on the full-sib and on the modified pseudo-testcross maps. We show that in general the modified pseudo-testcross and the full-sib mapping designs generate highly correlated maps with proportional linkage groups length. That independent of the QTL mapping approach used, detection power is reduced in small populations, especially in situations where trait heritability or QTL magnitude are low. We also found that although both methods appear to be equivalent in terms of QTL positioning for high heritability traits/major effect QTLs, the CIM based on modified pseudo-testcross maps provide more accurate data for low heritability traits/minor effect QTLs in larger populations. With regard to QTLs magnitude, we show that both methods appear to be equivalent, and that the magnitude values tended to be over estimated for the high heritability trait, and underestimated for the low heritability trait, independent of the sample size. Thus, for outbreeding species with mid-level of genomic resources we propose that either the modified pseudo-testcross or the single full-sib mapping design and the related QTL mapping strategies can be used to generate genetic maps and map QTLs with high confidence. It is important to highlight however, that, other studies, using different scenarios, i.e. different coefficients of variation of the error, different number of QTLs, different marker distributions, which collectively may make the simulation a bit more realistic, are needed in order to see if the results of our work hold true in every situation.
- Published
- 2010
37. Linkage analysis between dominant and co-dominant makers in full-sib families of out-breeding species
- Author
-
Cosme Damião Cruz, Alexandre Alonso Alves, Acelino C. Alfenas, and Leonardo Lopes Bhering
- Subjects
Genetics ,statistical genomics ,lcsh:QH426-470 ,Diversity Arrays Technology ,Recombination frequency ,Single-nucleotide polymorphism ,Biology ,recombination frequency and maximum likelihood ,Plant Genetics ,lcsh:Genetics ,exogamic populations ,Genetic linkage ,Epistasis ,Allele ,DNA microarray ,Molecular Biology ,Genotyping ,Research Article ,Maximum likelihood ,Dominance (genetics) - Abstract
As high-throughput genomic tools, such as the DNA microarray platform, have lead to the development of novel genotyping procedures, such as Diversity Arrays Technology (DArT) and Single Nucleotide Polymorphisms (SNPs), it is likely that, in the future, high density linkage maps will be constructed from both dominant and co-dominant markers. Recently, a strictly genetic approach was described for estimating recombination frequency (r) between co-dominant markers in full-sib families. The complete set of maximum likelihood estimators for r in full-sib families was almost obtained, but unfortunately, one particular configuration involving dominant markers, segregating in a 3:1 ratio and co-dominant markers, was not considered. Here we add nine further estimators to the previously published set, thereby making it possible to cover all combinations of molecular markers with two to four alleles (without epistasis) in a full-sib family. This includes segregation in one or both parents, dominance and all linkage phase configurations.
- Published
- 2010
38. Multiple testing procedures under confounding
- Author
-
Debashis Ghosh
- Subjects
FOS: Computer and information sciences ,False discovery rate ,statistical genomics ,Mathematics - Statistics Theory ,Statistics Theory (math.ST) ,Machine learning ,computer.software_genre ,Methodology (stat.ME) ,62P10 (Primary) 92D10 (Secondary) ,FOS: Mathematics ,Econometrics ,Statistics - Methodology ,Analysis method ,Mathematics ,multiple comparisons ,business.industry ,Confounding ,Mathematical statistics ,92D10 ,Estimator ,empirical null hypothesis ,Mixture model ,association studies ,62P10 ,Multiple comparisons problem ,Artificial intelligence ,business ,computer - Abstract
While multiple testing procedures have been the focus of much statistical research, an important facet of the problem is how to deal with possible confounding. Procedures have been developed by authors in genetics and statistics. In this chapter, we relate these proposals. We propose two new multiple testing approaches within this framework. The first combines sensitivity analysis methods with false discovery rate estimation procedures. The second involves construction of shrinkage estimators that utilize the mixture model for multiple testing. The procedures are illustrated with applications to a gene expression profiling experiment in prostate cancer., Comment: Published in at http://dx.doi.org/10.1214/193940307000000176 the IMS Collections (http://www.imstat.org/publications/imscollections.htm) by the Institute of Mathematical Statistics (http://www.imstat.org)
- Published
- 2008
39. Center for Systems and Bioinformatics — National Taiwan University.
- Author
-
Oyang, Yen-Jen
- Subjects
GENOMICS ,SYSTEMS biology ,BIOINFORMATICS - Abstract
This article denotes the strengths and resources of the National Taiwan University. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
40. Multiset Statistics for Gene Set Analysis.
- Author
-
Newton MA and Wang Z
- Abstract
An important data analysis task in statistical genomics involves the integration of genome-wide gene-level measurements with preexisting data on the same genes. A wide variety of statistical methodologies and computational tools have been developed for this general task. We emphasize one particular distinction among methodologies, namely whether they process gene sets one at a time (uniset) or simultaneously via some multiset technique. Owing to the complexity of collections of gene sets, the multiset approach offers some advantages, as it naturally accommodates set-size variations and among-set overlaps. However, this approach presents both computational and inferential challenges. After reviewing some statistical issues that arise in uniset analysis, we examine two model-based multiset methods for gene list data.
- Published
- 2015
- Full Text
- View/download PDF
41. Linkage analysis between dominant and co-dominant makers in full-sib families of out-breeding species.
- Author
-
Alves AA, Bhering LL, Cruz CD, and Alfenas AC
- Abstract
As high-throughput genomic tools, such as the DNA microarray platform, have lead to the development of novel genotyping procedures, such as Diversity Arrays Technology (DArT) and Single Nucleotide Polymorphisms (SNPs), it is likely that, in the future, high density linkage maps will be constructed from both dominant and co-dominant markers. Recently, a strictly genetic approach was described for estimating recombination frequency (r) between co-dominant markers in full-sib families. The complete set of maximum likelihood estimators for r in full-sib families was almost obtained, but unfortunately, one particular configuration involving dominant markers, segregating in a 3:1 ratio and co-dominant markers, was not considered. Here we add nine further estimators to the previously published set, thereby making it possible to cover all combinations of molecular markers with two to four alleles (without epistasis) in a full-sib family. This includes segregation in one or both parents, dominance and all linkage phase configurations.
- Published
- 2010
- Full Text
- View/download PDF
42. An Overview of Recent Developments in Genomics and Associated Statistical Methods
- Author
-
Bickel, Peter J., Brown, James B., Huang, Haiyan, and Li, Qunhua
- Published
- 2009
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.