401 results
Search Results
2. Ten Simple Rules for Writing a Reply Paper.
- Author
-
Simmons MP
- Subjects
- Algorithms, Information Dissemination methods, Medical Writing, Peer Review, Research methods, Periodicals as Topic, Publishing organization & administration
- Published
- 2015
- Full Text
- View/download PDF
3. Ten Simple Rules for Writing Research Papers.
- Author
-
Zhang, Weixiong
- Subjects
REPORT writing ,PRIMARY audience ,SCHOLARLY periodicals ,HYPOTHESIS ,DOUBLE helix structure ,NOBEL Prizes ,SCHOLARLY peer review - Abstract
The author discusses the factors which should be considered by researchers while writing research papers. Topics discussed include analyzing the logical flow of experiments, choosing a target audience or a right journal for delivering main messages of the research to them, and discussion of various aspects of a hypothesis pursued in the research. Also mentioned are a Nobel-Prize-winning paper on the DNA double helix structure, and objective reading of reviews of the papers.
- Published
- 2014
- Full Text
- View/download PDF
4. Calibrating dimension reduction hyperparameters in the presence of noise.
- Author
-
Lin J and Fukuyama J
- Subjects
- Calibration, Humans, Signal-To-Noise Ratio, Computer Simulation, Computational Biology methods, Algorithms
- Abstract
The goal of dimension reduction tools is to construct a low-dimensional representation of high-dimensional data. These tools are employed for a variety of reasons such as noise reduction, visualization, and to lower computational costs. However, there is a fundamental issue that is discussed in other modeling problems that is often overlooked in dimension reduction-overfitting. In the context of other modeling problems, techniques such as feature-selection, cross-validation, and regularization are employed to combat overfitting, but rarely are such precautions taken when applying dimension reduction. Prior applications of the two most popular non-linear dimension reduction methods, t-SNE and UMAP, fail to acknowledge data as a combination of signal and noise when assessing performance. These methods are typically calibrated to capture the entirety of the data, not just the signal. In this paper, we demonstrate the importance of acknowledging noise when calibrating hyperparameters and present a framework that enables users to do so. We use this framework to explore the role hyperparameter calibration plays in overfitting the data when applying t-SNE and UMAP. More specifically, we show previously recommended values for perplexity and n_neighbors are too small and overfit the noise. We also provide a workflow others may use to calibrate hyperparameters in the presence of noise., Competing Interests: The authors have declared that no competing interests exist., (Copyright: © 2024 Lin, Fukuyama. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.)
- Published
- 2024
- Full Text
- View/download PDF
5. A proactive/reactive mass screening approach with uncertain symptomatic cases.
- Author
-
Lin J, Aprahamian H, and Golovko G
- Subjects
- Humans, Pandemics, COVID-19 Testing methods, Uncertainty, Computational Biology methods, COVID-19 diagnosis, COVID-19 epidemiology, SARS-CoV-2, Mass Screening methods, Algorithms
- Abstract
We study the problem of mass screening of heterogeneous populations under limited testing budget. Mass screening is an essential tool that arises in various settings, e.g., the COVID-19 pandemic. The objective of mass screening is to classify the entire population as positive or negative for a disease as efficiently and accurately as possible. Under limited budget, testing facilities need to allocate a portion of the budget to target sub-populations (i.e., proactive screening) while reserving the remaining budget to screen for symptomatic cases (i.e., reactive screening). This paper addresses this decision problem by taking advantage of accessible population-level risk information to identify the optimal set of sub-populations for proactive/reactive screening. The framework also incorporates two widely used testing schemes: Individual and Dorfman group testing. By leveraging the special structure of the resulting bilinear optimization problem, we identify key structural properties, which in turn enable us to develop efficient solution schemes. Furthermore, we extend the model to accommodate customized testing schemes across different sub-populations and introduce a highly efficient heuristic solution algorithm for the generalized model. We conduct a comprehensive case study on COVID-19 in the US, utilizing geographically-based data. Numerical results demonstrate a significant improvement of up to 52% in total misclassifications compared to conventional screening strategies. In addition, our case study offers valuable managerial insights regarding the allocation of proactive/reactive measures and budget across diverse geographic regions., Competing Interests: The authors have declared that no competing interests exist., (Copyright: © 2024 Lin et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.)
- Published
- 2024
- Full Text
- View/download PDF
6. Knotted artifacts in predicted 3D RNA structures.
- Author
-
Gren BA, Antczak M, Zok T, Sulkowska JI, and Szachniuk M
- Subjects
- Machine Learning, Databases, Protein, Nucleic Acid Conformation, RNA chemistry, Computational Biology methods, Algorithms, Models, Molecular, Artifacts
- Abstract
Unlike proteins, RNAs deposited in the Protein Data Bank do not contain topological knots. Recently, admittedly, the first trefoil knot and some lasso-type conformations have been found in experimental RNA structures, but these are still exceptional cases. Meanwhile, algorithms predicting 3D RNA models have happened to form knotted structures not so rarely. Interestingly, machine learning-based predictors seem to be more prone to generate knotted RNA folds than traditional methods. A similar situation is observed for the entanglements of structural elements. In this paper, we analyze all models submitted to the CASP15 competition in the 3D RNA structure prediction category. We show what types of topological knots and structure element entanglements appear in the submitted models and highlight what methods are behind the generation of such conformations. We also study the structural aspect of susceptibility to entanglement. We suggest that predictors take care of an evaluation of RNA models to avoid publishing structures with artifacts, such as unusual entanglements, that result from hallucinations of predictive algorithms., Competing Interests: The authors have declared that no competing interests exist., (Copyright: © 2024 Gren et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.)
- Published
- 2024
- Full Text
- View/download PDF
7. Threshold-awareness in adaptive cancer therapy.
- Author
-
Wang M, Scott JG, and Vladimirsky A
- Subjects
- Humans, Models, Biological, Antineoplastic Agents therapeutic use, Computer Simulation, Neoplasms therapy, Algorithms, Stochastic Processes, Computational Biology methods
- Abstract
Although adaptive cancer therapy shows promise in integrating evolutionary dynamics into treatment scheduling, the stochastic nature of cancer evolution has seldom been taken into account. Various sources of random perturbations can impact the evolution of heterogeneous tumors, making performance metrics of any treatment policy random as well. In this paper, we propose an efficient method for selecting optimal adaptive treatment policies under randomly evolving tumor dynamics. The goal is to improve the cumulative "cost" of treatment, a combination of the total amount of drugs used and the total treatment time. As this cost also becomes random in any stochastic setting, we maximize the probability of reaching the treatment goals (tumor stabilization or eradication) without exceeding a pre-specified cost threshold (or a "budget"). We use a novel Stochastic Optimal Control formulation and Dynamic Programming to find such "threshold-aware" optimal treatment policies. Our approach enables an efficient algorithm to compute these policies for a range of threshold values simultaneously. Compared to treatment plans shown to be optimal in a deterministic setting, the new "threshold-aware" policies significantly improve the chances of the therapy succeeding under the budget, which is correlated with a lower general drug usage. We illustrate this method using two specific examples, but our approach is far more general and provides a new tool for optimizing adaptive therapies based on a broad range of stochastic cancer models., Competing Interests: The authors have declared that no competing interests exist., (Copyright: © 2024 Wang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.)
- Published
- 2024
- Full Text
- View/download PDF
8. Multi-view clustering by CPS-merge analysis with application to multimodal single-cell data.
- Author
-
Zhang L, Lin L, and Li J
- Subjects
- Cluster Analysis, Consensus, Algorithms, Technology
- Abstract
Multi-view data can be generated from diverse sources, by different technologies, and in multiple modalities. In various fields, integrating information from multi-view data has pushed the frontier of discovery. In this paper, we develop a new approach for multi-view clustering, which overcomes the limitations of existing methods such as the need of pooling data across views, restrictions on the clustering algorithms allowed within each view, and the disregard for complementary information between views. Our new method, called CPS-merge analysis, merges clusters formed by the Cartesian product of single-view cluster labels, guided by the principle of maximizing clustering stability as evaluated by CPS analysis. In addition, we introduce measures to quantify the contribution of each view to the formation of any cluster. CPS-merge analysis can be easily incorporated into an existing clustering pipeline because it only requires single-view cluster labels instead of the original data. We can thus readily apply advanced single-view clustering algorithms. Importantly, our approach accounts for both consensus and complementary effects between different views, whereas existing ensemble methods focus on finding a consensus for multiple clustering results, implying that results from different views are variations of one clustering structure. Through experiments on single-cell datasets, we demonstrate that our approach frequently outperforms other state-of-the-art methods., Competing Interests: The authors have declared that no competing interests exist., (Copyright: © 2023 Zhang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.)
- Published
- 2023
- Full Text
- View/download PDF
9. RCFGL: Rapid Condition adaptive Fused Graphical Lasso and application to modeling brain region co-expression networks.
- Author
-
Seal S, Li Q, Basner EB, Saba LM, and Kechris K
- Subjects
- Animals, Rats, Computer Simulation, Gene Regulatory Networks genetics, Brain, Algorithms
- Abstract
Inferring gene co-expression networks is a useful process for understanding gene regulation and pathway activity. The networks are usually undirected graphs where genes are represented as nodes and an edge represents a significant co-expression relationship. When expression data of multiple (p) genes in multiple (K) conditions (e.g., treatments, tissues, strains) are available, joint estimation of networks harnessing shared information across them can significantly increase the power of analysis. In addition, examining condition-specific patterns of co-expression can provide insights into the underlying cellular processes activated in a particular condition. Condition adaptive fused graphical lasso (CFGL) is an existing method that incorporates condition specificity in a fused graphical lasso (FGL) model for estimating multiple co-expression networks. However, with computational complexity of O(p2K log K), the current implementation of CFGL is prohibitively slow even for a moderate number of genes and can only be used for a maximum of three conditions. In this paper, we propose a faster alternative of CFGL named rapid condition adaptive fused graphical lasso (RCFGL). In RCFGL, we incorporate the condition specificity into another popular model for joint network estimation, known as fused multiple graphical lasso (FMGL). We use a more efficient algorithm in the iterative steps compared to CFGL, enabling faster computation with complexity of O(p2K) and making it easily generalizable for more than three conditions. We also present a novel screening rule to determine if the full network estimation problem can be broken down into estimation of smaller disjoint sub-networks, thereby reducing the complexity further. We demonstrate the computational advantage and superior performance of our method compared to two non-condition adaptive methods, FGL and FMGL, and one condition adaptive method, CFGL in both simulation study and real data analysis. We used RCFGL to jointly estimate the gene co-expression networks in different brain regions (conditions) using a cohort of heterogeneous stock rats. We also provide an accommodating C and Python based package that implements RCFGL., Competing Interests: No competing interests declared., (Copyright: © 2023 Seal et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.)
- Published
- 2023
- Full Text
- View/download PDF
10. Systematic comparison of modeling fidelity levels and parameter inference settings applied to negative feedback gene regulation.
- Author
-
Coulier A, Singh P, Sturrock M, and Hellander A
- Subjects
- Feedback, Bayes Theorem, Gene Regulatory Networks genetics, Algorithms
- Abstract
Quantitative stochastic models of gene regulatory networks are important tools for studying cellular regulation. Such models can be formulated at many different levels of fidelity. A practical challenge is to determine what model fidelity to use in order to get accurate and representative results. The choice is important, because models of successively higher fidelity come at a rapidly increasing computational cost. In some situations, the level of detail is clearly motivated by the question under study. In many situations however, many model options could qualitatively agree with available data, depending on the amount of data and the nature of the observations. Here, an important distinction is whether we are interested in inferring the true (but unknown) physical parameters of the model or if it is sufficient to be able to capture and explain available data. The situation becomes complicated from a computational perspective because inference needs to be approximate. Most often it is based on likelihood-free Approximate Bayesian Computation (ABC) and here determining which summary statistics to use, as well as how much data is needed to reach the desired level of accuracy, are difficult tasks. Ultimately, all of these aspects-the model fidelity, the available data, and the numerical choices for inference-interplay in a complex manner. In this paper we develop a computational pipeline designed to systematically evaluate inference accuracy for a wide range of true known parameters. We then use it to explore inference settings for negative feedback gene regulation. In particular, we compare a detailed spatial stochastic model, a coarse-grained compartment-based multiscale model, and the standard well-mixed model, across several data-scenarios and for multiple numerical options for parameter inference. Practically speaking, this pipeline can be used as a preliminary step to guide modelers prior to gathering experimental data. By training Gaussian processes to approximate the distance function values, we are able to substantially reduce the computational cost of running the pipeline., Competing Interests: The authors have declared that no competing interests exist., (Copyright: © 2022 Coulier et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.)
- Published
- 2022
- Full Text
- View/download PDF
11. HELIOS: High-speed sequence alignment in optics.
- Author
-
Maleki E, Akbari Rokn Abadi S, and Koohi S
- Subjects
- Sequence Alignment, Amino Acid Sequence, Computer Simulation, Algorithms, INDEL Mutation
- Abstract
In response to the imperfections of current sequence alignment methods, originated from the inherent serialism within their corresponding electrical systems, a few optical approaches for biological data comparison have been proposed recently. However, due to their low performance, raised from their inefficient coding scheme, this paper presents a novel all-optical high-throughput method for aligning DNA, RNA, and protein sequences, named HELIOS. The HELIOS method employs highly sophisticated operations to locate character matches, single or multiple mutations, and single or multiple indels within various biological sequences. On the other hand, the HELIOS optical architecture exploits high-speed processing and operational parallelism in optics, by adopting wavelength and polarization of optical beams. For evaluation, the functionality and accuracy of the HELIOS method are approved through behavioral and optical simulation studies, while its complexity and performance are estimated through analytical computation. The accuracy evaluations indicate that the HELIOS method achieves a precise pairwise alignment of two sequences, highly similar to those of Smith-Waterman, Needleman-Wunsch, BLAST, MUSCLE, ClustalW, ClustalΩ, T-Coffee, Kalign, and MAFFT. According to our performance evaluations, the HELIOS optical architecture outperforms all alternative electrical and optical algorithms in terms of processing time and memory requirement, relying on its highly sophisticated method and optical architecture. Moreover, the employed compact coding scheme highly escalates the number of input characters, and hence, it offers reduced time and space complexities, compared to the electrical and optical alternatives. It makes the HELIOS method and optical architecture highly applicable for biomedical applications., Competing Interests: The authors have declared that no competing interests exist., (Copyright: © 2022 Maleki et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.)
- Published
- 2022
- Full Text
- View/download PDF
12. Efficient Bayesian inference for stochastic agent-based models.
- Author
-
Jørgensen ACS, Ghosh A, Sturrock M, and Shahrezaei V
- Subjects
- Bayes Theorem, Humans, Algorithms, Biochemical Phenomena
- Abstract
The modelling of many real-world problems relies on computationally heavy simulations of randomly interacting individuals or agents. However, the values of the parameters that underlie the interactions between agents are typically poorly known, and hence they need to be inferred from macroscopic observations of the system. Since statistical inference rests on repeated simulations to sample the parameter space, the high computational expense of these simulations can become a stumbling block. In this paper, we compare two ways to mitigate this issue in a Bayesian setting through the use of machine learning methods: One approach is to construct lightweight surrogate models to substitute the simulations used in inference. Alternatively, one might altogether circumvent the need for Bayesian sampling schemes and directly estimate the posterior distribution. We focus on stochastic simulations that track autonomous agents and present two case studies: tumour growths and the spread of infectious diseases. We demonstrate that good accuracy in inference can be achieved with a relatively small number of simulations, making our machine learning approaches orders of magnitude faster than classical simulation-based methods that rely on sampling the parameter space. However, we find that while some methods generally produce more robust results than others, no algorithm offers a one-size-fits-all solution when attempting to infer model parameters from observations. Instead, one must choose the inference technique with the specific real-world application in mind. The stochastic nature of the considered real-world phenomena poses an additional challenge that can become insurmountable for some approaches. Overall, we find machine learning approaches that create direct inference machines to be promising for real-world applications. We present our findings as general guidelines for modelling practitioners., Competing Interests: The authors have declared that no competing interests exist.
- Published
- 2022
- Full Text
- View/download PDF
13. Fast and interpretable consensus clustering via minipatch learning.
- Author
-
Gan L and Allen GI
- Subjects
- Cluster Analysis, Consensus, Reproducibility of Results, Algorithms, Computational Biology methods
- Abstract
Consensus clustering has been widely used in bioinformatics and other applications to improve the accuracy, stability and reliability of clustering results. This approach ensembles cluster co-occurrences from multiple clustering runs on subsampled observations. For application to large-scale bioinformatics data, such as to discover cell types from single-cell sequencing data, for example, consensus clustering has two significant drawbacks: (i) computational inefficiency due to repeatedly applying clustering algorithms, and (ii) lack of interpretability into the important features for differentiating clusters. In this paper, we address these two challenges by developing IMPACC: Interpretable MiniPatch Adaptive Consensus Clustering. Our approach adopts three major innovations. We ensemble cluster co-occurrences from tiny subsets of both observations and features, termed minipatches, thus dramatically reducing computation time. Additionally, we develop adaptive sampling schemes for observations, which result in both improved reliability and computational savings, as well as adaptive sampling schemes of features, which lead to interpretable solutions by quickly learning the most relevant features that differentiate clusters. We study our approach on synthetic data and a variety of real large-scale bioinformatics data sets; results show that our approach not only yields more accurate and interpretable cluster solutions, but it also substantially improves computational efficiency compared to standard consensus clustering approaches., Competing Interests: The authors have declared that no competing interests exist.
- Published
- 2022
- Full Text
- View/download PDF
14. Generation of Binary Tree-Child phylogenetic networks.
- Author
-
Cardona, Gabriel, Pons, Joan Carles, and Scornavacca, Celine
- Subjects
BOTANY ,PHYSICAL sciences ,BINARY number system ,LIFE sciences ,PLANT anatomy ,GRAPH theory - Abstract
Phylogenetic networks generalize phylogenetic trees by allowing the modelization of events of reticulate evolution. Among the different kinds of phylogenetic networks that have been proposed in the literature, the subclass of binary tree-child networks is one of the most studied ones. However, very little is known about the combinatorial structure of these networks. In this paper we address the problem of generating all possible binary tree-child (BTC) networks with a given number of leaves in an efficient way via reduction/augmentation operations that extend and generalize analogous operations for phylogenetic trees, and are biologically relevant. Since our solution is recursive, this also provides us with a recurrence relation giving an upper bound on the number of such networks. We also show how the operations introduced in this paper can be employed to extend the evolutive history of a set of sequences, represented by a BTC network, to include a new sequence. An implementation in python of the algorithms described in this paper, along with some computational experiments, can be downloaded from . [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
15. Rapid Prediction of Bacterial Heterotrophic Fluxomics Using Machine Learning and Constraint Programming.
- Author
-
Wu, Stephen Gang, Wang, Yuxuan, Jiang, Wu, Oyetunde, Tolutola, Yao, Ruilian, Zhang, Xuehong, Shimizu, Kazuyuki, Tang, Yinjie J., and Bao, Forrest Sheng
- Subjects
METABOLIC flux analysis ,SUPPORT vector machines ,CELL metabolism ,MACHINE learning ,STOICHIOMETRY - Abstract
13 C metabolic flux analysis (13 C-MFA) has been widely used to measure in vivo enzyme reaction rates (i.e., metabolic flux) in microorganisms. Mining the relationship between environmental and genetic factors and metabolic fluxes hidden in existing fluxomic data will lead to predictive models that can significantly accelerate flux quantification. In this paper, we present a web-based platform MFlux () that predicts the bacterial central metabolism via machine learning, leveraging data from approximately 10013 C-MFA papers on heterotrophic bacterial metabolisms. Three machine learning methods, namely Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), and Decision Tree, were employed to study the sophisticated relationship between influential factors and metabolic fluxes. We performed a grid search of the best parameter set for each algorithm and verified their performance through 10-fold cross validations. SVM yields the highest accuracy among all three algorithms. Further, we employed quadratic programming to adjust flux profiles to satisfy stoichiometric constraints. Multiple case studies have shown that MFlux can reasonably predict fluxomes as a function of bacterial species, substrate types, growth rate, oxygen conditions, and cultivation methods. Due to the interest of studying model organism under particular carbon sources, bias of fluxome in the dataset may limit the applicability of machine learning models. This problem can be resolved after more papers on13 C-MFA are published for non-model species. [ABSTRACT FROM AUTHOR]- Published
- 2016
- Full Text
- View/download PDF
16. Efficient ReML inference in variance component mixed models using a Min-Max algorithm.
- Author
-
Laporte F, Charcosset A, and Mary-Huard T
- Subjects
- Algorithms, Computational Biology methods, Models, Genetic, Models, Statistical
- Abstract
Since their introduction in the 50's, variance component mixed models have been widely used in many application fields. In this context, ReML estimation is by far the most popular procedure to infer the variance components of the model. Although many implementations of the ReML procedure are readily available, there is still need for computational improvements due to the ever-increasing size of the datasets to be handled, and to the complexity of the models to be adjusted. In this paper, we present a Min-Max (MM) algorithm for ReML inference and combine it with several speed-up procedures. The ReML MM algorithm we present is compared to 5 state-of-the-art publicly available algorithms used in statistical genetics. The computational performance of the different algorithms are evaluated on several datasets representing different plant breeding experimental designs. The MM algorithm ranks among the top 2 methods in almost all settings and is more versatile than many of its competitors. The MM algorithm is a promising alternative to the classical AI-ReML algorithm in the context of variance component mixed models. It is available in the MM4LMM R-package., Competing Interests: The authors have declared that no competing interests exist.
- Published
- 2022
- Full Text
- View/download PDF
17. Inferring tumor progression in large datasets.
- Author
-
Mohaghegh Neyshabouri, Mohammadreza, Jun, Seong-Hwan, and Lagergren, Jens
- Subjects
CANCER invasiveness ,GENETIC mutation ,SOMATIC mutation ,ALGORITHMS ,COLON cancer ,GENE clusters - Abstract
Identification of mutations of the genes that give cancer a selective advantage is an important step towards research and clinical objectives. As such, there has been a growing interest in developing methods for identification of driver genes and their temporal order within a single patient (intra-tumor) as well as across a cohort of patients (inter-tumor). In this paper, we develop a probabilistic model for tumor progression, in which the driver genes are clustered into several ordered driver pathways. We develop an efficient inference algorithm that exhibits favorable scalability to the number of genes and samples compared to a previously introduced ILP-based method. Adopting a probabilistic approach also allows principled approaches to model selection and uncertainty quantification. Using a large set of experiments on synthetic datasets, we demonstrate our superior performance compared to the ILP-based method. We also analyze two biological datasets of colorectal and glioblastoma cancers. We emphasize that while the ILP-based method puts many seemingly passenger genes in the driver pathways, our algorithm keeps focused on truly driver genes and outputs more accurate models for cancer progression. Author summary: Cancer is a disease caused by the accumulation of somatic mutations in the genome. This process is mainly driven by mutations in certain genes that give the harboring cells some selective advantage. The rather few driver genes are usually masked amongst an abundance of so-called passenger mutations. Identification of the driver genes and the temporal order in which the mutations occur is of great importance towards research and clinical objectives. In this paper, we introduce a probabilistic model for cancer progression and devise an efficient inference algorithm to train the model. We show that our method scales favorably to large datasets and provides superior performance compared to an ILP-based counterpart on a wide set of synthetic data simulations. Our Bayesian approach also allows for systematic model selection and confidence quantification procedures in contrast to the previous non-probabilistic progression models. We also study two large datasets on colorectal and glioblastoma cancers and validate our inferred model in comparison to the ILP-based method. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
18. Vessel network extraction and analysis of mouse pulmonary vasculature via X-ray micro-computed tomographic imaging.
- Author
-
Chadwick EA, Suzuki T, George MG, Romero DA, Amon C, Waddell TK, Karoubi G, and Bazylak A
- Subjects
- Animals, Mice, Inbred C57BL, Pulmonary Artery cytology, Pulmonary Veins cytology, Mice, Algorithms, Image Processing, Computer-Assisted methods, Lung blood supply, Tomography, X-Ray Computed methods
- Abstract
In this work, non-invasive high-spatial resolution three-dimensional (3D) X-ray micro-computed tomography (μCT) of healthy mouse lung vasculature is performed. Methodologies are presented for filtering, segmenting, and skeletonizing the collected 3D images. Novel methods for the removal of spurious branch artefacts from the skeletonized 3D image are introduced, and these novel methods involve a combination of distance transform gradients, diameter-length ratios, and the fast marching method (FMM). These new techniques of spurious branch removal result in the consistent removal of spurious branches without compromising the connectivity of the pulmonary circuit. Analysis of the filtered, skeletonized, and segmented 3D images is performed using a newly developed Vessel Network Extraction algorithm to fully characterize the morphology of the mouse pulmonary circuit. The removal of spurious branches from the skeletonized image results in an accurate representation of the pulmonary circuit with significantly less variability in vessel diameter and vessel length in each generation. The branching morphology of a full pulmonary circuit is characterized by the mean diameter per generation and number of vessels per generation. The methods presented in this paper lead to a significant improvement in the characterization of 3D vasculature imaging, allow for automatic separation of arteries and veins, and for the characterization of generations containing capillaries and intrapulmonary arteriovenous anastomoses (IPAVA)., Competing Interests: The authors have declared that no competing interests exist.
- Published
- 2021
- Full Text
- View/download PDF
19. A gap-filling algorithm for prediction of metabolic interactions in microbial communities.
- Author
-
Giannari, Dafni, Ho, Cleo Hanchen, and Mahadevan, Radhakrishnan
- Subjects
MICROBIAL communities ,ALGORITHMS ,MICROBIAL metabolism ,METABOLIC models ,HUMAN microbiota ,GUT microbiome - Abstract
The study of microbial communities and their interactions has attracted the interest of the scientific community, because of their potential for applications in biotechnology, ecology and medicine. The complexity of interspecies interactions, which are key for the macroscopic behavior of microbial communities, cannot be studied easily experimentally. For this reason, the modeling of microbial communities has begun to leverage the knowledge of established constraint-based methods, which have long been used for studying and analyzing the microbial metabolism of individual species based on genome-scale metabolic reconstructions of microorganisms. A main problem of genome-scale metabolic reconstructions is that they usually contain metabolic gaps due to genome misannotations and unknown enzyme functions. This problem is traditionally solved by using gap-filling algorithms that add biochemical reactions from external databases to the metabolic reconstruction, in order to restore model growth. However, gap-filling algorithms could evolve by taking into account metabolic interactions among species that coexist in microbial communities. In this work, a gap-filling method that resolves metabolic gaps at the community level was developed. The efficacy of the algorithm was tested by analyzing its ability to resolve metabolic gaps on a synthetic community of auxotrophic Escherichia coli strains. Subsequently, the algorithm was applied to resolve metabolic gaps and predict metabolic interactions in a community of Bifidobacterium adolescentis and Faecalibacterium prausnitzii, two species present in the human gut microbiota, and in an experimentally studied community of Dehalobacter and Bacteroidales species of the ACT-3 community. The community gap-filling method can facilitate the improvement of metabolic models and the identification of metabolic interactions that are difficult to identify experimentally in microbial communities. Author summary: Microbes are the most abundant form of life on earth and they are almost never found in isolation as they live in close association with one another and with other organisms. The metabolic capacity of individual microbial species dictates their ways of interacting with other species as well as with their environment. The metabolic interactions among microorganisms has been recognised as the driving force for the properties of microbial communities. For this reason, understanding the effect of microbial metabolism on interspecies metabolic interactions is essential for the study of microbial communities. This study can benefit from metabolic modeling and the insights offered by constraint-based methods which have been developed for interrogating metabolic models. In this paper, we present an algorithm that predicts cooperative and competitive metabolic interactions between species while it resolves metabolic gaps in their metabolic models in a computationally efficient way. We use our community gap-filling algorithm to study microbial communities with interesting environmental and health-related applications. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
20. Reconstructing tumor evolutionary histories and clone trees in polynomial-time with SubMARine.
- Author
-
Sundermann LK, Wintersinger J, Rätsch G, Stoye J, and Morris Q
- Subjects
- Computer Simulation, Evolution, Molecular, High-Throughput Nucleotide Sequencing, Humans, Whole Genome Sequencing, Algorithms, Computational Biology methods, Mutation genetics, Neoplasms classification, Neoplasms genetics
- Abstract
Tumors contain multiple subpopulations of genetically distinct cancer cells. Reconstructing their evolutionary history can improve our understanding of how cancers develop and respond to treatment. Subclonal reconstruction methods cluster mutations into groups that co-occur within the same subpopulations, estimate the frequency of cells belonging to each subpopulation, and infer the ancestral relationships among the subpopulations by constructing a clone tree. However, often multiple clone trees are consistent with the data and current methods do not efficiently capture this uncertainty; nor can these methods scale to clone trees with a large number of subclonal populations. Here, we formalize the notion of a partially-defined clone tree (partial clone tree for short) that defines a subset of the pairwise ancestral relationships in a clone tree, thereby implicitly representing the set of all clone trees that have these defined pairwise relationships. Also, we introduce a special partial clone tree, the Maximally-Constrained Ancestral Reconstruction (MAR), which summarizes all clone trees fitting the input data equally well. Finally, we extend commonly used clone tree validity conditions to apply to partial clone trees and describe SubMARine, a polynomial-time algorithm producing the subMAR, which approximates the MAR and guarantees that its defined relationships are a subset of those present in the MAR. We also extend SubMARine to work with subclonal copy number aberrations and define equivalence constraints for this purpose. Further, we extend SubMARine to permit noise in the estimates of the subclonal frequencies while retaining its validity conditions and guarantees. In contrast to other clone tree reconstruction methods, SubMARine runs in time and space that scale polynomially in the number of subclones. We show through extensive noise-free simulation, a large lung cancer dataset and a prostate cancer dataset that the subMAR equals the MAR in all cases where only a single clone tree exists and that it is a perfect match to the MAR in most of the other cases. Notably, SubMARine runs in less than 70 seconds on a single thread with less than one Gb of memory on all datasets presented in this paper, including ones with 50 nodes in a clone tree. On the real-world data, SubMARine almost perfectly recovers the previously reported trees and identifies minor errors made in the expert-driven reconstructions of those trees. The freely-available open-source code implementing SubMARine can be downloaded at https://github.com/morrislab/submarine., Competing Interests: The authors have declared that no competing interests exist.
- Published
- 2021
- Full Text
- View/download PDF
21. Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data.
- Author
-
Tjärnberg A, Mahmood O, Jackson CA, Saldi GA, Cho K, Christiaen LA, and Bonneau RA
- Subjects
- Animals, Cell Line, Databases, Genetic, Humans, Mice, RNA-Seq, Saccharomyces cerevisiae, Algorithms, Genomics methods, Single-Cell Analysis methods, Supervised Machine Learning
- Abstract
The analysis of single-cell genomics data presents several statistical challenges, and extensive efforts have been made to produce methods for the analysis of this data that impute missing values, address sampling issues and quantify and correct for noise. In spite of such efforts, no consensus on best practices has been established and all current approaches vary substantially based on the available data and empirical tests. The k-Nearest Neighbor Graph (kNN-G) is often used to infer the identities of, and relationships between, cells and is the basis of many widely used dimensionality-reduction and projection methods. The kNN-G has also been the basis for imputation methods using, e.g., neighbor averaging and graph diffusion. However, due to the lack of an agreed-upon optimal objective function for choosing hyperparameters, these methods tend to oversmooth data, thereby resulting in a loss of information with regard to cell identity and the specific gene-to-gene patterns underlying regulatory mechanisms. In this paper, we investigate the tuning of kNN- and diffusion-based denoising methods with a novel non-stochastic method for optimally preserving biologically relevant informative variance in single-cell data. The framework, Denoising Expression data with a Weighted Affinity Kernel and Self-Supervision (DEWÄKSS), uses a self-supervised technique to tune its parameters. We demonstrate that denoising with optimal parameters selected by our objective function (i) is robust to preprocessing methods using data from established benchmarks, (ii) disentangles cellular identity and maintains robust clusters over dimension-reduction methods, (iii) maintains variance along several expression dimensions, unlike previous heuristic-based methods that tend to oversmooth data variance, and (iv) rarely involves diffusion but rather uses a fixed weighted kNN graph for denoising. Together, these findings provide a new understanding of kNN- and diffusion-based denoising methods. Code and example data for DEWÄKSS is available at https://gitlab.com/Xparx/dewakss/-/tree/Tjarnberg2020branch., Competing Interests: The authors have declared that no competing interests exist.
- Published
- 2021
- Full Text
- View/download PDF
22. Extended graphical lasso for multiple interaction networks for high dimensional omics data.
- Author
-
Xu, Yang, Jiang, Hongmei, and Jiang, Wenxin
- Subjects
PROBLEM solving ,BIOLOGICAL networks ,NETWORK hubs ,ALGORITHMS ,MEDICAL research - Abstract
There has been a spate of interest in association networks in biological and medical research, for example, genetic interaction networks. In this paper, we propose a novel method, the extended joint hub graphical lasso (EDOHA), to estimate multiple related interaction networks for high dimensional omics data across multiple distinct classes. To be specific, we construct a convex penalized log likelihood optimization problem and solve it with an alternating direction method of multipliers (ADMM) algorithm. The proposed method can also be adapted to estimate interaction networks for high dimensional compositional data such as microbial interaction networks. The performance of the proposed method in the simulated studies shows that EDOHA has remarkable advantages in recognizing class-specific hubs than the existing comparable methods. We also present three applications of real datasets. Biological interpretations of our results confirm those of previous studies and offer a more comprehensive understanding of the underlying mechanism in disease. Author summary: Reconstruction of multiple association networks from high dimensional omics data is an important topic, especially in biology. Previous studies focused on estimating different networks and detecting common hubs among all classes. Integration of information over different classes of data while allowing the difference in the hub nodes is also biologically plausible. Therefore, we propose a method, EDOHA, to jointly construct multiple interaction networks with the capacity in finding different hub networks for each class of data. Simulation studies show better performance over conventional methods. The method has been demonstrated in three real-world data. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
23. Enzyme sequestration by the substrate: An analysis in the deterministic and stochastic domains.
- Author
-
Petrides, Andreas and Vinnicombe, Glenn
- Subjects
PHOSPHORYLATION ,PHOSPHATASES ,KINASES ,ENZYMES ,SEQUESTRATION (Chemistry) - Abstract
This paper is concerned with the potential multistability of protein concentrations in the cell. That is, situations where one, or a family of, proteins may sit at one of two or more different steady state concentrations in otherwise identical cells, and in spite of them being in the same environment. For models of multisite protein phosphorylation for example, in the presence of excess substrate, it has been shown that the achievable number of stable steady states can increase linearly with the number of phosphosites available. In this paper, we analyse the consequences of adding enzyme docking to these and similar models, with the resultant sequestration of phosphatase and kinase by the fully unphosphorylated and by the fully phosphorylated substrates respectively. In the large molecule numbers limit, where deterministic analysis is applicable, we prove that there are always values for these rates of sequestration which, when exceeded, limit the extent of multistability. For the models considered here, these numbers are much smaller than the affinity of the enzymes to the substrate when it is in a modifiable state. As substrate enzyme-sequestration is increased, we further prove that the number of steady states will inevitably be reduced to one. For smaller molecule numbers a stochastic analysis is more appropriate, where multistability in the large molecule numbers limit can manifest itself as multimodality of the probability distribution; the system spending periods of time in the vicinity of one mode before jumping to another. Here, we find that substrate enzyme sequestration can induce bimodality even in systems where only a single steady state can exist at large numbers. To facilitate this analysis, we develop a weakly chained diagonally dominant M-matrix formulation of the Chemical Master Equation, allowing greater insights in the way particular mechanisms, like enzyme sequestration, can shape probability distributions and therefore exhibit different behaviour across different regimes. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
24. Drug2ways: Reasoning over causal paths in biological networks for drug discovery.
- Author
-
Rivas-Barragan, Daniel, Mubeen, Sarah, Guim Bernat, Francesc, Hofmann-Apitius, Martin, and Domingo-Fernández, Daniel
- Subjects
BIOLOGICAL networks ,PYTHON programming language ,SYMPTOMS ,CELL physiology ,DRUG design ,ALGORITHMS ,SCIENTIFIC community - Abstract
Elucidating the causal mechanisms responsible for disease can reveal potential therapeutic targets for pharmacological intervention and, accordingly, guide drug repositioning and discovery. In essence, the topology of a network can reveal the impact a drug candidate may have on a given biological state, leading the way for enhanced disease characterization and the design of advanced therapies. Network-based approaches, in particular, are highly suited for these purposes as they hold the capacity to identify the molecular mechanisms underlying disease. Here, we present drug2ways, a novel methodology that leverages multimodal causal networks for predicting drug candidates. Drug2ways implements an efficient algorithm which reasons over causal paths in large-scale biological networks to propose drug candidates for a given disease. We validate our approach using clinical trial information and demonstrate how drug2ways can be used for multiple applications to identify: i) single-target drug candidates, ii) candidates with polypharmacological properties that can optimize multiple targets, and iii) candidates for combination therapy. Finally, we make drug2ways available to the scientific community as a Python package that enables conducting these applications on multiple standard network formats. Author summary: At any given time, a large set of biomolecules are interacting in ways that give rise to the normal functioning of a cell. By representing biological interactions as networks, we can reconstruct the complex molecular mechanisms that govern the physiology of a cell. These networks can then be analyzed to understand where the system fails and how that can give rise to disease. Similarly, using computational methods, we can also enrich these networks with drugs, diseases and disease phenotypes to estimate how a drug, or a combination of drugs, would behave in a system and whether it can be used to treat or alleviate the symptoms of a disease. In this paper, we present drug2ways, a novel methodology designed for drug discovery applications, that exploits the information contained in a biological network comprising causal relations between drugs, proteins, and diseases. Employing these networks and an efficient algorithm, drugways2 traverses over the ensemble of paths between a drug and a disease to propose the drugs that are most likely to cure the disease based on the information contained in the network. We hypothesize that this ensemble of paths could be used to simulate the mechanism of action of a drug and the directionality inferred through these paths could be used as a proxy to identify drug candidates. Through several experiments, we demonstrate how drug2ways can be used to find novel ways of using existing drugs, identify drug candidates, optimize treatments by targeting multiple disease phenotypes, and propose combination therapies. Owing to the generalizability of the algorithm and the accompanying software, we ambition that drug2ways could be applied to a variety of biological networks to generate new hypotheses for drug discovery and a better understanding of their mechanisms of action. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
25. Deconvolution of heterogeneous tumor samples using partial reference signals.
- Author
-
Qin, Yufang, Zhang, Weiwei, Sun, Xiaoqiang, Nan, Siwei, Wei, Nana, Wu, Hua-Jun, and Zheng, Xiaoqi
- Subjects
DECONVOLUTION (Mathematics) ,NOMOGRAPHY (Mathematics) ,MATRIX decomposition ,NONNEGATIVE matrices ,CANCER cells ,CELL communication ,ALGORITHMS - Abstract
Deconvolution of heterogeneous bulk tumor samples into distinct cellular populations is an important yet challenging problem, particularly when only partial references are available. A common approach to dealing with this problem is to deconvolve the mixed signals using available references and leverage the remaining signal as a new cell component. However, as indicated in our simulation, such an approach tends to over-estimate the proportions of known cell types and fails to detect novel cell types. Here, we propose PREDE, a partial reference-based deconvolution method using an iterative non-negative matrix factorization algorithm. Our method is verified to be effective in estimating cell proportions and expression profiles of unknown cell types based on simulated datasets at a variety of parameter settings. Applying our method to TCGA tumor samples, we found that proportions of pure cancer cells better indicate different subtypes of tumor samples. We also detected several cell types for each cancer type whose proportions successfully predicted patient survival. Our method makes a significant contribution to deconvolution of heterogeneous tumor samples and could be widely applied to varieties of high throughput bulk data. PREDE is implemented in R and is freely available from GitHub (https://xiaoqizheng.github.io/PREDE). Author summary: Tumor tissues are mixtures of different cell types. Identification and quantification of constitutional cell types within tumor tissues are important tasks in cancer research. The problem can be readily solved using regression-based methods if reference signals are available. But in most clinical applications, only partial references are available, which significantly reduces the deconvolution accuracy of the existing regression-based methods. In this paper, we propose a partial-reference based deconvolution model, PREDE, integrating the non-negative matrix factorization framework with an iterative optimization strategy. We conducted comprehensive evaluations for PREDE using both simulation and real data analyses, demonstrating better performance of our method than other existing methods. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
26. LOTUS: A single- and multitask machine learning algorithm for the prediction of cancer driver genes.
- Author
-
Collier O, Stoven V, and Vert JP
- Subjects
- Humans, Models, Statistical, Algorithms, Computational Biology methods, Machine Learning, Neoplasms genetics, Oncogenes genetics, Software
- Abstract
Cancer driver genes, i.e., oncogenes and tumor suppressor genes, are involved in the acquisition of important functions in tumors, providing a selective growth advantage, allowing uncontrolled proliferation and avoiding apoptosis. It is therefore important to identify these driver genes, both for the fundamental understanding of cancer and to help finding new therapeutic targets or biomarkers. Although the most frequently mutated driver genes have been identified, it is believed that many more remain to be discovered, particularly for driver genes specific to some cancer types. In this paper, we propose a new computational method called LOTUS to predict new driver genes. LOTUS is a machine-learning based approach which allows to integrate various types of data in a versatile manner, including information about gene mutations and protein-protein interactions. In addition, LOTUS can predict cancer driver genes in a pan-cancer setting as well as for specific cancer types, using a multitask learning strategy to share information across cancer types. We empirically show that LOTUS outperforms five other state-of-the-art driver gene prediction methods, both in terms of intrinsic consistency and prediction accuracy, and provide predictions of new cancer genes across many cancer types., Competing Interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: JPV received salary and stocks from Google LLC.
- Published
- 2019
- Full Text
- View/download PDF
27. Likelihood-free nested sampling for parameter inference of biochemical reaction networks.
- Author
-
Mikelson, Jan and Khammash, Mustafa
- Subjects
STOCHASTIC systems ,SYSTEMS biology ,BIOCHEMICAL models ,WEBSITES ,ALGORITHMS - Abstract
The development of mechanistic models of biological systems is a central part of Systems Biology. One major challenge in developing these models is the accurate inference of model parameters. In recent years, nested sampling methods have gained increased attention in the Systems Biology community due to the fact that they are parallelizable and provide error estimates with no additional computations. One drawback that severely limits the usability of these methods, however, is that they require the likelihood function to be available, and thus cannot be applied to systems with intractable likelihoods, such as stochastic models. Here we present a likelihood-free nested sampling method for parameter inference which overcomes these drawbacks. This method gives an unbiased estimator of the Bayesian evidence as well as samples from the posterior. We derive a lower bound on the estimators variance which we use to formulate a novel termination criterion for nested sampling. The presented method enables not only the reliable inference of the posterior of parameters for stochastic systems of a size and complexity that is challenging for traditional methods, but it also provides an estimate of the obtained variance. We illustrate our approach by applying it to several realistically sized models with simulated data as well as recently published biological data. We also compare our developed method with the two most popular other likeliood-free approaches: pMCMC and ABC-SMC. The C++ code of the proposed methods, together with test data, is available at the github web page https://github.com/Mijan/LFNS_paper. Author summary: The behaviour of mathematical models of biochemical reactions is governed by model parameters encoding for various reaction rates, molecule concentrations and other biochemical quantities. As the general purpose of these models is to reproduce and predict the true biological response to different stimuli, the inference of these parameters, given experimental observations, is a crucial part of Systems Biology. While plenty of methods have been published for the inference of model parameters, most of them require the availability of the likelihood function and thus cannot be applied to models that do not allow for the computation of the likelihood. Further, most established methods do not provide an estimate of the variance of the obtained estimator. In this paper, we present a novel inference method that accurately approximates the posterior distribution of parameters and does not require the evaluation of the likelihood function. Our method is based on the nested sampling algorithm and approximates the likelihood with a particle filter. We show that the resulting posterior estimates are unbiased and provide a way to estimate not just the posterior distribution, but also an error estimate of the final estimator. We illustrate our method on several stochastic models with simulated data as well as one model of transcription with real biological data. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
28. Shape-to-graph mapping method for efficient characterization and classification of complex geometries in biological images.
- Author
-
Pilcher, William, Yang, Xingyu, Zhurikhina, Anastasia, Chernaya, Olga, Xu, Yinghan, Qiu, Peng, and Tsygankov, Denis
- Subjects
COLLECTIVE behavior ,ALGORITHMS ,DATA mining ,CELL culture ,SMALL molecules ,POLYNOMIAL chaos - Abstract
With the ever-increasing quality and quantity of imaging data in biomedical research comes the demand for computational methodologies that enable efficient and reliable automated extraction of the quantitative information contained within these images. One of the challenges in providing such methodology is the need for tailoring algorithms to the specifics of the data, limiting their areas of application. Here we present a broadly applicable approach to quantification and classification of complex shapes and patterns in biological or other multi-component formations. This approach integrates the mapping of all shape boundaries within an image onto a global information-rich graph and machine learning on the multidimensional measures of the graph. We demonstrated the power of this method by (1) extracting subtle structural differences from visually indistinguishable images in our phenotype rescue experiments using the endothelial tube formations assay, (2) training the algorithm to identify biophysical parameters underlying the formation of different multicellular networks in our simulation model of collective cell behavior, and (3) analyzing the response of U2OS cell cultures to a broad array of small molecule perturbations. Author summary: In this paper, we present a methodology that is based on mapping an arbitrary set of outlines onto a complete, strictly defined structure, in which every point representing the shape becomes a terminal point of a global graph. Because this mapping preserves the whole complexity of the shape, it allows for extracting the full scope of geometric features of any scale. Importantly, an extensive set of graph-based metrics in each image makes integration with machine learning routines highly efficient even for a small data sets and provide an opportunity to backtrack the subtle morphological features responsible for the automated distinction into image classes. The resulting tool provides efficient, versatile, and robust quantification of complex shapes and patterns in experimental images. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
29. Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression.
- Author
-
Wiedenhoeft, John, Brugel, Eric, and Schliep, Alexander
- Subjects
MARKOV processes ,WAVELETS (Mathematics) ,FORWARD-backward algorithm ,CHROMOSOME fragments ,BAYESIAN analysis - Abstract
By integrating Haar wavelets with Hidden Markov Models, we achieve drastically reduced running times for Bayesian inference using Forward-Backward Gibbs sampling. We show that this improves detection of genomic copy number variants (CNV) in array CGH experiments compared to the state-of-the-art, including standard Gibbs sampling. The method concentrates computational effort on chromosomal segments which are difficult to call, by dynamically and adaptively recomputing consecutive blocks of observations likely to share a copy number. This makes routine diagnostic use and re-analysis of legacy data collections feasible; to this end, we also propose an effective automatic prior. An open source software implementation of our method is available at (DOI: ). This paper was selected for oral presentation at RECOMB 2016, and an abstract is published in the conference proceedings. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
30. Hyperdimensional computing: A fast, robust, and interpretable paradigm for biological data.
- Author
-
Stock, Michiel, Van Criekinge, Wim, Boeckaerts, Dimitri, Taelman, Steff, Van Haeverbeke, Maxime, Dewulf, Pieter, and De Baets, Bernard
- Subjects
DEEP learning ,BIOINFORMATICS ,ALGORITHMS ,PHYLOGENY - Abstract
Advances in bioinformatics are primarily due to new algorithms for processing diverse biological data sources. While sophisticated alignment algorithms have been pivotal in analyzing biological sequences, deep learning has substantially transformed bioinformatics, addressing sequence, structure, and functional analyses. However, these methods are incredibly data-hungry, compute-intensive, and hard to interpret. Hyperdimensional computing (HDC) has recently emerged as an exciting alternative. The key idea is that random vectors of high dimensionality can represent concepts such as sequence identity or phylogeny. These vectors can then be combined using simple operators for learning, reasoning, or querying by exploiting the peculiar properties of high-dimensional spaces. Our work reviews and explores HDC's potential for bioinformatics, emphasizing its efficiency, interpretability, and adeptness in handling multimodal and structured data. HDC holds great potential for various omics data searching, biosignal analysis, and health applications. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
31. Accelerating joint species distribution modelling with Hmsc-HPC by GPU porting.
- Author
-
Rahman, Anis Ur, Tikhonov, Gleb, Oksanen, Jari, Rossi, Tuomas, and Ovaskainen, Otso
- Subjects
STATISTICAL learning ,SPECIES distribution ,USER interfaces ,MACHINE learning ,ALGORITHMS - Abstract
Joint species distribution modelling (JSDM) is a widely used statistical method that analyzes combined patterns of all species in a community, linking empirical data to ecological theory and enhancing community-wide prediction tasks. However, fitting JSDMs to large datasets is often computationally demanding and time-consuming. Recent studies have introduced new statistical and machine learning techniques to provide more scalable fitting algorithms, but extending these to complex JSDM structures that account for spatial dependencies or multi-level sampling designs remains challenging. In this study, we aim to enhance JSDM scalability by leveraging high-performance computing (HPC) resources for an existing fitting method. Our work focuses on the Hmsc R-package, a widely used JSDM framework that supports the integration of various dataset types into a single comprehensive model. We developed a GPU-compatible implementation of its model-fitting algorithm using Python and the TensorFlow library. Despite these changes, our enhanced framework retains the original user interface of the Hmsc R-package. We evaluated the performance of the proposed implementation across various model configurations and dataset sizes. Our results show a significant increase in model fitting speed for most models compared to the baseline Hmsc R-package. For the largest datasets, we achieved speed-ups of over 1000 times, demonstrating the substantial potential of GPU porting for previously CPU-bound JSDM software. This advancement opens promising opportunities for better utilizing the rapidly accumulating new biodiversity data resources for inference and prediction. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
32. Optimizing spatial allocation of seasonal influenza vaccine under temporal constraints.
- Author
-
Venkatramanan, Srinivasan, Chen, Jiangzhuo, Fadikar, Arindam, Gupta, Sandeep, Higdon, Dave, Lewis, Bryan, Marathe, Madhav, Mortveit, Henning, and Vullikanti, Anil
- Subjects
SEASONAL influenza ,INFLUENZA vaccines ,FLU vaccine efficacy ,HEALTH policy - Abstract
Prophylactic interventions such as vaccine allocation are some of the most effective public health policy planning tools. The supply of vaccines, however, is limited and an important challenge is to optimally allocate the vaccines to minimize epidemic impact. This resource allocation question (which we refer to as VID) has multiple dimensions: when, where, to whom, etc. Most of the existing literature in this topic deals with the latter (to whom), proposing policies that prioritize individuals by age and disease risk. However, since seasonal influenza spread has a typical spatial trend, and due to the temporal constraints enforced by the availability schedule, the when and where problems become equally, if not more, relevant. In this paper, we study the VID problem in the context of seasonal influenza spread in the United States. We develop a national scale metapopulation model for influenza that integrates both short and long distance human mobility, along with realistic data on vaccine uptake. We also design GA, a greedy algorithm for allocating the vaccine supply at the state level under temporal constraints and show that such a strategy improves over the current baseline of pro-rata allocation, and the improvement is more pronounced for higher vaccine efficacy and moderate flu season intensity. Further, the resulting strategy resembles a ring vaccination applied spatiallyacross the US. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
33. Identifying nonlinear dynamical systems via generative recurrent neural networks with applications to fMRI.
- Author
-
Koppe, Georgia, Toutounji, Hazem, Kirsch, Peter, Lis, Stefanie, and Durstewitz, Daniel
- Subjects
RECURRENT neural networks ,NONLINEAR dynamical systems ,LINEAR dynamical systems ,FUNCTIONAL magnetic resonance imaging ,DYNAMICAL systems - Abstract
A major tenet in theoretical neuroscience is that cognitive and behavioral processes are ultimately implemented in terms of the neural system dynamics. Accordingly, a major aim for the analysis of neurophysiological measurements should lie in the identification of the computational dynamics underlying task processing. Here we advance a state space model (SSM) based on generative piecewise-linear recurrent neural networks (PLRNN) to assess dynamics from neuroimaging data. In contrast to many other nonlinear time series models which have been proposed for reconstructing latent dynamics, our model is easily interpretable in neural terms, amenable to systematic dynamical systems analysis of the resulting set of equations, and can straightforwardly be transformed into an equivalent continuous-time dynamical system. The major contributions of this paper are the introduction of a new observation model suitable for functional magnetic resonance imaging (fMRI) coupled to the latent PLRNN, an efficient stepwise training procedure that forces the latent model to capture the ‘true’ underlying dynamics rather than just fitting (or predicting) the observations, and of an empirical measure based on the Kullback-Leibler divergence to evaluate from empirical time series how well this goal of approximating the underlying dynamics has been achieved. We validate and illustrate the power of our approach on simulated ‘ground-truth’ dynamical systems as well as on experimental fMRI time series, and demonstrate that the learnt dynamics harbors task-related nonlinear structure that a linear dynamical model fails to capture. Given that fMRI is one of the most common techniques for measuring brain activity non-invasively in human subjects, this approach may provide a novel step toward analyzing aberrant (nonlinear) dynamics for clinical assessment or neuroscientific research. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
34. Transient crosslinking kinetics optimize gene cluster interactions.
- Author
-
Walker, Benjamin, Taylor, Dane, Lawrimore, Josh, Hult, Caitlin, Adalsteinsson, David, Bloom, Kerry, and Forest, M. Gregory
- Subjects
GENE clusters ,CHROMOSOME structure ,COMPUTATIONAL biology ,RIBOSOMAL DNA - Abstract
Our understanding of how chromosomes structurally organize and dynamically interact has been revolutionized through the lens of long-chain polymer physics. Major protein contributors to chromosome structure and dynamics are condensin and cohesin that stochastically generate loops within and between chains, and entrap proximal strands of sister chromatids. In this paper, we explore the ability of transient, protein-mediated, gene-gene crosslinks to induce clusters of genes, thereby dynamic architecture, within the highly repeated ribosomal DNA that comprises the nucleolus of budding yeast. We implement three approaches: live cell microscopy; computational modeling of the full genome during G1 in budding yeast, exploring four decades of timescales for transient crosslinks between 5kbp domains (genes) in the nucleolus on Chromosome XII; and, temporal network models with automated community (cluster) detection algorithms applied to the full range of 4D modeling datasets. The data analysis tools detect and track gene clusters, their size, number, persistence time, and their plasticity (deformation). Of biological significance, our analysis reveals an optimal mean crosslink lifetime that promotes pairwise and cluster gene interactions through “flexible” clustering. In this state, large gene clusters self-assemble yet frequently interact (merge and separate), marked by gene exchanges between clusters, which in turn maximizes global gene interactions in the nucleolus. This regime stands between two limiting cases each with far less global gene interactions: with shorter crosslink lifetimes, “rigid” clustering emerges with clusters that interact infrequently; with longer crosslink lifetimes, there is a dissolution of clusters. These observations are compared with imaging experiments on a normal yeast strain and two condensin-modified mutant cell strains. We apply the same image analysis pipeline to the experimental and simulated datasets, providing support for the modeling predictions. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
35. Chemical features mining provides new descriptive structure-odor relationships.
- Author
-
Licon, Carmen C., Bosc, Guillaume, Sabri, Mohammed, Mantel, Marylou, Fournel, Arnaud, Bushdid, Caroline, Golebiowski, Jerome, Robardet, Celine, Plantevit, Marc, Kaytoue, Mehdi, and Bensafi, Moustafa
- Subjects
ODORS ,COLOR vision ,PREDICTION models ,BIOLOGY ,ALGORITHMS - Abstract
An important goal in researching the biology of olfaction is to link the perception of smells to the chemistry of odorants. In other words, why do some odorants smell like fruits and others like flowers? While the so-called stimulus-percept issue was resolved in the field of color vision some time ago, the relationship between the chemistry and psycho-biology of odors remains unclear up to the present day. Although a series of investigations have demonstrated that this relationship exists, the descriptive and explicative aspects of the proposed models that are currently in use require greater sophistication. One reason for this is that the algorithms of current models do not consistently consider the possibility that multiple chemical rules can describe a single quality despite the fact that this is the case in reality, whereby two very different molecules can evoke a similar odor. Moreover, the available datasets are often large and heterogeneous, thus rendering the generation of multiple rules without any use of a computational approach overly complex. We considered these two issues in the present paper. First, we built a new database containing 1689 odorants characterized by physicochemical properties and olfactory qualities. Second, we developed a computational method based on a subgroup discovery algorithm that discriminated perceptual qualities of smells on the basis of physicochemical properties. Third, we ran a series of experiments on 74 distinct olfactory qualities and showed that the generation and validation of rules linking chemistry to odor perception was possible. Taken together, our findings provide significant new insights into the relationship between stimulus and percept in olfaction. In addition, by automatically extracting new knowledge linking chemistry of odorants and psychology of smells, our results provide a new computational framework of analysis enabling scientists in the field to test original hypotheses using descriptive or predictive modeling. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
36. Bayesian adaptive dual control of deep brain stimulation in a computational model of Parkinson’s disease.
- Author
-
Grado, Logan L., Johnson, Matthew D., and Netoff, Theoden I.
- Subjects
BAYESIAN analysis ,PROBABILITY theory ,BRAIN stimulation ,KINDLING (Neurology) ,TRANSCRANIAL magnetic stimulation - Abstract
In this paper, we present a novel Bayesian adaptive dual controller (ADC) for autonomously programming deep brain stimulation devices. We evaluated the Bayesian ADC’s performance in the context of reducing beta power in a computational model of Parkinson’s disease, in which it was tasked with finding the set of stimulation parameters which optimally reduced beta power as fast as possible. Here, the Bayesian ADC has dual goals: (a) to minimize beta power by exploiting the best parameters found so far, and (b) to explore the space to find better parameters, thus allowing for better control in the future. The Bayesian ADC is composed of two parts: an inner parameterized feedback stimulator and an outer parameter adjustment loop. The inner loop operates on a short time scale, delivering stimulus based upon the phase and power of the beta oscillation. The outer loop operates on a long time scale, observing the effects of the stimulation parameters and using Bayesian optimization to intelligently select new parameters to minimize the beta power. We show that the Bayesian ADC can efficiently optimize stimulation parameters, and is superior to other optimization algorithms. The Bayesian ADC provides a robust and general framework for tuning stimulation parameters, can be adapted to use any feedback signal, and is applicable across diseases and stimulator designs. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
37. Efficient pedigree recording for fast population genetics simulation.
- Author
-
Kelleher, Jerome, Thornton, Kevin R., Ashander, Jaime, and Ralph, Peter L.
- Subjects
POPULATION genetics ,EUKARYOTES ,PHYLOGENY ,GENOTYPES ,ALGORITHMS - Abstract
In this paper we describe how to efficiently record the entire genetic history of a population in forwards-time, individual-based population genetics simulations with arbitrary breeding models, population structure and demography. This approach dramatically reduces the computational burden of tracking individual genomes by allowing us to simulate only those loci that may affect reproduction (those having non-neutral variants). The genetic history of the population is recorded as a succinct tree sequence as introduced in the software package msprime, on which neutral mutations can be quickly placed afterwards. Recording the results of each breeding event requires storage that grows linearly with time, but there is a great deal of redundancy in this information. We solve this storage problem by providing an algorithm to quickly ‘simplify’ a tree sequence by removing this irrelevant history for a given set of genomes. By periodically simplifying the history with respect to the extant population, we show that the total storage space required is modest and overall large efficiency gains can be made over classical forward-time simulations. We implement a general-purpose framework for recording and simplifying genealogical data, which can be used to make simulations of any population model more efficient. We modify two popular forwards-time simulation frameworks to use this new approach and observe efficiency gains in large, whole-genome simulations of one to two orders of magnitude. In addition to speed, our method for recording pedigrees has several advantages: (1) All marginal genealogies of the simulated individuals are recorded, rather than just genotypes. (2) A population of N individuals with M polymorphic sites can be stored in O(N log N + M) space, making it feasible to store a simulation’s entire final generation as well as its history. (3) A simulation can easily be initialized with a more efficient coalescent simulation of deep history. The software for recording and processing tree sequences is named tskit. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
38. Clusternomics: Integrative context-dependent clustering for heterogeneous datasets.
- Author
-
Gabasova E, Reid J, and Wernisch L
- Subjects
- Breast Neoplasms genetics, Breast Neoplasms metabolism, Gene Expression Profiling, Humans, Survival Analysis, Algorithms, Cluster Analysis, Computational Biology methods
- Abstract
Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm.
- Published
- 2017
- Full Text
- View/download PDF
39. ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.
- Author
-
Cai Y, Zheng W, Yao J, Yang Y, Mai V, Mao Q, and Sun Y
- Subjects
- Computational Biology, Databases, Genetic, Humans, Microbiota genetics, RNA, Ribosomal, 16S genetics, Algorithms, Cluster Analysis, Sequence Alignment methods, Sequence Analysis, RNA methods
- Abstract
The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computationally expensive to perform hierarchical clustering of extremely large sequence datasets due to its quadratic time and space complexities. In this paper we developed a new algorithm called ESPRIT-Forest for parallel hierarchical clustering of sequences. The algorithm achieves subquadratic time and space complexity and maintains a high clustering accuracy comparable to the standard method. The basic idea is to organize sequences into a pseudo-metric based partitioning tree for sub-linear time searching of nearest neighbors, and then use a new multiple-pair merging criterion to construct clusters in parallel using multiple threads. The new algorithm was tested on the human microbiome project (HMP) dataset, currently one of the largest published microbial 16S rRNA sequence dataset. Our experiment demonstrated that with the power of parallel computing it is now compu- tationally feasible to perform hierarchical clustering analysis of tens of millions of sequences. The software is available at http://www.acsu.buffalo.edu/∼yijunsun/lab/ESPRIT-Forest.html.
- Published
- 2017
- Full Text
- View/download PDF
40. Dynamic compensation, parameter identifiability, and equivariances.
- Author
-
Sontag ED
- Subjects
- Algorithms, Models, Biological, Systems Biology
- Abstract
A recent paper by Karin et al. introduced a mathematical notion called dynamical compensation (DC) of biological circuits. DC was shown to play an important role in glucose homeostasis as well as other key physiological regulatory mechanisms. Karin et al. went on to provide a sufficient condition to test whether a given system has the DC property. Here, we show how DC can be formulated in terms of a well-known concept in systems biology, statistics, and control theory-that of parameter structural non-identifiability. Viewing DC as a parameter identification problem enables one to take advantage of powerful theoretical and computational tools to test a system for DC. We obtain as a special case the sufficient criterion discussed by Karin et al. We also draw connections to system equivalence and to the fold-change detection property.
- Published
- 2017
- Full Text
- View/download PDF
41. A multitask clustering approach for single-cell RNA-seq analysis in Recessive Dystrophic Epidermolysis Bullosa.
- Author
-
Zhang, Huanan, Lee, Catherine A. A., Li, Zhuliu, Garbe, John R., Eide, Cindy R., Petegrosso, Raphael, Kuang, Rui, and Tolar, Jakub
- Subjects
EPIDERMOLYSIS bullosa ,RNA sequencing ,FLOW cytometry ,BIOMARKERS ,GENE expression - Abstract
Single-cell RNA sequencing (scRNA-seq) has been widely applied to discover new cell types by detecting sub-populations in a heterogeneous group of cells. Since scRNA-seq experiments have lower read coverage/tag counts and introduce more technical biases compared to bulk RNA-seq experiments, the limited number of sampled cells combined with the experimental biases and other dataset specific variations presents a challenge to cross-dataset analysis and discovery of relevant biological variations across multiple cell populations. In this paper, we introduce a method of variance-driven multitask clustering of single-cell RNA-seq data (scVDMC) that utilizes multiple single-cell populations from biological replicates or different samples. scVDMC clusters single cells in multiple scRNA-seq experiments of similar cell types and markers but varying expression patterns such that the scRNA-seq data are better integrated than typical pooled analyses which only increase the sample size. By controlling the variance among the cell clusters within each dataset and across all the datasets, scVDMC detects cell sub-populations in each individual experiment with shared cell-type markers but varying cluster centers among all the experiments. Applied to two real scRNA-seq datasets with several replicates and one large-scale Drop-seq dataset on three patient samples, scVDMC more accurately detected cell populations and known cell markers than pooled clustering and other recently proposed scRNA-seq clustering methods. In the case study applied to in-house Recessive Dystrophic Epidermolysis Bullosa (RDEB) scRNA-seq data, scVDMC revealed several new cell types and unknown markers validated by flow cytometry. MATLAB/Octave code available at . [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
42. SARNAclust: Semi-automatic detection of RNA protein binding motifs from immunoprecipitation data.
- Author
-
Dotu, Ivan, Adamson, Scott I., Coleman, Benjamin, Fournier, Cyril, Ricart-Altimiras, Emma, Eyras, Eduardo, and Chuang, Jeffrey H.
- Subjects
IMMUNOPRECIPITATION ,RNA-binding proteins ,PROTEIN-protein interactions ,NUCLEOTIDE sequence ,RNA splicing - Abstract
RNA-protein binding is critical to gene regulation, controlling fundamental processes including splicing, translation, localization and stability, and aberrant RNA-protein interactions are known to play a role in a wide variety of diseases. However, molecular understanding of RNA-protein interactions remains limited; in particular, identification of RNA motifs that bind proteins has long been challenging, especially when such motifs depend on both sequence and structure. Moreover, although RNA binding proteins (RBPs) often contain more than one binding domain, algorithms capable of identifying more than one binding motif simultaneously have not been developed. In this paper we present a novel pipeline to determine binding peaks in crosslinking immunoprecipitation (CLIP) data, to discover multiple possible RNA sequence/structure motifs among them, and to experimentally validate such motifs. At the core is a new semi-automatic algorithm SARNAclust, the first unsupervised method to identify and deconvolve multiple sequence/structure motifs simultaneously. SARNAclust computes similarity between sequence/structure objects using a graph kernel, providing the ability to isolate the impact of specific features through the bulge graph formalism. Application of SARNAclust to synthetic data shows its capability of clustering 5 motifs at once with a V-measure value of over 0.95, while GraphClust achieves only a V-measure of 0.083 and RNAcontext cannot detect any of the motifs. When applied to existing eCLIP sets, SARNAclust finds known motifs for SLBP and HNRNPC and novel motifs for several other RBPs such as AGGF1, AKAP8L and ILF3. We demonstrate an experimental validation protocol, a targeted Bind-n-Seq-like high-throughput sequencing approach that relies on RNA inverse folding for oligo pool design, that can validate the components within the SLBP motif. Finally, we use this protocol to experimentally interrogate the SARNAclust motif predictions for protein ILF3. Our results support a newly identified partially double-stranded UUUUUGAGA motif similar to that known for the splicing factor HNRNPC. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
43. Genetic programming based models in plant tissue culture: An addendum to traditional statistical approach.
- Author
-
Mridula, Meenu R., Nair, Ashalatha S., and Kumar, K. Satheesh
- Subjects
NAPHTHALENE ,ACETIC acid ,CHARCOAL ,CALLUS (Botany) ,PLANT roots - Abstract
In this paper, we compared the efficacy of observation based modeling approach using a genetic algorithm with the regular statistical analysis as an alternative methodology in plant research. Preliminary experimental data on in vitro rooting was taken for this study with an aim to understand the effect of charcoal and naphthalene acetic acid (NAA) on successful rooting and also to optimize the two variables for maximum result. Observation-based modelling, as well as traditional approach, could identify NAA as a critical factor in rooting of the plantlets under the experimental conditions employed. Symbolic regression analysis using the software deployed here optimised the treatments studied and was successful in identifying the complex non-linear interaction among the variables, with minimalistic preliminary data. The presence of charcoal in the culture medium has a significant impact on root generation by reducing basal callus mass formation. Such an approach is advantageous for establishing in vitro culture protocols as these models will have significant potential for saving time and expenditure in plant tissue culture laboratories, and it further reduces the need for specialised background. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
44. Bayesian inference of phylogenetic networks from bi-allelic genetic markers.
- Author
-
Zhu, Jiafan, Wen, Dingqiao, Yu, Yun, Meudt, Heidi M., and Nakhleh, Luay
- Subjects
BAYESIAN analysis ,PHYLOGENY ,INFERENTIAL statistics ,GENETIC markers in plants ,PLANTAGINACEAE - Abstract
Phylogenetic networks are rooted, directed, acyclic graphs that model reticulate evolutionary histories. Recently, statistical methods were devised for inferring such networks from either gene tree estimates or the sequence alignments of multiple unlinked loci. Bi-allelic markers, most notably single nucleotide polymorphisms (SNPs) and amplified fragment length polymorphisms (AFLPs), provide a powerful source of genome-wide data. In a recent paper, a method called SNAPP was introduced for statistical inference of species trees from unlinked bi-allelic markers. The generative process assumed by the method combined both a model of evolution for the bi-allelic markers, as well as the multispecies coalescent. A novel component of the method was a polynomial-time algorithm for exact computation of the likelihood of a fixed species tree via integration over all possible gene trees for a given marker. Here we report on a method for Bayesian inference of phylogenetic networks from bi-allelic markers. Our method significantly extends the algorithm for exact computation of phylogenetic network likelihood via integration over all possible gene trees. Unlike the case of species trees, the algorithm is no longer polynomial-time on all instances of phylogenetic networks. Furthermore, the method utilizes a reversible-jump MCMC technique to sample the posterior of phylogenetic networks given bi-allelic marker data. Our method has a very good performance in terms of accuracy and robustness as we demonstrate on simulated data, as well as a data set of multiple New Zealand species of the plant genus Ourisia (Plantaginaceae). We implemented the method in the publicly available, open-source PhyloNet software package. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
45. Generative Embedding for Model-Based Classification of fMRI Data.
- Author
-
Brodersen, Kay H., Schofield, Thomas M., Leff, Alexander P., Cheng Soon Ong, Lomakina, Ekaterina I., Buhmann, Joachim M., and Stephan, Klaas E.
- Subjects
ALGORITHMS ,BRAIN physiology ,MAGNETIC resonance imaging ,SUPPORT vector machines ,DIAGNOSTIC imaging ,DISEASE research - Abstract
Decoding models, such as those underlying multivariate classification algorithms, have been increasingly used to infer cognitive or clinical brain states from measures of brain activity obtained by functional magnetic resonance imaging (fMRI). The practicality of current classifiers, however, is restricted by two major challenges. First, due to the high data dimensionality and low sample size, algorithms struggle to separate informative from uninformative features, resulting in poor generalization performance. Second, popular discriminative methods such as support vector machines (SVMs) rarely afford mechanistic interpretability. In this paper, we address these issues by proposing a novel generative-embedding approach that incorporates neurobiologically interpretable generative models into discriminative classifiers. Our approach extends previous work on trial-by-trial classification for electrophysiological recordings to subject-by-subject classification for fMRI and offers two key advantages over conventional methods: it may provide more accurate predictions by exploiting discriminative information encoded in 'hidden' physiological quantities such as synaptic connection strengths; and it affords mechanistic interpretability of clinical classifications. Here, we introduce generative embedding for fMRI using a combination of dynamic causal models (DCMs) and SVMs. We propose a general procedure of DCM-based generative embedding for subject-wise classification, provide a concrete implementation, and suggest good-practice guidelines for unbiased application of generative embedding in the context of fMRI. We illustrate the utility of our approach by a clinical example in which we classify moderately aphasic patients and healthy controls using a DCM of thalamo-temporal regions during speech processing. Generative embedding achieves a near-perfect balanced classification accuracy of 98% and significantly outperforms conventional activation-based and correlation-based methods. This example demonstrates how disease states can be detected with very high accuracy and, at the same time, be interpreted mechanistically in terms of abnormalities in connectivity. We envisage that future applications of generative embedding may provide crucial advances in dissecting spectrum disorders into physiologically more well-defined subgroups. [ABSTRACT FROM AUTHOR]
- Published
- 2011
- Full Text
- View/download PDF
46. Structural Identifiability of Dynamic Systems Biology Models.
- Author
-
Villaverde AF, Barreiro A, and Papachristodoulou A
- Subjects
- Animals, Computer Simulation, Humans, Algorithms, Models, Biological, Nonlinear Dynamics, Programming Languages, Software, Systems Biology methods
- Abstract
A powerful way of gaining insight into biological systems is by creating a nonlinear differential equation model, which usually contains many unknown parameters. Such a model is called structurally identifiable if it is possible to determine the values of its parameters from measurements of the model outputs. Structural identifiability is a prerequisite for parameter estimation, and should be assessed before exploiting a model. However, this analysis is seldom performed due to the high computational cost involved in the necessary symbolic calculations, which quickly becomes prohibitive as the problem size increases. In this paper we show how to analyse the structural identifiability of a very general class of nonlinear models by extending methods originally developed for studying observability. We present results about models whose identifiability had not been previously determined, report unidentifiabilities that had not been found before, and show how to modify those unidentifiable models to make them identifiable. This method helps prevent problems caused by lack of identifiability analysis, which can compromise the success of tasks such as experiment design, parameter estimation, and model-based optimization. The procedure is called STRIKE-GOLDD (STRuctural Identifiability taKen as Extended-Generalized Observability with Lie Derivatives and Decomposition), and it is implemented in a MATLAB toolbox which is available as open source software. The broad applicability of this approach facilitates the analysis of the increasingly complex models used in systems biology and other areas., Competing Interests: The authors have declared that no competing interests exist.
- Published
- 2016
- Full Text
- View/download PDF
47. Point process analysis of noise in early invertebrate vision.
- Author
-
Parag, Kris V. and Vinnicombe, Glenn
- Subjects
NOISE ,INVERTEBRATES ,PHOTONS ,LIGHT intensity ,G proteins - Abstract
Noise is a prevalent and sometimes even dominant aspect of many biological processes. While many natural systems have adapted to attenuate or even usefully integrate noise, the variability it introduces often still delimits the achievable precision across biological functions. This is particularly so for visual phototransduction, the process responsible for converting photons of light into usable electrical signals (quantum bumps). Here, randomness of both the photon inputs (regarded as extrinsic noise) and the conversion process (intrinsic noise) are seen as two distinct, independent and significant limitations on visual reliability. Past research has attempted to quantify the relative effects of these noise sources by using approximate methods that do not fully account for the discrete, point process and time ordered nature of the problem. As a result the conclusions drawn from these different approaches have led to inconsistent expositions of phototransduction noise performance. This paper provides a fresh and complete analysis of the relative impact of intrinsic and extrinsic noise in invertebrate phototransduction using minimum mean squared error reconstruction techniques based on Bayesian point process (Snyder) filters. An integrate-fire based algorithm is developed to reliably estimate photon times from quantum bumps and Snyder filters are then used to causally estimate random light intensities both at the front and back end of the phototransduction cascade. Comparison of these estimates reveals that the dominant noise source transitions from extrinsic to intrinsic as light intensity increases. By extending the filtering techniques to account for delays, it is further found that among the intrinsic noise components, which include bump latency (mean delay and jitter) and shape (amplitude and width) variance, it is the mean delay that is critical to noise performance. As the timeliness of visual information is important for real-time action, this delay could potentially limit the speed at which invertebrates can respond to stimuli. Consequently, if one wants to increase visual fidelity, reducing the photoconversion lag is much more important than improving the regularity of the electrical signal. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
48. PCSF: An R-package for network-based interpretation of high-throughput data.
- Author
-
Akhmedov, Murodzhon, Kedaigle, Amanda, Chong, Renan Escalante, Montemanni, Roberto, Bertoni, Francesco, Fraenkel, Ernest, and Kwee, Ivo
- Subjects
BIOINFORMATICS software ,DATA analysis software ,MATHEMATICAL optimization ,COMPUTATIONAL biology ,PROTEIN-protein interactions - Abstract
With the recent technological developments a vast amount of high-throughput data has been profiled to understand the mechanism of complex diseases. The current bioinformatics challenge is to interpret the data and underlying biology, where efficient algorithms for analyzing heterogeneous high-throughput data using biological networks are becoming increasingly valuable. In this paper, we propose a software package based on the Prize-collecting Steiner Forest graph optimization approach. The PCSF package performs fast and user-friendly network analysis of high-throughput data by mapping the data onto a biological networks such as protein-protein interaction, gene-gene interaction or any other correlation or coexpression based networks. Using the interaction networks as a template, it determines high-confidence subnetworks relevant to the data, which potentially leads to predictions of functional units. It also interactively visualizes the resulting subnetwork with functional enrichment analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
49. A Generative Statistical Algorithm for Automatic Detection of Complex Postures.
- Author
-
Nagy S, Goessling M, Amit Y, and Biron D
- Subjects
- Animals, Caenorhabditis elegans anatomy & histology, Computer Simulation, Data Interpretation, Statistical, Models, Statistical, Pattern Recognition, Automated methods, Reproducibility of Results, Sensitivity and Specificity, Algorithms, Caenorhabditis elegans physiology, Image Interpretation, Computer-Assisted methods, Locomotion physiology, Posture physiology, Whole Body Imaging methods
- Abstract
This paper presents a method for automated detection of complex (non-self-avoiding) postures of the nematode Caenorhabditis elegans and its application to analyses of locomotion defects. Our approach is based on progressively detailed statistical models that enable detection of the head and the body even in cases of severe coilers, where data from traditional trackers is limited. We restrict the input available to the algorithm to a single digitized frame, such that manual initialization is not required and the detection problem becomes embarrassingly parallel. Consequently, the proposed algorithm does not propagate detection errors and naturally integrates in a "big data" workflow used for large-scale analyses. Using this framework, we analyzed the dynamics of postures and locomotion of wild-type animals and mutants that exhibit severe coiling phenotypes. Our approach can readily be extended to additional automated tracking tasks such as tracking pairs of animals (e.g., for mating assays) or different species.
- Published
- 2015
- Full Text
- View/download PDF
50. Mapping the Conformation Space of Wildtype and Mutant H-Ras with a Memetic, Cellular, and Multiscale Evolutionary Algorithm.
- Author
-
Clausen R, Ma B, Nussinov R, and Shehu A
- Subjects
- Crystallography, Humans, Mutation, Oncogene Protein p21(ras) metabolism, Principal Component Analysis, Protein Conformation, Thermodynamics, Algorithms, Computational Biology methods, Models, Molecular, Oncogene Protein p21(ras) chemistry, Oncogene Protein p21(ras) genetics
- Abstract
An important goal in molecular biology is to understand functional changes upon single-point mutations in proteins. Doing so through a detailed characterization of structure spaces and underlying energy landscapes is desirable but continues to challenge methods based on Molecular Dynamics. In this paper we propose a novel algorithm, SIfTER, which is based instead on stochastic optimization to circumvent the computational challenge of exploring the breadth of a protein's structure space. SIfTER is a data-driven evolutionary algorithm, leveraging experimentally-available structures of wildtype and variant sequences of a protein to define a reduced search space from where to efficiently draw samples corresponding to novel structures not directly observed in the wet laboratory. The main advantage of SIfTER is its ability to rapidly generate conformational ensembles, thus allowing mapping and juxtaposing landscapes of variant sequences and relating observed differences to functional changes. We apply SIfTER to variant sequences of the H-Ras catalytic domain, due to the prominent role of the Ras protein in signaling pathways that control cell proliferation, its well-studied conformational switching, and abundance of documented mutations in several human tumors. Many Ras mutations are oncogenic, but detailed energy landscapes have not been reported until now. Analysis of SIfTER-computed energy landscapes for the wildtype and two oncogenic variants, G12V and Q61L, suggests that these mutations cause constitutive activation through two different mechanisms. G12V directly affects binding specificity while leaving the energy landscape largely unchanged, whereas Q61L has pronounced, starker effects on the landscape. An implementation of SIfTER is made available at http://www.cs.gmu.edu/~ashehu/?q=OurTools. We believe SIfTER is useful to the community to answer the question of how sequence mutations affect the function of a protein, when there is an abundance of experimental structures that can be exploited to reconstruct an energy landscape that would be computationally impractical to do via Molecular Dynamics.
- Published
- 2015
- Full Text
- View/download PDF
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.