133 results on '"Pirola, Y"'
Search Results
2. MALVIRUS: an integrated application for viral variant analysis
- Author
-
Ciccolella, S, Denti, L, Bonizzoni, P, Della Vedova, G, Pirola, Y, Previtali, M, Ciccolella S., Denti L., Bonizzoni P., Della Vedova G., Pirola Y., Previtali M., Ciccolella, S, Denti, L, Bonizzoni, P, Della Vedova, G, Pirola, Y, Previtali, M, Ciccolella S., Denti L., Bonizzoni P., Della Vedova G., Pirola Y., and Previtali M.
- Abstract
Background: Being able to efficiently call variants from the increasing amount of sequencing data daily produced from multiple viral strains is of the utmost importance, as demonstrated during the COVID-19 pandemic, in order to track the spread of the viral strains across the globe. Results: We present MALVIRUS, an easy-to-install and easy-to-use application that assists users in multiple tasks required for the analysis of a viral population, such as the SARS-CoV-2. MALVIRUS allows to: (1) construct a variant catalog consisting in a set of variations (SNPs/indels) from the population sequences, (2) efficiently genotype and annotate variants of the catalog supported by a read sample, and (3) when the considered viral species is the SARS-CoV-2, assign the input sample to the most likely Pango lineages using the genotyped variations. Conclusions: Tests on Illumina and Nanopore samples proved the efficiency and the effectiveness of MALVIRUS in analyzing SARS-CoV-2 strain samples with respect to publicly available data provided by NCBI and the more complete dataset provided by GISAID. A comparison with state-of-the-art tools showed that MALVIRUS is always more precise and often have a better recall.
- Published
- 2022
3. Numeric Lyndon-based feature embedding of sequencing reads for machine learning approaches
- Author
-
Bonizzoni, P, Costantini, M, De Felice, C, Petescia, A, Pirola, Y, Previtali, M, Rizzi, R, Stoye, J, Zaccagnino, R, Zizza, R, Bonizzoni, P., Costantini, M., De Felice, C., Petescia, A., Pirola, Y., Previtali, M., Rizzi, R., Stoye, J., Zaccagnino, R., Zizza, R., Bonizzoni, P, Costantini, M, De Felice, C, Petescia, A, Pirola, Y, Previtali, M, Rizzi, R, Stoye, J, Zaccagnino, R, Zizza, R, Bonizzoni, P., Costantini, M., De Felice, C., Petescia, A., Pirola, Y., Previtali, M., Rizzi, R., Stoye, J., Zaccagnino, R., and Zizza, R.
- Abstract
Feature embedding methods have been proposed in the literature to represent sequences as numeric vectors to be used in some bioinformatics investigations, such as family classification and protein structure prediction. Recent theoretical results showed that the well-known Lyndon factorization preserves common factors in overlapping strings [1]. Surprisingly, the fingerprint of a sequencing read, which is the sequence of lengths of consecutive factors in variants of the Lyndon factorization of the read, is effective in capturing sequence similarities, suggesting it as basis for the definition of novel representations of sequencing reads. We propose a novel feature embedding method for Next-Generation Sequencing (NGS) data using the notion of fingerprint. We provide a theoretical and experimental framework to estimate the behaviour of fingerprints and of the k-mers extracted from it, called k-fingers, as possible feature embeddings for sequencing reads. As a case study to assess the effectiveness of such embeddings, we use fingerprints to represent RNA-Seq reads in order to assign them to the most likely gene from which they originated as fragments of transcripts of the gene. We provide an implementation of the proposed method in the tool lyn2vec, which produces Lyndon-based feature embeddings of sequencing reads.
- Published
- 2022
4. Identification of Chimeric RNAs: a novel machine learning perspective
- Author
-
Bonizzoni, P, De Felice, C, Pirola, Y, Rizzi, R, Zaccagnino, R, Zizza, R, Bonizzoni, P, De Felice, C, Pirola, Y, Rizzi, R, Zaccagnino, R, and Zizza, R
- Abstract
Chimeric RNAs are transcripts generated by gene fusion and intergenic splicing events, thus comprising nucleotide sequences from different genes. Recent studies have shown that some chimeric RNAs can play a role in cancer development, and so can be used as diagnostics biomarkers when specifically expressed in cancerous cells and tissues. Most gene fusion prediction tools rely on an initial alignment step. However, alignments might be biased, especially for chimeric reads, creating many false positives. Therefore, developing alignment-free prediction methods of fusion genes would be helpful and may provide new insights into the genomic breakage phenomenon in the cell. In this direction, machine learning could pave the way for new solutions, due to their success in predicting genomic regulatory elements and alternative junction events from the genomic context. To date, however, these techniques have had a marginal supporting role, and, furthermore, manually-curated data sets, that are crucial for model training, are often expensive, unreliable or simply unavailable. Here we propose a novel ML-based method that learn to recognize the hidden patterns that allow to identify chimeric RNAs deriving from oncogenic gene fusions. Preliminary comparison with another state-ofthe- art method shows promising results.
- Published
- 2023
5. Multiallelic Maximal Perfect Haplotype Blocks with Wildcards via PBWT
- Author
-
Rojas, I, Valenzuela, O, Rojas Ruiz, F, Herrera, LJ, Ortuño, F, Bonizzoni, P, Della Vedova, G, Pirola, Y, Rizzi, R, Sgrò, M, Rojas, I, Valenzuela, O, Rojas Ruiz, F, Herrera, LJ, Ortuño, F, Bonizzoni, P, Della Vedova, G, Pirola, Y, Rizzi, R, and Sgrò, M
- Abstract
Computing maximal perfect blocks of a given panel of haplotypes is a crucial task for efficiently solving problems such as polyploid haplotype reconstruction and finding identical-by-descent segments shared among individuals of a population. Unfortunately, the presence of missing data in the haplotype panel limits the usefulness of the notion of perfect blocks. We propose a novel algorithm for computing maximal blocks in a panel with missing data (represented as wildcards). The algorithm is based on the Positional Burrows-Wheeler Transform (PBWT) and has been implemented in the tool Wild-pBWT, available at https://github.com/AlgoLab/Wild-pBWT/. Experimental comparison showed that Wild-pBWT is 10–15 times faster than another state-of-the-art approach, while using a negligible amount of memory.
- Published
- 2023
6. MALVIRUS: an integrated application for viral variant analysis
- Author
-
Ciccolella, S, Denti, L, Bonizzoni, P, Della Vedova, G, Pirola, Y, Previtali, M, Ciccolella S., Denti L., Bonizzoni P., Della Vedova G., Pirola Y., Previtali M., Ciccolella, S, Denti, L, Bonizzoni, P, Della Vedova, G, Pirola, Y, Previtali, M, Ciccolella S., Denti L., Bonizzoni P., Della Vedova G., Pirola Y., and Previtali M.
- Abstract
Background: Being able to efficiently call variants from the increasing amount of sequencing data daily produced from multiple viral strains is of the utmost importance, as demonstrated during the COVID-19 pandemic, in order to track the spread of the viral strains across the globe. Results: We present MALVIRUS, an easy-to-install and easy-to-use application that assists users in multiple tasks required for the analysis of a viral population, such as the SARS-CoV-2. MALVIRUS allows to: (1) construct a variant catalog consisting in a set of variations (SNPs/indels) from the population sequences, (2) efficiently genotype and annotate variants of the catalog supported by a read sample, and (3) when the considered viral species is the SARS-CoV-2, assign the input sample to the most likely Pango lineages using the genotyped variations. Conclusions: Tests on Illumina and Nanopore samples proved the efficiency and the effectiveness of MALVIRUS in analyzing SARS-CoV-2 strain samples with respect to publicly available data provided by NCBI and the more complete dataset provided by GISAID. A comparison with state-of-the-art tools showed that MALVIRUS is always more precise and often have a better recall.
- Published
- 2021
7. Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?
- Author
-
Bonizzoni, P, De Felice, C, Pirola, Y, Rizzi, R, Zaccagnino, R, Zizza, R, Bonizzoni, Paola, De Felice, Clelia, Pirola, Yuri, Rizzi, Raffaella, Zaccagnino, Rocco, Zizza, Rosalba, Bonizzoni, P, De Felice, C, Pirola, Y, Rizzi, R, Zaccagnino, R, Zizza, R, Bonizzoni, Paola, De Felice, Clelia, Pirola, Yuri, Rizzi, Raffaella, Zaccagnino, Rocco, and Zizza, Rosalba
- Abstract
Graph pangenomics is a new emerging field in computational biology that is changing the traditional view of a reference genome from a linear sequence to a new paradigm: a sequence graph (pangenome graph or simply pangenome) that represents the main similarities and differences in multiple evolutionary related genomes. The speed in producing large amounts of genome data, driven by advances in sequencing technologies, is far from the slow progress in developing new methods for constructing and analyzing a pangenome. Most recent advances in the field are still based on notions rooted in established and quite old literature on combinatorics on words, formal languages and space efficient data structures. In this paper we discuss two novel notions that may help in managing and analyzing multiple genomes by addressing a relevant question: how can we summarize sequence similarities and dissimilarities in large sequence data? The first notion is related to variants of the Lyndon factorization and allows to represent sequence similarities for a sample of reads, while the second one is that of sample specific string as a tool to detect differences in a sample of reads. New perspectives opened by these two notions are discussed.
- Published
- 2022
8. Computational graph pangenomics: a tutorial on data structures and their applications
- Author
-
Baaijens, J, Bonizzoni, P, Boucher, C, Della Vedova, G, Pirola, Y, Rizzi, R, Sirén, J, Baaijens, Jasmijn A., Bonizzoni, Paola, Boucher, Christina, Della Vedova, Gianluca, Pirola, Yuri, Rizzi, Raffaella, Sirén, Jouni, Baaijens, J, Bonizzoni, P, Boucher, C, Della Vedova, G, Pirola, Y, Rizzi, R, Sirén, J, Baaijens, Jasmijn A., Bonizzoni, Paola, Boucher, Christina, Della Vedova, Gianluca, Pirola, Yuri, Rizzi, Raffaella, and Sirén, Jouni
- Abstract
Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations—thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.
- Published
- 2022
9. KFinger: Capturing Overlaps Between Long Reads by Using Lyndon Fingerprints
- Author
-
Bonizzoni, P, Petescia, A, Pirola, Y, Rizzi, R, Zaccagnino, R, Zizza, R, Bonizzoni, P, Petescia, A, Pirola, Y, Rizzi, R, Zaccagnino, R, and Zizza, R
- Abstract
Detecting common regions and overlaps between DNA sequences is crucial in many Bioinformatics tasks. One of them is genome assembly based on the use of the overlap graph which is constructed by detecting the overlap between genomic reads. When dealing with long reads this task is further complicated by the length of the reads and the high sequencing error rate. This paper proposes a novel alignment-free method for detecting the overlaps in a set of long reads which exploits a signature (called fingerprint) of reads built from a factorization of the read based on the notion of Lyndon words. The method has been implemented in the tool KFinger and tested over a simulated and a real PacBio HiFi dataset of genomic reads; its results have been compared with the well-known aligner Minimap2. KFinger is available at https://github.com/AlgoLab/kfinger.
- Published
- 2022
10. Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era
- Author
-
Rizzi, R, Beretta, S, Patterson, M, Pirola, Y, Previtali, M, Della Vedova, G, Bonizzoni, P, Rizzi R., Beretta S., Patterson M., Pirola Y., Previtali M., Della Vedova G., Bonizzoni P., Rizzi, R, Beretta, S, Patterson, M, Pirola, Y, Previtali, M, Della Vedova, G, Bonizzoni, P, Rizzi R., Beretta S., Patterson M., Pirola Y., Previtali M., Della Vedova G., and Bonizzoni P.
- Abstract
Background: De novo genome assembly relies on two kinds of graphs: de Bruijn graphs and overlap graphs. Overlap graphs are the basis for the Celera assembler, while de Bruijn graphs have become the dominant technical device in the last decade. Those two kinds of graphs are collectively called assembly graphs. Results: In this review, we discuss the most recent advances in the problem of constructing, representing and navigating assembly graphs, focusing on very large datasets. We will also explore some computational techniques, such as the Bloom filter, to compactly store graphs while keeping all functionalities intact. Conclusions: We complete our analysis with a discussion on the algorithmic issues of assembling from long reads (e.g., PacBio and Oxford Nanopore). Finally, we present some of the most relevant open problems in this field. [Figure not available: see fulltext.]
- Published
- 2019
11. Can We Replace Reads by Numeric Signatures? Lyndon Fingerprints as Representations of Sequencing Reads for Machine Learning
- Author
-
Martín-Vide, C, Vega-Rodríguez, MA, Wheeler, T, Bonizzoni, P, De Felice, C, Petescia, A, Pirola, Y, Rizzi, R, Stoye, J, Zaccagnino, R, Zizza, R, Bonizzoni, Paola, De Felice, Clelia, Petescia, Alessia, Pirola, Yuri, Rizzi, Raffaella, Stoye, Jens, Zaccagnino, Rocco, Zizza, Rosalba, Martín-Vide, C, Vega-Rodríguez, MA, Wheeler, T, Bonizzoni, P, De Felice, C, Petescia, A, Pirola, Y, Rizzi, R, Stoye, J, Zaccagnino, R, Zizza, R, Bonizzoni, Paola, De Felice, Clelia, Petescia, Alessia, Pirola, Yuri, Rizzi, Raffaella, Stoye, Jens, Zaccagnino, Rocco, and Zizza, Rosalba
- Abstract
Representations of biological sequences facilitating sequence comparison are crucial in several bioinformatics tasks. Recently, the Lyndon factorization has been proved to preserve common factors in overlapping reads, thus leading to the idea of using factorizations of sequences to define measures of similarity between reads. In this paper we propose as a signature of sequencing reads the notion of fingerprint, i.e., the sequence of lengths of consecutive factors in Lyndon-based factorizations of the reads. Surprisingly, fingerprints of reads are effective in preserving sequence similarities while providing a compact representation of the read, and so, k-mers extracted from a fingerprint, called k-fingers, can be used to capture sequence similarity between reads. We first provide a probabilistic framework to estimate the behaviour of fingerprints. Then we experimentally evaluate the effectiveness of this representation for machine learning algorithms for classifying biological sequences. In particular, we considered the problem of assigning RNA-Seq reads to the most likely gene from which they were generated. Our results show that fingerprints can provide an effective machine learning interpretable representation, successfully preserving sequence similarity.
- Published
- 2021
12. Shark: fishing relevant reads in an RNA-Seq sample
- Author
-
Denti, L, Pirola, Y, Previtali, M, Ceccato, T, Della Vedova, G, Rizzi, R, Bonizzoni, P, Denti, Luca, Pirola, Yuri, Previtali, Marco, Ceccato, Tamara, Della Vedova, Gianluca, Rizzi, Raffaella, Bonizzoni, Paola, Denti, L, Pirola, Y, Previtali, M, Ceccato, T, Della Vedova, G, Rizzi, R, Bonizzoni, P, Denti, Luca, Pirola, Yuri, Previtali, Marco, Ceccato, Tamara, Della Vedova, Gianluca, Rizzi, Raffaella, and Bonizzoni, Paola
- Abstract
Motivation: Recent advances in high throughput RNA-Seq technologies allow to produce massive datasets. When a study focuses only on a handful of genes, most reads are not relevant and degrade the performance of the tools used to analyze the data. Removing irrelevant reads from the input dataset leads to improved efficiency without compromising the results of the study. Results: We introduce a novel computational problem, called gene assignment and we propose an efficient alignment-free approach to solve it. Given an RNA-Seq sample and a panel of genes, a gene assignment consists in extracting from the sample the reads that most probably were sequenced from those genes. The problem becomes more complicated when the sample exhibits evidence of novel alternative splicing events. We implemented our approach in a tool called Shark and assessed its effectiveness in speeding up differential splicing analysis pipelines. This evaluation shows that Shark is able to significantly improve the performance of RNA-Seq analysis tools without having any impact on the final results. Availability: The tool is distributed as a stand-alone module and the software is freely available at https://github.com/AlgoLab/shark.
- Published
- 2021
13. Computing the multi-string BWT and LCP array in external memory
- Author
-
Bonizzoni, P, Della Vedova, G, Pirola, Y, Previtali, M, Rizzi, R, Bonizzoni, P, Della Vedova, G, Pirola, Y, Previtali, M, and Rizzi, R
- Abstract
Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multi-string generalization of the Burrows-Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is instrumental to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In a previous paper, we presented an in-memory divide-and-conquer method for building the BWT and LCP where we merge partial BWTs with a forward approach to sort suffixes. In this paper, we propose an alternative backward strategy to develop an external memory method to simultaneously build the BWT and the LCP array on a collection of m strings of different lengths. The algorithm over a set of strings having constant length k has O(mkl) time and I/O volume, using O(k+m) main memory, where l is the maximum value in the LCP array.
- Published
- 2021
14. Bioconda: Sustainable and comprehensive software distribution for the life sciences
- Author
-
Dale, R, Gruning, B, Sjodin, A, Rowe, J, Chapman, B, Tomkins-Tinch, C, Valieris, R, Batut, B, Caprez, A, Cokelaer, T, Yusuf, D, Beauchamp, K, Brinda, K, Wollmann, T, Corguille, G, Ryan, D, Bretaudeau, A, Hoogstrate, Y, Pedersen, B, Heeringen, S, Raden, M, Luna-Valero, S, Soranzo, N, Smet, M, Kuster, G, Kirchner, R, Pantano, L, Charlop-Powers, Z, Thornton, K, Martin, M, Beek, M, Maticzka, D, Miladi, M, Will, S, Gravouil, K, Unneberg, P, Brueffer, C, Blank, C, Piro, V, Wolff, J, Antao, T, Gladman, S, Shlyakhter, I, Hollander, M, Mabon, P, Shen, W, Boekel, J, Holtgrewe, M, Bouvier, D, de Ruiter, J, Cabral, J, Choudhary, S, Harding, N, Kleinkauf, R, Enns, E, Eggenhofer, F, Brown, J, Cock, P, Timm, H, Thomas, C, Zhang, X, Chambers, M, Turaga, N, Seiler, E, Brislawn, C, Pruesse, E, Fallmann, J, Kelleher, J, Nguyen, H, Parsons, L, Fang, Z, Stovner, E, Stoler, N, Ye, S, Wohlers, I, Farouni, R, Freeberg, M, Johnson, J, Bargull, M, Kensche, P, Webster, T, Eppley, J, Stahl, C, Rose, A, Reynolds, A, Wang, L, Garnier, X, Dirmeier, S, Knudsen, M, Taylor, J, Srivastava, A, Rai, V, Agren, R, Junge, A, Guimera, R, Khan, A, Schmeier, S, He, G, Pinello, L, Hagglund, E, Mikheyev, A, Preussner, J, Waters, N, Li, W, Capellades, J, Chande, A, Pirola, Y, Hiltemann, S, Bendall, M, Singh, S, Dunn, W, Drouin, A, Domenico, T, Bruijn, I, Larson, D, Chicco, D, Grassi, E, Gonnella, G, B, J, Giacomoni, F, Clarke, E, Blankenberg, D, Tran, C, Patro, R, Laurent, S, Gopez, M, Sennblad, B, Baaijens, J, Ewels, P, Wright, P, Enache, O, Roger, P, Dampier, W, Koppstein, D, Devisetty, U, Rausch, T, Cornwell, M, Salatino, A, Seiler, J, Jung, M, Kornobis, E, Cumbo, F, Stocker, B, Moskalenko, O, Bogema, D, Workentine, M, Newhouse, S, Leprevost, F, Arvai, K, Koster, J, Dale R., Gruning B., Sjodin A., Rowe J., Chapman B. A., Tomkins-Tinch C. H., Valieris R., Batut B., Caprez A., Cokelaer T., Yusuf D., Beauchamp K. A., Brinda K., Wollmann T., Corguille G. L., Ryan D., Bretaudeau A., Hoogstrate Y., Pedersen B. S., Heeringen S., Raden M., Luna-Valero S., Soranzo N., Smet M. D., Kuster G. V., Kirchner R., Pantano L., Charlop-Powers Z., Thornton K., Martin M., Beek M. D., Maticzka D., Miladi M., Will S., Gravouil K., Unneberg P., Brueffer C., Blank C., Piro V. C., Wolff J., Antao T., Gladman S., Shlyakhter I., Hollander M., Mabon P., Shen W., Boekel J., Holtgrewe M., Bouvier D., de Ruiter J. R., Cabral J., Choudhary S., Harding N., Kleinkauf R., Enns E., Eggenhofer F., Brown J., Cock P. J. A., Timm H., Thomas C., Zhang X. -O., Chambers M., Turaga N., Seiler E., Brislawn C., Pruesse E., Fallmann J., Kelleher J., Nguyen H., Parsons L., Fang Z., Stovner E. B., Stoler N., Ye S., Wohlers I., Farouni R., Freeberg M., Johnson J. E., Bargull M., Kensche P. R., Webster T. H., Eppley J. M., Stahl C., Rose A. S., Reynolds A., Wang L. -B., Garnier X., Dirmeier S., Knudsen M., Taylor J., Srivastava A., Rai V., Agren R., Junge A., Guimera R. V., Khan A., Schmeier S., He G., Pinello L., Hagglund E., Mikheyev A. S., Preussner J., Waters N. R., Li W., Capellades J., Chande A. T., Pirola Y., Hiltemann S., Bendall M. L., Singh S., Dunn W. A., Drouin A., Domenico T. D., Bruijn I., Larson D. E., Chicco D., Grassi E., Gonnella G., B J., Wang L., Giacomoni F., Clarke E., Blankenberg D., Tran C., Patro R., Laurent S., Gopez M., Sennblad B., Baaijens J. A., Ewels P., Wright P. R., Enache O. M., Roger P., Dampier W., Koppstein D., Devisetty U. K., Rausch T., Cornwell M., Salatino A. E., Seiler J., Jung M., Kornobis E., Cumbo F., Stocker B. K., Moskalenko O., Bogema D. R., Workentine M. L., Newhouse S. J., Leprevost F. D. V., Arvai K., Koster J., Dale, R, Gruning, B, Sjodin, A, Rowe, J, Chapman, B, Tomkins-Tinch, C, Valieris, R, Batut, B, Caprez, A, Cokelaer, T, Yusuf, D, Beauchamp, K, Brinda, K, Wollmann, T, Corguille, G, Ryan, D, Bretaudeau, A, Hoogstrate, Y, Pedersen, B, Heeringen, S, Raden, M, Luna-Valero, S, Soranzo, N, Smet, M, Kuster, G, Kirchner, R, Pantano, L, Charlop-Powers, Z, Thornton, K, Martin, M, Beek, M, Maticzka, D, Miladi, M, Will, S, Gravouil, K, Unneberg, P, Brueffer, C, Blank, C, Piro, V, Wolff, J, Antao, T, Gladman, S, Shlyakhter, I, Hollander, M, Mabon, P, Shen, W, Boekel, J, Holtgrewe, M, Bouvier, D, de Ruiter, J, Cabral, J, Choudhary, S, Harding, N, Kleinkauf, R, Enns, E, Eggenhofer, F, Brown, J, Cock, P, Timm, H, Thomas, C, Zhang, X, Chambers, M, Turaga, N, Seiler, E, Brislawn, C, Pruesse, E, Fallmann, J, Kelleher, J, Nguyen, H, Parsons, L, Fang, Z, Stovner, E, Stoler, N, Ye, S, Wohlers, I, Farouni, R, Freeberg, M, Johnson, J, Bargull, M, Kensche, P, Webster, T, Eppley, J, Stahl, C, Rose, A, Reynolds, A, Wang, L, Garnier, X, Dirmeier, S, Knudsen, M, Taylor, J, Srivastava, A, Rai, V, Agren, R, Junge, A, Guimera, R, Khan, A, Schmeier, S, He, G, Pinello, L, Hagglund, E, Mikheyev, A, Preussner, J, Waters, N, Li, W, Capellades, J, Chande, A, Pirola, Y, Hiltemann, S, Bendall, M, Singh, S, Dunn, W, Drouin, A, Domenico, T, Bruijn, I, Larson, D, Chicco, D, Grassi, E, Gonnella, G, B, J, Giacomoni, F, Clarke, E, Blankenberg, D, Tran, C, Patro, R, Laurent, S, Gopez, M, Sennblad, B, Baaijens, J, Ewels, P, Wright, P, Enache, O, Roger, P, Dampier, W, Koppstein, D, Devisetty, U, Rausch, T, Cornwell, M, Salatino, A, Seiler, J, Jung, M, Kornobis, E, Cumbo, F, Stocker, B, Moskalenko, O, Bogema, D, Workentine, M, Newhouse, S, Leprevost, F, Arvai, K, Koster, J, Dale R., Gruning B., Sjodin A., Rowe J., Chapman B. A., Tomkins-Tinch C. H., Valieris R., Batut B., Caprez A., Cokelaer T., Yusuf D., Beauchamp K. A., Brinda K., Wollmann T., Corguille G. L., Ryan D., Bretaudeau A., Hoogstrate Y., Pedersen B. S., Heeringen S., Raden M., Luna-Valero S., Soranzo N., Smet M. D., Kuster G. V., Kirchner R., Pantano L., Charlop-Powers Z., Thornton K., Martin M., Beek M. D., Maticzka D., Miladi M., Will S., Gravouil K., Unneberg P., Brueffer C., Blank C., Piro V. C., Wolff J., Antao T., Gladman S., Shlyakhter I., Hollander M., Mabon P., Shen W., Boekel J., Holtgrewe M., Bouvier D., de Ruiter J. R., Cabral J., Choudhary S., Harding N., Kleinkauf R., Enns E., Eggenhofer F., Brown J., Cock P. J. A., Timm H., Thomas C., Zhang X. -O., Chambers M., Turaga N., Seiler E., Brislawn C., Pruesse E., Fallmann J., Kelleher J., Nguyen H., Parsons L., Fang Z., Stovner E. B., Stoler N., Ye S., Wohlers I., Farouni R., Freeberg M., Johnson J. E., Bargull M., Kensche P. R., Webster T. H., Eppley J. M., Stahl C., Rose A. S., Reynolds A., Wang L. -B., Garnier X., Dirmeier S., Knudsen M., Taylor J., Srivastava A., Rai V., Agren R., Junge A., Guimera R. V., Khan A., Schmeier S., He G., Pinello L., Hagglund E., Mikheyev A. S., Preussner J., Waters N. R., Li W., Capellades J., Chande A. T., Pirola Y., Hiltemann S., Bendall M. L., Singh S., Dunn W. A., Drouin A., Domenico T. D., Bruijn I., Larson D. E., Chicco D., Grassi E., Gonnella G., B J., Wang L., Giacomoni F., Clarke E., Blankenberg D., Tran C., Patro R., Laurent S., Gopez M., Sennblad B., Baaijens J. A., Ewels P., Wright P. R., Enache O. M., Roger P., Dampier W., Koppstein D., Devisetty U. K., Rausch T., Cornwell M., Salatino A. E., Seiler J., Jung M., Kornobis E., Cumbo F., Stocker B. K., Moskalenko O., Bogema D. R., Workentine M. L., Newhouse S. J., Leprevost F. D. V., Arvai K., and Koster J.
- Abstract
Bioinformatics software comes in a variety of programming languages and requires diverse installation methods. This heterogeneity makes management of a software stack complicated, error-prone, and inordinately time-consuming. Whereas software deployment has traditionally been handled by administrators, ensuring the reproducibility of data analyses1–3 requires that the researcher be able to maintain full control of the software environment, rapidly modify it without administrative privileges, and reproduce the same software stack on different machines.
- Published
- 2018
15. Bioconda: sustainable and comprehensive software distribution for the life sciences
- Author
-
Dale R., Gruning B., Sjodin A., Rowe J., Chapman B. A., Tomkins-Tinch C. H., Valieris R., Batut B., Caprez A., Cokelaer T., Yusuf D., Beauchamp K. A., Brinda K., Wollmann T., Corguille G. L., Ryan D., Bretaudeau A., Hoogstrate Y., Pedersen B. S., Heeringen S., Raden M., Luna-Valero S., Soranzo N., Smet M. D., Kuster G. V., Kirchner R., Pantano L., Charlop-Powers Z., Thornton K., Martin M., Beek M. D., Maticzka D., Miladi M., Will S., Gravouil K., Unneberg P., Brueffer C., Blank C., Piro V. C., Wolff J., Antao T., Gladman S., Shlyakhter I., Hollander M., Mabon P., Shen W., Boekel J., Holtgrewe M., Bouvier D., de Ruiter J. R., Cabral J., Choudhary S., Harding N., Kleinkauf R., Enns E., Eggenhofer F., Brown J., Cock P. J. A., Timm H., Thomas C., Zhang X. -O., Chambers M., Turaga N., Seiler E., Brislawn C., Pruesse E., Fallmann J., Kelleher J., Nguyen H., Parsons L., Fang Z., Stovner E. B., Stoler N., Ye S., Wohlers I., Farouni R., Freeberg M., Johnson J. E., Bargull M., Kensche P. R., Webster T. H., Eppley J. M., Stahl C., Rose A. S., Reynolds A., Wang L. -B., Garnier X., Dirmeier S., Knudsen M., Taylor J., Srivastava A., Rai V., Agren R., Junge A., Guimera R. V., Khan A., Schmeier S., He G., Pinello L., Hagglund E., Mikheyev A. S., Preussner J., Waters N. R., Li W., Capellades J., Chande A. T., Pirola Y., Hiltemann S., Bendall M. L., Singh S., Dunn W. A., Drouin A., Domenico T. D., Bruijn I., Larson D. E., Chicco D., Grassi E., Gonnella G., B J., Wang L., Giacomoni F., Clarke E., Blankenberg D., Tran C., Patro R., Laurent S., Gopez M., Sennblad B., Baaijens J. A., Ewels P., Wright P. R., Enache O. M., Roger P., Dampier W., Koppstein D., Devisetty U. K., Rausch T., Cornwell M., Salatino A. E., Seiler J., Jung M., Kornobis E., Cumbo F., Stocker B. K., Moskalenko O., Bogema D. R., Workentine M. L., Newhouse S. J., Leprevost F. D. V., Arvai K., Koster J., Albert-Ludwigs-Universität Freiburg, National Institutes of Health [Bethesda] (NIH), Swedish Defence Research Agency [Stockholm] (FOI), Umeå University, Harvard T.H. Chan School of Public Health, New York University [Abu Dhabi], NYU System (NYU), Harvard University [Cambridge], Hospital Camargo Sao Paulo, Partenaires INRAE, University of Duisburg-Essen, This work was supported by the Intramural Program of the National Institute of Diabetes and Digestive and Kidney Diseases, US National Institutes of Health (R.D.), the Netherlands Organisation for Scientific Research (NWO) (VENI grant 016.Veni.173.076 to J.K.), the German Research Foundation (SFB 876 to J.K.), and the NYU Abu Dhabi Research Institute for the NYU Abu Dhabi Center for Genomics and Systems Biology, program number CGSB1 (grant to J.R. and A. Yousif)., We thank all contributors, the conda-forge team, and Anaconda Inc. for excellent cooperation. Further, we thank Travis CI (https://travis-ci.com) and Circle CI (https://circleci.com) for providing free Linux and macOS computing capacity. Finally, we thank ELIXIR (https://www.elixir-europe.org) for constant support and donation of staff., Etienne Kornobis (Epigenetic Regulation Unit, Institut Pasteur, Paris, France) fait partie de Bioconda Team, Bioinformatics Group, Department of Computer Science, University of Freiburg [Freiburg], Laboratory of Cellular and Developmental Biology (LCDB), NIDDK, NIH, Department of Organismic and Evolutionary Biology, Broad Institute of MIT and Harvard (BROAD INSTITUTE), Harvard Medical School [Boston] (HMS)-Massachusetts Institute of Technology (MIT)-Massachusetts General Hospital [Boston], Microbiologie Environnement Digestif Santé (MEDIS), INRA Clermont-Ferrand-Theix-Université Clermont Auvergne [2017-2020] (UCA [2017-2020]), Umea Plant Science Center (UPSC), Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences (SLU)-Swedish University of Agricultural Sciences (SLU), Laboratoire Microorganismes : Génome et Environnement (LMGE), Université Clermont Auvergne [2017-2020] (UCA [2017-2020])-Centre National de la Recherche Scientifique (CNRS), Harvard University, Universität Duisburg-Essen = University of Duisburg-Essen [Essen], Dale, R, Gruning, B, Sjodin, A, Rowe, J, Chapman, B, Tomkins-Tinch, C, Valieris, R, Batut, B, Caprez, A, Cokelaer, T, Yusuf, D, Beauchamp, K, Brinda, K, Wollmann, T, Corguille, G, Ryan, D, Bretaudeau, A, Hoogstrate, Y, Pedersen, B, Heeringen, S, Raden, M, Luna-Valero, S, Soranzo, N, Smet, M, Kuster, G, Kirchner, R, Pantano, L, Charlop-Powers, Z, Thornton, K, Martin, M, Beek, M, Maticzka, D, Miladi, M, Will, S, Gravouil, K, Unneberg, P, Brueffer, C, Blank, C, Piro, V, Wolff, J, Antao, T, Gladman, S, Shlyakhter, I, Hollander, M, Mabon, P, Shen, W, Boekel, J, Holtgrewe, M, Bouvier, D, de Ruiter, J, Cabral, J, Choudhary, S, Harding, N, Kleinkauf, R, Enns, E, Eggenhofer, F, Brown, J, Cock, P, Timm, H, Thomas, C, Zhang, X, Chambers, M, Turaga, N, Seiler, E, Brislawn, C, Pruesse, E, Fallmann, J, Kelleher, J, Nguyen, H, Parsons, L, Fang, Z, Stovner, E, Stoler, N, Ye, S, Wohlers, I, Farouni, R, Freeberg, M, Johnson, J, Bargull, M, Kensche, P, Webster, T, Eppley, J, Stahl, C, Rose, A, Reynolds, A, Wang, L, Garnier, X, Dirmeier, S, Knudsen, M, Taylor, J, Srivastava, A, Rai, V, Agren, R, Junge, A, Guimera, R, Khan, A, Schmeier, S, He, G, Pinello, L, Hagglund, E, Mikheyev, A, Preussner, J, Waters, N, Li, W, Capellades, J, Chande, A, Pirola, Y, Hiltemann, S, Bendall, M, Singh, S, Dunn, W, Drouin, A, Domenico, T, Bruijn, I, Larson, D, Chicco, D, Grassi, E, Gonnella, G, B, J, Giacomoni, F, Clarke, E, Blankenberg, D, Tran, C, Patro, R, Laurent, S, Gopez, M, Sennblad, B, Baaijens, J, Ewels, P, Wright, P, Enache, O, Roger, P, Dampier, W, Koppstein, D, Devisetty, U, Rausch, T, Cornwell, M, Salatino, A, Seiler, J, Jung, M, Kornobis, E, Cumbo, F, Stocker, B, Moskalenko, O, Bogema, D, Workentine, M, Newhouse, S, Leprevost, F, Arvai, K, Koster, J, Urology, and Pathology
- Subjects
0301 basic medicine ,Computer science ,[SDV]Life Sciences [q-bio] ,Medizin ,computer.software_genre ,Biochemistry ,User-Computer Interface ,03 medical and health sciences ,0302 clinical medicine ,Software system ,Molecular Biology ,ComputingMilieux_MISCELLANEOUS ,Social software engineering ,Database ,business.industry ,Software development ,INF/01 - INFORMATICA ,Computational Biology ,Cell Biology ,Software distribution ,030104 developmental biology ,Software construction ,[SDE]Environmental Sciences ,Package development process ,Backporting ,[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] ,business ,computer ,Software ,030217 neurology & neurosurgery ,Biotechnology - Abstract
International audience; We present Bioconda (https://bioconda.github.io), a distribution of bioinformatics software for the lightweight, multi- platform and language-agnostic package manager Conda. Currently, Bioconda o ers a collection of over 3000 software packages, which is continuously maintained, updated, and extended by a growing global community of more than 200 contributors. Bio- conda improves analysis reproducibility by allowing users to de ne isolated environments with de ned software versions, all of which are easily installed and managed without administrative privileges.
- Published
- 2018
16. Multithread multistring burrows-wheeler transform and longest common prefix array
- Author
-
Bonizzoni, P, Della Vedova, G, Pirola, Y, Previtali, M, Rizzi, R, Bonizzoni, Paola, Della Vedova, Gianluca, Pirola, Yuri, Previtali, Marco, Rizzi, Raffaella, Bonizzoni, P, Della Vedova, G, Pirola, Y, Previtali, M, Rizzi, R, Bonizzoni, Paola, Della Vedova, Gianluca, Pirola, Yuri, Previtali, Marco, and Rizzi, Raffaella
- Abstract
Indexing huge collections of strings, such as those produced by the widespread sequencing technologies, heavily relies on multistring generalizations of the Burrows–Wheeler transform (BWT) and the longest common prefix (LCP) array, since solving efficiently both problems are essential ingredients of several algorithms on a collection of strings, such as those for genome assembly. In this article, we explore a multithread computational strategy for building the BWT and LCP array. Our algorithm applies a divide and conquer approach that leads to parallel computation of multistring BWT and LCP array., Indexing huge collections of strings, such as those produced by the widespread sequencing technologies, heavily relies on multistring generalizations of the Burrows-Wheeler transform (BWT) and the longest common prefix (LCP) array, since solving efficiently both problems are essential ingredients of several algorithms on a collection of strings, such as those for genome assembly. In this article, we explore a multithread computational strategy for building the BWT and LCP array. Our algorithm applies a divide and conquer approach that leads to parallel computation of multistring BWT and LCP array.
- Published
- 2019
17. Use of SNP genotypes to identify carriers of harmful recessive mutations in cattle populations
- Author
-
Biscarini, F, Schwarzenbacher, H, Pausch, H, Nicolazzi, E, Pirola, Y, Biffani, S, Biscarini F., Schwarzenbacher H., Pausch H., Nicolazzi E. L., Pirola Y., Biffani S., Biscarini, F, Schwarzenbacher, H, Pausch, H, Nicolazzi, E, Pirola, Y, Biffani, S, Biscarini F., Schwarzenbacher H., Pausch H., Nicolazzi E. L., Pirola Y., and Biffani S.
- Abstract
Background: SNP (single nucleotide polymorphisms) genotype data are increasingly available in cattle populations and, among other things, can be used to predict carriers of specific mutations. It is therefore convenient to have a practical statistical method for the accurate classification of individuals into carriers and non-carriers. In this paper, we compared - through cross-validation- five classification models (Lasso-penalized logistic regression -Lasso, Support Vector Machines with either linear or radial kernel -SVML and SVMR, k-nearest neighbors -KNN, and multi-allelic gene prediction -MAG), for the identification of carriers of the TUBD1 recessive mutation on BTA19 (Bos taurus autosome 19), known to be associated with high calf mortality. A population of 3116 Fleckvieh and 392 Brown Swiss animals genotyped with the 54K SNP-chip was available for the analysis. Results: In general, the use of SNP genotypes proved to be very effective for the identification of mutation carriers. The best predictive models were Lasso, SVML and MAG, with an average error rate, respectively, of 0.2 %, 0.4 % and 0.6 % in Fleckvieh, and 1.2 %, 0.9 % and 1.7 % in Brown Swiss. For the three models, the false positive rate was, respectively, 0.1 %, 0.1 % and 0.2 % in Fleckvieh, and 3.0 %, 2.4 % and 1.6 % in Brown Swiss; the false negative rate was 4.4 %, 7.6 %1.0 % in Fleckvieh, and 0.0 %, 0.1% and 0.8 % in Brown Swiss. MAG appeared to be more robust to sample size reduction: with 25 % of the data, the average error rate was 0.7 % and 2.2 % in Fleckvieh and Brown Swiss, compared to 2.1 % and 5.5 % with Lasso, and 2.6 % and 12.0 % with SVML. Conclusions: The use of SNP genotypes is a very effective and efficient technique for the identification of mutation carriers in cattle populations. Very few misclassifications were observed, overall and both in the carriers and non-carriers classes. This indicates that this is a very reliable approach for potential applications in cattle breeding
- Published
- 2016
18. Divide and conquer computation of the multi-string BWT and LCP array
- Author
-
Bonizzoni, P, Della Vedova, G, Nicosia, S, Pirola, Y, Previtali, M, Rizzi, R, Bonizzoni, P, Della Vedova, G, Nicosia, S, Pirola, Y, Previtali, M, and Rizzi, R
- Abstract
Indexing huge collections of strings, such as those produced by the widespread sequencing technologies, heavily relies on multi-string generalizations of the Burrows-Wheeler Transform (BWT) and the Longest Common Prefix (LCP) array, since solving efficiently both problems are essential ingredients of several algorithms on a collection of strings. In this paper we explore lightweight and parallel computational strategies for building the BWT and LCP array. We design a novel algorithm based on a divide and conquer approach that leads to a simultaneous and parallel computation of multi-string BWT and LCP array.
- Published
- 2018
19. FSG: Fast String Graph Construction for de Novo Assembly
- Author
-
Bonizzoni, P, Della Vedova, G, Pirola, Y, Previtali, M, Rizzi, R, Bonizzoni, P, Della Vedova, G, Pirola, Y, Previtali, M, and Rizzi, R
- Abstract
The string graph for a collection of next-generation reads is a lossless data representation that is fundamental for de novo assemblers based on the overlap-layout-consensus paradigm. In this article, we explore a novel approach to compute the string graph, based on the FM-index and Burrows and Wheeler Transform. We describe a simple algorithm that uses only the FM-index representation of the collection of reads to construct the string graph, without accessing the input reads. Our algorithm has been integrated into the string graph assembler (SGA) as a standalone module to construct the string graph. The new integrated assembler has been assessed on a standard benchmark, showing that fast string graph (FSG) is significantly faster than SGA while maintaining a moderate use of main memory, and showing practical advantages in running FSG on multiple threads. Moreover, we have studied the effect of coverage rates on the running times.
- Published
- 2017
20. An External-Memory Algorithm for String Graph Construction
- Author
-
Bonizzoni, P, Della Vedova, G, Pirola, Y, Previtali, M, Rizzi, R, Bonizzoni, P, Della Vedova, G, Pirola, Y, Previtali, M, and Rizzi, R
- Abstract
Some recent results (Bauer et al. in Algorithms in bioinformatics, Springer, Berlin, pp 326–337, 2012; Cox et al. in Algorithms in bioinformatics, Springer, Berlin, pp. 214–224, 2012; Rosone and Sciortino in The nature of computation. Logic, algorithms, applications, Springer, Berlin, pp 353–364, 2013) have introduced external-memory algorithms to compute self-indexes of a set of strings, mainly via computing the Burrows–Wheeler transform of the input strings. The motivations for those results stem from Bioinformatics, where a large number of short strings (called reads) are routinely produced and analyzed. In that field, a fundamental problem is to assemble a genome from a large set of much shorter samples extracted from the unknown genome. The approaches that are currently used to tackle this problem are memory-intensive. This fact does not bode well with the ongoing increase in the availability of genomic data. A data structure that is used in genome assembly is the string graph, where vertices correspond to samples and arcs represent two overlapping samples. In this paper we address an open problem of Simpson and Durbin (Bioinformatics 26(12):i367–i373, 2010): to design an external-memory algorithm to compute the string graph.
- Published
- 2017
21. Protein Kinase A Activation Promotes Cancer Cell Resistance to Glucose Starvation and Anoikis
- Author
-
Palorini, R, Votta, G, Pirola, Y, De Vitto, H, De Palma, S, Airoldi, C, Vasso, M, Ricciardiello, F, Lombardi, P, Cirulli, C, Rizzi, R, Nicotra, F, Hiller, K, Gelfi, C, Alberghina, L, Chiaradonna, F, PALORINI, ROBERTA, VOTTA, GIUSEPPINA, PIROLA, YURI, De Vitto, Humberto, AIROLDI, CRISTINA, RICCIARDIELLO, FRANCESCA, RIZZI, RAFFAELLA, NICOTRA, FRANCESCO, ALBERGHINA, LILIA, CHIARADONNA, FERDINANDO, Palorini, R, Votta, G, Pirola, Y, De Vitto, H, De Palma, S, Airoldi, C, Vasso, M, Ricciardiello, F, Lombardi, P, Cirulli, C, Rizzi, R, Nicotra, F, Hiller, K, Gelfi, C, Alberghina, L, Chiaradonna, F, PALORINI, ROBERTA, VOTTA, GIUSEPPINA, PIROLA, YURI, De Vitto, Humberto, AIROLDI, CRISTINA, RICCIARDIELLO, FRANCESCA, RIZZI, RAFFAELLA, NICOTRA, FRANCESCO, ALBERGHINA, LILIA, and CHIARADONNA, FERDINANDO
- Abstract
Cancer cells often rely on glycolysis to obtain energy and support anabolic growth. Several studies showed that glycolytic cells are susceptible to cell death when subjected to low glucose availability or to lack of glucose. However, some cancer cells, including glycolytic ones, can efficiently acquire higher tolerance to glucose depletion, leading to their survival and aggressiveness. Although increased resistance to glucose starvation has been shown to be a consequence of signaling pathways and compensatory metabolic routes activation, the full repertoire of the underlying molecular alterations remain elusive. Using omics and computational analyses, we found that cyclic adenosine monophosphate-Protein Kinase A (cAMP-PKA) axis activation is fundamental for cancer cell resistance to glucose starvation and anoikis. Notably, here we show that such a PKA-dependent survival is mediated by parallel activation of autophagy and glutamine utilization that in concert concur to attenuate the endoplasmic reticulum (ER) stress and to sustain cell anabolism. Indeed, the inhibition of PKA-mediated autophagy or glutamine metabolism increased the level of cell death, suggesting that the induction of autophagy and metabolic rewiring by PKA is important for cancer cellular survival under glucose starvation. Importantly, both processes actively participate to cancer cell survival mediated by suspension-activated PKA as well. In addition we identify also a PKA/Src mechanism capable to protect cancer cells from anoikis. Our results reveal for the first time the role of the versatile PKA in cancer cells survival under chronic glucose starvation and anoikis and may be a novel potential target for cancer treatment.
- Published
- 2016
22. On the Minimum Error Correction Problem for haplotype assembly in diploid and polyploid genomes
- Author
-
Bonizzoni, P. (Paola), Dondi, R. (Riccardo), Klau, G.W. (Gunnar), Pirola, Y. (Yuri), Pisanti, N. (Nadia), Zaccaria, S. (Simone), Bonizzoni, P. (Paola), Dondi, R. (Riccardo), Klau, G.W. (Gunnar), Pirola, Y. (Yuri), Pisanti, N. (Nadia), and Zaccaria, S. (Simone)
- Abstract
In diploid genomes, haplotype assembly is the computational problem of reconstructing the two parental copies, called haplotypes, of each chromosome starting from sequencing reads, called fragments, possibly affected by sequencing errors. Minimum error correction (MEC) is a prominent computational problem for haplotype assembly and, given a set of fragments, aims at reconstructing the two haplotypes by applying the minimum number of base corrections. MEC is computationally hard to solve, but some approximation-based or fixed-parameter approaches have been proved capable of obtaining accurate results on real data. In this work, we expand the current characterization of the computational complexity of MEC from the approximation and the fixed-parameter tractability point of view. In particular, we show that MEC is not approximable within a constant factor, whereas it is approximable within a logarithmic factor in the size of the input. Furthermore, we answer open questions on the fixed-parameter tractability for parameters of classical or practical interest: the total number of corrections and the fragment length. In addition, we present a direct 2-approximation algorithm for a variant of the problem that has also been applied in the framework of clustering data. Finally, since polyploid genomes, such as those of plants and fishes, are composed of more than two copies of the chromosomes, we introduce a novel formulation of MEC, namely the k-ploid MEC problem, that extends the traditional problem to deal with polyploid genomes. We show that the novel formulation is still both computationally hard and hard to approximate. Nonetheless, from the parameterized point of view, we prove that the problem is tractable for parameters of practical interest such as the number of haplotypes and the coverage, or the number of haplotypes and the fragment length.
- Published
- 2016
- Full Text
- View/download PDF
23. FSG: Fast string graph construction for de novo assembly of reads data
- Author
-
Bourgeois, A, Skums, P, Wan, X, Zelikovsky, A, Bonizzoni, P, DELLA VEDOVA, G, Pirola, Y, Previtali, M, Rizzi, R, BONIZZONI, PAOLA, DELLA VEDOVA, GIANLUCA, PIROLA, YURI, PREVITALI, MARCO, RIZZI, RAFFAELLA, Bourgeois, A, Skums, P, Wan, X, Zelikovsky, A, Bonizzoni, P, DELLA VEDOVA, G, Pirola, Y, Previtali, M, Rizzi, R, BONIZZONI, PAOLA, DELLA VEDOVA, GIANLUCA, PIROLA, YURI, PREVITALI, MARCO, and RIZZI, RAFFAELLA
- Abstract
The string graph for a collection of next-generation reads is a lossless data representation that is fundamental for de novo assemblers based on the overlap-layout-consensus paradigm. In this paper, we explore a novel approach to compute the string graph, based on the FMindex and Burrows-Wheeler Transform (BWT). We describe a simple algorithm that uses only the FM-index representation of the collection of reads to construct the string graph, without accessing the input reads. Our algorithm has been integrated into the SGA assembler as a standalone module to construct the string graph. The new integrated assembler has been assessed on a standard benchmark, showing that FSG is significantly faster than SGA while maintaining a moderate use of main memory, and showing practical advantages in running FSG on multiple threads.
- Published
- 2016
24. Beyond Evolutionary Trees
- Author
-
Kao Ming-Yang, Dondi, R, Pirola, Y, Dondi, Riccardo, Pirola, Yuri, Kao Ming-Yang, Dondi, R, Pirola, Y, Dondi, Riccardo, and Pirola, Yuri
- Published
- 2016
25. On the Minimum Error Correction Problem for Haplotype Assembly in Diploid and Polyploid Genomes
- Author
-
Bonizzoni, P, Dondi, R, Klau, G, Pirola, Y, Pisanti, N, Zaccaria, S, BONIZZONI, PAOLA, PIROLA, YURI, ZACCARIA, SIMONE, Bonizzoni, P, Dondi, R, Klau, G, Pirola, Y, Pisanti, N, Zaccaria, S, BONIZZONI, PAOLA, PIROLA, YURI, and ZACCARIA, SIMONE
- Abstract
In diploid genomes, haplotype assembly is the computational problem of reconstructing the two parental copies, called haplotypes, of each chromosome starting from sequencing reads, called fragments, possibly affected by sequencing errors. Minimum error correction (MEC) is a prominent computational problem for haplotype assembly and, given a set of fragments, aims at reconstructing the two haplotypes by applying the minimum number of base corrections. MEC is computationally hard to solve, but some approximation-based or fixed-parameter approaches have been proved capable of obtaining accurate results on real data. In this work, we expand the current characterization of the computational complexity of MEC from the approximation and the fixed-parameter tractability point of view. In particular, we show that MEC is not approximable within a constant factor, whereas it is approximable within a logarithmic factor in the size of the input. Furthermore, we answer open questions on the fixed-parameter tractability for parameters of classical or practical interest: the total number of corrections and the fragment length. In addition, we present a direct 2-approximation algorithm for a variant of the problem that has also been applied in the framework of clustering data. Finally, since polyploid genomes, such as those of plants and fishes, are composed of more than two copies of the chromosomes, we introduce a novel formulation of MEC, namely the k-ploid MEC problem, that extends the traditional problem to deal with polyploid genomes. We show that the novel formulation is still both computationally hard and hard to approximate. Nonetheless, from the parameterized point of view, we prove that the problem is tractable for parameters of practical interest such as the number of haplotypes and the coverage, or the number of haplotypes and the fragment length.
- Published
- 2016
26. HapCol: Accurate and memory-efficient haplotype assembly from long reads
- Author
-
Pirola, Y, Zaccaria, S, Dondi, R, Klau, G, Pisanti, N, Bonizzoni, P, Pirola, Y, Zaccaria, S, Dondi, R, Klau, G, Pisanti, N, and Bonizzoni, P
- Abstract
Motivation: Haplotype assembly is the computational problem of reconstructing haplotypes in diploid organisms and is of fundamental importance for characterizing the effects of single-nucleotide polymorphisms on the expression of phenotypic traits. Haplotype assembly highly benefits from the advent of 'future-generation' sequencing technologies and their capability to produce long reads at increasing coverage. Existing methods are not able to deal with such data in a fully satisfactory way, either because accuracy or performances degrade as read length and sequencing coverage increase or because they are based on restrictive assumptions. Results: By exploiting a feature of future-generation technologies - the uniform distribution of sequencing errors - we designed an exact algorithm, called HapCol, that is exponential in the maximum number of corrections for each single-nucleotide polymorphism position and that minimizes the overall error-correction score. We performed an experimental analysis, comparing HapCol with the current state-of-the-art combinatorial methods both on real and simulated data. On a standard benchmark of real data, we show that HapCol is competitive with state-of-the-art methods, improving the accuracy and the number of phased positions. Furthermore, experiments on realistically simulated datasets revealed that HapCol requires significantly less computing resources, especially memory. Thanks to its computational efficiency, HapCol can overcome the limits of previous approaches, allowing to phase datasets with higher coverage and without the traditional all-heterozygous assumption. Availability and implementation: Our source code is available under the terms of the GNU General Public License at http://hapcol.algolab.eu/. Supplementary information: Supplementary data are available at Bioinformatics online.
- Published
- 2016
27. LSG: An External-Memory Tool to Compute String Graphs for Next-Generation Sequencing Data Assembly
- Author
-
Bonizzoni, P, DELLA VEDOVA, G, Pirola, Y, Previtali, M, Rizzi, R, BONIZZONI, PAOLA, DELLA VEDOVA, GIANLUCA, PIROLA, YURI, PREVITALI, MARCO, RIZZI, RAFFAELLA, Bonizzoni, P, DELLA VEDOVA, G, Pirola, Y, Previtali, M, Rizzi, R, BONIZZONI, PAOLA, DELLA VEDOVA, GIANLUCA, PIROLA, YURI, PREVITALI, MARCO, and RIZZI, RAFFAELLA
- Abstract
The large amount of short read data that has to be assembled in future applications, such as in metagenomics or cancer genomics, strongly motivates the investigation of disk-based approaches to index next-generation sequencing (NGS) data. Positive results in this direction stimulate the investigation of efficient external memory algorithms for de novo assembly from NGS data. Our article is also motivated by the open problem of designing a space-efficient algorithm to compute a string graph using an indexing procedure based on the Burrows-Wheeler transform (BWT). We have developed a disk-based algorithm for computing string graphs in external memory: the light string graph (LSG). LSG relies on a new representation of the FM-index that is exploited to use an amount of main memory requirement that is independent from the size of the data set. Moreover, we have developed a pipeline for genome assembly from NGS data that integrates LSG with the assembly step of SGA (Simpson and Durbin, 2012), a state-of-the-art string graph-based assembler, and uses BEETL for indexing the input data. LSG is open source software and is available online. We have analyzed our implementation on a 875-million read whole-genome dataset, on which LSG has built the string graph using only 1GB of main memory (reducing the memory occupation by a factor of 50 with respect to SGA), while requiring slightly more than twice the time than SGA. The analysis of the entire pipeline shows an important decrease in memory usage, while managing to have only a moderate increase in the running time.
- Published
- 2016
28. A quantitative study of neutrality in GP boolean landscapes
- Author
-
Leonardo Vanneschi, Tomassini, M., Pirola, Y., Verel, S., Collard, P., Mauri, G., Vanneschi, L, Pirola, Y, Collard, P, Tomassini, M, Verel, S, Mauri, G, Laboratoire d'Informatique, Signaux, et Systèmes de Sophia Antipolis (I3S), Université Nice Sophia Antipolis (... - 2019) (UNS), COMUE Université Côte d'Azur (2015-2019) (COMUE UCA)-COMUE Université Côte d'Azur (2015-2019) (COMUE UCA)-Centre National de la Recherche Scientifique (CNRS)-Université Côte d'Azur (UCA), Dipartimento di Informatica Sistemistica e Comunicazione (DISCo), Università degli Studi di Milano-Bicocca [Milano] (UNIMIB), Institut des systèmes d'information (ISI), Université de Lausanne (UNIL), and M. Keijzer et al.
- Subjects
Neutral network ,Mathematical optimization ,Fitness function ,Fitness landscape ,fitness landscapes ,neutrality ,INF/01 - INFORMATICA ,Genetic programming ,0102 computer and information sciences ,02 engineering and technology ,01 natural sciences ,Evolutionary computation ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,automatic programming ,Tree structure ,evolutionary computation ,010201 computation theory & mathematics ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Neutrality ,Automatic programming ,performance measure ,Mathematics - Abstract
International audience; Neutrality of some boolean parity fitness landscapes is investigated in this paper. Compared with some well known contributions on the same issue, we define some new measures that help characterizing neutral landscapes, we use a new sampling methodology, which captures some features that are disregarded by uniform random sampling, and we introduce new genetic operators to define the neighborhood of tree structures. We compare the fitness landscape induced by two different sets of functional operators (SNand and SXorNot). The different characteristics of the neutral networks seem to justify the different difficulties of these landscapes for genetic programming.
- Published
- 2006
29. On the fixed parameter tractability and approximability of the minimum error correction problem
- Author
-
Cicalese, F., Porat, E., Vaccaro, U., Bonizzoni, P. (Paola), Dondi, R. (Riccardo), Klau, G.W. (Gunnar), Pirola, Y. (Yuri), Pisanti, N. (Nadia), Zaccaria, S. (Simone), Cicalese, F., Porat, E., Vaccaro, U., Bonizzoni, P. (Paola), Dondi, R. (Riccardo), Klau, G.W. (Gunnar), Pirola, Y. (Yuri), Pisanti, N. (Nadia), and Zaccaria, S. (Simone)
- Abstract
Haplotype assembly is the computational problem of reconstructing the two parental copies, called haplotypes, of each chromosome starting from sequencing reads, called fragments, possibly affected by sequencing errors. Minimum Error Correction (MEC) is a prominent computational problem for haplotype assembly and, given a set of fragments, aims at reconstructing the two haplotypes by applying the minimum number of base corrections. By using novel combinatorial properties of MEC instances, we are able to provide new results on the fixed-parameter tractability and approximability of MEC. In particular, we show that MEC is in FPT when parameterized by the number of corrections, and, on “gapless” instances, it is in FPT also when parameterized by the length of the fragments, whereas the result known in literature forces the reconstruction of complementary haplotypes. Then, we show that MEC cannot be approximated within any constant factor while it is approximable within factor O(log nm) where nm is the size of the input. Finally, we provide a practical 2-approximation algorithm for the Binary MEC, a variant of MEC that has been applied in the framework of clustering binary data.
- Published
- 2015
- Full Text
- View/download PDF
30. HapCol: Accurate and Memory-efficient Haplotype Assembly from Long Reads
- Author
-
Pirola, Y. (Yuri), Zaccaria, S. (Simone), Dondi, R. (Riccardo), Klau, G.W. (Gunnar), Pisanti, N. (Nadia), Bonizzoni, P. (Paola), Pirola, Y. (Yuri), Zaccaria, S. (Simone), Dondi, R. (Riccardo), Klau, G.W. (Gunnar), Pisanti, N. (Nadia), and Bonizzoni, P. (Paola)
- Abstract
Motivation: Haplotype assembly is the computational problem of reconstructing haplotypes in diploid organisms and is of fundamental importance for characterizing the effects of single-nucleotide polymorphisms on the expression of phenotypic traits. Haplotype assembly highly benefits from the advent of ‘future-generation’ sequencing technologies and their capability to produce long reads at increasing coverage. Existing methods are not able to deal with such data in a fully satisfactory way, either because accuracy or performances degrade as read length and sequencing coverage increase or because they are based on restrictive assumptions. Results: By exploiting a feature of future-generation technologies—the uniform distribution of sequencing errors—we designed an exact algorithm, called HapCol, that is exponential in the maximum number of corrections for each single-nucleotide polymorphism position and that minimizes the overall error-correction score. We performed an experimental analysis, comparing HapCol with the current state-of-the-art combinatorial methods both on real and simulated data. On a standard benchmark of real data, we show that HapCol is competitive with state-of-the-art methods, improving the accuracy and the number of phased positions. Furthermore, experiments on realistically simulated datasets revealed that HapCol requires significantly less computing resources, especially memory. Thanks to its computational efficiency, HapCol can overcome the limits of previous approaches, allowing to phase datasets with higher coverage and without the traditional all-heterozygous assumption. Availability and implementation: Our source code is available under the terms of the GNU General Public License at http://hapcol.algolab.eu/. Contact: bonizzoni@disco.unimib.it Supplementary information: Supplementary data are available at Bioinformatics online.
- Published
- 2015
- Full Text
- View/download PDF
31. On the fixed parameter tractability and approximability of the minimum error correction problem
- Author
-
Bonizzoni, P, Dondi, R, Klau, G, Pirola, Y, Pisanti, N, Zaccaria, S, BONIZZONI, PAOLA, PIROLA, YURI, ZACCARIA, SIMONE, Bonizzoni, P, Dondi, R, Klau, G, Pirola, Y, Pisanti, N, Zaccaria, S, BONIZZONI, PAOLA, PIROLA, YURI, and ZACCARIA, SIMONE
- Abstract
Haplotype assembly is the computational problem of reconstructing the two parental copies, called haplotypes, of each chromosome starting from sequencing reads, called fragments, possibly affected by sequencing errors. Minimum Error Correction (MEC) is a prominent computational problem for haplotype assembly and, given a set of fragments, aims at reconstructing the two haplotypes by applying the minimum number of base corrections. By using novel combinatorial properties of MEC instances, we are able to provide new results on the fixed-parameter tractability and approximability of MEC. In particular, we show that MEC is in FPT when parameterized by the number of corrections, and, on “gapless” instances, it is in FPT also when parameterized by the length of the fragments, whereas the result known in literature forces the reconstruction of complementary haplotypes. Then, we show that MEC cannot be approximated within any constant factor while it is approximable within factor O(lognm) where nm is the size of the input. Finally, we provide a practical 2-approximation algorithm for the Binary MEC, a variant of MEC that has been applied in the framework of clustering binary data.
- Published
- 2015
32. Transcriptome assembly and alternative splicing analysis
- Author
-
Picardi, E, Bonizzoni, P, DELLA VEDOVA, G, Pesole, G, Pirola, Y, Rizzi, R, BONIZZONI, PAOLA, DELLA VEDOVA, GIANLUCA, PIROLA, YURI, RIZZI, RAFFAELLA, Picardi, E, Bonizzoni, P, DELLA VEDOVA, G, Pesole, G, Pirola, Y, Rizzi, R, BONIZZONI, PAOLA, DELLA VEDOVA, GIANLUCA, PIROLA, YURI, and RIZZI, RAFFAELLA
- Abstract
Alternative Splicing (AS) is the molecular phenomenon whereby multiple transcripts are produced from the same gene locus. As a consequence, it is responsible for the expansion of eukaryotic transcriptomes. Aberrant AS is involved in the onset and progression of several human diseases. Therefore, the characterization of exon-intron structure of a gene and the detection of corresponding transcript isoforms is an extremely relevant biological task. Nonetheless, the computational prediction of AS events and the repertoire of alternative transcripts is yet a challenging issue.Hereafter we introduce PIntron, a software package to predict the exon-intron structure and the fulllength isoforms of a gene given a genomic region and a set of transcripts (ESTs and/or mRNAs). The software is open source and available at http://pintron.algolab.eu. PIntron has been designed for (and extensively tested on) a standard workstation without requiring dedicated expensive hardware. It easily manages large genomic regions and more than 20,000 ESTs, achieving good accuracy as shown in an experimental evaluation performed on 112 well-annotated genes selected from the ENCODE human regions used as training set in the EGASP competition.
- Published
- 2015
33. Covering Pairs in Directed Acyclic Graphs
- Author
-
Beerenwinkel, N, Beretta, S, Bonizzoni, P, Dondi, R, Pirola, Y, BERETTA, STEFANO, BONIZZONI, PAOLA, PIROLA, YURI, Beerenwinkel, N, Beretta, S, Bonizzoni, P, Dondi, R, Pirola, Y, BERETTA, STEFANO, BONIZZONI, PAOLA, and PIROLA, YURI
- Abstract
The Minimum Path Cover (MinPC) problem on directed acyclic graphs (DAGs) is a classical problem in graph theory that provides a clear and simple mathematical formulation for several applications in computational biology. In this paper, we study the computational complexity of three constrained variants of MinPC motivated by the recent introduction of Next-Generation Sequencing technologies. The first variant (MinRPC), given a DAG and a set of pairs of vertices, asks for a minimum-cardinality set of (not necessarily disjoint) paths such that both vertices of each pair belong to the same path. For this problem, we establish a sharp tractability borderline depending on the 'overlapping degree' of the instance, a natural parameter in some applications of the problem. The second variant we consider (MinPCRP), given a DAG and a set of pairs of vertices, asks for a minimum-cardinality set of (not necessarily disjoint) paths 'covering' all the vertices of the graph and such that both vertices of each pair belong to the same path. For this problem, we show that, while it is NP-hard to compute if there exists a solution consisting of at most three paths, it is possible to decide in polynomial time whether a solution consisting of at most two paths exists. The third variant (MaxRPSP), given a DAG and a set of pairs of vertices, asks for a single path containing the maximum number of the given pairs of vertices. We show that MaxRPSP is W[1]-hard when parameterized by the number of covered pairs and we give a fixed-parameter algorithm when the parameter is the maximum overlapping degree.
- Published
- 2015
34. A Clustering Algorithm for Planning the Integration Process of a Large Number of Conceptual Schemas
- Author
-
Batini, C, Bonizzoni, P, Comerio, M, Dondi, R, Pirola, Y, Salandra, F, Batini, C, Bonizzoni, P, Comerio, M, Dondi, R, Pirola, Y, and Salandra, F
- Abstract
When tens and even hundreds of schemas are involved in the integration process, criteria are needed for choosing clusters of schemas to be integrated, so as to deal with the integration problem through an efficient iterative process. Schemas in clusters should be chosen according to cohesion and coupling criteria that are based on similarities and dissimilarities among schemas. In this paper, we propose an algorithm for a novel variant of the correlation clustering approach that addresses the problem of assisting a designer in integrating a large number of conceptual schemas. The novel variant introduces upper and lower bounds to the number of schemas in each cluster, in order to avoid too complex and too simple integration contexts respectively. We give a heuristic for solving the problem, being an NP hard combinatorial problem. An experimental activity demonstrates an appreciable increment in the effectiveness of the schema integration process when clusters are computed by means of the proposed algorithm w.r.t. the ones manually defined by an expert.
- Published
- 2015
35. PIntron: a fast method for detecting the gene structure due to alternative splicing via maximal pairings of a pattern and a text
- Author
-
Pirola Y, Rizzi R, Picardi E, Pesole G, Della Vedova G, and Bonizzoni P
- Published
- 2012
36. Modeling Alternative Splicing Variants from RNA-Seq Data with Isoform Graphs
- Author
-
Beretta, S, Bonizzoni, P, DELLA VEDOVA, G, Pirola, Y, Rizzi, R, BERETTA, STEFANO, BONIZZONI, PAOLA, DELLA VEDOVA, GIANLUCA, PIROLA, YURI, RIZZI, RAFFAELLA, Beretta, S, Bonizzoni, P, DELLA VEDOVA, G, Pirola, Y, Rizzi, R, BERETTA, STEFANO, BONIZZONI, PAOLA, DELLA VEDOVA, GIANLUCA, PIROLA, YURI, and RIZZI, RAFFAELLA
- Abstract
Next-generation sequencing (NGS) technologies need new methodologies for alternative splicing (AS) analysis. Current computational methods for AS analysis from NGS data are mainly based on aligning short reads against a reference genome, while methods that do not need a reference genome are mostly underdeveloped. In this context, the main developed tools for NGS data focus on de novo transcriptome assembly (Grabherr et al., 2011; Schulz et al., 2012). While these tools are extensively applied for biological investigations and often show intrinsic shortcomings from the obtained results, a theoretical investigation of the inherent computational limits of transcriptome analysis from NGS data, when a reference genome is unknown or highly unreliable, is still missing. On the other hand, we still lack methods for computing the gene structures due to AS events under the above assumptions-a problem that we start to tackle with this article. More precisely, based on the notion of isoform graph (Lacroix et al., 2008), we define a compact representation of gene structures-called splicing graph-and investigate the computational problem of building a splicing graph that is (i) compatible with NGS data and (ii) isomorphic to the isoform graph. We characterize when there is only one representative splicing graph compatible with input data, and we propose an efficient algorithmic approach to compute this graph.
- Published
- 2014
37. Constructing String Graphs in External Memory
- Author
-
Bonizzoni, P, DELLA VEDOVA, G, Pirola, Y, Previtali, M, Rizzi, R, BONIZZONI, PAOLA, DELLA VEDOVA, GIANLUCA, PIROLA, YURI, PREVITALI, MARCO, RIZZI, RAFFAELLA, Bonizzoni, P, DELLA VEDOVA, G, Pirola, Y, Previtali, M, Rizzi, R, BONIZZONI, PAOLA, DELLA VEDOVA, GIANLUCA, PIROLA, YURI, PREVITALI, MARCO, and RIZZI, RAFFAELLA
- Abstract
In this paper we present an efficient external memory algorithm to compute the string graph from a collection of reads, which is a fundamental data representation used for sequence assembly. Our algorithm builds upon some recent results on lightweight Burrows-Wheeler Transform (BWT) and Longest Common Prefix (LCP) construction providing, as a by-product, an efficient procedure to extend intervals of the BWT that could be of independent interest. We have implemented our algorithm and compared its efficiency against SGA-the most advanced assembly string graph construction program
- Published
- 2014
38. Covering Pairs in Directed Acyclic Graphs
- Author
-
Dediu, A-H, Martín-Vide, C, Sierra-Rodríguez, J-L, Truthe, B, Beerenwinkel, N, Beretta, S, Bonizzoni, P, Dondi, R, Pirola, Y, BERETTA, STEFANO, BONIZZONI, PAOLA, PIROLA, YURI, Dediu, A-H, Martín-Vide, C, Sierra-Rodríguez, J-L, Truthe, B, Beerenwinkel, N, Beretta, S, Bonizzoni, P, Dondi, R, Pirola, Y, BERETTA, STEFANO, BONIZZONI, PAOLA, and PIROLA, YURI
- Abstract
The Minimum Path Cover problem on directed acyclic graphs (DAGs) is a classical problem that provides a clear and simple mathematical formulation for several applications in different areas and that has an efficient algorithmic solution. In this paper, we study the computational complexity of two constrained variants of Minimum Path Cover motivated by the recent introduction of next-generation sequencing technologies in bioinformatics. The first problem (MinPCRP), given a DAG and a set of pairs of vertices, asks for a minimum cardinality set of paths “covering” all the vertices such that both vertices of each pair belong to the same path. For this problem, we show that, while it is NP-hard to compute if there exists a solution consisting of at most three paths, it is possible to decide in polynomial time whether a solution consisting of at most two paths exists. The second problem (MaxRPSP), given a DAG and a set of pairs of vertices, asks for a single path containing the maximum number of the given pairs of vertices. We show its NP-hardness and also its W[1]-hardness when parametrized by the number of covered pairs. On the positive side, we give a fixed-parameter algorithm when the parameter is the maximum overlapping degree, a natural parameter in the bioinformatics applications of the problem.
- Published
- 2014
39. Haplotype-based prediction of gene alleles using pedigrees and SNP genotypes
- Author
-
Pirola, Y, DELLA VEDOVA, G, Bonizzoni, P, Stella, A, Biscarini, F, PIROLA, YURI, DELLA VEDOVA, GIANLUCA, BONIZZONI, PAOLA, Biscarini, F., Pirola, Y, DELLA VEDOVA, G, Bonizzoni, P, Stella, A, Biscarini, F, PIROLA, YURI, DELLA VEDOVA, GIANLUCA, BONIZZONI, PAOLA, and Biscarini, F.
- Abstract
Computational methods for gene allele prediction have been proposed to substitute dedicated and expensive assays with cheaper in-silico analyses that operate on routinely collected data, such as SNP genotypes. Most of these methods are tailored to the needs and characteristics of human genetic studies where they achieve good prediction accuracy. However, genomic analyses are becoming increasingly important in livestock species too. For livestock species generally the underlying—usually quite large and complex—pedigree is known and available; this information is not fully exploited by current allele prediction methods. In this paper, we propose a new gene allele prediction method based on a simple, but robust, combinatorial formulation for the problem of discovering haplotype-allele associations. The inherent uncertainty of the haplotype inference process is reduced by taking into account the inheritance of gene alleles across the population pedigree while genotypes are phased. The accuracy of the method has been extensively evaluated on a representative real-world livestock dataset under several scenarios and choices of parameters. The median error rate ranged from 0.0537 to 0.0896, with an average of 0.0678; this is 21% better than another state-of-the-art prediction algorithm that does not use the pedigree information. The experimental results support the validity of the proposed approach and, in particular, of the use of pedigree information in gene allele predictions.
- Published
- 2013
40. Parameterized complexity of k-anonymity: hardness and tractability
- Author
-
Bonizzoni, P, DELLA VEDOVA, G, Dondi, R, Pirola, Y, BONIZZONI, PAOLA, DELLA VEDOVA, GIANLUCA, PIROLA, YURI, Bonizzoni, P, DELLA VEDOVA, G, Dondi, R, Pirola, Y, BONIZZONI, PAOLA, DELLA VEDOVA, GIANLUCA, and PIROLA, YURI
- Abstract
The problem of publishing personal data without giving up privacy is becoming increasingly important. A precise formalization that has been recently proposed is the k-anonymity, where the rows of a table are partitioned into clusters of sizes at least k and all rows in a cluster become the same tuple after the suppression of some entries. The natural optimization problem, where the goal is to minimize the number of suppressed entries, is hard even when the stored values are over a binary alphabet or the table consists of a bounded number of columns. In this paper we study how the complexity of the problem is influenced by different parameters. First we show that the problem is W[1]-hard when parameterized by the value of the solution (and k). Then we exhibit a fixed-parameter algorithm when the problem is parameterized by the number of columns and the number of different values in any column. Finally, we prove that k-anonymity is still APX-hard even when restricting to instances with 3 columns and k=3.
- Published
- 2013
41. PIntron: a fast method for detecting the gene structure due to alternative splicing via maximal pairings of a pattern and a text
- Author
-
Pirola, Y, Rizzi, R, Picardi, E, Pesole, G, DELLA VEDOVA, G, Bonizzoni, P, PIROLA, YURI, RIZZI, RAFFAELLA, DELLA VEDOVA, GIANLUCA, BONIZZONI, PAOLA, Pirola, Y, Rizzi, R, Picardi, E, Pesole, G, DELLA VEDOVA, G, Bonizzoni, P, PIROLA, YURI, RIZZI, RAFFAELLA, DELLA VEDOVA, GIANLUCA, and BONIZZONI, PAOLA
- Abstract
Background A challenging issue in designing computational methods for predicting the gene structure into exons and introns from a cluster of transcript (EST, mRNA) sequences, is guaranteeing accuracy as well as efficiency in time and space, when large clusters of more than 20,000 ESTs and genes longer than 1 Mb are processed. Traditionally, the problem has been faced by combining different tools, not specifically designed for this task. Results We propose a fast method based on ad hoc procedures for solving the problem. Our method combines two ideas: a novel algorithm of proved small time complexity for computing spliced alignments of a transcript against a genome, and an efficient algorithm that exploits the inherent redundancy of information in a cluster of transcripts to select, among all possible factorizations of EST sequences, those allowing to infer splice site junctions that are largely confirmed by the input data. The EST alignment procedure is based on the construction of maximal embeddings, that are sequences obtained from paths of a graph structure, called embedding graph, whose vertices are the maximal pairings of a genomic sequence T and an EST P. The procedure runs in time linear in the length of P and T and in the size of the output. The method was implemented into the PIntron package. PIntron requires as input a genomic sequence or region and a set of EST and/or mRNA sequences. Besides the prediction of the full-length transcript isoforms potentially expressed by the gene, the PIntron package includes a module for the CDS annotation of the predicted transcripts. Conclusions PIntron, the software tool implementing our methodology, is available at http://www.algolab.eu/PIntron under GNU AGPL. PIntron has been shown to outperform state-of-the-art methods, and to quickly process some critical genes. At the same time, PIntron exhibits high accuracy (sensitivity and specificity) when benchmarked with ENCODE annotations.
- Published
- 2012
42. A fast and practical approach to genotype phasing and imputation on a pedigree with erroneous and incomplete information
- Author
-
Istrail, S, Mandoiu, II, Pop, M, Rajasekaran, S, Spouge, JL, Pirola, Y, DELLA VEDOVA, G, Biffani, S, Stella, A, Bonizzoni, P, PIROLA, YURI, DELLA VEDOVA, GIANLUCA, BONIZZONI, PAOLA, Istrail, S, Mandoiu, II, Pop, M, Rajasekaran, S, Spouge, JL, Pirola, Y, DELLA VEDOVA, G, Biffani, S, Stella, A, Bonizzoni, P, PIROLA, YURI, DELLA VEDOVA, GIANLUCA, and BONIZZONI, PAOLA
- Abstract
In this work, we propose the MIN-RECOMBINANT HAPLOTYPE CONFIGURATION WITH BOUNDED ERRORS problem (MRHCE), which extends the original MIN-RECOMBINANT HAPLOTYPE CONFIGURATION formulation by incorporating two common characteristics of real data: errors and missing genotypes (including untyped individuals). We describe a practical algorithm for MRHCE that is based on a reduction to the Satisfiability problem (SAT) and exploits recent advances in the constraint programming literature. An experimental analysis demonstrates the soundness of our model and the effectiveness of the algorithm under several scenarios. The analysis on real data and the comparison with state-of-the-art programs reveals that our approach couples better scalability to large and complex pedigrees with the explicit inclusion of genotyping errors into the model
- Published
- 2012
43. A Fast and Practical Approach to Genotype Phasing and Imputation on a Pedigree with Erroneous and Incomplete Information
- Author
-
Pirola, Y, DELLA VEDOVA, G, Biffani, S, Stella, A, Bonizzoni, P, PIROLA, YURI, DELLA VEDOVA, GIANLUCA, BONIZZONI, PAOLA, Pirola, Y, DELLA VEDOVA, G, Biffani, S, Stella, A, Bonizzoni, P, PIROLA, YURI, DELLA VEDOVA, GIANLUCA, and BONIZZONI, PAOLA
- Abstract
The MINIMUM-RECOMBINANT HAPLOTYPE CONFIGURATION problem (MRHC) has been highly successful in providing a sound combinatorial formulation for the important problem of genotype phasing on pedigrees. Despite several algorithmic advances that have improved the efficiency, its applicability to real data sets has been limited since it does not take into account some important phenomena such as mutations, genotyping errors, and missing data. In this work, we propose the MINIMUM-RECOMBINANT HAPLOTYPE CONFIGURATION WITH BOUNDED ERRORS problem (MRHCE), which extends the original MRHC formulation by incorporating the two most common characteristics of real data: errors and missing genotypes (including untyped individuals). We describe a practical algorithm for MRHCE that is based on a reduction to the well-known Satisfiability problem (SAT) and exploits recent advances in the constraint programming literature. An experimental analysis demonstrates the biological soundness of the phasing model and the effectiveness (on both accuracy and performance) of the algorithm under several scenarios. The analysis on real data and the comparison with state-of-the-art programs reveals that our approach couples better scalability to large and complex pedigrees with the explicit inclusion of genotyping errors into the model.
- Published
- 2012
44. A study of the neutrality of Boolean function landscapes in genetic programming
- Author
-
Vanneschi, L, Pirola, Y, Mauri, G, Tomassini, M, Collard, P, Verel, S, VANNESCHI, LEONARDO, PIROLA, YURI, MAURI, GIANCARLO, Verel, S., Vanneschi, L, Pirola, Y, Mauri, G, Tomassini, M, Collard, P, Verel, S, VANNESCHI, LEONARDO, PIROLA, YURI, MAURI, GIANCARLO, and Verel, S.
- Abstract
The neutrality of genetic programming Boolean function landscapes is investigated in this paper. Compared with some well-known contributions on the same issue, (i) we first define new measures which help in characterizing neutral landscapes; (ii) we use a new sampling methodology, which captures features that are disregarded by uniform random sampling; (iii) we introduce new genetic operators to define the neighborhood of tree structures; and (iv) we compare the fitness landscape induced by different sets of functional operators. This study indicates the existence of a relationship between our neutrality measures and the performance of genetic programming for the problems studied.
- Published
- 2012
45. An Efficient Algorithm for Haplotype Inference on Pedigrees with Recombinations and Mutations
- Author
-
Pirola, Y, Bonizzoni, P, Jiang, T, PIROLA, YURI, BONIZZONI, PAOLA, Jiang, T., Pirola, Y, Bonizzoni, P, Jiang, T, PIROLA, YURI, BONIZZONI, PAOLA, and Jiang, T.
- Abstract
Haplotype Inference (HI) is a computational challenge of crucial importance in a range of genetic studies. Pedigrees allow to infer haplotypes from genotypes more accurately than population data, since Mendelian inheritance restricts the set of possible solutions. In this work, we define a new HI problem on pedigrees, called MINIMUM-CHANGE HAPLOTYPE CONFIGURATION (MCHC) problem, that allows two types of genetic variation events: recombinations and mutations. Our new formulation extends the MINIMUM-RECOMBINANT HAPLOTYPE CONFIGURATION (MRHC) problem, that has been proposed in the literature to overcome the limitations of classic statistical haplotyping methods. Our contribution is twofold. First, we prove that the MCHC problem is APX-hard under several restrictions. Second, we propose an efficient and accurate heuristic algorithm for MCHC based on an L-reduction to a well-known coding problem. Our heuristic can also be used to solve the original MRHC problem and can take advantage of additional knowledge about the input genotypes. Moreover, the L-reduction proves for the first time that MCHC and MRHC are O(nm/(log nm))-approximable on general pedigrees, where n is the pedigree size and m is the genotype length. Finally, we present an extensive experimental evaluation and comparison of our heuristic algorithm with several other state-of-the-art methods for HI on pedigrees.
- Published
- 2012
46. Parameterized Complexity of k-Anonymity: Hardness and Tractability
- Author
-
Bonizzoni, P, DELLA VEDOVA, G, Dondi, R, Pirola, Y, BONIZZONI, PAOLA, DELLA VEDOVA, GIANLUCA, PIROLA, YURI, Bonizzoni, P, DELLA VEDOVA, G, Dondi, R, Pirola, Y, BONIZZONI, PAOLA, DELLA VEDOVA, GIANLUCA, and PIROLA, YURI
- Abstract
The problem of publishing personal data without giving up privacy is becoming increasingly important. A precise formalization that has been recently proposed is the k-anonymity, where the rows of a table are partitioned in clusters of size at least k and all rows in a cluster become the same tuple after the suppression of some entries. The natural optimization problem, where the goal is to minimize the number of suppressed entries, is hard even when the stored values are over a binary alphabet or the table consists of a bounded number of columns. In this paper we study how the complexity of the problem is influenced by different parameters. First we show that the problem is W[1]-hard when parameterized by the value of the solution (and k). Then we exhibit a fixed-parameter algorithm when the problem is parameterized by the number of columns and the number of different values in any column.
- Published
- 2011
47. PIntron: A fast method for gene structure prediction via maximal pairings of a pattern and a text
- Author
-
Mandoiu, I, Miyano, S, Przytycka, T, Rajasekaaran, S, Bonizzoni, P, DELLA VEDOVA, G, Pirola, Y, Rizzi, R, BONIZZONI, PAOLA, DELLA VEDOVA, GIANLUCA, PIROLA, YURI, RIZZI, RAFFAELLA, Mandoiu, I, Miyano, S, Przytycka, T, Rajasekaaran, S, Bonizzoni, P, DELLA VEDOVA, G, Pirola, Y, Rizzi, R, BONIZZONI, PAOLA, DELLA VEDOVA, GIANLUCA, PIROLA, YURI, and RIZZI, RAFFAELLA
- Abstract
A challenging issue in designing computational methods for predicting the gene structure into exons and introns from a cluster of transcript (EST, mRNA) sequences, is guaranteeing both accuracy and efficiency in time and space, when large clusters of over than 20,000 ESTs and genes longer than 1Mb are processed. Traditionally, the problem has been faced by combining different tools, not specifically designed for this task. We propose a fast method based on ad hoc procedures for solving the problem. Our method combines two ideas: a novel algorithm of proved small time complexity for computing spliced alignments of a transcript against a genome, and an efficient algorithm that exploits the inherent redundancy of information in a cluster of transcripts to select, among all possible factorizations of EST sequences, those allowing to infer splice site junctions that are largely confirmed by the input data. The EST alignment procedure is based on the construction of maximal embeddings, that are sequences obtained from paths of a graph structure, called embedding graph, whose vertices are the maximal pairings of a genomic sequence T and an EST P. The procedure runs in time linear in the length of P and T and in the size of the output. PIntron, the software tool implementing our methodology, is available at http://www.algolab.eu/PIntron and it is able to process in a few seconds some critical genes that are not manageable by other gene structure prediction tools. At the same time, PIntron exhibits high accuracy (sensitivity and specificity) when compared with ENCODE data.
- Published
- 2011
48. Combinatorial problems in studies of genetic variations: haplotyping and transcript analysis
- Author
-
Pirola, Y, BONIZZONI, PAOLA, PIROLA, YURI, Pirola, Y, BONIZZONI, PAOLA, and PIROLA, YURI
- Abstract
After the first draft of the human genome has been published, the study of genetic variations has attracted increasing attention as a medium to map observable characteristics (phenotypic traits) of individuals to their underlying genes and biological processes. Unfortunately, observing and obtaining directly the genetic data of interest is a long and expensive operation, especially for large populations. Less informative sources of data are instead available at a fraction of cost. Computational methods are then called to recover the original data of interest from their less informative observations. The design of efficient combinatorial algorithms to infer relevant data for genetic variation studies starting from less informative observations is the main aim of the work of this thesis. In particular, we focused on two different kinds of data of crucial importance for genetic variation studies: haplotypes and transcripts. In the first part of the thesis, we studied the Haplotype Inference (HI) problem that requires to infer, for each individual of a population, the genetic sequences inherited from each parent (haplotypes) starting from their conflation (genotype). We formulated and analyzed two different computational problems of haplotype inference: the pure parsimony xor-haplotyping problem (PPXH) and the minimum event haplotype configuration problem (MinEHC). In the first problem, the haplotype inference process is guided by the pure parsimony criterion and it starts from a recently proposed representation of genotypes, called xor-genotypes. For this problem, we presented exact and approximate solutions by providing (i) polynomial time exact algorithms for some restricted cases, (ii) a fixed-parameter algorithm for the general case, and (iii) a polynomial time k-approximation algorithm, where k is the maximum number of heterozygous individuals at a given marker locus. These results are based on some interesting combinatorial properties of a graph representation of
- Published
- 2010
49. Haplotype Inference on Pedigrees with Recombinations and Mutations
- Author
-
Pirola, Y, Bonizzoni, P, Jiang, T, PIROLA, YURI, BONIZZONI, PAOLA, Jiang, T., Pirola, Y, Bonizzoni, P, Jiang, T, PIROLA, YURI, BONIZZONI, PAOLA, and Jiang, T.
- Abstract
Haplotype Inference (HI) is a computational challenge of crucial importance in a range of genetic studies, such as functional genomics, pharmacogenetics and population genetics. Pedigrees have been shown a valuable data that allows us to infer haplotypes from genotypes more accurately than population data, since Mendelian inheritance restricts the set of possible solutions. In order to overcome the limitations of classic statistical haplotyping methods, a combinatorial formulation of the HI problem on pedigrees has been proposed in the literature, called Minimum-Recombinant Haplotype Configuration (MRHC) problem, that allows a single type of genetic variation events, namely recombinations. In this work, we define a new problem, called Minimum-Change Haplotype Configuration (MRHC), that extends the MRHC formulation by allowing also a second type of natural variation events: mutations. We propose an efficient and accurate heuristic algorithm for MRHC based on an L-reduction to a well-known coding problem. Our heuristic can also be used to solve the original MRHC problem and it can take advantage of additional knowledge about the input genotypes, such as the presence of recombination hotspots and different rates of recombinations and mutations. Finally, we present an extensive experimental evaluation and comparison of our heuristic algorithm with several other state-of-the-art methods for HI on pedigrees under several simulated scenarios. © 2010 Springer-Verlag.
- Published
- 2010
50. Pure parsimony xor haplotyping
- Author
-
Bonizzoni, P, DELLA VEDOVA, G, Dondi, R, Pirola, Y, Rizzi, R, BONIZZONI, PAOLA, DELLA VEDOVA, GIANLUCA, PIROLA, YURI, Rizzi, R., Bonizzoni, P, DELLA VEDOVA, G, Dondi, R, Pirola, Y, Rizzi, R, BONIZZONI, PAOLA, DELLA VEDOVA, GIANLUCA, PIROLA, YURI, and Rizzi, R.
- Abstract
The haplotype resolution from xor-genotype data has been recently formulated as a new model for genetic studies. The xor-genotype data is a cheaply obtainable type of data distinguishing heterozygous from homozygous sites without identifying the homozygous alleles. In this paper, we propose a formulation based on a well-known model used in haplotype inference: pure parsimony. We exhibit exact solutions of the problem by providing polynomial time algorithms for some restricted cases and a fixed-parameter algorithm for the general case. These results are based on some interesting combinatorial properties of a graph representation of the solutions. Furthermore, we show that the problem has a polynomial time k-approximation, where k is the maximum number of xor-genotypes containing a given single nucleotide polymorphisms (SNP). Finally, we propose a heuristic and produce an experimental analysis showing that it scales to real-world large instances taken from the HapMap project
- Published
- 2010
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.