13 results on '"Pirola, Y"'
Search Results
2. MALVIRUS: an integrated application for viral variant analysis
- Author
-
Ciccolella, S, Denti, L, Bonizzoni, P, Della Vedova, G, Pirola, Y, Previtali, M, Ciccolella S., Denti L., Bonizzoni P., Della Vedova G., Pirola Y., Previtali M., Ciccolella, S, Denti, L, Bonizzoni, P, Della Vedova, G, Pirola, Y, Previtali, M, Ciccolella S., Denti L., Bonizzoni P., Della Vedova G., Pirola Y., and Previtali M.
- Abstract
Background: Being able to efficiently call variants from the increasing amount of sequencing data daily produced from multiple viral strains is of the utmost importance, as demonstrated during the COVID-19 pandemic, in order to track the spread of the viral strains across the globe. Results: We present MALVIRUS, an easy-to-install and easy-to-use application that assists users in multiple tasks required for the analysis of a viral population, such as the SARS-CoV-2. MALVIRUS allows to: (1) construct a variant catalog consisting in a set of variations (SNPs/indels) from the population sequences, (2) efficiently genotype and annotate variants of the catalog supported by a read sample, and (3) when the considered viral species is the SARS-CoV-2, assign the input sample to the most likely Pango lineages using the genotyped variations. Conclusions: Tests on Illumina and Nanopore samples proved the efficiency and the effectiveness of MALVIRUS in analyzing SARS-CoV-2 strain samples with respect to publicly available data provided by NCBI and the more complete dataset provided by GISAID. A comparison with state-of-the-art tools showed that MALVIRUS is always more precise and often have a better recall.
- Published
- 2022
3. Numeric Lyndon-based feature embedding of sequencing reads for machine learning approaches
- Author
-
Bonizzoni, P, Costantini, M, De Felice, C, Petescia, A, Pirola, Y, Previtali, M, Rizzi, R, Stoye, J, Zaccagnino, R, Zizza, R, Bonizzoni, P., Costantini, M., De Felice, C., Petescia, A., Pirola, Y., Previtali, M., Rizzi, R., Stoye, J., Zaccagnino, R., Zizza, R., Bonizzoni, P, Costantini, M, De Felice, C, Petescia, A, Pirola, Y, Previtali, M, Rizzi, R, Stoye, J, Zaccagnino, R, Zizza, R, Bonizzoni, P., Costantini, M., De Felice, C., Petescia, A., Pirola, Y., Previtali, M., Rizzi, R., Stoye, J., Zaccagnino, R., and Zizza, R.
- Abstract
Feature embedding methods have been proposed in the literature to represent sequences as numeric vectors to be used in some bioinformatics investigations, such as family classification and protein structure prediction. Recent theoretical results showed that the well-known Lyndon factorization preserves common factors in overlapping strings [1]. Surprisingly, the fingerprint of a sequencing read, which is the sequence of lengths of consecutive factors in variants of the Lyndon factorization of the read, is effective in capturing sequence similarities, suggesting it as basis for the definition of novel representations of sequencing reads. We propose a novel feature embedding method for Next-Generation Sequencing (NGS) data using the notion of fingerprint. We provide a theoretical and experimental framework to estimate the behaviour of fingerprints and of the k-mers extracted from it, called k-fingers, as possible feature embeddings for sequencing reads. As a case study to assess the effectiveness of such embeddings, we use fingerprints to represent RNA-Seq reads in order to assign them to the most likely gene from which they originated as fragments of transcripts of the gene. We provide an implementation of the proposed method in the tool lyn2vec, which produces Lyndon-based feature embeddings of sequencing reads.
- Published
- 2022
4. Identification of Chimeric RNAs: a novel machine learning perspective
- Author
-
Bonizzoni, P, De Felice, C, Pirola, Y, Rizzi, R, Zaccagnino, R, Zizza, R, Bonizzoni, P, De Felice, C, Pirola, Y, Rizzi, R, Zaccagnino, R, and Zizza, R
- Abstract
Chimeric RNAs are transcripts generated by gene fusion and intergenic splicing events, thus comprising nucleotide sequences from different genes. Recent studies have shown that some chimeric RNAs can play a role in cancer development, and so can be used as diagnostics biomarkers when specifically expressed in cancerous cells and tissues. Most gene fusion prediction tools rely on an initial alignment step. However, alignments might be biased, especially for chimeric reads, creating many false positives. Therefore, developing alignment-free prediction methods of fusion genes would be helpful and may provide new insights into the genomic breakage phenomenon in the cell. In this direction, machine learning could pave the way for new solutions, due to their success in predicting genomic regulatory elements and alternative junction events from the genomic context. To date, however, these techniques have had a marginal supporting role, and, furthermore, manually-curated data sets, that are crucial for model training, are often expensive, unreliable or simply unavailable. Here we propose a novel ML-based method that learn to recognize the hidden patterns that allow to identify chimeric RNAs deriving from oncogenic gene fusions. Preliminary comparison with another state-ofthe- art method shows promising results.
- Published
- 2023
5. Multiallelic Maximal Perfect Haplotype Blocks with Wildcards via PBWT
- Author
-
Rojas, I, Valenzuela, O, Rojas Ruiz, F, Herrera, LJ, Ortuño, F, Bonizzoni, P, Della Vedova, G, Pirola, Y, Rizzi, R, Sgrò, M, Rojas, I, Valenzuela, O, Rojas Ruiz, F, Herrera, LJ, Ortuño, F, Bonizzoni, P, Della Vedova, G, Pirola, Y, Rizzi, R, and Sgrò, M
- Abstract
Computing maximal perfect blocks of a given panel of haplotypes is a crucial task for efficiently solving problems such as polyploid haplotype reconstruction and finding identical-by-descent segments shared among individuals of a population. Unfortunately, the presence of missing data in the haplotype panel limits the usefulness of the notion of perfect blocks. We propose a novel algorithm for computing maximal blocks in a panel with missing data (represented as wildcards). The algorithm is based on the Positional Burrows-Wheeler Transform (PBWT) and has been implemented in the tool Wild-pBWT, available at https://github.com/AlgoLab/Wild-pBWT/. Experimental comparison showed that Wild-pBWT is 10–15 times faster than another state-of-the-art approach, while using a negligible amount of memory.
- Published
- 2023
6. Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?
- Author
-
Bonizzoni, P, De Felice, C, Pirola, Y, Rizzi, R, Zaccagnino, R, Zizza, R, Bonizzoni, Paola, De Felice, Clelia, Pirola, Yuri, Rizzi, Raffaella, Zaccagnino, Rocco, Zizza, Rosalba, Bonizzoni, P, De Felice, C, Pirola, Y, Rizzi, R, Zaccagnino, R, Zizza, R, Bonizzoni, Paola, De Felice, Clelia, Pirola, Yuri, Rizzi, Raffaella, Zaccagnino, Rocco, and Zizza, Rosalba
- Abstract
Graph pangenomics is a new emerging field in computational biology that is changing the traditional view of a reference genome from a linear sequence to a new paradigm: a sequence graph (pangenome graph or simply pangenome) that represents the main similarities and differences in multiple evolutionary related genomes. The speed in producing large amounts of genome data, driven by advances in sequencing technologies, is far from the slow progress in developing new methods for constructing and analyzing a pangenome. Most recent advances in the field are still based on notions rooted in established and quite old literature on combinatorics on words, formal languages and space efficient data structures. In this paper we discuss two novel notions that may help in managing and analyzing multiple genomes by addressing a relevant question: how can we summarize sequence similarities and dissimilarities in large sequence data? The first notion is related to variants of the Lyndon factorization and allows to represent sequence similarities for a sample of reads, while the second one is that of sample specific string as a tool to detect differences in a sample of reads. New perspectives opened by these two notions are discussed.
- Published
- 2022
7. Computational graph pangenomics: a tutorial on data structures and their applications
- Author
-
Baaijens, J, Bonizzoni, P, Boucher, C, Della Vedova, G, Pirola, Y, Rizzi, R, Sirén, J, Baaijens, Jasmijn A., Bonizzoni, Paola, Boucher, Christina, Della Vedova, Gianluca, Pirola, Yuri, Rizzi, Raffaella, Sirén, Jouni, Baaijens, J, Bonizzoni, P, Boucher, C, Della Vedova, G, Pirola, Y, Rizzi, R, Sirén, J, Baaijens, Jasmijn A., Bonizzoni, Paola, Boucher, Christina, Della Vedova, Gianluca, Pirola, Yuri, Rizzi, Raffaella, and Sirén, Jouni
- Abstract
Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations—thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.
- Published
- 2022
8. KFinger: Capturing Overlaps Between Long Reads by Using Lyndon Fingerprints
- Author
-
Bonizzoni, P, Petescia, A, Pirola, Y, Rizzi, R, Zaccagnino, R, Zizza, R, Bonizzoni, P, Petescia, A, Pirola, Y, Rizzi, R, Zaccagnino, R, and Zizza, R
- Abstract
Detecting common regions and overlaps between DNA sequences is crucial in many Bioinformatics tasks. One of them is genome assembly based on the use of the overlap graph which is constructed by detecting the overlap between genomic reads. When dealing with long reads this task is further complicated by the length of the reads and the high sequencing error rate. This paper proposes a novel alignment-free method for detecting the overlaps in a set of long reads which exploits a signature (called fingerprint) of reads built from a factorization of the read based on the notion of Lyndon words. The method has been implemented in the tool KFinger and tested over a simulated and a real PacBio HiFi dataset of genomic reads; its results have been compared with the well-known aligner Minimap2. KFinger is available at https://github.com/AlgoLab/kfinger.
- Published
- 2022
9. KFinger: Capturing Overlaps Between Long Reads by Using Lyndon Fingerprints
- Author
-
Rocco Zaccagnino, PAOLA BONIZZONI, ROSALBA ZIZZA, Alessia Petescia, RAFFAELLA RIZZI, Yuri Pirola, Bonizzoni, P, Petescia, A, Pirola, Y, Rizzi, R, Zaccagnino, R, and Zizza, R
- Subjects
Fingerprint ,INF/01 - INFORMATICA ,Overlap graph ,Factorization ,Lyndon word ,Long read - Abstract
Detecting common regions and overlaps between DNA sequences is crucial in many Bioinformatics tasks. One of them is genome assembly based on the use of the overlap graph which is constructed by detecting the overlap between genomic reads. When dealing with long reads this task is further complicated by the length of the reads and the high sequencing error rate. This paper proposes a novel alignment-free method for detecting the overlaps in a set of long reads which exploits a signature (called fingerprint) of reads built from a factorization of the read based on the notion of Lyndon words. The method has been implemented in the tool KFinger and tested over a simulated and a real PacBio HiFi dataset of genomic reads; its results have been compared with the well-known aligner Minimap2. KFinger is available at https://github.com/AlgoLab/kfinger.
- Published
- 2022
10. Computational graph pangenomics: A tutorial on data structures and their applications
- Author
-
Baaijens, Jasmijn A., Bonizzoni, Paola, Boucher, Christina, Della Vedova, Gianluca, Pirola, Yuri, Rizzi, Raffaella, Sirén, Jouni, Baaijens, J, Bonizzoni, P, Boucher, C, Della Vedova, G, Pirola, Y, Rizzi, R, and Sirén, J
- Subjects
algorithm ,INF/01 - INFORMATICA ,pangenomic ,Computer Science Applications - Abstract
Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations—thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.
- Published
- 2022
11. RecGraph: recombination-aware alignment of sequences to variation graphs.
- Author
-
Avila Cartes J, Bonizzoni P, Ciccolella S, Della Vedova G, Denti L, Didelot X, Monti DC, and Pirola Y
- Subjects
- Humans, Software, Sequence Analysis, DNA methods, Genomics methods, Recombination, Genetic, Sequence Alignment methods, Algorithms, Genome, Bacterial
- Abstract
Motivation: Bacterial genomes present more variability than human genomes, which requires important adjustments in computational tools that are developed for human data. In particular, bacteria exhibit a mosaic structure due to homologous recombinations, but this fact is not sufficiently captured by standard read mappers that align against linear reference genomes. The recent introduction of pangenomics provides some insights in that context, as a pangenome graph can represent the variability within a species. However, the concept of sequence-to-graph alignment that captures the presence of recombinations has not been previously investigated., Results: In this paper, we present the extension of the notion of sequence-to-graph alignment to a variation graph that incorporates a recombination, so that the latter are explicitly represented and evaluated in an alignment. Moreover, we present a dynamic programming approach for the special case where there is at most a recombination-we implement this case as RecGraph. From a modelling point of view, a recombination corresponds to identifying a new path of the variation graph, where the new arc is composed of two halves, each extracted from an original path, possibly joined by a new arc. Our experiments show that RecGraph accurately aligns simulated recombinant bacterial sequences that have at most a recombination, providing evidence for the presence of recombination events., Availability and Implementation: Our implementation is open source and available at https://github.com/AlgoLab/RecGraph., (© The Author(s) 2024. Published by Oxford University Press.)
- Published
- 2024
- Full Text
- View/download PDF
12. MALVIRUS: an integrated application for viral variant analysis.
- Author
-
Ciccolella S, Denti L, Bonizzoni P, Della Vedova G, Pirola Y, and Previtali M
- Subjects
- Genome, Viral, High-Throughput Nucleotide Sequencing, Humans, Mutation, Pandemics, Phylogeny, SARS-CoV-2 genetics, COVID-19
- Abstract
Background: Being able to efficiently call variants from the increasing amount of sequencing data daily produced from multiple viral strains is of the utmost importance, as demonstrated during the COVID-19 pandemic, in order to track the spread of the viral strains across the globe., Results: We present MALVIRUS, an easy-to-install and easy-to-use application that assists users in multiple tasks required for the analysis of a viral population, such as the SARS-CoV-2. MALVIRUS allows to: (1) construct a variant catalog consisting in a set of variations (SNPs/indels) from the population sequences, (2) efficiently genotype and annotate variants of the catalog supported by a read sample, and (3) when the considered viral species is the SARS-CoV-2, assign the input sample to the most likely Pango lineages using the genotyped variations., Conclusions: Tests on Illumina and Nanopore samples proved the efficiency and the effectiveness of MALVIRUS in analyzing SARS-CoV-2 strain samples with respect to publicly available data provided by NCBI and the more complete dataset provided by GISAID. A comparison with state-of-the-art tools showed that MALVIRUS is always more precise and often have a better recall., (© 2022. The Author(s).)
- Published
- 2022
- Full Text
- View/download PDF
13. Computational graph pangenomics: a tutorial on data structures and their applications.
- Author
-
Baaijens JA, Bonizzoni P, Boucher C, Della Vedova G, Pirola Y, Rizzi R, and Sirén J
- Abstract
Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations-thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome , is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.
- Published
- 2022
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.