Descriptor: "Alignment" / Journal: bmc bioinformatics - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Alignment"' showing total 143 results

Start Over Descriptor "Alignment" Journal bmc bioinformatics

143 results on '"Alignment"'

1. A mapping-free natural language processing-based technique for sequence search in nanopore long-reads.

Author: Strzoda, Tomasz, Cruz-Garcia, Lourdes, Najim, Mustafa, Badie, Christophe, and Polanska, Joanna
Abstract: Background: In unforeseen situations, such as nuclear power plant's or civilian radiation accidents, there is a need for effective and computationally inexpensive methods to determine the expression level of a selected gene panel, allowing for rough dose estimates in thousands of donors. The new generation in-situ mapper, fast and of low energy consumption, working at the level of single nanopore output, is in demand. We aim to create a sequence identification tool that utilizes natural language processing techniques and ensures a high level of negative predictive value (NPV) compared to the classical approach. Results: The training dataset consisted of RNA sequencing data from 6 samples. Multiple natural language processing models were examined, differing in the type of dictionary components (word length, step, context) as well as the encoding length and number of sequences required for algorithm training. The best configuration analyses the entire sequence and uses a word length of 3 base pairs with one-word neighbor on each side. For the considered FDXR gene, the achieved mean balanced accuracy (BACC) was 98.29% and NPV was 99.25%, compared to minimap2's performance in a cross-validation scenario. The next stage focused on exploring the dictionary components and attempting to optimize it, employing statistical techniques as well as those relying on the explainability of the decisions made. Reducing the dictionary from 1024 to 145 changed BACC to 96.49% and the NPV to 98.15%. Obtained model, validated on an external independent genome sequencing dataset, gave NPV of 99.64% for complete and 95.87% for reduced dictionary. The salmon-estimated read counts differed from the classical approach on average by 3.48% for the complete dictionary and by 5.82% for the reduced one. Conclusions: We conclude that for long Oxford nanopore reads, a natural language processing-based approach can reliably replace classical mapping when there is a need for fast, reliable and energy and computationally efficient targeted mapping of a pre-defined subset of transcripts. The developed model can be easily retrained to identify selected transcripts and/or work with various long-read sequencing techniques. Our results of the study clearly demonstrate the potential of applying techniques known from classical text processing to nucleotide sequences. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

2. A mapping-free natural language processing-based technique for sequence search in nanopore long-reads

Author: Tomasz Strzoda, Lourdes Cruz-Garcia, Mustafa Najim, Christophe Badie, and Joanna Polanska
Subjects: Natural language processing, Machine learning, Sequencing, Alignment, Transcriptomics, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background In unforeseen situations, such as nuclear power plant’s or civilian radiation accidents, there is a need for effective and computationally inexpensive methods to determine the expression level of a selected gene panel, allowing for rough dose estimates in thousands of donors. The new generation in-situ mapper, fast and of low energy consumption, working at the level of single nanopore output, is in demand. We aim to create a sequence identification tool that utilizes natural language processing techniques and ensures a high level of negative predictive value (NPV) compared to the classical approach. Results The training dataset consisted of RNA sequencing data from 6 samples. Multiple natural language processing models were examined, differing in the type of dictionary components (word length, step, context) as well as the encoding length and number of sequences required for algorithm training. The best configuration analyses the entire sequence and uses a word length of 3 base pairs with one-word neighbor on each side. For the considered FDXR gene, the achieved mean balanced accuracy (BACC) was 98.29% and NPV was 99.25%, compared to minimap2’s performance in a cross-validation scenario. The next stage focused on exploring the dictionary components and attempting to optimize it, employing statistical techniques as well as those relying on the explainability of the decisions made. Reducing the dictionary from 1024 to 145 changed BACC to 96.49% and the NPV to 98.15%. Obtained model, validated on an external independent genome sequencing dataset, gave NPV of 99.64% for complete and 95.87% for reduced dictionary. The salmon-estimated read counts differed from the classical approach on average by 3.48% for the complete dictionary and by 5.82% for the reduced one. Conclusions We conclude that for long Oxford nanopore reads, a natural language processing-based approach can reliably replace classical mapping when there is a need for fast, reliable and energy and computationally efficient targeted mapping of a pre-defined subset of transcripts. The developed model can be easily retrained to identify selected transcripts and/or work with various long-read sequencing techniques. Our results of the study clearly demonstrate the potential of applying techniques known from classical text processing to nucleotide sequences.
Published: 2024
Full Text: View/download PDF

3. A phylogenetic approach for weighting genetic sequences

Author: De Maio, Nicola, Alekseyenko, Alexander V, Coleman-Smith, William J, Pardi, Fabio, Suchard, Marc A, Tamuri, Asif U, Truszkowski, Jakub, and Goldman, Nick
Subjects: Biological Sciences, Bioinformatics and Computational Biology, Evolutionary Biology, Genetics, Generic health relevance, Algorithms, Computational Biology, Phylogeny, Sequence Alignment, Phylogenetics, Sequence weights, Alignment, Protein profile, Conservation scores, Mathematical Sciences, Information and Computing Sciences, Bioinformatics, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: BackgroundMany important applications in bioinformatics, including sequence alignment and protein family profiling, employ sequence weighting schemes to mitigate the effects of non-independence of homologous sequences and under- or over-representation of certain taxa in a dataset. These schemes aim to assign high weights to sequences that are 'novel' compared to the others in the same dataset, and low weights to sequences that are over-represented.ResultsWe formalise this principle by rigorously defining the evolutionary 'novelty' of a sequence within an alignment. This results in new sequence weights that we call 'phylogenetic novelty scores'. These scores have various desirable properties, and we showcase their use by considering, as an example application, the inference of character frequencies at an alignment column-important, for example, in protein family profiling. We give computationally efficient algorithms for calculating our scores and, using simulations, show that they are versatile and can improve the accuracy of character frequency estimation compared to existing sequence weighting schemes.ConclusionsOur phylogenetic novelty scores can be useful when an evolutionarily meaningful system for adjusting for uneven taxon sampling is desired. They have numerous possible applications, including estimation of evolutionary conservation scores and sequence logos, identification of targets in conservation biology, and improving and measuring sequence alignment accuracy.
Published: 2021

4. ElasticBLAST: accelerating sequence search via cloud computing

Author: Christiam Camacho, Grzegorz M. Boratyn, Victor Joukov, Roberto Vera Alvarez, and Thomas L. Madden
Subjects: BLAST, Cloud computing, Alignment, Kubernetes, AWS Batch, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background Biomedical researchers use alignments produced by BLAST (Basic Local Alignment Search Tool) to categorize their query sequences. Producing such alignments is an essential bioinformatics task that is well suited for the cloud. The cloud can perform many calculations quickly as well as store and access large volumes of data. Bioinformaticians can also use it to collaborate with other researchers, sharing their results, datasets and even their pipelines on a common platform. Results We present ElasticBLAST, a cloud native application to perform BLAST alignments in the cloud. ElasticBLAST can handle anywhere from a few to many thousands of queries and run the searches on thousands of virtual CPUs (if desired), deleting resources when it is done. It uses cloud native tools for orchestration and can request discounted instances, lowering cloud costs for users. It is supported on Amazon Web Services and Google Cloud Platform. It can search BLAST databases that are user provided or from the National Center for Biotechnology Information. Conclusion We show that ElasticBLAST is a useful application that can efficiently perform BLAST searches for the user in the cloud, demonstrating that with two examples. At the same time, it hides much of the complexity of working in the cloud, lowering the threshold to move work to the cloud.
Published: 2023
Full Text: View/download PDF

5. nPoRe: n-polymer realigner for improved pileup-based variant calling

Author: Tim Dunn, David Blaauw, Reetuparna Das, and Satish Narayanasamy
Subjects: Germline variant calling, Alignment, N-polymer, Homopolymer, Short tandem repeat, Copy number, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Despite recent improvements in nanopore basecalling accuracy, germline variant calling of small insertions and deletions (INDELs) remains poor. Although precision and recall for single nucleotide polymorphisms (SNPs) now exceeds 99.5%, INDEL recall remains below 80% for standard R9.4.1 flow cells. We show that read phasing and realignment can recover a significant portion of false negative INDELs. In particular, we extend Needleman-Wunsch affine gap alignment by introducing new gap penalties for more accurately aligning repeated n-polymer sequences such as homopolymers ( $$n=1$$ n = 1 ) and tandem repeats ( $$2 \le n \le 6$$ 2 ≤ n ≤ 6 ). At the same precision, haplotype phasing improves INDEL recall from 63.76 to $$70.66\%$$ 70.66 % and nPoRe realignment improves it further to $$73.04\%$$ 73.04 % .
Published: 2023
Full Text: View/download PDF

6. ADEPT: a domain independent sequence alignment strategy for gpu architectures

Author: Awan, Muaaz G, Deslippe, Jack, Buluc, Aydin, Selvitopi, Oguz, Hofmeyr, Steven, Oliker, Leonid, and Yelick, Katherine
Subjects: Biological Sciences, Bioinformatics and Computational Biology, Networking and Information Technology R&D (NITRD), Human Genome, Genetics, Bioengineering, Biotechnology, Generic health relevance, Algorithms, Computational Biology, Humans, Sequence Alignment, Bioinformatics, GPU, Alignment, Protein, DNA, Mathematical Sciences, Information and Computing Sciences, Biological sciences, Information and computing sciences, Mathematical sciences
Abstract: BackgroundBioinformatic workflows frequently make use of automated genome assembly and protein clustering tools. At the core of most of these tools, a significant portion of execution time is spent in determining optimal local alignment between two sequences. This task is performed with the Smith-Waterman algorithm, which is a dynamic programming based method. With the advent of modern sequencing technologies and increasing size of both genome and protein databases, a need for faster Smith-Waterman implementations has emerged. Multiple SIMD strategies for the Smith-Waterman algorithm are available for CPUs. However, with the move of HPC facilities towards accelerator based architectures, a need for an efficient GPU accelerated strategy has emerged. Existing GPU based strategies have either been optimized for a specific type of characters (Nucleotides or Amino Acids) or for only a handful of application use-cases.ResultsIn this paper, we present ADEPT, a new sequence alignment strategy for GPU architectures that is domain independent, supporting alignment of sequences from both genomes and proteins. Our proposed strategy uses GPU specific optimizations that do not rely on the nature of sequence. We demonstrate the feasibility of this strategy by implementing the Smith-Waterman algorithm and comparing it to similar CPU strategies as well as the fastest known GPU methods for each domain. ADEPT's driver enables it to scale across multiple GPUs and allows easy integration into software pipelines which utilize large scale computational systems. We have shown that the ADEPT based Smith-Waterman algorithm demonstrates a peak performance of 360 GCUPS and 497 GCUPs for protein based and DNA based datasets respectively on a single GPU node (8 GPUs) of the Cori Supercomputer. Overall ADEPT shows 10x faster performance in a node-to-node comparison against a corresponding SIMD CPU implementation.ConclusionsADEPT demonstrates a performance that is either comparable or better than existing GPU strategies. We demonstrated the efficacy of ADEPT in supporting existing bionformatics software pipelines by integrating ADEPT in MetaHipMer a high-performance denovo metagenome assembler and PASTIS a high-performance protein similarity graph construction pipeline. Our results show 10% and 30% boost of performance in MetaHipMer and PASTIS respectively.
Published: 2020

7. ElasticBLAST: accelerating sequence search via cloud computing.

Author: Camacho, Christiam, Boratyn, Grzegorz M., Joukov, Victor, Vera Alvarez, Roberto, and Madden, Thomas L.
Subjects: *WEB services, *CLOUD computing, *DATABASE searching
Abstract: Background: Biomedical researchers use alignments produced by BLAST (Basic Local Alignment Search Tool) to categorize their query sequences. Producing such alignments is an essential bioinformatics task that is well suited for the cloud. The cloud can perform many calculations quickly as well as store and access large volumes of data. Bioinformaticians can also use it to collaborate with other researchers, sharing their results, datasets and even their pipelines on a common platform. Results: We present ElasticBLAST, a cloud native application to perform BLAST alignments in the cloud. ElasticBLAST can handle anywhere from a few to many thousands of queries and run the searches on thousands of virtual CPUs (if desired), deleting resources when it is done. It uses cloud native tools for orchestration and can request discounted instances, lowering cloud costs for users. It is supported on Amazon Web Services and Google Cloud Platform. It can search BLAST databases that are user provided or from the National Center for Biotechnology Information. Conclusion: We show that ElasticBLAST is a useful application that can efficiently perform BLAST searches for the user in the cloud, demonstrating that with two examples. At the same time, it hides much of the complexity of working in the cloud, lowering the threshold to move work to the cloud. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

8. nPoRe: n-polymer realigner for improved pileup-based variant calling.

Author: Dunn, Tim, Blaauw, David, Das, Reetuparna, and Narayanasamy, Satish
Subjects: *SINGLE nucleotide polymorphisms, *TANDEM repeats, *HAPLOTYPES, *MICROSATELLITE repeats
Abstract: Despite recent improvements in nanopore basecalling accuracy, germline variant calling of small insertions and deletions (INDELs) remains poor. Although precision and recall for single nucleotide polymorphisms (SNPs) now exceeds 99.5%, INDEL recall remains below 80% for standard R9.4.1 flow cells. We show that read phasing and realignment can recover a significant portion of false negative INDELs. In particular, we extend Needleman-Wunsch affine gap alignment by introducing new gap penalties for more accurately aligning repeated n-polymer sequences such as homopolymers ( n = 1 ) and tandem repeats ( 2 ≤ n ≤ 6 ). At the same precision, haplotype phasing improves INDEL recall from 63.76 to 70.66 % and nPoRe realignment improves it further to 73.04 % . [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

9. Global, highly specific and fast filtering of alignment seeds

Author: Matthis Ebel, Giovanna Migliorelli, and Mario Stanke
Subjects: Spaced seeds, Alignment, Alignment anchor, Genome alignment, Geometric hashing, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background An important initial phase of arguably most homology search and alignment methods such as required for genome alignments is seed finding. The seed finding step is crucial to curb the runtime as potential alignments are restricted to and anchored at the sequence position pairs that constitute the seed. To identify seeds, it is good practice to use sets of spaced seed patterns, a method that locally compares two sequences and requires exact matches at certain positions only. Results We introduce a new method for filtering alignment seeds that we call geometric hashing. Geometric hashing achieves a high specificity by combining non-local information from different seeds using a simple hash function that only requires a constant and small amount of additional time per spaced seed. Geometric hashing was tested on the task of finding homologous positions in the coding regions of human and mouse genome sequences. Thereby, the number of false positives was decreased about million-fold over sets of spaced seeds while maintaining a very high sensitivity. Conclusions An additional geometric hashing filtering phase could improve the run-time, accuracy or both of programs for various homology-search-and-align tasks.
Published: 2022
Full Text: View/download PDF

10. MGcount: a total RNA-seq quantification tool to address multi-mapping and multi-overlapping alignments ambiguity in non-coding transcripts

Author: Andrea Hita, Gilles Brocart, Ana Fernandez, Marc Rehmsmeier, Anna Alemany, and Sol Schvartzman
Subjects: RNA-seq, Non-coding, Small RNA, Alignment, NGS, Quantification, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background Total-RNA sequencing (total-RNA-seq) allows the simultaneous study of both the coding and the non-coding transcriptome. Yet, computational pipelines have traditionally focused on particular biotypes, making assumptions that are not fullfilled by total-RNA-seq datasets. Transcripts from distinct RNA biotypes vary in length, biogenesis, and function, can overlap in a genomic region, and may be present in the genome with a high copy number. Consequently, reads from total-RNA-seq libraries may cause ambiguous genomic alignments, demanding for flexible quantification approaches. Results Here we present Multi-Graph count (MGcount), a total-RNA-seq quantification tool combining two strategies for handling ambiguous alignments. First, MGcount assigns reads hierarchically to small-RNA and long-RNA features to account for length disparity when transcripts overlap in the same genomic position. Next, MGcount aggregates RNA products with similar sequences where reads systematically multi-map using a graph-based approach. MGcount outputs a transcriptomic count matrix compatible with RNA-sequencing downstream analysis pipelines, with both bulk and single-cell resolution, and the graphs that model repeated transcript structures for different biotypes. The software can be used as a python module or as a single-file executable program. Conclusions MGcount is a flexible total-RNA-seq quantification tool that successfully integrates reads that align to multiple genomic locations or that overlap with multiple gene features. Its approach is suitable for the simultaneous estimation of protein-coding, long non-coding and small non-coding transcript concentration, in both precursor and processed forms. Both source code and compiled software are available at https://github.com/hitaandrea/MGcount .
Published: 2022
Full Text: View/download PDF

11. Solving the master equation for Indels

Author: Holmes, Ian H
Subjects: Algorithms, Computer Simulation, INDEL Mutation, Markov Chains, Models, Theoretical, Monte Carlo Method, Phylogeny, Sequence Alignment, Phylogenetics, Alignment, Indels, Mathematical Sciences, Biological Sciences, Information and Computing Sciences, Bioinformatics
Abstract: BackgroundDespite the long-anticipated possibility of putting sequence alignment on the same footing as statistical phylogenetics, theorists have struggled to develop time-dependent evolutionary models for indels that are as tractable as the analogous models for substitution events.Main textThis paper discusses progress in the area of insertion-deletion models, in view of recent work by Ezawa (BMC Bioinformatics 17:304, 2016); (BMC Bioinformatics 17:397, 2016); (BMC Bioinformatics 17:457, 2016) on the calculation of time-dependent gap length distributions in pairwise alignments, and current approaches for extending these approaches from ancestor-descendant pairs to phylogenetic trees.ConclusionsWhile approximations that use finite-state machines (Pair HMMs and transducers) currently represent the most practical approach to problems such as sequence alignment and phylogeny, more rigorous approaches that work directly with the matrix exponential of the underlying continuous-time Markov chain also show promise, especially in view of recent advances.
Published: 2017

12. Global, highly specific and fast filtering of alignment seeds.

Author: Ebel, Matthis, Migliorelli, Giovanna, and Stanke, Mario
Subjects: *SEEDS, *HUMAN genome, *FILTERS & filtration
Abstract: Background: An important initial phase of arguably most homology search and alignment methods such as required for genome alignments is seed finding. The seed finding step is crucial to curb the runtime as potential alignments are restricted to and anchored at the sequence position pairs that constitute the seed. To identify seeds, it is good practice to use sets of spaced seed patterns, a method that locally compares two sequences and requires exact matches at certain positions only. Results: We introduce a new method for filtering alignment seeds that we call geometric hashing. Geometric hashing achieves a high specificity by combining non-local information from different seeds using a simple hash function that only requires a constant and small amount of additional time per spaced seed. Geometric hashing was tested on the task of finding homologous positions in the coding regions of human and mouse genome sequences. Thereby, the number of false positives was decreased about million-fold over sets of spaced seeds while maintaining a very high sensitivity. Conclusions: An additional geometric hashing filtering phase could improve the run-time, accuracy or both of programs for various homology-search-and-align tasks. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

13. MGcount: a total RNA-seq quantification tool to address multi-mapping and multi-overlapping alignments ambiguity in non-coding transcripts.

Author: Hita, Andrea, Brocart, Gilles, Fernandez, Ana, Rehmsmeier, Marc, Alemany, Anna, and Schvartzman, Sol
Subjects: *RNA sequencing, *SOURCE code, *AMBIGUITY, *PYTHON programming language, *NON-coding RNA, *GENE libraries, *TRANSCRIPTOMES, *SEQUENCE alignment
Abstract: Background: Total-RNA sequencing (total-RNA-seq) allows the simultaneous study of both the coding and the non-coding transcriptome. Yet, computational pipelines have traditionally focused on particular biotypes, making assumptions that are not fullfilled by total-RNA-seq datasets. Transcripts from distinct RNA biotypes vary in length, biogenesis, and function, can overlap in a genomic region, and may be present in the genome with a high copy number. Consequently, reads from total-RNA-seq libraries may cause ambiguous genomic alignments, demanding for flexible quantification approaches. Results: Here we present Multi-Graph count (MGcount), a total-RNA-seq quantification tool combining two strategies for handling ambiguous alignments. First, MGcount assigns reads hierarchically to small-RNA and long-RNA features to account for length disparity when transcripts overlap in the same genomic position. Next, MGcount aggregates RNA products with similar sequences where reads systematically multi-map using a graph-based approach. MGcount outputs a transcriptomic count matrix compatible with RNA-sequencing downstream analysis pipelines, with both bulk and single-cell resolution, and the graphs that model repeated transcript structures for different biotypes. The software can be used as a python module or as a single-file executable program. Conclusions: MGcount is a flexible total-RNA-seq quantification tool that successfully integrates reads that align to multiple genomic locations or that overlap with multiple gene features. Its approach is suitable for the simultaneous estimation of protein-coding, long non-coding and small non-coding transcript concentration, in both precursor and processed forms. Both source code and compiled software are available at https://github.com/hitaandrea/MGcount. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

14. Ribovore: ribosomal RNA sequence analysis for GenBank submissions and database curation

Author: Alejandro A. Schäffer, Richard McVeigh, Barbara Robbertse, Conrad L. Schoch, Anjanette Johnston, Beverly A. Underwood, Ilene Karsch-Mizrachi, and Eric P. Nawrocki
Subjects: Ribosomal RNA, Annotation, Alignment, ncRNA, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background The DNA sequences encoding ribosomal RNA genes (rRNAs) are commonly used as markers to identify species, including in metagenomics samples that may combine many organismal communities. The 16S small subunit ribosomal RNA (SSU rRNA) gene is typically used to identify bacterial and archaeal species. The nuclear 18S SSU rRNA gene, and 28S large subunit (LSU) rRNA gene have been used as DNA barcodes and for phylogenetic studies in different eukaryote taxonomic groups. Because of their popularity, the National Center for Biotechnology Information (NCBI) receives a disproportionate number of rRNA sequence submissions and BLAST queries. These sequences vary in quality, length, origin (nuclear, mitochondria, plastid), and organism source and can represent any region of the ribosomal cistron. Results To improve the timely verification of quality, origin and loci boundaries, we developed Ribovore, a software package for sequence analysis of rRNA sequences. The ribotyper and ribosensor programs are used to validate incoming sequences of bacterial and archaeal SSU rRNA. The ribodbmaker program is used to create high-quality datasets of rRNAs from different taxonomic groups. Key algorithmic steps include comparing candidate sequences against rRNA sequence profile hidden Markov models (HMMs) and covariance models of rRNA sequence and secondary-structure conservation, as well as other tests. Nine freely available blastn rRNA databases created and maintained with Ribovore are used for checking incoming GenBank submissions and used by the blastn browser interface at NCBI. Since 2018, Ribovore has been used to analyze more than 50 million prokaryotic SSU rRNA sequences submitted to GenBank, and to select at least 10,435 fungal rRNA RefSeq records from type material of 8350 taxa. Conclusion Ribovore combines single-sequence and profile-based methods to improve GenBank processing and analysis of rRNA sequences. It is a standalone, portable, and extensible software package for the alignment, classification and validation of rRNA sequences. Researchers planning on submitting SSU rRNA sequences to GenBank are encouraged to download and use Ribovore to analyze their sequences prior to submission to determine which sequences are likely to be automatically accepted into GenBank.
Published: 2021
Full Text: View/download PDF

15. Fast and SNP-aware short read alignment with SALT

Author: Wei Quan, Bo Liu, and Yadong Wang
Subjects: NGS, Alignment, SNP-aware, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background DNA sequence alignment is a common first step in most applications of high-throughput sequencing technologies. The accuracy of sequence alignments directly affects the accuracy of downstream analyses, such as variant calling and quantitative analysis of transcriptome; therefore, rapidly and accurately mapping reads to a reference genome is a significant topic in bioinformatics. Conventional DNA read aligners map reads to a linear reference genome (such as the GRCh38 primary assembly). However, such a linear reference genome represents the genome of only one or a few individuals and thus lacks information on variations in the population. This limitation can introduce bias and impact the sensitivity and accuracy of mapping. Recently, a number of aligners have begun to map reads to populations of genomes, which can be represented by a reference genome and a large number of genetic variants. However, compared to linear reference aligners, an aligner that can store and index all genetic variants has a high cost in memory (RAM) space and leads to extremely long run time. Aligning reads to a graph-model-based index that includes all types of variants is ultimately an NP-hard problem in theory. By contrast, considering only single nucleotide polymorphism (SNP) information will reduce the complexity of the index and improve the speed of sequence alignment. Results The SNP-aware alignment tool (SALT) is a fast, memory-efficient, and SNP-aware short read alignment tool. SALT uses 5.8 GB of RAM to index a human reference genome (GRCh38) and incorporates 12.8M UCSC common SNPs. Compared with a state-of-the-art aligner, SALT has a similar speed but higher accuracy. Conclusions Herein, we present an SNP-aware alignment tool (SALT) that aligns reads to a reference genome that incorporates an SNP database. We benchmarked SALT using simulated and real datasets. The results demonstrate that SALT can efficiently map reads to the reference genome with significantly improved accuracy. Incorporating SNP information can improve the accuracy of read alignment and can reveal novel variants. The source code is freely available at https://github.com/weiquan/SALT .
Published: 2021
Full Text: View/download PDF

16. A phylogenetic approach for weighting genetic sequences

Author: Nicola De Maio, Alexander V. Alekseyenko, William J. Coleman-Smith, Fabio Pardi, Marc A. Suchard, Asif U. Tamuri, Jakub Truszkowski, and Nick Goldman
Subjects: Phylogenetics, Sequence weights, Alignment, Protein profile, Conservation scores, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background Many important applications in bioinformatics, including sequence alignment and protein family profiling, employ sequence weighting schemes to mitigate the effects of non-independence of homologous sequences and under- or over-representation of certain taxa in a dataset. These schemes aim to assign high weights to sequences that are ‘novel’ compared to the others in the same dataset, and low weights to sequences that are over-represented. Results We formalise this principle by rigorously defining the evolutionary ‘novelty’ of a sequence within an alignment. This results in new sequence weights that we call ‘phylogenetic novelty scores’. These scores have various desirable properties, and we showcase their use by considering, as an example application, the inference of character frequencies at an alignment column—important, for example, in protein family profiling. We give computationally efficient algorithms for calculating our scores and, using simulations, show that they are versatile and can improve the accuracy of character frequency estimation compared to existing sequence weighting schemes. Conclusions Our phylogenetic novelty scores can be useful when an evolutionarily meaningful system for adjusting for uneven taxon sampling is desired. They have numerous possible applications, including estimation of evolutionary conservation scores and sequence logos, identification of targets in conservation biology, and improving and measuring sequence alignment accuracy.
Published: 2021
Full Text: View/download PDF

17. Performance evaluation of pipelines for mapping, variant calling and interval padding, for the analysis of NGS germline panels

Author: Maria Zanti, Kyriaki Michailidou, Maria A. Loizidou, Christina Machattou, Panagiota Pirpa, Kyproula Christodoulou, George M. Spyrou, Kyriacos Kyriacou, and Andreas Hadjisavvas
Subjects: Next-generation sequencing (NGS), Germline NGS data analysis, Variant calling, Alignment, Interval padding, Pipeline comparison, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background Next-generation sequencing (NGS) represents a significant advancement in clinical genetics. However, its use creates several technical, data interpretation and management challenges. It is essential to follow a consistent data analysis pipeline to achieve the highest possible accuracy and avoid false variant calls. Herein, we aimed to compare the performance of twenty-eight combinations of NGS data analysis pipeline compartments, including short-read mapping (BWA-MEM, Bowtie2, Stampy), variant calling (GATK-HaplotypeCaller, GATK-UnifiedGenotyper, SAMtools) and interval padding (null, 50 bp, 100 bp) methods, along with a commercially available pipeline (BWA Enrichment, Illumina®). Fourteen germline DNA samples from breast cancer patients were sequenced using a targeted NGS panel approach and subjected to data analysis. Results We highlight that interval padding is required for the accurate detection of intronic variants including spliceogenic pathogenic variants (PVs). In addition, using nearly default parameters, the BWA Enrichment algorithm, failed to detect these spliceogenic PVs and a missense PV in the TP53 gene. We also recommend the BWA-MEM algorithm for sequence alignment, whereas variant calling should be performed using a combination of variant calling algorithms; GATK-HaplotypeCaller and SAMtools for the accurate detection of insertions/deletions and GATK-UnifiedGenotyper for the efficient detection of single nucleotide variant calls. Conclusions These findings have important implications towards the identification of clinically actionable variants through panel testing in a clinical laboratory setting, when dedicated bioinformatics personnel might not always be available. The results also reveal the necessity of improving the existing tools and/or at the same time developing new pipelines to generate more reliable and more consistent data.
Published: 2021
Full Text: View/download PDF

18. ADEPT: a domain independent sequence alignment strategy for gpu architectures

Author: Muaaz G. Awan, Jack Deslippe, Aydin Buluc, Oguz Selvitopi, Steven Hofmeyr, Leonid Oliker, and Katherine Yelick
Subjects: Bioinformatics, GPU, Alignment, Protein, DNA, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background Bioinformatic workflows frequently make use of automated genome assembly and protein clustering tools. At the core of most of these tools, a significant portion of execution time is spent in determining optimal local alignment between two sequences. This task is performed with the Smith-Waterman algorithm, which is a dynamic programming based method. With the advent of modern sequencing technologies and increasing size of both genome and protein databases, a need for faster Smith-Waterman implementations has emerged. Multiple SIMD strategies for the Smith-Waterman algorithm are available for CPUs. However, with the move of HPC facilities towards accelerator based architectures, a need for an efficient GPU accelerated strategy has emerged. Existing GPU based strategies have either been optimized for a specific type of characters (Nucleotides or Amino Acids) or for only a handful of application use-cases. Results In this paper, we present ADEPT, a new sequence alignment strategy for GPU architectures that is domain independent, supporting alignment of sequences from both genomes and proteins. Our proposed strategy uses GPU specific optimizations that do not rely on the nature of sequence. We demonstrate the feasibility of this strategy by implementing the Smith-Waterman algorithm and comparing it to similar CPU strategies as well as the fastest known GPU methods for each domain. ADEPT’s driver enables it to scale across multiple GPUs and allows easy integration into software pipelines which utilize large scale computational systems. We have shown that the ADEPT based Smith-Waterman algorithm demonstrates a peak performance of 360 GCUPS and 497 GCUPs for protein based and DNA based datasets respectively on a single GPU node (8 GPUs) of the Cori Supercomputer. Overall ADEPT shows 10x faster performance in a node-to-node comparison against a corresponding SIMD CPU implementation. Conclusions ADEPT demonstrates a performance that is either comparable or better than existing GPU strategies. We demonstrated the efficacy of ADEPT in supporting existing bionformatics software pipelines by integrating ADEPT in MetaHipMer a high-performance denovo metagenome assembler and PASTIS a high-performance protein similarity graph construction pipeline. Our results show 10% and 30% boost of performance in MetaHipMer and PASTIS respectively.
Published: 2020
Full Text: View/download PDF

19. VADR: validation and annotation of virus sequence submissions to GenBank

Author: Alejandro A. Schäffer, Eneida L. Hatcher, Linda Yankie, Lara Shonkwiler, J. Rodney Brister, Ilene Karsch-Mizrachi, and Eric P. Nawrocki
Subjects: Annotation, Virus, Alignment, ncRNA, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background GenBank contains over 3 million viral sequences. The National Center for Biotechnology Information (NCBI) previously made available a tool for validating and annotating influenza virus sequences that is used to check submissions to GenBank. Before this project, there was no analogous tool in use for non-influenza viral sequence submissions. Results We developed a system called VADR (Viral Annotation DefineR) that validates and annotates viral sequences in GenBank submissions. The annotation system is based on the analysis of the input nucleotide sequence using models built from curated RefSeqs. Hidden Markov models are used to classify sequences by determining the RefSeq they are most similar to, and feature annotation from the RefSeq is mapped based on a nucleotide alignment of the full sequence to a covariance model. Predicted proteins encoded by the sequence are validated with nucleotide-to-protein alignments using BLAST. The system identifies 43 types of “alerts” that (unlike the previous BLAST-based system) provide deterministic and rigorous feedback to researchers who submit sequences with unexpected characteristics. VADR has been integrated into GenBank’s submission processing pipeline allowing for viral submissions passing all tests to be accepted and annotated automatically, without the need for any human (GenBank indexer) intervention. Unlike the previous submission-checking system, VADR is freely available ( https://github.com/nawrockie/vadr ) for local installation and use. VADR has been used for Norovirus submissions since May 2018 and for Dengue virus submissions since January 2019. Since March 2020, VADR has also been used to check SARS-CoV-2 sequence submissions. Other viruses with high numbers of submissions will be added incrementally. Conclusion VADR improves the speed with which non-flu virus submissions to GenBank can be checked and improves the content and quality of the GenBank annotations. The availability and portability of the software allow researchers to run the GenBank checks prior to submitting their viral sequences, and thereby gain confidence that their submissions will be accepted immediately without the need to correspond with GenBank staff. Reciprocally, the adoption of VADR frees GenBank staff to spend more time on services other than checking routine viral sequence submissions.
Published: 2020
Full Text: View/download PDF

20. Fast and SNP-aware short read alignment with SALT.

Author: Quan, Wei, Liu, Bo, and Wang, Yadong
Subjects: *SEQUENCE alignment, *GENETIC variation, *NUCLEOTIDE sequencing, *DNA sequencing, *LINEAR operators, *HUMAN genome
Abstract: Background: DNA sequence alignment is a common first step in most applications of high-throughput sequencing technologies. The accuracy of sequence alignments directly affects the accuracy of downstream analyses, such as variant calling and quantitative analysis of transcriptome; therefore, rapidly and accurately mapping reads to a reference genome is a significant topic in bioinformatics. Conventional DNA read aligners map reads to a linear reference genome (such as the GRCh38 primary assembly). However, such a linear reference genome represents the genome of only one or a few individuals and thus lacks information on variations in the population. This limitation can introduce bias and impact the sensitivity and accuracy of mapping. Recently, a number of aligners have begun to map reads to populations of genomes, which can be represented by a reference genome and a large number of genetic variants. However, compared to linear reference aligners, an aligner that can store and index all genetic variants has a high cost in memory (RAM) space and leads to extremely long run time. Aligning reads to a graph-model-based index that includes all types of variants is ultimately an NP-hard problem in theory. By contrast, considering only single nucleotide polymorphism (SNP) information will reduce the complexity of the index and improve the speed of sequence alignment. Results: The SNP-aware alignment tool (SALT) is a fast, memory-efficient, and SNP-aware short read alignment tool. SALT uses 5.8 GB of RAM to index a human reference genome (GRCh38) and incorporates 12.8M UCSC common SNPs. Compared with a state-of-the-art aligner, SALT has a similar speed but higher accuracy. Conclusions: Herein, we present an SNP-aware alignment tool (SALT) that aligns reads to a reference genome that incorporates an SNP database. We benchmarked SALT using simulated and real datasets. The results demonstrate that SALT can efficiently map reads to the reference genome with significantly improved accuracy. Incorporating SNP information can improve the accuracy of read alignment and can reveal novel variants. The source code is freely available at https://github.com/weiquan/SALT. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

21. Ribovore: ribosomal RNA sequence analysis for GenBank submissions and database curation.

Author: Schäffer, Alejandro A., McVeigh, Richard, Robbertse, Barbara, Schoch, Conrad L., Johnston, Anjanette, Underwood, Beverly A., Karsch-Mizrachi, Ilene, and Nawrocki, Eric P.
Subjects: *RIBOSOMAL DNA, *RIBOSOMAL RNA, *RNA sequencing, *DNA sequencing, *MARKOV processes, *SEQUENCE analysis, *GUT microbiome
Abstract: Background: The DNA sequences encoding ribosomal RNA genes (rRNAs) are commonly used as markers to identify species, including in metagenomics samples that may combine many organismal communities. The 16S small subunit ribosomal RNA (SSU rRNA) gene is typically used to identify bacterial and archaeal species. The nuclear 18S SSU rRNA gene, and 28S large subunit (LSU) rRNA gene have been used as DNA barcodes and for phylogenetic studies in different eukaryote taxonomic groups. Because of their popularity, the National Center for Biotechnology Information (NCBI) receives a disproportionate number of rRNA sequence submissions and BLAST queries. These sequences vary in quality, length, origin (nuclear, mitochondria, plastid), and organism source and can represent any region of the ribosomal cistron. Results: To improve the timely verification of quality, origin and loci boundaries, we developed Ribovore, a software package for sequence analysis of rRNA sequences. The ribotyper and ribosensor programs are used to validate incoming sequences of bacterial and archaeal SSU rRNA. The ribodbmaker program is used to create high-quality datasets of rRNAs from different taxonomic groups. Key algorithmic steps include comparing candidate sequences against rRNA sequence profile hidden Markov models (HMMs) and covariance models of rRNA sequence and secondary-structure conservation, as well as other tests. Nine freely available blastn rRNA databases created and maintained with Ribovore are used for checking incoming GenBank submissions and used by the blastn browser interface at NCBI. Since 2018, Ribovore has been used to analyze more than 50 million prokaryotic SSU rRNA sequences submitted to GenBank, and to select at least 10,435 fungal rRNA RefSeq records from type material of 8350 taxa. Conclusion: Ribovore combines single-sequence and profile-based methods to improve GenBank processing and analysis of rRNA sequences. It is a standalone, portable, and extensible software package for the alignment, classification and validation of rRNA sequences. Researchers planning on submitting SSU rRNA sequences to GenBank are encouraged to download and use Ribovore to analyze their sequences prior to submission to determine which sequences are likely to be automatically accepted into GenBank. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

22. Magic-BLAST, an accurate RNA-seq aligner for long and short reads

Author: Grzegorz M. Boratyn, Jean Thierry-Mieg, Danielle Thierry-Mieg, Ben Busby, and Thomas L. Madden
Subjects: RNA-seq, BLAST, Alignment, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background Next-generation sequencing technologies can produce tens of millions of reads, often paired-end, from transcripts or genomes. But few programs can align RNA on the genome and accurately discover introns, especially with long reads. We introduce Magic-BLAST, a new aligner based on ideas from the Magic pipeline. Results Magic-BLAST uses innovative techniques that include the optimization of a spliced alignment score and selective masking during seed selection. We evaluate the performance of Magic-BLAST to accurately map short or long sequences and its ability to discover introns on real RNA-seq data sets from PacBio, Roche and Illumina runs, and on six benchmarks, and compare it to other popular aligners. Additionally, we look at alignments of human idealized RefSeq mRNA sequences perfectly matching the genome. Conclusions We show that Magic-BLAST is the best at intron discovery over a wide range of conditions and the best at mapping reads longer than 250 bases, from any platform. It is versatile and robust to high levels of mismatches or extreme base composition, and reasonably fast. It can align reads to a BLAST database or a FASTA file. It can accept a FASTQ file as input or automatically retrieve an accession from the SRA repository at the NCBI.
Published: 2019
Full Text: View/download PDF

23. An integrated package for bisulfite DNA methylation data analysis with Indel-sensitive mapping

Author: Qiangwei Zhou, Jing-Quan Lim, Wing-Kin Sung, and Guoliang Li
Subjects: DNA methylation, Bisulfite sequencing, Alignment, Indel, Pipeline, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background DNA methylation plays crucial roles in most eukaryotic organisms. Bisulfite sequencing (BS-Seq) is a sequencing approach that provides quantitative cytosine methylation levels in genome-wide scope and single-base resolution. However, genomic variations such as insertions and deletions (indels) affect methylation calling, and the alignment of reads near/across indels becomes inaccurate in the presence of polymorphisms. Hence, the simultaneous detection of DNA methylation and indels is important for exploring the mechanisms of functional regulation in organisms. Results These problems motivated us to develop the algorithm BatMeth2, which can align BS reads with high accuracy while allowing for variable-length indels with respect to the reference genome. The results from simulated and real bisulfite DNA methylation data demonstrated that our proposed method increases alignment accuracy. Additionally, BatMeth2 can calculate the methylation levels of individual loci, genomic regions or functional regions such as genes/transposable elements. Additional programs were also developed to provide methylation data annotation, visualization, and differentially methylated cytosine/region (DMC/DMR) detection. The whole package provides new tools and will benefit bisulfite data analysis. Conclusion BatMeth2 improves DNA methylation calling, particularly for regions close to indels. It is an autorun package and easy to use. In addition, a DNA methylation visualization program and a differential analysis program are provided in BatMeth2. We believe that BatMeth2 will facilitate the study of the mechanisms of DNA methylation in development and disease. BatMeth2 is an open source software program and is available on GitHub (https://github.com/GuoliangLi-HZAU/BatMeth2/).
Published: 2019
Full Text: View/download PDF

24. BiSpark: a Spark-based highly scalable aligner for bisulfite sequencing data

Author: Seokjun Soe, Yoonjae Park, and Heejoon Chae
Subjects: DNA methylation, Bisulfite sequencing, Alignment, Apache Spark, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background Bisulfite sequencing is one of the major high-resolution DNA methylation measurement method. Due to the selective nucleotide conversion on unmethylated cytosines after treatment with sodium bisulfite, processing bisulfite-treated sequencing reads requires additional steps which need high computational demands. However, a dearth of efficient aligner that is designed for bisulfite-treated sequencing becomes a bottleneck of large-scale DNA methylome analyses. Results In this study, we present a highly scalable, efficient, and load-balanced bisulfite aligner, BiSpark, which is designed for processing large volumes of bisulfite sequencing data. We implemented the BiSpark algorithm over the Apache Spark, a memory optimized distributed data processing framework, to achieve the maximum data parallel efficiency. The BiSpark algorithm is designed to support redistribution of imbalanced data to minimize delays on large-scale distributed environment. Conclusions Experimental results on methylome datasets show that BiSpark significantly outperforms other state-of-the-art bisulfite sequencing aligners in terms of alignment speed and scalability with respect to dataset size and a number of computing nodes while providing highly consistent and comparable mapping results. Availability The implementation of BiSpark software package and source code is available at https://github.com/bhi-kimlab/BiSpark/.
Published: 2018
Full Text: View/download PDF

25. Performance evaluation of pipelines for mapping, variant calling and interval padding, for the analysis of NGS germline panels.

Author: Zanti, Maria, Michailidou, Kyriaki, Loizidou, Maria A., Machattou, Christina, Pirpa, Panagiota, Christodoulou, Kyproula, Spyrou, George M., Kyriacou, Kyriacos, and Hadjisavvas, Andreas
Subjects: *GERM cells, *MEDICAL genetics, *LABORATORY test panels, *SEQUENCE alignment, *DATA analysis
Abstract: Background: Next-generation sequencing (NGS) represents a significant advancement in clinical genetics. However, its use creates several technical, data interpretation and management challenges. It is essential to follow a consistent data analysis pipeline to achieve the highest possible accuracy and avoid false variant calls. Herein, we aimed to compare the performance of twenty-eight combinations of NGS data analysis pipeline compartments, including short-read mapping (BWA-MEM, Bowtie2, Stampy), variant calling (GATK-HaplotypeCaller, GATK-UnifiedGenotyper, SAMtools) and interval padding (null, 50 bp, 100 bp) methods, along with a commercially available pipeline (BWA Enrichment, Illumina®). Fourteen germline DNA samples from breast cancer patients were sequenced using a targeted NGS panel approach and subjected to data analysis. Results: We highlight that interval padding is required for the accurate detection of intronic variants including spliceogenic pathogenic variants (PVs). In addition, using nearly default parameters, the BWA Enrichment algorithm, failed to detect these spliceogenic PVs and a missense PV in the TP53 gene. We also recommend the BWA-MEM algorithm for sequence alignment, whereas variant calling should be performed using a combination of variant calling algorithms; GATK-HaplotypeCaller and SAMtools for the accurate detection of insertions/deletions and GATK-UnifiedGenotyper for the efficient detection of single nucleotide variant calls. Conclusions: These findings have important implications towards the identification of clinically actionable variants through panel testing in a clinical laboratory setting, when dedicated bioinformatics personnel might not always be available. The results also reveal the necessity of improving the existing tools and/or at the same time developing new pipelines to generate more reliable and more consistent data. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

26. miRge 2.0 for comprehensive analysis of microRNA sequencing data

Author: Yin Lu, Alexander S. Baras, and Marc K. Halushka
Subjects: miRNA, Small RNA-seq, Alignment, isomiR, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background miRNAs play important roles in the regulation of gene expression. The rapidly developing field of microRNA sequencing (miRNA-seq; small RNA-seq) needs comprehensive, robust, user-friendly and standardized bioinformatics tools to analyze these large datasets. We present miRge 2.0, in which multiple enhancements were made towards these goals. Results miRge 2.0 has become more comprehensive with increased functionality including a novel miRNA detection method, A-to-I editing analysis, integrated standardized GFF3 isomiR reporting, and improved alignment to miRNAs. The novel miRNA detection method uniquely uses both miRNA hairpin sequence structure and composition of isomiRs resulting in higher specificity for potential miRNA identification. Using known miRNA data, our support vector machine (SVM) model predicted miRNAs with an average Matthews correlation coefficient (MCC) of 0.939 over 32 human cell datasets and outperformed miRDeep2 and miRAnalyzer regarding phylogenetic conservation. The A-to-I editing detection strongly correlated with a reference dataset with adjusted R2 = 0.96. miRge 2.0 is the most up-to-date aligner with custom libraries to both miRBase v22 and MirGeneDB v2.0 for 6 species: human, mouse, rat, fruit fly, nematode and zebrafish; and has a tool to create custom libraries. For user-friendliness, miRge 2.0 is incorporated into bcbio-nextgen and implementable through Bioconda. Conclusions miRge 2.0 is a redesigned, leading miRNA RNA-seq aligner with several improvements and novel utilities. miRge 2.0 is freely available at: https://github.com/mhalushka/miRge.
Published: 2018
Full Text: View/download PDF

27. Introducing difference recurrence relations for faster semi-global alignment of long sequences

Author: Hajime Suzuki and Masahiro Kasahara
Subjects: Sequence analysis, Alignment, Long read, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background The read length of single-molecule DNA sequencers is reaching 1 Mb. Popular alignment software tools widely used for analyzing such long reads often take advantage of single-instruction multiple-data (SIMD) operations to accelerate calculation of dynamic programming (DP) matrices in the Smith–Waterman–Gotoh (SWG) algorithm with a fixed alignment start position at the origin. Nonetheless, 16-bit or 32-bit integers are necessary for storing the values in a DP matrix when sequences to be aligned are long; this situation hampers the use of the full SIMD width of modern processors. Results We proposed a faster semi-global alignment algorithm, “difference recurrence relations,” that runs more rapidly than the state-of-the-art algorithm by a factor of 2.1. Instead of calculating and storing all the values in a DP matrix directly, our algorithm computes and stores mainly the differences between the values of adjacent cells in the matrix. Although the SWG algorithm and our algorithm can output exactly the same result, our algorithm mainly involves 8-bit integer operations, enabling us to exploit the full width of SIMD operations (e.g., 32) on modern processors. We also developed a library, libgaba, so that developers can easily integrate our algorithm into alignment programs. Conclusions Our novel algorithm and optimized library implementation will facilitate accelerating nucleotide long-read analysis algorithms that use pairwise alignment stages. The library is implemented in the C programming language and available at https://github.com/ocxtal/libgaba.
Published: 2018
Full Text: View/download PDF

28. VADR: validation and annotation of virus sequence submissions to GenBank.

Author: Schäffer, Alejandro A., Hatcher, Eneida L., Yankie, Linda, Shonkwiler, Lara, Brister, J. Rodney, Karsch-Mizrachi, Ilene, and Nawrocki, Eric P.
Subjects: *RNA sequencing, *SOFTWARE compatibility, *ANNOTATIONS, *AMINO acid sequence, *INFLUENZA viruses, *NOROVIRUS diseases, *NOROVIRUSES, *INFLUENZA A virus
Abstract: Background: GenBank contains over 3 million viral sequences. The National Center for Biotechnology Information (NCBI) previously made available a tool for validating and annotating influenza virus sequences that is used to check submissions to GenBank. Before this project, there was no analogous tool in use for non-influenza viral sequence submissions. Results: We developed a system called VADR (Viral Annotation DefineR) that validates and annotates viral sequences in GenBank submissions. The annotation system is based on the analysis of the input nucleotide sequence using models built from curated RefSeqs. Hidden Markov models are used to classify sequences by determining the RefSeq they are most similar to, and feature annotation from the RefSeq is mapped based on a nucleotide alignment of the full sequence to a covariance model. Predicted proteins encoded by the sequence are validated with nucleotide-to-protein alignments using BLAST. The system identifies 43 types of "alerts" that (unlike the previous BLAST-based system) provide deterministic and rigorous feedback to researchers who submit sequences with unexpected characteristics. VADR has been integrated into GenBank's submission processing pipeline allowing for viral submissions passing all tests to be accepted and annotated automatically, without the need for any human (GenBank indexer) intervention. Unlike the previous submission-checking system, VADR is freely available (https://github.com/nawrockie/vadr) for local installation and use. VADR has been used for Norovirus submissions since May 2018 and for Dengue virus submissions since January 2019. Since March 2020, VADR has also been used to check SARS-CoV-2 sequence submissions. Other viruses with high numbers of submissions will be added incrementally. Conclusion: VADR improves the speed with which non-flu virus submissions to GenBank can be checked and improves the content and quality of the GenBank annotations. The availability and portability of the software allow researchers to run the GenBank checks prior to submitting their viral sequences, and thereby gain confidence that their submissions will be accepted immediately without the need to correspond with GenBank staff. Reciprocally, the adoption of VADR frees GenBank staff to spend more time on services other than checking routine viral sequence submissions. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

29. Solving the master equation for Indels

Author: Ian H. Holmes
Subjects: Phylogenetics, Alignment, Indels, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background Despite the long-anticipated possibility of putting sequence alignment on the same footing as statistical phylogenetics, theorists have struggled to develop time-dependent evolutionary models for indels that are as tractable as the analogous models for substitution events. Main text This paper discusses progress in the area of insertion-deletion models, in view of recent work by Ezawa (BMC Bioinformatics 17:304, 2016); (BMC Bioinformatics 17:397, 2016); (BMC Bioinformatics 17:457, 2016) on the calculation of time-dependent gap length distributions in pairwise alignments, and current approaches for extending these approaches from ancestor-descendant pairs to phylogenetic trees. Conclusions While approximations that use finite-state machines (Pair HMMs and transducers) currently represent the most practical approach to problems such as sequence alignment and phylogeny, more rigorous approaches that work directly with the matrix exponential of the underlying continuous-time Markov chain also show promise, especially in view of recent advances.
Published: 2017
Full Text: View/download PDF

30. An integrated package for bisulfite DNA methylation data analysis with Indel-sensitive mapping.

Author: Zhou, Qiangwei, Lim, Jing-Quan, Sung, Wing-Kin, and Li, Guoliang
Subjects: *DNA methylation, *DATA analysis, *CYTOSINE, *GENOMES, *GENOMICS, *DISEASES, *OPEN source software
Abstract: Background: DNA methylation plays crucial roles in most eukaryotic organisms. Bisulfite sequencing (BS-Seq) is a sequencing approach that provides quantitative cytosine methylation levels in genome-wide scope and single-base resolution. However, genomic variations such as insertions and deletions (indels) affect methylation calling, and the alignment of reads near/across indels becomes inaccurate in the presence of polymorphisms. Hence, the simultaneous detection of DNA methylation and indels is important for exploring the mechanisms of functional regulation in organisms. Results: These problems motivated us to develop the algorithm BatMeth2, which can align BS reads with high accuracy while allowing for variable-length indels with respect to the reference genome. The results from simulated and real bisulfite DNA methylation data demonstrated that our proposed method increases alignment accuracy. Additionally, BatMeth2 can calculate the methylation levels of individual loci, genomic regions or functional regions such as genes/transposable elements. Additional programs were also developed to provide methylation data annotation, visualization, and differentially methylated cytosine/region (DMC/DMR) detection. The whole package provides new tools and will benefit bisulfite data analysis. Conclusion: BatMeth2 improves DNA methylation calling, particularly for regions close to indels. It is an autorun package and easy to use. In addition, a DNA methylation visualization program and a differential analysis program are provided in BatMeth2. We believe that BatMeth2 will facilitate the study of the mechanisms of DNA methylation in development and disease. BatMeth2 is an open source software program and is available on GitHub (https://github.com/GuoliangLi-HZAU/BatMeth2/). [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

31. BiSpark: a Spark-based highly scalable aligner for bisulfite sequencing data.

Author: Soe, Seokjun, Park, Yoonjae, and Chae, Heejoon
Subjects: *DNA methylation, *SODIUM bisulfite, *SEQUENCE alignment, *ELECTRONIC data processing, *ALGORITHMS, *COMPUTER software
Abstract: Background: Bisulfite sequencing is one of the major high-resolution DNA methylation measurement method. Due to the selective nucleotide conversion on unmethylated cytosines after treatment with sodium bisulfite, processing bisulfite-treated sequencing reads requires additional steps which need high computational demands. However, a dearth of efficient aligner that is designed for bisulfite-treated sequencing becomes a bottleneck of large-scale DNA methylome analyses. Results: In this study, we present a highly scalable, efficient, and load-balanced bisulfite aligner, BiSpark, which is designed for processing large volumes of bisulfite sequencing data. We implemented the BiSpark algorithm over the Apache Spark, a memory optimized distributed data processing framework, to achieve the maximum data parallel efficiency. The BiSpark algorithm is designed to support redistribution of imbalanced data to minimize delays on large-scale distributed environment. Conclusions: Experimental results on methylome datasets show that BiSpark significantly outperforms other state-of-the-art bisulfite sequencing aligners in terms of alignment speed and scalability with respect to dataset size and a number of computing nodes while providing highly consistent and comparable mapping results. Availability: The implementation of BiSpark software package and source code is available at https://github.com/bhi-kimlab/BiSpark/. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

32. miRge 2.0 for comprehensive analysis of microRNA sequencing data.

Author: Lu, Yin, Baras, Alexander S., and Halushka, Marc K.
Subjects: *MICRORNA, *RNA sequencing, *GENE expression, *BIOINFORMATICS, *HAIRPIN (Genetics)
Abstract: Background: miRNAs play important roles in the regulation of gene expression. The rapidly developing field of microRNA sequencing (miRNA-seq; small RNA-seq) needs comprehensive, robust, user-friendly and standardized bioinformatics tools to analyze these large datasets. We present miRge 2.0, in which multiple enhancements were made towards these goals. Results: miRge 2.0 has become more comprehensive with increased functionality including a novel miRNA detection method, A-to-I editing analysis, integrated standardized GFF3 isomiR reporting, and improved alignment to miRNAs. The novel miRNA detection method uniquely uses both miRNA hairpin sequence structure and composition of isomiRs resulting in higher specificity for potential miRNA identification. Using known miRNA data, our support vector machine (SVM) model predicted miRNAs with an average Matthews correlation coefficient (MCC) of 0.939 over 32 human cell datasets and outperformed miRDeep2 and miRAnalyzer regarding phylogenetic conservation. The A-to-I editing detection strongly correlated with a reference dataset with adjusted R2 = 0.96. miRge 2.0 is the most up-to-date aligner with custom libraries to both miRBase v22 and MirGeneDB v2.0 for 6 species: human, mouse, rat, fruit fly, nematode and zebrafish; and has a tool to create custom libraries. For user-friendliness, miRge 2.0 is incorporated into bcbio-nextgen and implementable through Bioconda. Conclusions: miRge 2.0 is a redesigned, leading miRNA RNA-seq aligner with several improvements and novel utilities. miRge 2.0 is freely available at: https://github.com/mhalushka/miRge. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

33. Performance evaluation of pipelines for mapping, variant calling and interval padding, for the analysis of NGS germline panels

Author: Kyproula Christodoulou, Andreas Hadjisavvas, Maria A. Loizidou, Maria Zanti, Kyriacos Kyriacou, Kyriaki Michailidou, Christina Machattou, Panagiota Pirpa, and George M. Spyrou
Subjects: Pipeline comparison, Computer science, QH301-705.5, Computer applications to medicine. Medical informatics, R858-859.7, Interval (mathematics), Computational biology, Biochemistry, Padding, Polymorphism, Single Nucleotide, Germline, 03 medical and health sciences, Germline NGS data analysis, Structural Biology, Variant calling, Humans, Biology (General), Molecular Biology, 030304 developmental biology, Alignment, 0303 health sciences, Next-generation sequencing (NGS), Research, Applied Mathematics, 030305 genetics & heredity, Computational Biology, High-Throughput Nucleotide Sequencing, Interval padding, Pipeline (software), Computer Science Applications, Pipeline transport, Identification (information), Germ Cells, Null (SQL), DNA microarray, Software
Abstract: Background Next-generation sequencing (NGS) represents a significant advancement in clinical genetics. However, its use creates several technical, data interpretation and management challenges. It is essential to follow a consistent data analysis pipeline to achieve the highest possible accuracy and avoid false variant calls. Herein, we aimed to compare the performance of twenty-eight combinations of NGS data analysis pipeline compartments, including short-read mapping (BWA-MEM, Bowtie2, Stampy), variant calling (GATK-HaplotypeCaller, GATK-UnifiedGenotyper, SAMtools) and interval padding (null, 50 bp, 100 bp) methods, along with a commercially available pipeline (BWA Enrichment, Illumina®). Fourteen germline DNA samples from breast cancer patients were sequenced using a targeted NGS panel approach and subjected to data analysis. Results We highlight that interval padding is required for the accurate detection of intronic variants including spliceogenic pathogenic variants (PVs). In addition, using nearly default parameters, the BWA Enrichment algorithm, failed to detect these spliceogenic PVs and a missense PV in the TP53 gene. We also recommend the BWA-MEM algorithm for sequence alignment, whereas variant calling should be performed using a combination of variant calling algorithms; GATK-HaplotypeCaller and SAMtools for the accurate detection of insertions/deletions and GATK-UnifiedGenotyper for the efficient detection of single nucleotide variant calls. Conclusions These findings have important implications towards the identification of clinically actionable variants through panel testing in a clinical laboratory setting, when dedicated bioinformatics personnel might not always be available. The results also reveal the necessity of improving the existing tools and/or at the same time developing new pipelines to generate more reliable and more consistent data.
Published: 2021

34. BATVI: Fast, sensitive and accurate detection of virus integrations.

Author: Tennakoon, Chandana and Wing Kin Sung
Subjects: *SEQUENCE alignment, *VIRUS identification, *HUMAN genome, *ROUS sarcoma, *CHIMERIC proteins, *GENE expression
Abstract: Background: The study of virus integrations in human genome is important since virus integrations were shown to be associated with diseases. In the literature, few methods have been proposed that predict virus integrations using next generation sequencing datasets. Although they work, they are slow and are not very sensitive. Results and discussion: This paper introduces a new method BatVI to predict viral integrations. Our method uses a fast screening method to filter out chimeric reads containing possible viral integrations. Next, sensitive alignments of these candidate chimeric reads are called by BLAST. Chimeric reads that are co-localized in the human genome are clustered. Finally, by assembling the chimeric reads in each cluster, high confident virus integration sites are extracted. Conclusion: We compared the performance of BatVI with existing methods VirusFinder and VirusSeq using both simulated and real-life datasets of liver cancer patients. BatVI ran an order of magnitude faster and was able to predict almost twice the number of true positives compared to other methods while maintaining a false positive rate less than 1%. For the liver cancer datasets, BatVI uncovered novel integrations to two important genes TERT and MLL4, which were missed by previous studies. Through gene expression data, we verified the correctness of these additional integrations. BatVI can be downloaded from http://biogpu.ddns.comp.nus.edu.sg/∼ksung/batvi/index.html. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

35. Meta-aligner: long-read alignment based on genome statistics.

Author: Nashta-ali, Damoon, Aliyari, Ali, Moghadam, Ahmad Ahmadian, Edrisi, Mohammad Amin, Motahari, Seyed Abolfaz, and Khalaj, Babak Hossein
Subjects: *NUCLEOTIDE sequencing, *STATISTICAL methods in genetics, *BIOTECHNOLOGY, *PERFORMANCE evaluation, *PRECISION (Information retrieval)
Abstract: Background: Current development of sequencing technologies is towards generating longer and noisier reads. Evidently, accurate alignment of these reads play an important role in any downstream analysis. Similarly, reducing the overall cost of sequencing is related to the time consumption of the aligner. The tradeoff between accuracy and speed is the main challenge in designing long read aligners. Results: We propose Meta-aligner which aligns long and very long reads to the reference genome very efficiently and accurately. Meta-aligner incorporates available short/long aligners as subcomponents and uses statistics from the reference genome to increase the performance. Meta-aligner estimates statistics from reads and the reference genome automatically. Meta-aligner is implemented in C++ and runs in popular POSIX-like operating systems such as Linux. Conclusions: Meta-aligner achieves high recall rates and precisions especially for long reads and high error rates. Also, it improves performance of alignment in the case of PacBio long-reads in comparison with traditional schemes. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

36. Optical map guided genome assembly

Author: Leena Salmela, Miika Leinonen, Department of Computer Science, Algorithmic Bioinformatics, Helsinki Institute for Information Technology, and Bioinformatics
Subjects: Optical mapping, Optical Map, Computer science, Contiguity, Sequence assembly, Computational biology, lcsh:Computer applications to medicine. Medical informatics, Biochemistry, Genome, 03 medical and health sciences, 0302 clinical medicine, Structural Biology, Genetic linkage, Humans, TOOL, Molecular Biology, lcsh:QH301-705.5, 030304 developmental biology, 0303 health sciences, Genome assembly, Contig, Applied Mathematics, Methodology Article, High-Throughput Nucleotide Sequencing, food and beverages, 113 Computer and information sciences, Voltage-Sensitive Dye Imaging, Computer Science Applications, ALIGNMENT, lcsh:Biology (General), lcsh:R858-859.7, DNA microarray, 030217 neurology & neurosurgery
Abstract: Background The long reads produced by third generation sequencing technologies have significantly boosted the results of genome assembly but still, genome-wide assemblies solely based on read data cannot be produced. Thus, for example, optical mapping data has been used to further improve genome assemblies but it has mostly been applied in a post-processing stage after contig assembly. Results We propose OpticalKermit which directly integrates genome wide optical maps into contig assembly. We show how genome wide optical maps can be used to localize reads on the genome and then we adapt the Kermit method, which originally incorporated genetic linkage maps to the miniasm assembler, to use this information in contig assembly. Our experimental results show that incorporating genome wide optical maps to the contig assembly of miniasm increases NGA50 while the number of misassemblies decreases or stays the same. Furthermore, when compared to the Canu assembler, OpticalKermit produces an assembly with almost three times higher NGA50 with a lower number of misassemblies on real A. thaliana reads. Conclusions OpticalKermit successfully incorporates optical mapping data directly to contig assembly of eukaryotic genomes. Our results show that this is a promising approach to improve the contiguity of genome assemblies.
Published: 2020

37. Systematic evaluation of the impact of ChIP-seq read designs on genome coverage, peak identification, and allele-specific binding detection.

Author: Qi Zhang, Xin Zeng, Sam Younkin, Trupti Kawli, Snyder, Michael P., and Keleş, Sündüz
Subjects: *CHROMATIN, *IMMUNOPRECIPITATION, *GENOMES, *TRANSCRIPTION factors, *DATA analysis
Abstract: Background: Chromatin immunoprecipitation followed by sequencing (ChIP-seq) experiments revolutionized genome-wide profiling of transcription factors and histone modifications. Although maturing sequencing technologies allow these experiments to be carried out with short (36-50 bps), long (75-100 bps), single-end, or paired-end reads, the impact of these read parameters on the downstream data analysis are not well understood. In this paper, we evaluate the effects of different read parameters on genome sequence alignment, coverage of different classes of genomic features, peak identification, and allele-specific binding detection. Results: We generated 101 bps paired-end ChIP-seq data for many transcription factors from human GM12878 and MCF7 cell lines. Systematic evaluations using in silico variations of these data as well as fully simulated data, revealed complex interplay between the sequencing parameters and analysis tools, and indicated clear advantages of paired-end designs in several aspects such as alignment accuracy, peak resolution, and most notably, allele-specific binding detection. Conclusions: Our work elucidates the effect of design on the downstream analysis and provides insights to investigators in deciding sequencing parameters in ChIP-seq experiments. We present the first systematic evaluation of the impact of ChIP-seq designs on allele-specific binding detection and highlights the power of pair-end designs in such studies. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

38. Alignment of time course gene expression data and the classification of developmentally driven genes with hidden Markov models.

Author: Robinson, Sean, Glonek, Garique, Koch, Inge, Thomas, Mark, and Davies, Christopher
Subjects: *MICROARRAY technology, *GENE expression, *HIDDEN Markov models, *MARKOV processes, *BIOINFORMATICS
Abstract: Background: We consider data from a time course microarray experiment that was conducted on grapevines over the development cycle of the grape berries at two different vineyards in South Australia. Although the underlying biological process of berry development is the same at both vineyards, there are differences in the timing of the development due to local conditions. We aim to align the data from the two vineyards to enable an integrated analysis of the gene expression and use the alignment of the expression profiles to classify likely developmental function. Results: We present a novel alignment method based on hidden Markov models (HMMs) and use the method to align the motivating grapevine data. We show that our alignment method is robust against subsets of profiles that are not suitable for alignment, investigate alignment diagnostics under the model and demonstrate the classification of developmentally driven genes. Conclusions: The classification of developmentally driven genes both validates that the alignment we obtain is meaningful and also gives new evidence that can be used to identify the role of genes with unknown function. Using our alignment methodology, we find at least 1279 grapevine probe sets with no current annotated function that are likely to be controlled in a developmental manner. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

39. MDAT- Aligning multiple domain arrangements.

Author: Kemena, Carsten, Feildel, Tristan Bitard, and Bornberg-Bauer, Erich
Subjects: *AMINO acids, *AMINO compounds, *SYMMETRIC domains, *PROTEIN-protein interactions, *METHYLENEDIOXYPHENYL compounds
Abstract: Background Proteins are composed of domains, protein segments that fold independently from the rest of the protein and have a specific function. During evolution the arrangement of domains can change: domains are gained, lost or their order is rearranged. To facilitate the analysis of these changes we propose the use of multiple domain alignments. Results We developed a multiple domain alignment program, called MDAT, which aligns multiple domain arrangements. MDAT extends earlier programs which perform pairwise alignments of domain arrangements. MDAT uses a domain similarity matrix to score domain pairs and aligns the domain arrangements using a consistency supported progressive alignment method. Conclusion MDAT will be useful for analysing changes in domain arrangements within and between protein families and will thus provide valuable insights into the evolution of proteins and their domains. MDAT is coded in C++, and the source code is freely available for download at http://www.bornberglab.org/pages/mdat. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

40. Impact of T-RFLP data analysis choices on assessments of microbial community structure and dynamics.

Author: Fredriksson, Nils Johan, Hermansson, Malte, and Wilén, Britt-Marie
Abstract: Background: Terminal restriction fragment length polymorphism (T-RFLP) analysis is a common DNA-fingerprinting technique used for comparisons of complex microbial communities. Although the technique is well established there is no consensus on how to treat T-RFLP data to achieve the highest possible accuracy and reproducibility. This study focused on two critical steps in the T-RFLP data treatment: the alignment of the terminal restriction fragments (T-RFs), which enables comparisons of samples, and the normalization of T-RF profiles, which adjusts for differences in signal strength, total fluorescence, between samples. Results: Variations in the estimation of T-RF sizes were observed and these variations were found to affect the alignment of the T-RFs. A novel method was developed which improved the alignment by adjusting for systematic shifts in the T-RF size estimations between the T-RF profiles. Differences in total fluorescence were shown to be caused by differences in sample concentration and by the gel loading. Five normalization methods were evaluated and the total fluorescence normalization procedure based on peak height data was found to increase the similarity between replicate profiles the most. A high peak detection threshold, alignment correction, normalization and the use of consensus profiles instead of single profiles increased the similarity of replicate T-RF profiles, i.e. lead to an increased reproducibility. The impact of different treatment methods on the outcome of subsequent analyses of T-RFLP data was evaluated using a dataset from a longitudinal study of the bacterial community in an activated sludge wastewater treatment plant. Whether the alignment was corrected or not and if and how the T-RF profiles were normalized had a substantial impact on ordination analyses, assessments of bacterial dynamics and analyses of correlations with environmental parameters. Conclusions: A novel method for the evaluation and correction of the alignment of T-RF profiles was shown to reduce the uncertainty and ambiguity in alignments of T-RF profiles. Large differences in the outcome of assessments of bacterial community structure and dynamics were observed between different alignment and normalization methods. The results of this study can therefore be of value when considering what methods to use in the analysis of T-RFLP data. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

41. Bison: bisulfite alignment on nodes of a cluster.

Author: Ryan, Devon Patrick and Ehninger, Dan
Abstract: Background: DNA methylation changes are associated with a wide array of biological processes. Bisulfite conversion of DNA followed by high-throughput sequencing is increasingly being used to assess genome-wide methylation at single-base resolution. The relative slowness of most commonly used aligners for processing such data introduces an unnecessarily long delay between receipt of raw data and statistical analysis. While this process can be sped-up by using computer clusters, current tools are not designed with them in mind and end-users must create such implementations themselves. Results: Here, we present a novel BS-seq aligner, Bison, which exploits multiple nodes of a computer cluster to speed up this process and also has increased accuracy. Bison is accompanied by a variety of helper programs and scripts to ease, as much as possible, the process of quality control and preparing results for statistical analysis by a variety of popular R packages. Bison is also accompanied by bison_herd, a variant of Bison with the same output but that can scale to a semi-arbitrary number of nodes, with concomitant increased demands on the underlying message passing interface implementation. Conclusions: Bison is a new bisulfite-converted short-read aligner providing end users easier scalability for performance gains, more accurate alignments, and a convenient pathway for quality controlling alignments and converting methylation calls into a form appropriate for statistical analysis. Bison and the more scalable bison_herd are natively able to utilize multiple nodes of a computer cluster simultaneously and serve to simplify to the process of creating analysis pipelines. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

42. The number of reduced alignments between two DNA sequences.

Author: Andrade, Helena, Area, Iván, Nieto, Juan J., and Torres, Ángela
Abstract: Background: In this study we consider DNA sequences as mathematical strings. Total and reduced alignments between two DNA sequences have been considered in the literature to measure their similarity. Results for explicit representations of some alignments have been already obtained. Results: We present exact, explicit and computable formulas for the number of different possible alignments between two DNA sequences and a new formula for a class of reduced alignments. Conclusions: A unified approach for a wide class of alignments between two DNA sequences has been provided. The formula is computable and, if complemented by software development, will provide a deeper insight into the theory of sequence alignment and give rise to new comparison methods. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

43. MethyQA: a pipeline for bisulfite-treated methylation sequencing quality assessment.

Author: Shuying Sun, Aaron Noviski, and Xiaoqing Yu
Subjects: *DNA methylation, *BIOINFORMATICS, *CYTOSINE, *EPIGENESIS, *GENE expression, *SULFITES, *METHYLATION
Abstract: Background: DNA methylation is an epigenetic event that adds a methyl-group to the 5' cytosine. This epigenetic modification can significantly affect gene expression in both normal and diseased cells. Hence, it is important to study methylation signals at the single cytosine site level, which is now possible utilizing bisulfite conversion technique (i.e., converting unmethylated Cs to Us and then to Ts after PCR amplification) and next generation sequencing (NGS) technologies. Despite the advances of NGS technologies, certain quality issues remain. Some of the more prevalent quality issues involve low per-base sequencing quality at the 3' end, PCR amplification bias, and bisulfite conversion rates. Therefore, it is important to conduct quality assessment before downstream analysis. To the best of our knowledge, no existing software packages can generally assess the quality of methylation sequencing data generated based on different bisulfite-treated protocols. Results: To conduct the quality assessment of bisulfite methylation sequencing data, we have developed a pipeline named MethyQA. MethyQA combines currently available open-source software packages with our own custom programs written in Perl and R. The pipeline can provide quality assessment results for tens of millions of reads in under an hour. The novelty of our pipeline lies in its examination of bisulfite conversion rates and of the DNA sequence structure of regions that have different conversion rates or coverage. Conclusions: MethyQA is a new software package that provides users with a unique insight into the methylation sequencing data they are researching. It allows the users to determine the quality of their data and better prepares them to address the research questions that lie ahead. Due to the speed and efficiency at which MethyQA operates, it will become an important tool for studies dealing with bisulfite methylation sequencing data. [ABSTRACT FROM AUTHOR]
Published: 2013
Full Text: View/download PDF

44. MultiAlign: a multiple LC-MS analysis tool for targeted omics analysis.

Author: LaMarche, Brian L., Crowell, Kevin L., Jaitly, Navdeep, Petyuk, Vladislav A., Shah, Anuj R., Polpitiya, Ashoka D., Sandoval, John D., Kiebel, Gary R., Monroe, Matthew E., Callister, Stephen J., Metz, Thomas O., Anderson, Gordon A., and Smith, Richard D.
Subjects: *PROTEOMICS, *MOLECULAR structure, *PEPTIDES, *FATTY acid oxidation, *COMPARATIVE studies, *LIQUID chromatography-mass spectrometry
Abstract: Background: MultiAlign is a free software tool that aligns multiple liquid chromatography-mass spectrometry datasets to one another by clustering mass and chromatographic elution features across datasets. Applicable to both label-free proteomics and metabolomics comparative analyses, the software can be operated in several modes. For example, clustered features can be matched to a reference database to identify analytes, used to generate abundance profiles, linked to tandem mass spectra based on parent precursor masses, and culled for targeted liquid chromatography-tandem mass spectrometric analysis. MultiAlign is also capable of tandem mass spectral clustering to describe proteome structure and find similarity in subsequent sample runs. Results: MultiAlign was applied to two large proteomics datasets obtained from liquid chromatography-mass spectrometry analyses of environmental samples. Peptides in the datasets for a microbial community that had a known metagenome were identified by matching mass and elution time features to those in an established reference peptide database. Results compared favorably with those obtained using existing tools such as VIPER, but with the added benefit of being able to trace clusters of peptides across conditions to existing tandem mass spectra. MultiAlign was further applied to detect clusters across experimental samples derived from a reactor biomass community for which no metagenome was available. Several clusters were culled for further analysis to explore changes in the community structure. Lastly, MultiAlign was applied to liquid chromatography-mass spectrometry-based datasets obtained from a previously published study of wild type and mitochondrial fatty acid oxidation enzyme knockdown mutants of human hepatocarcinoma to demonstrate its utility for analyzing metabolomics datasets. Conclusion: MultiAlign is an efficient software package for finding similar analytes across multiple liquid chromatography-mass spectrometry feature maps, as demonstrated here for both proteomics and metabolomics experiments. The software is particularly useful for proteomic studies where little or no genomic context is known, such as with environmental proteomics. [ABSTRACT FROM AUTHOR]
Published: 2013
Full Text: View/download PDF

45. Magic-BLAST, an accurate RNA-seq aligner for long and short reads

Author: Boratyn, Grzegorz M., Thierry-Mieg, Jean, Thierry-Mieg, Danielle, Busby, Ben, and Madden, Thomas L.
Published: 2019
Full Text: View/download PDF

46. Evaluation of the impact of Illumina error correction tools on de novo genome assembly

Author: Yves Van de Peer, Piet Demeester, Giles Miclotte, Mahdi Heydari, and Jan Fostier
Subjects: 0301 basic medicine, Computer science, Sequence assembly, computer.software_genre, Biochemistry, Genome, chemistry.chemical_compound, 0302 clinical medicine, MEMORY-EFFICIENT, Structural Biology, Error correction, lcsh:QH301-705.5, Contig, Applied Mathematics, High-Throughput Nucleotide Sequencing, Computer Science Applications, ACCURATE CORRECTION, ALIGNMENT, lcsh:R858-859.7, Drosophila, Data mining, DNA microarray, Algorithms, Research Article, READS, Sequence alignment, lcsh:Computer applications to medicine. Medical informatics, DNA sequencing, 03 medical and health sciences, Illumina, Animals, Humans, Caenorhabditis elegans, Molecular Biology, Gene, Genome assembly, Bacteria, Biology and Life Sciences, DNA, Sequence Analysis, DNA, 030104 developmental biology, chemistry, lcsh:Biology (General), BRUIJN GRAPHS, Next-generation sequencing, SEQUENCING ERRORS, Error detection and correction, computer, Sequence Alignment, 030217 neurology & neurosurgery
Abstract: Background Recently, many standalone applications have been proposed to correct sequencing errors in Illumina data. The key idea is that downstream analysis tools such as de novo genome assemblers benefit from a reduced error rate in the input data. Surprisingly, a systematic validation of this assumption using state-of-the-art assembly methods is lacking, even for recently published methods. Results For twelve recent Illumina error correction tools (EC tools) we evaluated both their ability to correct sequencing errors and their ability to improve de novo genome assembly in terms of contig size and accuracy. Conclusions We confirm that most EC tools reduce the number of errors in sequencing data without introducing many new errors. However, we found that many EC tools suffer from poor performance in certain sequence contexts such as regions with low coverage or regions that contain short repeated or low-complexity sequences. Reads overlapping such regions are often ill-corrected in an inconsistent manner, leading to breakpoints in the resulting assemblies that are not present in assemblies obtained from uncorrected data. Resolving this systematic flaw in future EC tools could greatly improve the applicability of such tools. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1784-8) contains supplementary material, which is available to authorized users.
Published: 2017

47. Magic-BLAST, an accurate RNA-seq aligner for long and short reads

Author: Jean Thierry-Mieg, Danielle Thierry-Mieg, Thomas L. Madden, Grzegorz M. Boratyn, and Ben Busby
Subjects: FASTQ format, Time Factors, Computer science, ComputerApplications_COMPUTERSINOTHERSYSTEMS, RNA-Seq, Computational biology, lcsh:Computer applications to medicine. Medical informatics, Biochemistry, Genome, 03 medical and health sciences, 0302 clinical medicine, Structural Biology, RefSeq, Humans, BLAST, lcsh:QH301-705.5, Molecular Biology, Alignment, 030304 developmental biology, 0303 health sciences, Messenger RNA, Base Sequence, Sequence Analysis, RNA, Applied Mathematics, Magic (programming), Intron, food and beverages, RNA, Pipeline (software), Introns, Computer Science Applications, ComputingMethodologies_PATTERNRECOGNITION, ROC Curve, lcsh:Biology (General), 030220 oncology & carcinogenesis, lcsh:R858-859.7, RNA-seq, DNA microarray, Databases, Nucleic Acid, Sequence Alignment, Algorithms, Software
Abstract: Background Next-generation sequencing technologies can produce tens of millions of reads, often paired-end, from transcripts or genomes. But few programs can align RNA on the genome and accurately discover introns, especially with long reads. We introduce Magic-BLAST, a new aligner based on ideas from the Magic pipeline. Results Magic-BLAST uses innovative techniques that include the optimization of a spliced alignment score and selective masking during seed selection. We evaluate the performance of Magic-BLAST to accurately map short or long sequences and its ability to discover introns on real RNA-seq data sets from PacBio, Roche and Illumina runs, and on six benchmarks, and compare it to other popular aligners. Additionally, we look at alignments of human idealized RefSeq mRNA sequences perfectly matching the genome. Conclusions We show that Magic-BLAST is the best at intron discovery over a wide range of conditions and the best at mapping reads longer than 250 bases, from any platform. It is versatile and robust to high levels of mismatches or extreme base composition, and reasonably fast. It can align reads to a BLAST database or a FASTA file. It can accept a FASTQ file as input or automatically retrieve an accession from the SRA repository at the NCBI. Electronic supplementary material The online version of this article (10.1186/s12859-019-2996-x) contains supplementary material, which is available to authorized users.
Published: 2019

48. BiSpark: a Spark-based highly scalable aligner for bisulfite sequencing data

Author: Yoonjae Park, Heejoon Chae, and Seokjun Soe
Subjects: 0301 basic medicine, Source code, Computer science, media_common.quotation_subject, Bisulfite sequencing, 02 engineering and technology, Parallel computing, lcsh:Computer applications to medicine. Medical informatics, Biochemistry, 03 medical and health sciences, chemistry.chemical_compound, Structural Biology, 020204 information systems, Spark (mathematics), 0202 electrical engineering, electronic engineering, information engineering, Humans, Sulfites, Nucleotide, lcsh:QH301-705.5, Molecular Biology, Alignment, media_common, chemistry.chemical_classification, Distributed Computing Environment, DNA methylation, Apache Spark, Applied Mathematics, Sequence Analysis, DNA, Computer Science Applications, Bisulfite, 030104 developmental biology, lcsh:Biology (General), chemistry, Sodium bisulfite, Scalability, lcsh:R858-859.7, DNA microarray, Sequence Alignment, Algorithms, Software, DNA
Abstract: Bisulfite sequencing is one of the major high-resolution DNA methylation measurement method. Due to the selective nucleotide conversion on unmethylated cytosines after treatment with sodium bisulfite, processing bisulfite-treated sequencing reads requires additional steps which need high computational demands. However, a dearth of efficient aligner that is designed for bisulfite-treated sequencing becomes a bottleneck of large-scale DNA methylome analyses. In this study, we present a highly scalable, efficient, and load-balanced bisulfite aligner, BiSpark, which is designed for processing large volumes of bisulfite sequencing data. We implemented the BiSpark algorithm over the Apache Spark, a memory optimized distributed data processing framework, to achieve the maximum data parallel efficiency. The BiSpark algorithm is designed to support redistribution of imbalanced data to minimize delays on large-scale distributed environment. Experimental results on methylome datasets show that BiSpark significantly outperforms other state-of-the-art bisulfite sequencing aligners in terms of alignment speed and scalability with respect to dataset size and a number of computing nodes while providing highly consistent and comparable mapping results. The implementation of BiSpark software package and source code is available at https://github.com/bhi-kimlab/BiSpark/ .
Published: 2018

49. TOPAZ: asymmetric suffix array neighbourhood search for massive protein databases

Author: Liisa Holm, Alan Medlar, Institute of Biotechnology, Computational genomics, Organismal and Evolutionary Biology Research Programme, Genetics, and Bioinformatics
Subjects: 0301 basic medicine, Computer science, engineering.material, lcsh:Computer applications to medicine. Medical informatics, Neighbourhood search, Biochemistry, law.invention, 03 medical and health sciences, Bacterial Proteins, Enterobacteriaceae, Protein Annotation, Suffix arrays, Structural Biology, law, Amino Acid Sequence, Databases, Protein, BLAST, lcsh:QH301-705.5, Molecular Biology, Applied Mathematics, Suffix array, Protein database, 113 Computer and information sciences, Computer Science Applications, Topaz, ALIGNMENT, ComputingMethodologies_PATTERNRECOGNITION, 030104 developmental biology, lcsh:Biology (General), engineering, lcsh:R858-859.7, 1182 Biochemistry, cell and molecular biology, HOMOLOGY SEARCH, UniProt, Sequence Alignment, Algorithm, Software, Algorithms
Abstract: Background Protein homology search is an important, yet time-consuming, step in everything from protein annotation to metagenomics. Its application, however, has become increasingly challenging, due to the exponential growth of protein databases. In order to perform homology search at the required scale, many methods have been proposed as alternatives to BLAST that make an explicit trade-off between sensitivity and speed. One such method, SANSparallel, uses a parallel implementation of the suffix array neighbourhood search (SANS) technique to achieve high speed and provides several modes to allow for greater sensitivity at the expense of performance. Results We present a new approach called asymmetric SANS together with scored seeds and an alternative suffix array ordering scheme called optimal substitution ordering. These techniques dramatically improve both the sensitivity and speed of the SANS approach. Our implementation, TOPAZ, is one of the top performing methods in terms of speed, sensitivity and scalability. In our benchmark, searching UniProtKB for homologous proteins to the Dickeya solani proteome, TOPAZ took less than 3 minutes to achieve a sensitivity of 0.84 compared to BLAST. Conclusions Despite the trade-off homology search methods have to make between sensitivity and speed, TOPAZ stands out as one of the most sensitive and highest performance methods currently available. Electronic supplementary material The online version of this article (10.1186/s12859-018-2290-3) contains supplementary material, which is available to authorized users.
Published: 2018

50. An integrated package for bisulfite DNA methylation data analysis with Indel-sensitive mapping

Author: Qiangwei Zhou, Jing Quan Lim, Wing-Kin Sung, and Guoliang Li
Subjects: Transposable element, Data Analysis, Bisulfite sequencing, Computational biology, Biology, lcsh:Computer applications to medicine. Medical informatics, Biochemistry, 03 medical and health sciences, 0302 clinical medicine, Structural Biology, Pipeline, Humans, Sulfites, Indel, lcsh:QH301-705.5, Molecular Biology, 030304 developmental biology, Alignment, 0303 health sciences, DNA methylation, Applied Mathematics, food and beverages, Methylation, Sequence Analysis, DNA, Computer Science Applications, Bisulfite, lcsh:Biology (General), 030220 oncology & carcinogenesis, lcsh:R858-859.7, DNA microarray, Algorithms, Software, Reference genome
Abstract: Background DNA methylation plays crucial roles in most eukaryotic organisms. Bisulfite sequencing (BS-Seq) is a sequencing approach that provides quantitative cytosine methylation levels in genome-wide scope and single-base resolution. However, genomic variations such as insertions and deletions (indels) affect methylation calling, and the alignment of reads near/across indels becomes inaccurate in the presence of polymorphisms. Hence, the simultaneous detection of DNA methylation and indels is important for exploring the mechanisms of functional regulation in organisms. Results These problems motivated us to develop the algorithm BatMeth2, which can align BS reads with high accuracy while allowing for variable-length indels with respect to the reference genome. The results from simulated and real bisulfite DNA methylation data demonstrated that our proposed method increases alignment accuracy. Additionally, BatMeth2 can calculate the methylation levels of individual loci, genomic regions or functional regions such as genes/transposable elements. Additional programs were also developed to provide methylation data annotation, visualization, and differentially methylated cytosine/region (DMC/DMR) detection. The whole package provides new tools and will benefit bisulfite data analysis. Conclusion BatMeth2 improves DNA methylation calling, particularly for regions close to indels. It is an autorun package and easy to use. In addition, a DNA methylation visualization program and a differential analysis program are provided in BatMeth2. We believe that BatMeth2 will facilitate the study of the mechanisms of DNA methylation in development and disease. BatMeth2 is an open source software program and is available on GitHub (https://github.com/GuoliangLi-HZAU/BatMeth2/). Electronic supplementary material The online version of this article (10.1186/s12859-018-2593-4) contains supplementary material, which is available to authorized users.
Published: 2018

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

Publisher

143 results on '"Alignment"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources