Author: "Frith MC" / Journal: bioinformatics oxford england - Searchworks@Jio Institute Digital Library Search Results

1. How to optimally sample a sequence for rapid analysis.

Author: Frith MC, Shaw J, and Spouge JL
Subjects: Sequence Analysis, DNA methods, Algorithms, Software
Abstract: Motivation: We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal., Results: We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible., Availability and Implementation: Source code is freely available at https://gitlab.com/mcfrith/noverlap., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2023. Published by Oxford University Press.)
Published: 2023
Full Text: View/download PDF

2. Minimally overlapping words for sequence similarity search.

Author: Frith MC, Noé L, and Kucherov G
Abstract: Motivation: Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via 'seeds': simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence., Results: Here, we study a simple sparse-seeding method: using seeds at positions of certain 'words' (e.g. ac, at, gc or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed 'minimizer' sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it., Availability and Implementation: Software to design and test minimally overlapping words is freely available at https://gitlab.com/mcfrith/noverlap., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2020. Published by Oxford University Press.)
Published: 2021
Full Text: View/download PDF

3. How sequence alignment scores correspond to probability models.

Author: Frith MC
Subjects: Markov Chains, Probability, Reproducibility of Results, Sequence Alignment, Algorithms, Models, Statistical
Abstract: Motivation: Sequence alignment remains fundamental in bioinformatics. Pair-wise alignment is traditionally based on ad hoc scores for substitutions, insertions and deletions, but can also be based on probability models (pair hidden Markov models: PHMMs). PHMMs enable us to: fit the parameters to each kind of data, calculate the reliability of alignment parts and measure sequence similarity integrated over possible alignments., Results: This study shows how multiple models correspond to one set of scores. Scores can be converted to probabilities by partition functions with a 'temperature' parameter: for any temperature, this corresponds to some PHMM. There is a special class of models with balanced length probability, i.e. no bias toward either longer or shorter alignments. The best way to score alignments and assess their significance depends on the aim: judging whether whole sequences are related versus finding related parts. This clarifies the statistical basis of sequence alignment., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2019. Published by Oxford University Press.)
Published: 2020
Full Text: View/download PDF

4. Training alignment parameters for arbitrary sequencers with LAST-TRAIN.

Author: Hamada M, Ono Y, Asai K, and Frith MC
Subjects: Humans, Genome, Human, Polymorphism, Genetic, Sequence Analysis, DNA methods, Software
Abstract: Summary: LAST-TRAIN improves sequence alignment accuracy by inferring substitution and gap scores that fit the frequencies of substitutions, insertions, and deletions in a given dataset. We have applied it to mapping DNA reads from IonTorrent and PacBio RS, and we show that it reduces reference bias for Oxford Nanopore reads., Availability and Implementation: the source code is freely available at http://last.cbrc.jp/., Contact: mhamada@waseda.jp or mcfrith@edu.k.u-tokyo.ac.jp., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author 2016. Published by Oxford University Press.)
Published: 2017
Full Text: View/download PDF

5. ALP & FALP: C++ libraries for pairwise local alignment E-values.

Author: Sheetlin S, Park Y, Frith MC, and Spouge JL
Subjects: DNA metabolism, Databases, Factual, Humans, Proteins metabolism, Sequence Alignment, Computational Biology methods, DNA chemistry, Proteins chemistry, Sequence Analysis, DNA methods, Sequence Analysis, Protein methods, Software
Abstract: Motivation: Pairwise local alignment is an indispensable tool for molecular biologists. In real time (i.e. in about 1 s), ALP (Ascending Ladder Program) calculates the E-values for protein-protein or DNA-DNA local alignments of random sequences, for arbitrary substitution score matrix, gap costs and letter abundances; and FALP (Frameshift Ascending Ladder Program) performs a similar task, although more slowly, for frameshifting DNA-protein alignments., Availability and Implementation: To permit other C++ programmers to implement the computational efficiencies in ALP and FALP directly within their own programs, C++ source codes are available in the public domain at http://go.usa.gov/3GTSW under 'ALP' and 'FALP', along with the standalone programs ALP and FALP., Contact: spouge@nih.gov, Supplementary Information: Supplementary data are available at Bioinformatics online., (Published by Oxford University Press 2015. This work is written by US Government employees and is in the public domain in the US.)
Published: 2016
Full Text: View/download PDF

6. Frameshift alignment: statistics and post-genomic applications.

Author: Sheetlin SL, Park Y, Frith MC, and Spouge JL
Subjects: Algorithms, Data Interpretation, Statistical, Genome, Human, Genomics, Humans, Metagenomics, Pseudogenes, Sequence Analysis, DNA, Sequence Analysis, Protein, Sequence Analysis, RNA, Software, Frameshift Mutation, Sequence Alignment methods
Abstract: Motivation: The alignment of DNA sequences to proteins, allowing for frameshifts, is a classic method in sequence analysis. It can help identify pseudogenes (which accumulate mutations), analyze raw DNA and RNA sequence data (which may have frameshift sequencing errors), investigate ribosomal frameshifts, etc. Often, however, only ad hoc approximations or simulations are available to provide the statistical significance of a frameshift alignment score., Results: We describe a method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics. (BLAST presently does not permit its alignments to include frameshifts.) We also illustrate the continuing usefulness of frameshift alignment with two 'post-genomic' applications: (i) when finding pseudogenes within the human genome, frameshift alignments show that most anciently conserved non-coding human elements are recent pseudogenes with conserved ancestral genes; and (ii) when analyzing metagenomic DNA reads from polluted soil, frameshift alignments show that most alignable metagenomic reads contain frameshifts, suggesting that metagenomic analysis needs to use frameshift alignment to derive accurate results., (Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.)
Published: 2014
Full Text: View/download PDF

7. An approximate Bayesian approach for mapping paired-end DNA reads to a reference genome.

Author: Shrestha AM and Frith MC
Subjects: Animals, Bayes Theorem, Chromosome Mapping, Chromosomes, Human, X, Genome, Human, Humans, Macaca mulatta, Models, Statistical, Software, High-Throughput Nucleotide Sequencing methods, Sequence Alignment methods, Sequence Analysis, DNA methods
Abstract: Summary: Many high-throughput sequencing experiments produce paired DNA reads. Paired-end DNA reads provide extra positional information that is useful in reliable mapping of short reads to a reference genome, as well as in downstream analyses of structural variations. Given the importance of paired-end alignments, it is surprising that there have been no previous publications focusing on this topic. In this article, we present a new probabilistic framework to predict the alignment of paired-end reads to a reference genome. Using both simulated and real data, we compare the performance of our method with six other read-mapping tools that provide a paired-end option. We show that our method provides a good combination of accuracy, error rate and computation time, especially in more challenging and practical cases, such as when the reference genome is incomplete or unavailable for the sample, or when there are large variations between the reference genome and the source of the reads. An open-source implementation of our method is available as part of Last, a multi-purpose alignment program freely available at http://last.cbrc.jp., Contact: martin@cbrc.jp, Supplementary Information: Supplementary data are available at Bioinformatics online.
Published: 2013
Full Text: View/download PDF

8. Adding unaligned sequences into an existing alignment using MAFFT and LAST.

Author: Katoh K and Frith MC
Subjects: Algorithms, Base Sequence, Computational Biology methods, Phylogeny, Sequence Alignment methods, Software
Abstract: Unlabelled: Two methods to add unaligned sequences into an existing multiple sequence alignment have been implemented as the '--add' and '--addfragments' options in the MAFFT package. The former option is a basic one and applicable only to full-length sequences, whereas the latter option is applicable even when the unaligned sequences are short and fragmentary. These methods internally infer the phylogenetic relationship among the sequences in the existing alignment and the phylogenetic positions of unaligned sequences. Benchmarks based on two independent simulations consistently suggest that the "--addfragments" option outperforms recent methods, PaPaRa and PAGAN, in accuracy for difficult problems and that these three methods appropriately handle easy problems., Availability: http://mafft.cbrc.jp/alignment/software/, Contact: katoh@ifrec.osaka-u.ac.jp, Supplementary Information: Supplementary data are available at Bioinformatics online.
Published: 2012
Full Text: View/download PDF

9. Probabilistic alignments with quality scores: an application to short-read mapping toward accurate SNP/indel detection.

Author: Hamada M, Wijaya E, Frith MC, and Asai K
Subjects: Algorithms, Humans, INDEL Mutation, Models, Statistical, Polymorphism, Single Nucleotide, Sequence Alignment methods, Sequence Analysis, DNA
Abstract: Motivation: Recent studies have revealed the importance of considering quality scores of reads generated by next-generation sequence (NGS) platforms in various downstream analyses. It is also known that probabilistic alignments based on marginal probabilities (e.g. aligned-column and/or gap probabilities) provide more accurate alignment than conventional maximum score-based alignment. There exists, however, no study about probabilistic alignment that considers quality scores explicitly, although the method is expected to be useful in SNP/indel callers and bisulfite mapping, because accurate estimation of aligned columns or gaps is important in those analyses., Results: In this study, we propose methods of probabilistic alignment that consider quality scores of (one of) the sequences as well as a usual score matrix. The method is based on posterior decoding techniques in which various marginal probabilities are computed from a probabilistic model of alignments with quality scores, and can arbitrarily trade-off sensitivity and positive predictive value (PPV) of prediction (aligned columns and gaps). The method is directly applicable to read mapping (alignment) toward accurate detection of SNPs and indels. Several computational experiments indicated that probabilistic alignments can estimate aligned columns and gaps accurately, compared with other mapping algorithms e.g. SHRiMP2, Stampy, BWA and Novoalign. The study also suggested that our approach yields favorable precision for SNP/indel calling.
Published: 2011
Full Text: View/download PDF

10. Site2genome: locating short DNA sequences in whole genomes.

Author: Frith MC, Halees AS, Hansen U, and Weng Z
Subjects: DNA analysis, DNA chemistry, DNA genetics, Genome, Internet, Oligonucleotides analysis, Algorithms, Chromosome Mapping methods, Oligonucleotides chemistry, Oligonucleotides genetics, Sequence Alignment methods, Sequence Analysis, DNA methods, Software
Abstract: Summary: Many biological papers describe short, functional DNA sites without specifying their exact positions in the genome. We have developed a Web server that automates the tedious task of locating such sites in eukaryotic genomes, thus giving access to the context of rich annotations that are increasingly available for genome sequences., Availability: http://zlab.bu.edu/site2genome/
Published: 2004
Full Text: View/download PDF

11. Detection of cis-element clusters in higher eukaryotic DNA.

Author: Frith MC, Hansen U, and Weng Z
Subjects: Animals, Binding Sites genetics, Computational Biology, DNA metabolism, Genes, Regulator, Genome, Human, Humans, Markov Chains, Muscles metabolism, Promoter Regions, Genetic, Sensitivity and Specificity, Sequence Analysis, DNA statistics & numerical data, Software, Transcription Factors metabolism, Algorithms, Cluster Analysis, DNA genetics
Abstract: Motivation: Computational prediction and analysis of transcription regulatory regions in DNA sequences has the potential to accelerate greatly our understanding of how cellular processes are controlled. We present a hidden Markov model based method for detecting regulatory regions in DNA sequences, by searching for clusters of cis-elements., Results: When applied to regulatory targets of the transcription factor LSF, this method achieves a sensitivity of 67%, while making one prediction per 33 kb of non-repetitive human genomic sequence. When applied to muscle specific regulatory regions, we obtain a sensitivity and prediction rate that compare favorably with one of the best alternative approaches. Our method, which we call Cister, can be used to predict different varieties of regulatory region by searching for clusters of cis-elements of any type chosen by the user. Cister is simple to use and is available on the web., Availability: http://sullivan.bu.edu/~mfrith/cister.shtml., Contact: mfrith@bu.edu; zhiping@bu.edu.
Published: 2001
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

11 results on '"Frith MC"'

1. How to optimally sample a sequence for rapid analysis.

2. Minimally overlapping words for sequence similarity search.

3. How sequence alignment scores correspond to probability models.

4. Training alignment parameters for arbitrary sequencers with LAST-TRAIN.

5. ALP & FALP: C++ libraries for pairwise local alignment E-values.

6. Frameshift alignment: statistics and post-genomic applications.

7. An approximate Bayesian approach for mapping paired-end DNA reads to a reference genome.

8. Adding unaligned sequences into an existing alignment using MAFFT and LAST.

9. Probabilistic alignments with quality scores: an application to short-read mapping toward accurate SNP/indel detection.

10. Site2genome: locating short DNA sequences in whole genomes.

11. Detection of cis-element clusters in higher eukaryotic DNA.

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

11 results on '"Frith MC"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources