96 results on '"Frith, Mc"'
Search Results
2. Discovery of widespread transcription initiation at microsatellites predictable by sequence-based deep neural network.
- Author
-
Grapotte, M, Saraswat, M, Bessière, C, Menichelli, C, Ramilowski, JA, Severin, J, Hayashizaki, Y, Itoh, M, Tagami, M, Murata, M, Kojima-Ishiyama, M, Noma, S, Noguchi, S, Kasukawa, T, Hasegawa, A, Suzuki, H, Nishiyori-Sueki, H, Frith, MC, FANTOM consortium, Chatelain, C, Carninci, P, de Hoon, MJL, Wasserman, WW, Bréhélin, L, Lecellier, C-H, Grapotte, M, Saraswat, M, Bessière, C, Menichelli, C, Ramilowski, JA, Severin, J, Hayashizaki, Y, Itoh, M, Tagami, M, Murata, M, Kojima-Ishiyama, M, Noma, S, Noguchi, S, Kasukawa, T, Hasegawa, A, Suzuki, H, Nishiyori-Sueki, H, Frith, MC, FANTOM consortium, Chatelain, C, Carninci, P, de Hoon, MJL, Wasserman, WW, Bréhélin, L, and Lecellier, C-H
- Abstract
Using the Cap Analysis of Gene Expression (CAGE) technology, the FANTOM5 consortium provided one of the most comprehensive maps of transcription start sites (TSSs) in several species. Strikingly, ~72% of them could not be assigned to a specific gene and initiate at unconventional regions, outside promoters or enhancers. Here, we probe these unassigned TSSs and show that, in all species studied, a significant fraction of CAGE peaks initiate at microsatellites, also called short tandem repeats (STRs). To confirm this transcription, we develop Cap Trap RNA-seq, a technology which combines cap trapping and long read MinION sequencing. We train sequence-based deep learning models able to predict CAGE signal at STRs with high accuracy. These models unveil the importance of STR surrounding sequences not only to distinguish STR classes, but also to predict the level of transcription initiation. Importantly, genetic variants linked to human diseases are preferentially found at STRs with high transcription initiation level, supporting the biological and clinical relevance of transcription initiation at STRs. Together, our results extend the repertoire of non-coding transcription associated with DNA tandem repeats and complexify STR polymorphism.
- Published
- 2021
3. Differential roles of epigenetic changes and Foxp3 expression in regulatory T cell-specific transcriptional regulation
- Author
-
Morikawa, H, Ohkura, N, Vandenbon, A, Itoh, M, Nagao Sato, S, Kawaji, H, Lassmann, T, Carninci, P, Hayashizaki, Y, Forrest, Ar, Standley, Dm, Date, H, Sakaguchi, S, FANTOM Consortium (Forrest AR, Rehli, M, Baillie, Jk, de Hoon MJ, Haberle, V, Kulakovskiy, Iv, Lizio, M, Andersson, R, Mungall, Cj, Meehan, Tf, Schmeier, S, Bertin, N, Jørgensen, M, Dimont, E, Arner, E, Schmidl, C, Schaefer, U, Medvedeva, Ya, Plessy, C, Vitezic, M, Severin, J, Semple, Ca, Ishizu, Y, Francescatto, M, Alam, I, Albanese, D, Altschuler, Gm, Archer, Ja, Arner, P, Babina, M, Baker, S, Balwierz, Pj, Beckhouse, Ag, Pradhan Bhatt, S, Blake, Ja, Blumenthal, A, Bodega, B, Bonetti, A, Briggs, J, Brombacher, F, Burroughs, Am, Califano, A, Cannistraci, Cv, Carbajo, D, Chen, Y, Chierici, M, Ciani, Y, Clevers, Hc, Dalla, E, Davis, Ca, Deplancke, B, Detmar, M, Diehl, Ad, Dohi, T, Drabløs, F, Edge, As, Edinger, M, Ekwall, K, Endoh, M, Enomoto, H, Fagiolini, M, Fairbairn, L, Fang, H, Farach Carson MC, Faulkner, Gj, Favorov, Av, Fisher, Me, Frith, Mc, Fujita, R, Fukuda, S, Furlanello, C, Furuno, M, Furusawa, J, Geijtenbeek, Tb, Gibson, A, Gingeras, T, Goldowitz, D, Gough, J, Guhl, S, Guler, R, Gustincich, Stefano, Ha, Tj, Hamaguchi, M, Hara, M, Harbers, M, Harshbarger, J, Hasegawa, A, Hasegawa, Y, Hashimoto, T, Herlyn, M, Hitchens, Kj, Ho Sui SJ, Hofmann, Om, Hoof, I, Hori, F, Huminiecki, L, Iida, K, Ikawa, T, Jankovic, Br, Jia, H, Joshi, A, Jurman, G, Kaczkowski, B, Kai, C, Kaida, K, Kaiho, A, Kajiyama, K, Kanamori Katayama, M, Kasianov, As, Kasukawa, T, Katayama, S, Kato, S, Kawaguchi, S, Kawamoto, H, Kawamura, Yi, Kawashima, T, Kempfle, Js, Kenna, Tj, Kere, J, Khachigian, Lm, Kitamura, T, Klinken, Sp, Knox, Aj, Kojima, M, Kojima, S, Kondo, N, Koseki, H, Koyasu, S, Krampitz, S, Kubosaki, A, Kwon, At, Laros, Jf, Lee, W, Lennartsson, A, Li, K, Lilje, B, Lipovich, L, Mackay Sim, A, Manabe, R, Mar, Jc, Marchand, B, Mathelier, A, Mejhert, N, Meynert, A, Mizuno, Y, Morais, Da, Morimoto, M, Moro, K, Motakis, E, Motohashi, H, Mummery, Cl, Murata, M, Nakachi, Y, Nakahara, F, Nakamura, T, Nakamura, Y, Nakazato, K, van Nimwegen, E, Ninomiya, N, Nishiyori, H, Noma, S, Nozaki, T, Ogishima, S, Ohmiya, H, Ohno, H, Ohshima, M, Okada Hatakeyama, M, Okazaki, Y, Orlando, V, Ovchinnikov, Da, Pain, A, Passier, R, Patrikakis, M, Persson, H, Piazza, S, Prendergast, Jg, Rackham, Oj, Ramilowski, Ja, Rashid, M, Ravasi, T, Rizzu, P, Roncador, M, Roy, S, Rye, Mb, Saijyo, E, Sajantila, A, Saka, A, Sakai, M, Sato, H, Satoh, H, Savvi, S, Saxena, A, Schneider, C, Schultes, Ea, Schulze Tanzil GG, Schwegmann, A, Sengstag, T, Sheng, G, Shimoji, H, Shimoni, Y, Shin, Jw, Simon, C, Sugiyama, D, Sugiyama, T, Suzuki, M, Swoboda, Rk, 't Hoen PA, Tagami, M, Takahashi, N, Takai, J, Tanaka, H, Tatsukawa, H, Tatum, Z, Thompson, M, Toyoda, H, Toyoda, T, Valen, E, van de Wetering, M, van den Berg LM, Verardo, R, Vijayan, D, Vorontsov, Ie, Wasserman, Ww, Watanabe, S, Wells, Ca, Winteringham, Ln, Wolvetang, E, Wood, Ej, Yamaguchi, Y, Yamamoto, M, Yoneda, M, Yonekura, Y, Yoshida, S, Zabierowski, Se, Zhang, Pg, Zhao, X, Zucchelli, S, Summers, Km, Suzuki, H, Daub, Co, Kawai, J, Heutink, P, Hide, W, Freeman, Tc, Lenhard, B, Bajic, Vb, Taylor, Ms, Makeev, Vj, Sandelin, A, Hume, Da, Hayashizaki, Y., AII - Amsterdam institute for Infection and Immunity, Infectious diseases, Experimental Immunology, and Hubrecht Institute for Developmental Biology and Stem Cell Research
- Subjects
Transcription, Genetic ,Regulatory T cell ,T-Lymphocytes ,Down-Regulation ,chemical and pharmacologic phenomena ,Biology ,Inbred C57BL ,T-Lymphocytes, Regulatory ,Epigenesis, Genetic ,Mice ,Genetic ,Settore BIO/13 - Biologia Applicata ,medicine ,Transcriptional regulation ,Animals ,Epigenetics ,Gene ,Inbred BALB C ,Genetics ,Regulation of gene expression ,Mice, Inbred BALB C ,Multidisciplinary ,Binding Sites ,FOXP3 ,hemic and immune systems ,Forkhead Transcription Factors ,DNA Methylation ,Biological Sciences ,Regulatory ,Cap analysis gene expression ,Mice, Inbred C57BL ,medicine.anatomical_structure ,Gene Expression Regulation ,DNA methylation ,Transcription ,Epigenesis - Abstract
Naturally occurring regulatory T (Treg) cells, which specifically express the transcription factor forkhead box P3 (Foxp3), are engaged in the maintenance of immunological self-tolerance and homeostasis. By transcriptional start site cluster analysis, we assessed here how genome-wide patterns of DNA methylation or Foxp3 binding sites were associated with Treg-specific gene expression. We found that Treg-specific DNA hypomethylated regions were closely associated with Treg up-regulated transcriptional start site clusters, whereas Foxp3 binding regions had no significant correlation with either up- or down-regulated clusters in nonactivated Treg cells. However, in activated Treg cells, Foxp3 binding regions showed a strong correlation with down-regulated clusters. In accordance with these findings, the above two features of activation-dependent gene regulation in Treg cells tend to occur at different locations in the genome. The results collectively indicate that Treg-specific DNA hypomethylation is instrumental in gene up-regulation in steady state Treg cells, whereas Foxp3 down-regulates the expression of its target genes in activated Treg cells. Thus, the two events seem to play distinct but complementary roles in Treg-specific gene expression.
- Published
- 2014
- Full Text
- View/download PDF
4. Clusters of internally primed transcripts reveal novel long noncoding RNAs
- Author
-
Blake, J, Hancock, J, Pavan, B, Stubbs, L, Furuno, M, Pang, KC, Ninomiya, N, Fukuda, S, Frith, MC, Bult, C, Kai, C, Kawai, J, Carninci, P, Hayashizaki, Y, Mattick, JS, Suzuki, H, Blake, J, Hancock, J, Pavan, B, Stubbs, L, Furuno, M, Pang, KC, Ninomiya, N, Fukuda, S, Frith, MC, Bult, C, Kai, C, Kawai, J, Carninci, P, Hayashizaki, Y, Mattick, JS, and Suzuki, H
- Abstract
Non-protein-coding RNAs (ncRNAs) are increasingly being recognized as having important regulatory roles. Although much recent attention has focused on tiny 22- to 25-nucleotide microRNAs, several functional ncRNAs are orders of magnitude larger in size. Examples of such macro ncRNAs include Xist and Air, which in mouse are 18 and 108 kilobases (Kb), respectively. We surveyed the 102,801 FANTOM3 mouse cDNA clones and found that Air and Xist were present not as single, full-length transcripts but as a cluster of multiple, shorter cDNAs, which were unspliced, had little coding potential, and were most likely primed from internal adenine-rich regions within longer parental transcripts. We therefore conducted a genome-wide search for regional clusters of such cDNAs to find novel macro ncRNA candidates. Sixty-six regions were identified, each of which mapped outside known protein-coding loci and which had a mean length of 92 Kb. We detected several known long ncRNAs within these regions, supporting the basic rationale of our approach. In silico analysis showed that many regions had evidence of imprinting and/or antisense transcription. These regions were significantly associated with microRNAs and transcripts from the central nervous system. We selected eight novel regions for experimental validation by northern blot and RT-PCR and found that the majority represent previously unrecognized noncoding transcripts that are at least 10 Kb in size and predominantly localized in the nucleus. Taken together, the data not only identify multiple new ncRNAs but also suggest the existence of many more macro ncRNAs like Xist and Air.
- Published
- 2006
5. The abundance of short proteins in the mammalian proteome
- Author
-
Blake, J, Hancock, J, Pavan, B, Stubbs, L, Frith, MC, Forrest, AR, Nourbakhsh, E, Pang, KC, Kai, C, Kawai, J, Carninci, P, Hayashizaki, Y, Bailey, TL, Grimmond, SM, Blake, J, Hancock, J, Pavan, B, Stubbs, L, Frith, MC, Forrest, AR, Nourbakhsh, E, Pang, KC, Kai, C, Kawai, J, Carninci, P, Hayashizaki, Y, Bailey, TL, and Grimmond, SM
- Abstract
Short proteins play key roles in cell signalling and other processes, but their abundance in the mammalian proteome is unknown. Current catalogues of mammalian proteins exhibit an artefactual discontinuity at a length of 100 aa, so that protein abundance peaks just above this length and falls off sharply below it. To clarify the abundance of short proteins, we identify proteins in the FANTOM collection of mouse cDNAs by analysing synonymous and non-synonymous substitutions with the computer program CRITICA. This analysis confirms that there is no real discontinuity at length 100. Roughly 10% of mouse proteins are shorter than 100 aa, although the majority of these are variants of proteins longer than 100 aa. We identify many novel short proteins, including a "dark matter" subset containing ones that lack detectable homology to other known proteins. Translation assays confirm that some of these novel proteins can be translated and localised to the secretory pathway.
- Published
- 2006
6. Targeted nanopore sequencing using the Flongle device to identify mitochondrial DNA variants.
- Author
-
Akamatsu S, Mitsuhashi S, Soga K, Mizukami H, Shiraishi M, Frith MC, and Yamano Y
- Subjects
- Humans, Parkinson Disease genetics, Parkinson Disease diagnosis, Genetic Variation, Sequence Analysis, DNA methods, High-Throughput Nucleotide Sequencing methods, Male, Heteroplasmy genetics, Female, DNA, Mitochondrial genetics, Nanopore Sequencing methods, Mitochondrial Diseases genetics, Mitochondrial Diseases diagnosis
- Abstract
Variants in mitochondrial genomes (mtDNA) can cause various neurological and mitochondrial diseases such as mitochondrial myopathy, encephalopathy, lactic acidosis, stroke-like episodes (MELAS). Given the 16 kb length of mtDNA, continuous sequencing is feasible using long-read sequencing (LRS). Herein, we aimed to show a simple and accessible method for comprehensive mtDNA sequencing with potential diagnostic applications for mitochondrial diseases using the compact and affordable LRS flow cell "Flongle." Whole mtDNA amplification (WMA) was performed using genomic DNA samples derived from four patients with mitochondrial diseases, followed by LRS using Flongle. We compared these results to those obtained using Cas9 enrichment. Additionally, the accuracy of heteroplasmy rates was assessed by incorporating mtDNA variants at equimolar levels. Finally, mtDNA from 19 patients with Parkinson's disease (PD) was sequenced using Flongle to identify disease risk-associated variants. mtDNA variants were detected in all four patients with mitochondrial diseases, with results comparable to those obtained from Cas9 enrichment. Heteroplasmy levels were accurately detected (r
2 > 0.99) via WMA using Flongle. A reported variant was identified in three patients with PD. In conclusion, Flongle can simplify the traditionally cumbersome and expensive mtDNA sequencing process, offering a streamlined and accessible approach to diagnosing mitochondrial diseases., (© 2024. The Author(s).)- Published
- 2024
- Full Text
- View/download PDF
7. A simple method for finding related sequences by adding probabilities of alternative alignments.
- Author
-
Frith MC
- Subjects
- Software, Algorithms, Probability, Humans, Sequence Analysis, DNA methods, Computational Biology methods, Base Sequence, Sequence Alignment methods
- Abstract
The main way of analyzing genetic sequences is by finding sequence regions that are related to each other. There are many methods to do that, usually based on this idea: Find an alignment of two sequence regions, which would be unlikely to exist between unrelated sequences. Unfortunately, it is hard to tell if an alignment is likely to exist by chance. Also, the precise alignment of related regions is uncertain. One alignment does not hold all evidence that they are related. We should consider alternative alignments too. This is rarely done, because we lack a simple and fast method that fits easily into practical sequence-search software. Described here is the simplest-conceivable change to standard sequence alignment, which sums probabilities of alternative alignments and makes it easier to tell if a similarity is likely to occur by chance. This approach is better than standard alignment at finding distant relationships, at least in a few tests. It can be used in practical sequence-search software, with minimal increase in implementation difficulty or run time. It generalizes to different kinds of alignment, for example, DNA-versus-protein with frameshifts. Thus, it can widely contribute to finding subtle relationships between sequences., (© 2024 Frith; Published by Cold Spring Harbor Laboratory Press.)
- Published
- 2024
- Full Text
- View/download PDF
8. Evolution and subfamilies of HERVL human endogenous retrovirus.
- Author
-
Zhang H and Frith MC
- Abstract
Background: Endogenous retroviruses (ERVs), which blur the boundary between virus and transposable element, are genetic material derived from retroviruses and have important implications for evolution. This study examines the diversity and evolution of human endogenous retroviruses (HERVs) of the HERVL family, which has long terminal repeats (LTRs) named MLT2., Results: By probability-based sequence comparison, we uncover systematic annotation errors that conceal the true complexity and diversity of transposable elements (TEs) in the human genome. Our analysis identifies new subfamilies within the MLT2 group, proposes a refined classification scheme, and constructs new consensus sequences. We present an evolutionary analysis including phylogenetic trees that elucidate the relationships between these subfamilies and their contributions to human evolution. The results underscore the significance of accurate TE annotation in understanding genome evolution, highlighting the potential for misclassified TEs to impact interpretations of genomic studies., Availability and Implementation: Not applicable., Competing Interests: None declared., (© The Author(s) 2024. Published by Oxford University Press.)
- Published
- 2024
- Full Text
- View/download PDF
9. Cost-Effective Cas9-Mediated Targeted Sequencing of Spinocerebellar Ataxia Repeat Expansions.
- Author
-
Tachikawa K, Shimizu T, Imai T, Ko R, Kawai Y, Omae Y, Tokunaga K, Frith MC, Yamano Y, and Mitsuhashi S
- Subjects
- Humans, Cost-Benefit Analysis, Microsatellite Repeats genetics, Whole Genome Sequencing, High-Throughput Nucleotide Sequencing, CRISPR-Cas Systems, Spinocerebellar Ataxias diagnosis, Spinocerebellar Ataxias genetics
- Abstract
Hereditary repeat diseases are caused by an abnormal expansion of short tandem repeats in the genome. Among them, spinocerebellar ataxia (SCA) is a heterogeneous disease, and currently, 16 responsible repeats are known. Genetic diagnosis is obtained by analyzing the number of repeats through separate testing of each repeat. Although simultaneous detection of candidate repeats using current massively parallel sequencing technologies has been developed to avoid complicated multiple experiments, these methods are generally expensive. This study developed a cost-effective SCA repeat panel [Flongle SCA repeat panel sequencing (FLO-SCAp)] using Cas9-mediated targeted long-read sequencing and the smallest long-read sequencing apparatus, Flongle. This panel enabled the detection of repeat copy number changes, internal repeat sequences, and DNA methylation in seven patients with different repeat expansion diseases. The median (interquartile range) values of coverage and on-target rate were 39.5 (12 to 72) and 11.6% (7.5% to 16.5%), respectively. This approach was validated by comparing repeat copy number changes measured by FLO-SCAp and short-read whole-genome sequencing. A high correlation was observed between FLO-SCAp and short-read whole-genome sequencing when the repeat length was ≤250 bp (r = 0.98; P < 0.001). Thus, FLO-SCAp represents the most cost-effective method for conducting multiplex testing of repeats and can serve as the first-line diagnostic tool for SCA., (Copyright © 2024 Association for Molecular Pathology and American Society for Investigative Pathology. Published by Elsevier Inc. All rights reserved.)
- Published
- 2024
- Full Text
- View/download PDF
10. DNA Conserved in Diverse Animals Since the Precambrian Controls Genes for Embryonic Development.
- Author
-
Frith MC and Ni S
- Subjects
- Animals, DNA, Transcription Factors genetics, Embryonic Development genetics, Conserved Sequence genetics, Genes, Homeobox, Anthozoa genetics
- Abstract
DNA that controls gene expression (e.g. enhancers, promoters) has seemed almost never to be conserved between distantly related animals, like vertebrates and arthropods. This is mysterious, because development of such animals is partly organized by homologous genes with similar complex expression patterns, termed "deep homology." Here, we report 25 regulatory DNA segments conserved across bilaterian animals, of which 7 are also conserved in cnidaria (coral and sea anemone). They control developmental genes (e.g. Nr2f, Ptch, Rfx1/3, Sall, Smad6, Sp5, Tbx2/3), including six homeobox genes: Gsx, Hmx, Meis, Msx, Six1/2, and Zfhx3/4. The segments contain perfectly or near-perfectly conserved CCAAT boxes, E-boxes, and other sequences recognized by regulatory proteins. More such DNA conservation will surely be found soon, as more genomes are published and sequence comparison is optimized. This reveals a control system for animal development conserved since the Precambrian., (© The Author(s) 2023. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution.)
- Published
- 2023
- Full Text
- View/download PDF
11. Biallelic structural variations within FGF12 detected by long-read sequencing in epilepsy.
- Author
-
Ohori S, Miyauchi A, Osaka H, Lourenco CM, Arakaki N, Sengoku T, Ogata K, Honjo RS, Kim CA, Mitsuhashi S, Frith MC, Seyama R, Tsuchida N, Uchiyama Y, Koshimizu E, Hamanaka K, Misawa K, Miyatake S, Mizuguchi T, Saito K, Fujita A, and Matsumoto N
- Subjects
- Humans, Fibroblast Growth Factors, Mutation, Missense, Epilepsy genetics
- Abstract
We discovered biallelic intragenic structural variations (SVs) in FGF12 by applying long-read whole genome sequencing to an exome-negative patient with developmental and epileptic encephalopathy (DEE). We also found another DEE patient carrying a biallelic (homozygous) single-nucleotide variant (SNV) in FGF12 that was detected by exome sequencing. FGF12 heterozygous recurrent missense variants with gain-of-function or heterozygous entire duplication of FGF12 are known causes of epilepsy, but biallelic SNVs/SVs have never been described. FGF12 encodes intracellular proteins interacting with the C-terminal domain of the alpha subunit of voltage-gated sodium channels 1.2, 1.5, and 1.6, promoting excitability by delaying fast inactivation of the channels. To validate the molecular pathomechanisms of these biallelic FGF12 SVs/SNV, highly sensitive gene expression analyses using lymphoblastoid cells from the patient with biallelic SVs, structural considerations, and Drosophila in vivo functional analysis of the SNV were performed, confirming loss-of-function. Our study highlights the importance of small SVs in Mendelian disorders, which may be overlooked by exome sequencing but can be detected efficiently by long-read whole genome sequencing, providing new insights into the pathomechanisms of human diseases., (© 2023 Ohori et al.)
- Published
- 2023
- Full Text
- View/download PDF
12. Improved DNA-Versus-Protein Homology Search for Protein Fossils.
- Author
-
Yao Y and Frith MC
- Abstract
Protein fossils, i.e., noncoding DNA descended from coding DNA, arise frequently from transposable elements (TEs), decayed genes, and viral integrations. They can reveal, and mislead about, evolutionary history and relationships. They have been detected by comparing DNA to protein sequences, but current methods are not optimized for this task. We describe a powerful DNA-protein homology search method. We use a 64×21 substitution matrix, which is fitted to sequence data, automatically learning the genetic code. We detect subtly homologous regions by considering alternative possible alignments between them, and calculate significance (probability of occurring by chance between random sequences). Our method detects TE protein fossils much more sensitively than blastx, and faster. Of the ∼ 7 major categories of eukaryotic TE, three were long thought absent in mammals: we find two of them in the human genome, polinton and DIRS/Ngaro. This method increases our power to find ancient fossils, and perhaps to detect non-standard genetic codes. The alternative-alignments and significance paradigm is not specific to DNA-protein comparison, and could benefit homology search generally. This is an extended version of a conference paper (Yao & Frith, 2021).
- Published
- 2023
- Full Text
- View/download PDF
13. How to optimally sample a sequence for rapid analysis.
- Author
-
Frith MC, Shaw J, and Spouge JL
- Subjects
- Sequence Analysis, DNA methods, Algorithms, Software
- Abstract
Motivation: We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal., Results: We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible., Availability and Implementation: Source code is freely available at https://gitlab.com/mcfrith/noverlap., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2023. Published by Oxford University Press.)
- Published
- 2023
- Full Text
- View/download PDF
14. An immune-suppressing protein in human endogenous retroviruses.
- Author
-
Zhang H, Ni S, and Frith MC
- Abstract
Motivation: Retroviruses are important contributors to disease and evolution in vertebrates. Sometimes, retrovirus DNA is heritably inserted in a vertebrate genome: an endogenous retrovirus (ERV). Vertebrate genomes have many such virus-derived fragments, usually with mutations disabling their original functions., Results: Some primate ERVs appear to encode an overlooked protein. This protein is homologous to protein MC132 from Molluscum contagiosum virus, which is a human poxvirus, not a retrovirus. MC132 suppresses the immune system by targeting NF- κ B, and it had no known homologs until now. The ERV homologs of MC132 in the human genome are mostly disrupted by mutations, but there is an intact copy on chromosome 4. We found homologs of MC132 in ERVs of apes, monkeys and bushbaby, but not tarsiers, lemurs or non-primates. This suggests that some primate retroviruses had, or have, an extra immune-suppressing protein, which underwent horizontal genetic transfer between unrelated viruses., Contact: mcfrith@edu.k.u-tokyo.ac.jp., (© The Author(s) 2023. Published by Oxford University Press.)
- Published
- 2023
- Full Text
- View/download PDF
15. Finding Rearrangements in Nanopore DNA Reads with LAST and dnarrange.
- Author
-
Frith MC and Mitsuhashi S
- Subjects
- Humans, DNA, Sequence Analysis, DNA methods, Genome, Gene Rearrangement, High-Throughput Nucleotide Sequencing methods, Nanopores
- Abstract
Long-read DNA sequencing techniques such as nanopore are especially useful for characterizing complex sequence rearrangements, which occur in some genetic diseases and also during evolution. Analyzing the sequence data to understand such rearrangements is not trivial, due to sequencing error, rearrangement intricacy, and abundance of repeated similar sequences in genomes.The LAST and dnarrange software packages can resolve complex relationships between DNA sequences and characterize changes such as gene conversion, processed pseudogene insertion, and chromosome shattering. They can filter out numerous rearrangements shared by controls, e.g., healthy humans versus a patient, to focus on rearrangements unique to the patient. One useful ingredient is last-train, which learns the rates (probabilities) of deletions, insertions, and each kind of base match and mismatch. These probabilities are then used to find the most likely sequence relationships/alignments, which is especially useful for DNA with unusual rates, such as DNA from Plasmodium falciparum (malaria) with ∼80% a+t. This is also useful for less-studied species that lack reference genomes, so the DNA reads are compared to a different species' genome. We also point out that a reference genome with ancestral alleles would be ideal., (© 2023. The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature.)
- Published
- 2023
- Full Text
- View/download PDF
16. Analysis of Tandem Repeat Expansions Using Long DNA Reads.
- Author
-
Mitsuhashi S and Frith MC
- Subjects
- Humans, Tandem Repeat Sequences genetics, Sequence Analysis, DNA, DNA genetics, High-Throughput Nucleotide Sequencing, Nanopores
- Abstract
Abnormal expansion or shortening of tandem repeats can cause a variety of genetic diseases. The use of long DNA reads has facilitated the analysis of disease-causing repeats in the human genome. Long read sequencers enable us to directly analyze repeat length and sequence content by covering whole repeats; they are therefore considered suitable for the analysis of long tandem repeats. Here, we describe an expanded repeat analysis using target sequencing data produced by the Oxford Nanopore Technologies (hereafter referred to as ONT) nanopore sequencer., (© 2023. The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature.)
- Published
- 2023
- Full Text
- View/download PDF
17. Paleozoic Protein Fossils Illuminate the Evolution of Vertebrate Genomes and Transposable Elements.
- Author
-
Frith MC
- Subjects
- Animals, Evolution, Molecular, Humans, Regulatory Sequences, Nucleic Acid, Retroelements, Vertebrates genetics, DNA Transposable Elements genetics, Fossils
- Abstract
Genomes hold a treasure trove of protein fossils: Fragments of formerly protein-coding DNA, which mainly come from transposable elements (TEs) or host genes. These fossils reveal ancient evolution of TEs and genomes, and many fossils have been exapted to perform diverse functions important for the host's fitness. However, old and highly degraded fossils are hard to identify, standard methods (e.g. BLAST) are not optimized for this task, and few Paleozoic protein fossils have been found. Here, a recently optimized method is used to find protein fossils in vertebrate genomes. It finds Paleozoic fossils predating the amphibian/amniote divergence from most major TE categories, including virus-related Polinton and Gypsy elements. It finds 10 fossils in the human genome (eight from TEs and two from host genes) that predate the last common ancestor of all jawed vertebrates, probably from the Ordovician period. It also finds types of transposon and retrotransposon not found in human before. These fossils have extreme sequence conservation, indicating exaptation: some have evidence of gene-regulatory function, and they tend to lie nearest to developmental genes. Some ancient fossils suggest "genome tectonics," where two fragments of one TE have drifted apart by up to megabases, possibly explaining gene deserts and large introns. This paints a picture of great TE diversity in our aquatic ancestors, with patchy TE inheritance by later vertebrates, producing new genes and regulatory elements on the way. Host-gene fossils too have contributed anciently conserved DNA segments. This paves the way to further studies of ancient protein fossils., (© The Author(s) 2022. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution.)
- Published
- 2022
- Full Text
- View/download PDF
18. Author Correction: Discovery of widespread transcription initiation at microsatellites predictable by sequence-based deep neural network.
- Author
-
Grapotte M, Saraswat M, Bessière C, Menichelli C, Ramilowski JA, Severin J, Hayashizaki Y, Itoh M, Tagami M, Murata M, Kojima-Ishiyama M, Noma S, Noguchi S, Kasukawa T, Hasegawa A, Suzuki H, Nishiyori-Sueki H, Frith MC, Chatelain C, Carninci P, de Hoon MJL, Wasserman WW, Bréhélin L, and Lecellier CH
- Published
- 2022
- Full Text
- View/download PDF
19. Long-read whole-genome sequencing identified a partial MBD5 deletion in an exome-negative patient with neurodevelopmental disorder.
- Author
-
Ohori S, Tsuburaya RS, Kinoshita M, Miyagi E, Mizuguchi T, Mitsuhashi S, Frith MC, and Matsumoto N
- Subjects
- Adult, Exome genetics, Haploinsufficiency genetics, Humans, Male, Mutagenesis, Insertional genetics, Neurodevelopmental Disorders pathology, Retroelements genetics, Whole Genome Sequencing, DNA-Binding Proteins genetics, Genetic Predisposition to Disease, Neurodevelopmental Disorders genetics
- Abstract
Whole-exome sequencing (WES) can detect not only single-nucleotide variants in causal genes, but also pathogenic copy-number variations using several methods. However, there may be overlooked pathogenic variations in the out of target genome regions of WES analysis (e.g., promoters), leaving many patients undiagnosed. Whole-genome sequencing (WGS) can potentially analyze such regions. We applied long-read nanopore WGS and our recently developed analysis pipeline "dnarrange" to a patient who was undiagnosed by trio-based WES analysis, and identified a heterozygous 97-kb deletion partially involving 5'-untranslated exons of MBD5, which was outside the WES target regions. The phenotype of the patient, a 32-year-old male, was consistent with haploinsufficiency of MBD5. The transcript level of MBD5 in the patient's lymphoblastoid cells was reduced. We therefore concluded that the partial MBD5 deletion is the culprit for this patient. Furthermore, we found other rare structural variations (SVs) in this patient, i.e., a large inversion and a retrotransposon insertion, which were not seen in 33 controls. Although we considered that they are benign SVs, this finding suggests that our pipeline using long-read WGS is useful for investigating various types of potentially pathogenic SVs. In conclusion, we identified a 97-kb deletion, which causes haploinsufficiency of MBD5 in a patient with neurodevelopmental disorder, demonstrating that long-read WGS is a powerful technique to discover pathogenic SVs.
- Published
- 2021
- Full Text
- View/download PDF
20. Discovery of widespread transcription initiation at microsatellites predictable by sequence-based deep neural network.
- Author
-
Grapotte M, Saraswat M, Bessière C, Menichelli C, Ramilowski JA, Severin J, Hayashizaki Y, Itoh M, Tagami M, Murata M, Kojima-Ishiyama M, Noma S, Noguchi S, Kasukawa T, Hasegawa A, Suzuki H, Nishiyori-Sueki H, Frith MC, Chatelain C, Carninci P, de Hoon MJL, Wasserman WW, Bréhélin L, and Lecellier CH
- Subjects
- A549 Cells, Animals, Base Sequence, Computational Biology methods, Deep Learning, Enhancer Elements, Genetic, Genome, Human, High-Throughput Nucleotide Sequencing, Humans, Mice, Neurodegenerative Diseases diagnosis, Neurodegenerative Diseases metabolism, Polymorphism, Genetic, Promoter Regions, Genetic, Microsatellite Repeats, Neural Networks, Computer, Neurodegenerative Diseases genetics, Transcription Initiation Site, Transcription Initiation, Genetic
- Abstract
Using the Cap Analysis of Gene Expression (CAGE) technology, the FANTOM5 consortium provided one of the most comprehensive maps of transcription start sites (TSSs) in several species. Strikingly, ~72% of them could not be assigned to a specific gene and initiate at unconventional regions, outside promoters or enhancers. Here, we probe these unassigned TSSs and show that, in all species studied, a significant fraction of CAGE peaks initiate at microsatellites, also called short tandem repeats (STRs). To confirm this transcription, we develop Cap Trap RNA-seq, a technology which combines cap trapping and long read MinION sequencing. We train sequence-based deep learning models able to predict CAGE signal at STRs with high accuracy. These models unveil the importance of STR surrounding sequences not only to distinguish STR classes, but also to predict the level of transcription initiation. Importantly, genetic variants linked to human diseases are preferentially found at STRs with high transcription initiation level, supporting the biological and clinical relevance of transcription initiation at STRs. Together, our results extend the repertoire of non-coding transcription associated with DNA tandem repeats and complexify STR polymorphism.
- Published
- 2021
- Full Text
- View/download PDF
21. Nanopore direct RNA sequencing detects DUX4-activated repeats and isoforms in human muscle cells.
- Author
-
Mitsuhashi S, Nakagawa S, Sasaki-Honda M, Sakurai H, Frith MC, and Mitsuhashi H
- Subjects
- Cell Line, Tumor, Gene Expression Profiling, Gene Expression Regulation, Humans, Muscle Cells metabolism, Muscular Dystrophy, Facioscapulohumeral pathology, Protein Isoforms genetics, RNA Isoforms genetics, Reverse Transcriptase Polymerase Chain Reaction, Sequence Analysis, RNA statistics & numerical data, Homeodomain Proteins genetics, Muscle, Skeletal metabolism, Muscular Dystrophy, Facioscapulohumeral genetics, Nanopores, Repetitive Sequences, Nucleic Acid genetics, Sequence Analysis, RNA methods
- Abstract
Facioscapulohumeral muscular dystrophy (FSHD) is an inherited muscle disease caused by misexpression of the DUX4 gene in skeletal muscle. DUX4 is a transcription factor, which is normally expressed in the cleavage-stage embryo and regulates gene expression involved in early embryonic development. Recent studies revealed that DUX4 also activates the transcription of repetitive elements such as endogenous retroviruses (ERVs), mammalian apparent long terminal repeat (LTR)-retrotransposons and pericentromeric satellite repeats (Human Satellite II). DUX4-bound ERV sequences also create alternative promoters for genes or long non-coding RNAs, producing fusion transcripts. To further understand transcriptional regulation by DUX4, we performed nanopore long-read direct RNA sequencing (dRNA-seq) of human muscle cells induced by DUX4, because long reads show whole isoforms with greater confidence. We successfully detected differential expression of known DUX4-induced genes and discovered 61 differentially expressed repeat loci, which are near DUX4-ChIP peaks. We also identified 247 gene-ERV fusion transcripts, of which 216 were not reported previously. In addition, long-read dRNA-seq clearly shows that RNA splicing is a common event in DUX4-activated ERV transcripts. Long-read analysis showed non-LTR transposons including Alu elements are also transcribed from LTRs. Our findings revealed further complexity of DUX4-induced ERV transcripts. This catalogue of DUX4-activated repetitive elements may provide useful information to elucidate the pathology of FSHD. Also, our results indicate that nanopore dRNA-seq has complementary strengths to conventional short-read complementary DNA sequencing., (© The Author(s) 2021. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
- Published
- 2021
- Full Text
- View/download PDF
22. Significant non-existence of sequences in genomes and proteomes.
- Author
-
Koulouras G and Frith MC
- Subjects
- Animals, Genome, Humans, Markov Chains, Mutation, Peptides chemistry, Proteome, Software, Viruses genetics, Databases, Genetic, Genomics methods, Proteomics methods
- Abstract
Minimal absent words (MAWs) are minimal-length oligomers absent from a genome or proteome. Although some artificially synthesized MAWs have deleterious effects, there is still a lack of a strategy for the classification of non-occurring sequences as potentially malicious or benign. In this work, by using Markovian models with multiple-testing correction, we reveal significant absent oligomers, which are statistically expected to exist. This suggests that their absence is due to negative selection. We survey genomes and proteomes covering the diversity of life and find thousands of significant absent sequences. Common significant MAWs are often mono- or dinucleotide tracts, or palindromic. Significant viral MAWs are often restriction sites and may indicate unknown restriction motifs. Surprisingly, significant mammal genome MAWs are often present, but rare, in other mammals, suggesting that they are suppressed but not completely forbidden. Significant human MAWs are frequently present in prokaryotes, suggesting immune function, but rarely present in human viruses, indicating viral mimicry of the host. More than one-fourth of human proteins are one substitution away from containing a significant MAW, with the majority of replacements being predicted harmful. We provide a web-based, interactive database of significant MAWs across genomes and proteomes., (© The Author(s) 2021. Published by Oxford University Press on behalf of Nucleic Acids Research.)
- Published
- 2021
- Full Text
- View/download PDF
23. Minimally overlapping words for sequence similarity search.
- Author
-
Frith MC, Noé L, and Kucherov G
- Abstract
Motivation: Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via 'seeds': simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence., Results: Here, we study a simple sparse-seeding method: using seeds at positions of certain 'words' (e.g. ac, at, gc or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed 'minimizer' sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it., Availability and Implementation: Software to design and test minimally overlapping words is freely available at https://gitlab.com/mcfrith/noverlap., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2020. Published by Oxford University Press.)
- Published
- 2021
- Full Text
- View/download PDF
24. Genome-wide survey of tandem repeats by nanopore sequencing shows that disease-associated repeats are more polymorphic in the general population.
- Author
-
Mitsuhashi S, Frith MC, and Matsumoto N
- Subjects
- Humans, Genome, Human, Genome-Wide Association Study, Polymorphism, Single Nucleotide, Polymorphism, Genetic, Genetic Predisposition to Disease, Tandem Repeat Sequences genetics, Nanopore Sequencing
- Abstract
Background: Tandem repeats are highly mutable and contribute to the development of human disease by a variety of mechanisms. It is difficult to predict which tandem repeats may cause a disease. One hypothesis is that changeable tandem repeats are the source of genetic diseases, because disease-causing repeats are polymorphic in healthy individuals. However, it is not clear whether disease-causing repeats are more polymorphic than other repeats., Methods: We performed a genome-wide survey of the millions of human tandem repeats using publicly available long read genome sequencing data from 21 humans. We measured tandem repeat copy number changes using tandem-genotypes. Length variation of known disease-associated repeats was compared to other repeat loci., Results: We found that known Mendelian disease-causing or disease-associated repeats, especially CAG and 5'UTR GGC repeats, are relatively long and polymorphic in the general population. We also show that repeat lengths of two disease-causing tandem repeats, in ATXN3 and GLS, are correlated with near-by GWAS SNP genotypes., Conclusions: We provide a catalog of polymorphic tandem repeats across a variety of repeat unit lengths and sequences, from long read sequencing data. This method especially if used in genome wide association study, may indicate possible new candidates of pathogenic or biologically important tandem repeats in human genomes.
- Published
- 2021
- Full Text
- View/download PDF
25. lamassemble: Multiple Alignment and Consensus Sequence of Long Reads.
- Author
-
Frith MC, Mitsuhashi S, and Katoh K
- Subjects
- Animals, Genetic Techniques, High-Throughput Nucleotide Sequencing, Humans, Nanopores, Consensus Sequence, Genomics methods, Sequence Alignment methods, Sequence Analysis, DNA methods, Software
- Abstract
Long DNA and RNA reads from nanopore and PacBio technologies have many applications, but the raw reads have a substantial error rate. More accurate sequences can be obtained by merging multiple reads from overlapping parts of the same sequence. lamassemble aligns up to ∼1000 reads to each other, and makes a consensus sequence, which is often much more accurate than the raw reads. It is useful for studying a region of interest such as an expanded tandem repeat or other disease-causing mutation.
- Published
- 2021
- Full Text
- View/download PDF
26. Long-read DNA sequencing fully characterized chromothripsis in a patient with Langer-Giedion syndrome and Cornelia de Lange syndrome-4.
- Author
-
Lei M, Liang D, Yang Y, Mitsuhashi S, Katoh K, Miyake N, Frith MC, Wu L, and Matsumoto N
- Subjects
- Child, Chromosome Deletion, De Lange Syndrome diagnosis, De Lange Syndrome physiopathology, Genome, Humans, Langer-Giedion Syndrome diagnosis, Langer-Giedion Syndrome physiopathology, Male, Nanopore Sequencing, Phenotype, Sequence Analysis, DNA, Translocation, Genetic, Exostosin 1, Cell Cycle Proteins genetics, Chromothripsis, DNA-Binding Proteins genetics, De Lange Syndrome genetics, Langer-Giedion Syndrome genetics, N-Acetylglucosaminyltransferases genetics
- Abstract
Chromothripsis is a type of chaotic complex genomic rearrangement caused by a single event of chromosomal shattering and repair processes. Chromothripsis is known to cause rare congenital diseases when it occurs in germline cells, however, current genome analysis technologies have difficulty in detecting and deciphering chromothripsis. It is possible that this type of complex rearrangement may be overlooked in rare-disease patients whose genetic diagnosis is unsolved. We applied long read nanopore sequencing and our recently developed analysis pipeline dnarrange to a patient who has a reciprocal chromosomal translocation t(8;18)(q22;q21) as a result of chromothripsis between the two chromosomes, and fully characterize the complex rearrangements at the translocation site. The patient genome was evidently shattered into 19 fragments, and rejoined into derivative chromosomes in a random order and orientation. The reconstructed patient genome indicates loss of five genomic regions, which all overlap with microarray-detected copy number losses. We found that two disease-related genes RAD21 and EXT1 were lost by chromothripsis. These two genes could fully explain the disease phenotype with facial dysmorphisms and bone abnormality, which is likely a contiguous gene syndrome, Cornelia de Lange syndrome type IV (CdLs-4) and atypical Langer-Giedion syndrome (LGS), also known as trichorhinophalangeal syndrome type II (TRPSII). This provides evidence that our approach based on long read sequencing can fully characterize chromothripsis in a patient's genome, which is important for understanding the phenotype of disease caused by complex genomic rearrangement.
- Published
- 2020
- Full Text
- View/download PDF
27. A pipeline for complete characterization of complex germline rearrangements from long DNA reads.
- Author
-
Mitsuhashi S, Ohori S, Katoh K, Frith MC, and Matsumoto N
- Subjects
- Chromosome Aberrations, Chromosome Breakpoints, Genetic Association Studies methods, Genetic Predisposition to Disease, Genome, Human, Genomics methods, High-Throughput Nucleotide Sequencing, Humans, Translocation, Genetic, Whole Genome Sequencing, Gene Rearrangement, Genome-Wide Association Study, Germ-Line Mutation
- Abstract
Background: Many genetic/genomic disorders are caused by genomic rearrangements. Standard methods can often characterize these variations only partly, e.g., copy number changes or breakpoints. It is important to fully understand the order and orientation of rearranged fragments, with precise breakpoints, to know the pathogenicity of the rearrangements., Methods: We performed whole-genome-coverage nanopore sequencing of long DNA reads from four patients with chromosomal translocations. We identified rearrangements relative to a reference human genome, subtracted rearrangements shared by any of 33 control individuals, and determined the order and orientation of rearranged fragments, with our newly developed analysis pipeline., Results: We describe the full characterization of complex chromosomal rearrangements, by filtering out genomic rearrangements seen in controls without the same disease, reducing the number of loci per patient from a few thousand to a few dozen. Breakpoint detection was very accurate; we usually see ~ 0 ± 1 base difference from Sanger sequencing-confirmed breakpoints. For one patient with two reciprocal chromosomal translocations, we find that the translocation points have complex rearrangements of multiple DNA fragments involving 5 chromosomes, which we could order and orient by an automatic algorithm, thereby fully reconstructing the rearrangement. A rearrangement is more than the sum of its parts: some properties, such as sequence loss, can be inferred only after reconstructing the whole rearrangement. In this patient, the rearrangements were evidently caused by shattering of the chromosomes into multiple fragments, which rejoined in a different order and orientation with loss of some fragments., Conclusions: We developed an effective analytic pipeline to find chromosomal aberration in congenital diseases by filtering benign changes, only from long read sequencing. Our algorithm for reconstruction of complex rearrangements is useful to interpret rearrangements with many breakpoints, e.g., chromothripsis. Our approach promises to fully characterize many congenital germline rearrangements, provided they do not involve poorly understood loci such as centromeric repeats.
- Published
- 2020
- Full Text
- View/download PDF
28. Long-read sequencing identifies the pathogenic nucleotide repeat expansion in RFC1 in a Japanese case of CANVAS.
- Author
-
Nakamura H, Doi H, Mitsuhashi S, Miyatake S, Katoh K, Frith MC, Asano T, Kudo Y, Ikeda T, Kubota S, Kunii M, Kitazawa Y, Tada M, Okamoto M, Joki H, Takeuchi H, Matsumoto N, and Tanaka F
- Subjects
- Aged, 80 and over, Asian People, Bilateral Vestibulopathy diagnosis, Cerebellar Ataxia diagnosis, Female, Humans, Japan, Nedd4 Ubiquitin Protein Ligases genetics, Bilateral Vestibulopathy genetics, Cerebellar Ataxia genetics, DNA Repeat Expansion, Replication Protein C genetics, Sequence Analysis, DNA
- Abstract
Recently, a recessively inherited intronic repeat expansion in replication factor C1 (RFC1) was identified in cerebellar ataxia with neuropathy and bilateral vestibular areflexia syndrome (CANVAS). Here, we describe a Japanese case of genetically confirmed CANVAS with autonomic failure and auditory hallucination. The case showed impaired uptake of iodine-123-metaiodobenzylguanidine and
123 I-ioflupane in the cardiac sympathetic nerve and dopaminergic neurons, respectively, by single-photon emission computed tomography. Long-read sequencing identified biallelic pathogenic (AAGGG)n nucleotide repeat expansion in RFC1 and heterozygous benign (TAAAA)n and (TAGAA)n expansions in brain expressed, associated with NEDD4 (BEAN1). Enrichment of the repeat regions in RFC1 and BEAN1 using a Cas9-mediated system clearly distinguished between pathogenic and benign repeat expansions. The haplotype around RFC1 indicated that the (AAGGG)n expansion in our case was on the same ancestral allele as that of European cases. Thus, long-read sequencing facilitates precise genetic diagnosis of diseases with complex repeat structures and various expansions.- Published
- 2020
- Full Text
- View/download PDF
29. How sequence alignment scores correspond to probability models.
- Author
-
Frith MC
- Subjects
- Markov Chains, Probability, Reproducibility of Results, Sequence Alignment, Algorithms, Models, Statistical
- Abstract
Motivation: Sequence alignment remains fundamental in bioinformatics. Pair-wise alignment is traditionally based on ad hoc scores for substitutions, insertions and deletions, but can also be based on probability models (pair hidden Markov models: PHMMs). PHMMs enable us to: fit the parameters to each kind of data, calculate the reliability of alignment parts and measure sequence similarity integrated over possible alignments., Results: This study shows how multiple models correspond to one set of scores. Scores can be converted to probabilities by partition functions with a 'temperature' parameter: for any temperature, this corresponds to some PHMM. There is a special class of models with balanced length probability, i.e. no bias toward either longer or shorter alignments. The best way to score alignments and assess their significance depends on the aim: judging whether whole sequences are related versus finding related parts. This clarifies the statistical basis of sequence alignment., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2019. Published by Oxford University Press.)
- Published
- 2020
- Full Text
- View/download PDF
30. Long-read sequencing identifies GGC repeat expansions in NOTCH2NLC associated with neuronal intranuclear inclusion disease.
- Author
-
Sone J, Mitsuhashi S, Fujita A, Mizuguchi T, Hamanaka K, Mori K, Koike H, Hashiguchi A, Takashima H, Sugiyama H, Kohno Y, Takiyama Y, Maeda K, Doi H, Koyano S, Takeuchi H, Kawamoto M, Kohara N, Ando T, Ieda T, Kita Y, Kokubun N, Tsuboi Y, Katoh K, Kino Y, Katsuno M, Iwasaki Y, Yoshida M, Tanaka F, Suzuki IK, Frith MC, Matsumoto N, and Sobue G
- Subjects
- Adolescent, Adult, Aged, Brain metabolism, Case-Control Studies, Female, Genetic Markers genetics, Humans, Intranuclear Inclusion Bodies genetics, Intranuclear Inclusion Bodies pathology, Male, Middle Aged, Pedigree, Receptors, Notch metabolism, Young Adult, Brain pathology, High-Throughput Nucleotide Sequencing methods, Linkage Disequilibrium, Neurodegenerative Diseases genetics, Neurodegenerative Diseases pathology, Receptors, Notch genetics, Trinucleotide Repeat Expansion genetics
- Abstract
Neuronal intranuclear inclusion disease (NIID) is a progressive neurodegenerative disease that is characterized by eosinophilic hyaline intranuclear inclusions in neuronal and somatic cells. The wide range of clinical manifestations in NIID makes ante-mortem diagnosis difficult
1-8 , but skin biopsy enables its ante-mortem diagnosis9-12 . The average onset age is 59.7 years among approximately 140 NIID cases consisting of mostly sporadic and several familial cases. By linkage mapping of a large NIID family with several affected members (Family 1), we identified a 58.1 Mb linked region at 1p22.1-q21.3 with a maximum logarithm of the odds score of 4.21. By long-read sequencing, we identified a GGC repeat expansion in the 5' region of NOTCH2NLC (Notch 2 N-terminal like C) in all affected family members. Furthermore, we found similar expansions in 8 unrelated families with NIID and 40 sporadic NIID cases. We observed abnormal anti-sense transcripts in fibroblasts specifically from patients but not unaffected individuals. This work shows that repeat expansion in human-specific NOTCH2NLC, a gene that evolved by segmental duplication, causes a human disease.- Published
- 2019
- Full Text
- View/download PDF
31. Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads.
- Author
-
Mitsuhashi S, Frith MC, Mizuguchi T, Miyatake S, Toyota T, Adachi H, Oma Y, Kino Y, Mitsuhashi H, and Matsumoto N
- Subjects
- Adult, Algorithms, Computational Biology methods, Genetic Predisposition to Disease, Humans, Epilepsies, Myoclonic genetics, High-Throughput Nucleotide Sequencing methods, Sequence Analysis, DNA methods, Software, Tandem Repeat Sequences, Whole Genome Sequencing methods
- Abstract
Tandemly repeated DNA is highly mutable and causes at least 31 diseases, but it is hard to detect pathogenic repeat expansions genome-wide. Here, we report robust detection of human repeat expansions from careful alignments of long but error-prone (PacBio and nanopore) reads to a reference genome. Our method is robust to systematic sequencing errors, inexact repeats with fuzzy boundaries, and low sequencing coverage. By comparing to healthy controls, we prioritize pathogenic expansions within the top 10 out of 700,000 tandem repeats in whole genome sequencing data. This may help to elucidate the many genetic diseases whose causes remain unknown.
- Published
- 2019
- Full Text
- View/download PDF
32. NanoPipe-a web server for nanopore MinION sequencing data analysis.
- Author
-
Shabardina V, Kischka T, Manske F, Grundmann N, Frith MC, Suzuki Y, and Makałowski W
- Subjects
- Algorithms, Animals, Eukaryota genetics, Genomics methods, Humans, Genome, High-Throughput Nucleotide Sequencing methods, Sequence Analysis, DNA methods, Software
- Abstract
Background: The fast-moving progress of the third-generation long-read sequencing technologies will soon bring the biological and medical sciences to a new era of research. Altogether, the technique and experimental procedures are becoming more straightforward and available to biologists from diverse fields, even without any profound experience in DNA sequencing. Thus, the introduction of the MinION device by Oxford Nanopore Technologies promises to "bring sequencing technology to the masses" and also allows quick and operative analysis in field studies. However, the convenience of this sequencing technology dramatically contrasts with the available analysis tools, which may significantly reduce enthusiasm of a "regular" user. To really bring the sequencing technology to every biologist, we need a set of user-friendly tools that can perform a powerful analysis in an automatic manner., Findings: NanoPipe was developed in consideration of the specifics of the MinION sequencing technologies, providing accordingly adjusted alignment parameters. The range of the target species/sequences for the alignment is not limited, and the descriptive usage page of NanoPipe helps a user to succeed with NanoPipe analysis. The results contain alignment statistics, consensus sequence, polymorphisms data, and visualization of the alignment. Several test cases are used to demonstrate the efficiency of the tool., Conclusions: Freely available NanoPipe software allows effortless and reliable analysis of MinION sequencing data for experienced bioinformaticians, as well for wet-lab biologists with minimum bioinformatics knowledge. Moreover, for the latter group, we describe the basic algorithm necessary for MinION sequencing analysis from the first to last step., (© The Author(s) 2019. Published by Oxford University Press.)
- Published
- 2019
- Full Text
- View/download PDF
33. Evaluation and application of RNA-Seq by MinION.
- Author
-
Seki M, Katsumata E, Suzuki A, Sereewattanawoot S, Sakamoto Y, Mizushima-Sugano J, Sugano S, Kohno T, Frith MC, Tsuchihara K, and Suzuki Y
- Subjects
- DNA, Complementary, High-Throughput Nucleotide Sequencing methods, Humans, Polymorphism, Single Nucleotide, Alleles, Gene Expression Profiling methods, RNA Splicing, Sequence Analysis, RNA methods
- Abstract
The current RNA-Seq method analyses fragments of mRNAs, from which it is occasionally difficult to reconstruct the entire transcript structure. Here, we performed and evaluated the recent procedure for full-length cDNA sequencing using the Nanopore sequencer MinION. We applied MinION RNA-Seq for various applications, which would not always be easy using the usual RNA-Seq by Illumina. First, we examined and found that even though the sequencing accuracy was still limited to 92.3%, practically useful RNA-Seq analysis is possible. Particularly, taking advantage of the long-read nature of MinION, we demonstrate the identification of splicing patterns and their combinations as a form of full-length cDNAs without losing precise information concerning their expression levels. Transcripts of fusion genes in cancer cells can also be identified and characterized. Furthermore, the full-length cDNA information can be used for phasing of the SNPs detected by WES on the transcripts, providing essential information to identify allele-specific transcriptional events. We constructed a catalogue of full-length cDNAs in seven major organs for two particular individuals and identified allele-specific transcription and splicing. Finally, we demonstrate that single-cell sequencing is also possible. RNA-Seq on the MinION platform should provide a novel approach that is complementary to the current RNA-Seq., (© The Author(s) 2018. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.)
- Published
- 2019
- Full Text
- View/download PDF
34. Biallelic COLGALT1 variants are associated with cerebral small vessel disease.
- Author
-
Miyatake S, Schneeberger S, Koyama N, Yokochi K, Ohmura K, Shiina M, Mori H, Koshimizu E, Imagawa E, Uchiyama Y, Mitsuhashi S, Frith MC, Fujita A, Satoh M, Taguri M, Tomono Y, Takahashi K, Doi H, Takeuchi H, Nakashima M, Mizuguchi T, Takata A, Miyake N, Saitsu H, Tanaka F, Ogata K, Hennet T, and Matsumoto N
- Subjects
- Cell Line, Transformed, Cerebral Small Vessel Diseases diagnostic imaging, Child, DNA Mutational Analysis, Glucosyltransferases metabolism, Humans, Magnetic Resonance Imaging, Male, Models, Molecular, Mutagenesis, RNA, Messenger metabolism, Transfection, Cerebral Small Vessel Diseases genetics, Collagen Type IV genetics, Genetic Predisposition to Disease genetics, Mutation genetics
- Abstract
Objective: Approximately 5% of cerebral small vessel diseases are hereditary, which include COL4A1/COL4A2-related disorders. COL4A1/COL4A2 encode type IV collagen α1/2 chains in the basement membranes of cerebral vessels. COL4A1/COL4A2 mutations impair the secretion of collagen to the extracellular matrix, thereby resulting in vessel fragility. The diagnostic yield for COL4A1/COL4A2 variants is around 20 to 30%, suggesting other mutated genes might be associated with this disease. This study aimed to identify novel genes that cause COL4A1/COL4A2-related disorders., Methods: Whole exome sequencing was performed in 2 families with suspected COL4A1/COL4A2-related disorders. We validated the role of COLGALT1 variants by constructing a 3-dimensional structural model, evaluating collagen β (1-O) galactosyltransferase 1 (ColGalT1) protein expression and ColGalT activity by Western blotting and collagen galactosyltransferase assays, and performing in vitro RNA interference and rescue experiments., Results: Exome sequencing demonstrated biallelic variants in COLGALT1 encoding ColGalT1, which was involved in the post-translational modification of type IV collagen in 2 unrelated patients: c.452 T > G (p.Leu151Arg) and c.1096delG (p.Glu366Argfs*15) in Patient 1, and c.460G > C (p.Ala154Pro) and c.1129G > C (p.Gly377Arg) in Patient 2. Three-dimensional model analysis suggested that p.Leu151Arg and p.Ala154Pro destabilized protein folding, which impaired enzymatic activity. ColGalT1 protein expression and ColGalT activity in Patient 1 were undetectable. RNA interference studies demonstrated that reduced ColGalT1 altered COL4A1 secretion, and rescue experiments showed that mutant COLGALT1 insufficiently restored COL4A1 production in cells compared with wild type., Interpretation: Biallelic COLGALT1 variants cause cerebral small vessel abnormalities through a common molecular pathogenesis with COL4A1/COL4A2-related disorders. Ann Neurol 2018;84:843-853., (© 2018 American Neurological Association.)
- Published
- 2018
- Full Text
- View/download PDF
35. A Simplified Description of Child Tables for Sequence Similarity Search.
- Author
-
Frith MC and Shrestha AMS
- Subjects
- Algorithms, Models, Statistical, Computational Biology methods, Sequence Alignment methods, Sequence Analysis methods, Software
- Abstract
Finding related nucleotide or protein sequences is a fundamental, diverse, and incompletely-solved problem in bioinformatics. It is often tackled by seed-and-extend methods, which first find "seed" matches of diverse types, such as spaced seeds, subset seeds, or minimizers. Seeds are usually found using an index of the reference sequence(s), which stores seed positions in a suffix array or related data structure. A child table is a fundamental way to achieve fast lookup in an index, but previous descriptions have been overly complex. This paper aims to provide a more accessible description of child tables, and demonstrate their generality: they apply equally to all the above-mentioned seed types and more. We also show that child tables can be used without LCP (longest common prefix) tables, reducing the memory requirement.
- Published
- 2018
- Full Text
- View/download PDF
36. Nanopore sequencing of drug-resistance-associated genes in malaria parasites, Plasmodium falciparum.
- Author
-
Runtuwene LR, Tuda JSB, Mongan AE, Makalowski W, Frith MC, Imwong M, Srisutham S, Nguyen Thi LA, Tuan NN, Eshita Y, Maeda R, Yamagishi J, and Suzuki Y
- Subjects
- Animals, Antimalarials pharmacology, Genotype, Humans, Malaria, Falciparum parasitology, Mutation drug effects, Nanopores, Parasites genetics, Plasmodium falciparum drug effects, Sequence Analysis, DNA instrumentation, Thailand, Vietnam, Drug Resistance genetics, Plasmodium falciparum genetics, Sequence Analysis, DNA methods
- Abstract
Here, we report the application of a portable sequencer, MinION, for genotyping the malaria parasite Plasmodium falciparum. In the present study, an amplicon mixture of nine representative genes causing resistance to anti-malaria drugs is diagnosed. First, we developed the procedure for four laboratory strains (3D7, Dd2, 7G8, and K1), and then applied the developed procedure to ten clinical samples. We sequenced and re-sequenced the samples using the obsolete flow cell R7.3 and the most recent flow cell R9.4. Although the average base-call accuracy of the MinION sequencer was 74.3%, performing >50 reads at a given position improves the accuracy of the SNP call, yielding a precision and recall rate of 0.92 and 0.8, respectively, with flow cell R7.3. These numbers increased significantly with flow cell R9.4, in which the precision and recall are 1 and 0.97, respectively. Based on the SNP information, the drug resistance status in ten clinical samples was inferred. We also analyzed K13 gene mutations from 54 additional clinical samples as a proof of concept. We found that a novel amino-acid changing variation is dominant in this area. In addition, we performed a small population-based analysis using 3 and 5 cases (K13) and 10 and 5 cases (PfCRT) from Thailand and Vietnam, respectively. We identified distinct genotypes from the respective regions. This approach will change the standard methodology for the sequencing diagnosis of malaria parasites, especially in developing countries.
- Published
- 2018
- Full Text
- View/download PDF
37. EAGLE: Explicit Alternative Genome Likelihood Evaluator.
- Author
-
Kuo T, Frith MC, Sese J, and Horton P
- Subjects
- Haplotypes, Humans, Likelihood Functions, Probability, Software, Genomics methods
- Abstract
Background: Reliable detection of genome variations, especially insertions and deletions (indels), from single sample DNA sequencing data remains challenging, partially due to the inherent uncertainty involved in aligning sequencing reads to the reference genome. In practice a variety of ad hoc quality filtering methods are employed to produce more reliable lists of putative variants, but the resulting lists typically still include numerous false positives. Thus it would be desirable to be able to rigorously evaluate the degree to which each putative variant is supported by the data. Unfortunately, users who wish to do this, e.g. for the purpose of prioritizing validation experiments, have been faced with limited options., Results: Here we present EAGLE, a method for evaluating the degree to which sequencing data supports a given candidate genome variant. EAGLE incorporates candidate variants into explicit hypotheses about the individual's genome, and then computes the probability of the observed data (the sequencing reads) under each hypothesis. In comparison with methods which rely heavily on a particular alignment of the reads to the reference genome, EAGLE readily accounts for uncertainties that may arise from multi-mapping or local misalignment and uses the entire length of each read. We compared the scores assigned by several well-known variant callers to EAGLE for the task of ranking true putative variants on both simulated data and real genome sequencing based benchmarks. For indels, EAGLE obtained marked improvement on simulated data and a whole genome sequencing benchmark, and modest but statistically significant improvement on an exome sequencing benchmark., Conclusions: EAGLE ranked true variants higher than the scores reported by the callers and can used to improve specificity in variant calling. EAGLE is freely available at https://github.com/tony-kuo/eagle .
- Published
- 2018
- Full Text
- View/download PDF
38. A survey of localized sequence rearrangements in human DNA.
- Author
-
Frith MC and Khan S
- Subjects
- Cell Line, Gene Conversion, Humans, Inverted Repeat Sequences, Sequence Alignment, Sequence Analysis, DNA, Sequence Deletion, Sequence Inversion, DNA chemistry, Mutation
- Abstract
Genomes mutate and evolve in ways simple (substitution or deletion of bases) and complex (e.g. chromosome shattering). We do not fully understand what types of complex mutation occur, and we cannot routinely characterize arbitrarily-complex mutations in a high-throughput, genome-wide manner. Long-read DNA sequencing methods (e.g. PacBio, nanopore) are promising for this task, because one read may encompass a whole complex mutation. We describe an analysis pipeline to characterize arbitrarily-complex 'local' mutations, i.e. intrachromosomal mutations encompassed by one DNA read. We apply it to nanopore and PacBio reads from one human cell line (NA12878), and survey sequence rearrangements, both real and artifactual. Almost all the real rearrangements belong to recurring patterns or motifs: the most common is tandem multiplication (e.g. heptuplication), but there are also complex patterns such as localized shattering, which resembles DNA damage by radiation. Gene conversions are identified, including one between hemoglobin gamma genes. This study demonstrates a way to find intricate rearrangements with any number of duplications, deletions, and repositionings. It demonstrates a probability-based method to resolve ambiguous rearrangements involving highly similar sequences, as occurs in gene conversion. We present a catalog of local rearrangements in one human cell line, and show which rearrangement patterns occur.
- Published
- 2018
- Full Text
- View/download PDF
39. Jointly aligning a group of DNA reads improves accuracy of identifying large deletions.
- Author
-
Shrestha AMS, Frith MC, Asai K, and Richard H
- Subjects
- Cell Line, Datasets as Topic, High-Throughput Nucleotide Sequencing, Humans, Internet, Male, Middle Aged, Ploidies, Primary Cell Culture, Sequence Alignment, Sequence Analysis, DNA, Software, Algorithms, Base Sequence, DNA genetics, Genome, Human, Sequence Deletion
- Abstract
Performing sequence alignment to identify structural variants, such as large deletions, from genome sequencing data is a fundamental task, but current methods are far from perfect. The current practice is to independently align each DNA read to a reference genome. We show that the propensity of genomic rearrangements to accumulate in repeat-rich regions imposes severe ambiguities in these alignments, and consequently on the variant calls-with current read lengths, this affects more than one third of known large deletions in the C. Venter genome. We present a method to jointly align reads to a genome, whereby alignment ambiguity of one read can be disambiguated by other reads. We show this leads to a significant improvement in the accuracy of identifying large deletions (≥20 bases), while imposing minimal computational overhead and maintaining an overall running time that is at par with current tools. A software implementation is available as an open-source Python program called JRA at https://bitbucket.org/jointreadalignment/jra-src.
- Published
- 2018
- Full Text
- View/download PDF
40. Sequencing and phasing cancer mutations in lung cancers using a long-read portable sequencer.
- Author
-
Suzuki A, Suzuki M, Mizushima-Sugano J, Frith MC, Makalowski W, Kohno T, Sugano S, Tsuchihara K, and Suzuki Y
- Subjects
- Biomarkers, Tumor genetics, Humans, Adenocarcinoma genetics, ErbB Receptors genetics, High-Throughput Nucleotide Sequencing instrumentation, High-Throughput Nucleotide Sequencing methods, Lung Neoplasms genetics, Mutation, Sequence Analysis, DNA methods
- Abstract
Here, we employed cDNA amplicon sequencing using a long-read portable sequencer, MinION, to characterize various types of mutations in cancer-related genes, namely, EGFR, KRAS, NRAS and NF1. For homozygous SNVs, the precision and recall rates were 87.5% and 91.3%, respectively. For previously reported hotspot mutations, the precision and recall rates reached 100%. The precise junctions of EML4-ALK, CCDC6-RET and five other gene fusions were also detected. Taking advantages of long-read sequencing, we conducted phasing of EGFR mutations and elucidated the mutational allelic backgrounds of anti-tumor drug-sensitive and resistant mutations, which could provide useful information for selecting therapeutic approaches. In the H1975 cells, 72% of the reads harbored both L858R and T790M mutations, and 22% of the reads harbored neither mutation. To ensure that the clinical requirements can be met in potentially low cancer cell populations, we further conducted a serial dilution analysis of the template for EGFR mutations. Several percent of the mutant alleles could be detected depending on the yield and quality of the sequencing data. Finally, we characterized the mutation genotypes in eight clinical samples. This method could be a convenient long-read sequencing-based analytical approach and thus may change the current approaches used for cancer genome sequencing., (© The Author 2017. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.)
- Published
- 2017
- Full Text
- View/download PDF
41. Nanopore-based single molecule sequencing of the D4Z4 array responsible for facioscapulohumeral muscular dystrophy.
- Author
-
Mitsuhashi S, Nakagawa S, Takahashi Ueda M, Imanishi T, Frith MC, and Mitsuhashi H
- Subjects
- Humans, Chromosomes, Human, Pair 4 genetics, Homeodomain Proteins genetics, Muscular Dystrophy, Facioscapulohumeral genetics, Nanopores, Repetitive Sequences, Amino Acid, Sequence Analysis, DNA methods
- Abstract
Subtelomeric macrosatellite repeats are difficult to sequence using conventional sequencing methods owing to the high similarity among repeat units and high GC content. Sequencing these repetitive regions is challenging, even with recent improvements in sequencing technologies. Among these repeats, a haplotype carrying a particular sequence and shortening of the D4Z4 array on human chromosome 4q35 causes one of the most prevalent forms of muscular dystrophy with autosomal-dominant inheritance, facioscapulohumeral muscular dystrophy (FSHD). Here, we applied a nanopore-based ultra-long read sequencer to sequence a BAC clone containing 13 D4Z4 repeats and flanking regions. We successfully obtained the whole D4Z4 repeat sequence, including the pathogenic gene DUX4 in the last D4Z4 repeat. The estimated sequence accuracy of the total repeat region was 99.8% based on a comparison with the reference sequence. Errors were typically observed between purine or between pyrimidine bases. Further, we analyzed the D4Z4 sequence from publicly available ultra-long whole human genome sequencing data obtained by nanopore sequencing. This technology may be a new tool for studying D4Z4 repeats and pathomechanism of FSHD in the future and has the potential to widen our understanding of subtelomeric regions.
- Published
- 2017
- Full Text
- View/download PDF
42. Training alignment parameters for arbitrary sequencers with LAST-TRAIN.
- Author
-
Hamada M, Ono Y, Asai K, and Frith MC
- Subjects
- Humans, Genome, Human, Polymorphism, Genetic, Sequence Analysis, DNA methods, Software
- Abstract
Summary: LAST-TRAIN improves sequence alignment accuracy by inferring substitution and gap scores that fit the frequencies of substitutions, insertions, and deletions in a given dataset. We have applied it to mapping DNA reads from IonTorrent and PacBio RS, and we show that it reduces reference bias for Oxford Nanopore reads., Availability and Implementation: the source code is freely available at http://last.cbrc.jp/., Contact: mhamada@waseda.jp or mcfrith@edu.k.u-tokyo.ac.jp., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author 2016. Published by Oxford University Press.)
- Published
- 2017
- Full Text
- View/download PDF
43. Protein sequence-similarity search acceleration using a heuristic algorithm with a sensitive matrix.
- Author
-
Lim K, Yamada KD, Frith MC, and Tomii K
- Subjects
- Algorithms, Computational Biology, Models, Molecular, Proteins chemistry, Sequence Alignment, Computer Heuristics, Databases, Protein, Sequence Analysis, Protein
- Abstract
Protein database search for public databases is a fundamental step in the target selection of proteins in structural and functional genomics and also for inferring protein structure, function, and evolution. Most database search methods employ amino acid substitution matrices to score amino acid pairs. The choice of substitution matrix strongly affects homology detection performance. We earlier proposed a substitution matrix named MIQS that was optimized for distant protein homology search. Herein we further evaluate MIQS in combination with LAST, a heuristic and fast database search tool with a tunable sensitivity parameter m, where larger m denotes higher sensitivity. Results show that MIQS substantially improves the homology detection and alignment quality performance of LAST across diverse m parameters. Against a protein database consisting of approximately 15 million sequences, LAST with m = 10
5 achieves better homology detection performance than BLASTP, and completes the search 20 times faster. Compared to the most sensitive existing methods being used today, CS-BLAST and SSEARCH, LAST with MIQS and m = 106 shows comparable homology detection performance at 2.0 and 3.9 times greater speed, respectively. Results demonstrate that MIQS-powered LAST is a time-efficient method for sensitive and accurate homology search.- Published
- 2016
- Full Text
- View/download PDF
44. ALP & FALP: C++ libraries for pairwise local alignment E-values.
- Author
-
Sheetlin S, Park Y, Frith MC, and Spouge JL
- Subjects
- DNA metabolism, Databases, Factual, Humans, Proteins metabolism, Sequence Alignment, Computational Biology methods, DNA chemistry, Proteins chemistry, Sequence Analysis, DNA methods, Sequence Analysis, Protein methods, Software
- Abstract
Motivation: Pairwise local alignment is an indispensable tool for molecular biologists. In real time (i.e. in about 1 s), ALP (Ascending Ladder Program) calculates the E-values for protein-protein or DNA-DNA local alignments of random sequences, for arbitrary substitution score matrix, gap costs and letter abundances; and FALP (Frameshift Ascending Ladder Program) performs a similar task, although more slowly, for frameshifting DNA-protein alignments., Availability and Implementation: To permit other C++ programmers to implement the computational efficiencies in ALP and FALP directly within their own programs, C++ source codes are available in the public domain at http://go.usa.gov/3GTSW under 'ALP' and 'FALP', along with the standalone programs ALP and FALP., Contact: spouge@nih.gov, Supplementary Information: Supplementary data are available at Bioinformatics online., (Published by Oxford University Press 2015. This work is written by US Government employees and is in the public domain in the US.)
- Published
- 2016
- Full Text
- View/download PDF
45. Split-alignment of genomes finds orthologies more accurately.
- Author
-
Frith MC and Kawaguchi R
- Subjects
- Algorithms, Animals, Base Sequence, Dogs, Drosophila classification, Drosophila genetics, Humans, Mice, Models, Genetic, Models, Statistical, Molecular Sequence Data, Synteny, Genome, Sequence Alignment
- Abstract
We present a new pair-wise genome alignment method, based on a simple concept of finding an optimal set of local alignments. It gains accuracy by not masking repeats, and by using a statistical model to quantify the (un)ambiguity of each alignment part. Compared to previous animal genome alignments, it aligns thousands of locations differently and with much higher similarity, strongly suggesting that the previous alignments are non-orthologous. The previous methods suffer from an overly-strong assumption of long un-rearranged blocks. The new alignments should help find interesting and unusual features, such as fast-evolving elements and micro-rearrangements, which are confounded by alignment errors.
- Published
- 2015
- Full Text
- View/download PDF
46. Frameshift alignment: statistics and post-genomic applications.
- Author
-
Sheetlin SL, Park Y, Frith MC, and Spouge JL
- Subjects
- Algorithms, Data Interpretation, Statistical, Genome, Human, Genomics, Humans, Metagenomics, Pseudogenes, Sequence Analysis, DNA, Sequence Analysis, Protein, Sequence Analysis, RNA, Software, Frameshift Mutation, Sequence Alignment methods
- Abstract
Motivation: The alignment of DNA sequences to proteins, allowing for frameshifts, is a classic method in sequence analysis. It can help identify pseudogenes (which accumulate mutations), analyze raw DNA and RNA sequence data (which may have frameshift sequencing errors), investigate ribosomal frameshifts, etc. Often, however, only ad hoc approximations or simulations are available to provide the statistical significance of a frameshift alignment score., Results: We describe a method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics. (BLAST presently does not permit its alignments to include frameshifts.) We also illustrate the continuing usefulness of frameshift alignment with two 'post-genomic' applications: (i) when finding pseudogenes within the human genome, frameshift alignments show that most anciently conserved non-coding human elements are recent pseudogenes with conserved ancestral genes; and (ii) when analyzing metagenomic DNA reads from polluted soil, frameshift alignments show that most alignable metagenomic reads contain frameshifts, suggesting that metagenomic analysis needs to use frameshift alignment to derive accurate results., (Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.)
- Published
- 2014
- Full Text
- View/download PDF
47. RECLU: a pipeline to discover reproducible transcriptional start sites and their alternative regulation using capped analysis of gene expression (CAGE).
- Author
-
Ohmiya H, Vitezic M, Frith MC, Itoh M, Carninci P, Forrest AR, Hayashizaki Y, and Lassmann T
- Subjects
- Cluster Analysis, Computational Biology methods, Gene Expression Profiling, HeLa Cells, Humans, Internet, Nucleotide Motifs, Position-Specific Scoring Matrices, Reproducibility of Results, Gene Expression Regulation, RNA Caps, Software, Transcription Initiation Site
- Abstract
Background: Next generation sequencing based technologies are being extensively used to study transcriptomes. Among these, cap analysis of gene expression (CAGE) is specialized in detecting the most 5' ends of RNA molecules. After mapping the sequenced reads back to a reference genome CAGE data highlights the transcriptional start sites (TSSs) and their usage at a single nucleotide resolution., Results: We propose a pipeline to group the single nucleotide TSS into larger reproducible peaks and compare their usage across biological states. Importantly, our pipeline discovers broad peaks as well as the fine structure of individual transcriptional start sites embedded within them. We assess the performance of our approach on a large CAGE datasets including 156 primary cell types and two cell lines with biological replicas. We demonstrate that genes have complicated structures of transcription initiation events. In particular, we discover that narrow peaks embedded in broader regions of transcriptional activity can be differentially used even if the larger region is not., Conclusions: By examining the reproducible fine scaled organization of TSS we can detect many differentially regulated peaks undetected by previous approaches.
- Published
- 2014
- Full Text
- View/download PDF
48. Explaining the correlations among properties of mammalian promoters.
- Author
-
Frith MC
- Subjects
- Animals, CpG Islands, Data Interpretation, Statistical, Humans, Mice, TATA Box, Transcription Initiation Site, Transcription, Genetic, Promoter Regions, Genetic
- Abstract
Proximal promoters are fundamental genomic elements for gene expression. They vary in terms of GC percentage, CpG abundance, presence of TATA signal, evolutionary conservation, chromosomal spread of transcription start sites and breadth of expression across cell types. These properties are correlated, and it has been suggested that there are two classes of promoters: one class with high CpG, widely spread transcription start sites and broad expression, and another with TATA signals, narrow spread and restricted expression. However, it has been unclear why these properties are correlated in this way. We reexamined these features using the deep FANTOM5 CAGE data from hundreds of cell types. First, we point out subtle but important biases in previous definitions of promoters and of expression breadth. Second, we show that most promoters are rather nonspecifically expressed across many cell types. Third, promoters' expression breadth is independent of maximum expression level, and therefore correlates with average expression level. Fourth, the data show a more complex picture than two classes, with a network of direct and indirect correlations among promoter properties. By tentatively distinguishing the direct from the indirect correlations, we reveal simple explanations for them.
- Published
- 2014
- Full Text
- View/download PDF
49. Improved search heuristics find 20,000 new alignments between human and mouse genomes.
- Author
-
Frith MC and Noé L
- Subjects
- Animals, Dogs, Genome, Humans, Mice, Genome, Human, Genomics methods, Sequence Alignment methods, Sequence Analysis, DNA methods
- Abstract
Sequence similarity search is a fundamental way of analyzing nucleotide sequences. Despite decades of research, this is not a solved problem because there exist many similarities that are not found by current methods. Search methods are typically based on a seed-and-extend approach, which has many variants (e.g. spaced seeds, transition seeds), and it remains unclear how to optimize this approach. This study designs and tests seeding methods for inter-mammal and inter-insect genome comparison. By considering substitution patterns of real genomes, we design sets of multiple complementary transition seeds, which have better performance (sensitivity per run time) than previous seeding strategies. Often the best seed patterns have more transition positions than those used previously. We also point out that recent computer memory sizes (e.g. 60 GB) make it feasible to use multiple (e.g. eight) seeds for whole mammal genomes. Interestingly, the most sensitive settings achieve diminishing returns for human-dog and melanogaster-pseudoobscura comparisons, but not for human-mouse, which suggests that we still miss many human-mouse alignments. Our optimized heuristics find ∼20,000 new human-mouse alignments that are missing from the standard UCSC alignments. We tabulate seed patterns and parameters that work well so they can be used in future research.
- Published
- 2014
- Full Text
- View/download PDF
50. A promoter-level mammalian expression atlas.
- Author
-
Forrest AR, Kawaji H, Rehli M, Baillie JK, de Hoon MJ, Haberle V, Lassmann T, Kulakovskiy IV, Lizio M, Itoh M, Andersson R, Mungall CJ, Meehan TF, Schmeier S, Bertin N, Jørgensen M, Dimont E, Arner E, Schmidl C, Schaefer U, Medvedeva YA, Plessy C, Vitezic M, Severin J, Semple C, Ishizu Y, Young RS, Francescatto M, Alam I, Albanese D, Altschuler GM, Arakawa T, Archer JA, Arner P, Babina M, Rennie S, Balwierz PJ, Beckhouse AG, Pradhan-Bhatt S, Blake JA, Blumenthal A, Bodega B, Bonetti A, Briggs J, Brombacher F, Burroughs AM, Califano A, Cannistraci CV, Carbajo D, Chen Y, Chierici M, Ciani Y, Clevers HC, Dalla E, Davis CA, Detmar M, Diehl AD, Dohi T, Drabløs F, Edge AS, Edinger M, Ekwall K, Endoh M, Enomoto H, Fagiolini M, Fairbairn L, Fang H, Farach-Carson MC, Faulkner GJ, Favorov AV, Fisher ME, Frith MC, Fujita R, Fukuda S, Furlanello C, Furino M, Furusawa J, Geijtenbeek TB, Gibson AP, Gingeras T, Goldowitz D, Gough J, Guhl S, Guler R, Gustincich S, Ha TJ, Hamaguchi M, Hara M, Harbers M, Harshbarger J, Hasegawa A, Hasegawa Y, Hashimoto T, Herlyn M, Hitchens KJ, Ho Sui SJ, Hofmann OM, Hoof I, Hori F, Huminiecki L, Iida K, Ikawa T, Jankovic BR, Jia H, Joshi A, Jurman G, Kaczkowski B, Kai C, Kaida K, Kaiho A, Kajiyama K, Kanamori-Katayama M, Kasianov AS, Kasukawa T, Katayama S, Kato S, Kawaguchi S, Kawamoto H, Kawamura YI, Kawashima T, Kempfle JS, Kenna TJ, Kere J, Khachigian LM, Kitamura T, Klinken SP, Knox AJ, Kojima M, Kojima S, Kondo N, Koseki H, Koyasu S, Krampitz S, Kubosaki A, Kwon AT, Laros JF, Lee W, Lennartsson A, Li K, Lilje B, Lipovich L, Mackay-Sim A, Manabe R, Mar JC, Marchand B, Mathelier A, Mejhert N, Meynert A, Mizuno Y, de Lima Morais DA, Morikawa H, Morimoto M, Moro K, Motakis E, Motohashi H, Mummery CL, Murata M, Nagao-Sato S, Nakachi Y, Nakahara F, Nakamura T, Nakamura Y, Nakazato K, van Nimwegen E, Ninomiya N, Nishiyori H, Noma S, Noma S, Noazaki T, Ogishima S, Ohkura N, Ohimiya H, Ohno H, Ohshima M, Okada-Hatakeyama M, Okazaki Y, Orlando V, Ovchinnikov DA, Pain A, Passier R, Patrikakis M, Persson H, Piazza S, Prendergast JG, Rackham OJ, Ramilowski JA, Rashid M, Ravasi T, Rizzu P, Roncador M, Roy S, Rye MB, Saijyo E, Sajantila A, Saka A, Sakaguchi S, Sakai M, Sato H, Savvi S, Saxena A, Schneider C, Schultes EA, Schulze-Tanzil GG, Schwegmann A, Sengstag T, Sheng G, Shimoji H, Shimoni Y, Shin JW, Simon C, Sugiyama D, Sugiyama T, Suzuki M, Suzuki N, Swoboda RK, 't Hoen PA, Tagami M, Takahashi N, Takai J, Tanaka H, Tatsukawa H, Tatum Z, Thompson M, Toyodo H, Toyoda T, Valen E, van de Wetering M, van den Berg LM, Verado R, Vijayan D, Vorontsov IE, Wasserman WW, Watanabe S, Wells CA, Winteringham LN, Wolvetang E, Wood EJ, Yamaguchi Y, Yamamoto M, Yoneda M, Yonekura Y, Yoshida S, Zabierowski SE, Zhang PG, Zhao X, Zucchelli S, Summers KM, Suzuki H, Daub CO, Kawai J, Heutink P, Hide W, Freeman TC, Lenhard B, Bajic VB, Taylor MS, Makeev VJ, Sandelin A, Hume DA, Carninci P, and Hayashizaki Y
- Subjects
- Animals, Cell Line, Cells, Cultured, Cluster Analysis, Conserved Sequence genetics, Gene Expression Regulation genetics, Gene Regulatory Networks genetics, Genes, Essential genetics, Genome genetics, Humans, Mice, Open Reading Frames genetics, Organ Specificity, RNA, Messenger analysis, RNA, Messenger genetics, Transcription Factors metabolism, Transcription Initiation Site, Transcription, Genetic genetics, Atlases as Topic, Molecular Sequence Annotation, Promoter Regions, Genetic genetics, Transcriptome genetics
- Abstract
Regulated transcription controls the diversity, developmental pathways and spatial organization of the hundreds of cell types that make up a mammal. Using single-molecule cDNA sequencing, we mapped transcription start sites (TSSs) and their usage in human and mouse primary cells, cell lines and tissues to produce a comprehensive overview of mammalian gene expression across the human body. We find that few genes are truly 'housekeeping', whereas many mammalian promoters are composite entities composed of several closely separated TSSs, with independent cell-type-specific expression profiles. TSSs specific to different cell types evolve at different rates, whereas promoters of broadly expressed genes are the most conserved. Promoter-based expression analysis reveals key transcription factors defining cell states and links them to binding-site motifs. The functions of identified novel transcripts can be predicted by coexpression and sample ontology enrichment analyses. The functional annotation of the mammalian genome 5 (FANTOM5) project provides comprehensive expression profiles and functional annotation of mammalian cell-type-specific transcriptomes with wide applications in biomedical research.
- Published
- 2014
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.