1. TALC: Transcript-level Aware Long-read Correction
- Author
-
Dany Severac, Andrew J. Oldfield, Aubin Thomas, William Ritchie, Lucile Broseus, Emeric Dubois, Institut de génétique humaine (IGH), Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS), Institut de Génomique Fonctionnelle - Montpellier GenomiX (IGF MGX), Institut de Génomique Fonctionnelle (IGF), Institut National de la Santé et de la Recherche Médicale (INSERM)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-BioCampus (BCM), Institut National de la Santé et de la Recherche Médicale (INSERM)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Centre National de la Recherche Scientifique (CNRS), Oldfield, Andrew, and Université de Montpellier (UM)-Université Montpellier 1 (UM1)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Université Montpellier 2 - Sciences et Techniques (UM2)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Université Montpellier 1 (UM1)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Université Montpellier 2 - Sciences et Techniques (UM2)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Statistics and Probability ,Gene isoform ,[INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI] ,Computer science ,Transcript level ,computer.software_genre ,Biochemistry ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,Transcriptome ,03 medical and health sciences ,0302 clinical medicine ,[SDV.BBM.GTP]Life Sciences [q-bio]/Biochemistry, Molecular Biology/Genomics [q-bio.GN] ,Molecular Biology ,030304 developmental biology ,[INFO.INFO-BI] Computer Science [cs]/Bioinformatics [q-bio.QM] ,0303 health sciences ,RNA ,High-Throughput Nucleotide Sequencing ,Genomics ,Sequence Analysis, DNA ,Computer Science Applications ,Transcriptome Sequencing ,Computational Mathematics ,Rna expression ,Computational Theory and Mathematics ,[SDV.BBM.GTP] Life Sciences [q-bio]/Biochemistry, Molecular Biology/Genomics [q-bio.GN] ,Data mining ,[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] ,computer ,030217 neurology & neurosurgery ,Algorithms ,Software - Abstract
Motivation Long-read sequencing technologies are invaluable for determining complex RNA transcript architectures but are error-prone. Numerous ‘hybrid correction’ algorithms have been developed for genomic data that correct long reads by exploiting the accuracy and depth of short reads sequenced from the same sample. These algorithms are not suited for correcting more complex transcriptome sequencing data. Results We have created a novel reference-free algorithm called Transcript-level Aware Long-Read Correction (TALC) which models changes in RNA expression and isoform representation in a weighted De Bruijn graph to correct long reads from transcriptome studies. We show that transcript-level aware correction by TALC improves the accuracy of the whole spectrum of downstream RNA-seq applications and is thus necessary for transcriptome analyses that use long read technology. Availability and implementation TALC is implemented in C++ and available at https://github.com/lbroseus/TALC. Supplementary information Supplementary data are available at Bioinformatics online.
- Published
- 2020
- Full Text
- View/download PDF