Back to Search Start Over

In-silico read normalization using set multi-cover optimization

Authors :
Marcel H. Schulz
Dilip Ariyur Durai
Publication Year :
2017
Publisher :
Cold Spring Harbor Laboratory, 2017.

Abstract

De Bruijn graphs are a common assembly data structure for large sequencing datasets. But with the advances in sequencing technologies, assembling high coverage datasets has become a computational challenge. Read normalization, which removes redundancy in large datasets, is widely applied to reduce resource requirements. Current normalization algorithms, though efficient, provide no guarantee to preserve important k-mers that form connections between regions in the graph. Here, normalization is phrased as a set multi-cover problem on reads and a heuristic algorithm, ORNA, is proposed. ORNA normalizes to the minimum number of reads required to retain all k-mers and their relative kmer abundances from the original dataset. Hence, all connections and coverage information from the original graph are preserved. ORNA was tested on various RNA-seq datasets with different coverage values. It was compared to the current normalization algorithms and was found to be performing better. It is shown that combining read error correction and normalization allows more accurate and resource efficient RNA assemblies compared to the original dataset. Further, an application was proposed in which multiple datasets were combined and normalized to predict novel transcripts that would have been missed otherwise. Finally, ORNA is a general purpose normalization algorithm that is fast and significantly reduces datasets with little loss of assembly quality.ORNA can be found under https://github.com/SchulzLab/ORNA

Details

Language :
English
Database :
OpenAIRE
Accession number :
edsair.doi.dedup.....be8f37a3b67a60f922c15225c9e80b95
Full Text :
https://doi.org/10.1101/133579