1. Making de Bruijn Graphs Eulerian
- Author
-
Bernardini, Giulia, Chen, Huiping, Loukides, Grigorios, Pissis, Solon P., Stougie, Leen, Sweering, Michelle, Bannai, Hideo, Holub, Jan, Bernardini, Giulia, Chen, Huiping, Loukides, Grigorio, Pissis, Solon P., Stougie, Leen, Sweering, Michelle, Università degli studi di Trieste = University of Trieste, Centrum Wiskunde & Informatica (CWI), King‘s College London, Vrije Universiteit Amsterdam [Amsterdam] (VU), Equipe de recherche européenne en algorithmique et biologie formelle et expérimentale (ERABLE), Laboratoire de Biométrie et Biologie Evolutive - UMR 5558 (LBBE), Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS)-Inria Lyon, Institut National de Recherche en Informatique et en Automatique (Inria), The work in this paper is supported in part by: the Netherlands Organisation for Scientific Research (NWO) through project OCENW.GROOT.2019.015 'Optimization for and with Machine Learning (OPTIMAL)' and Gravitation-grant NETWORKS-024.002.003, a CSC scholarship, the Leverhulme Trust RPG-2019-399 project, European Project: 872539,H2020-EU.1.3. - EXCELLENT SCIENCE - Marie Skłodowska-Curie Actions, H2020-EU.1.3.3. - Stimulating innovation by means of cross-fertilisation of knowledge,H2020-MSCA-RISE-2019,PANGAIA(2020), European Project: 956229,H2020-EU.1.3. - EXCELLENT SCIENCE - Marie Skłodowska-Curie Actions,ALPACA(2021), Bannai, Hideo, Holub, Jan, Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands, Bioinformatics, AIMMS, Bio Informatics (IBIVU), Operations Analytics, Tinbergen Institute, and Amsterdam Business Research Institute
- Subjects
graph algorithms ,2012 ACM Subject Classification Theory of computation → Pattern matching phrases string algorithms ,[SDV]Life Sciences [q-bio] ,graph algorithm ,[INFO]Computer Science [cs] ,Theory of computation → Pattern matching ,string algorithm ,string algorithms ,Eulerian graph ,de Bruijn graph - Abstract
A directed multigraph is called Eulerian if it has a circuit which uses each edge exactly once. Euler’s theorem tells us that a weakly connected directed multigraph is Eulerian if and only if every node is balanced. Given a collection S of strings over an alphabet Σ, the de Bruijn graph (dBG) of order k of S is a directed multigraph G_{S,k}(V,E), where V is the set of length-(k-1) substrings of the strings in S, and G_{S,k} contains an edge (u,v) with multiplicity m_{u,v}, if and only if the string u[0]⋅ v is equal to the string u⋅ v[k-2] and this string occurs exactly m_{u,v} times in total in strings in S. Let G_{Σ,k}(V_{Σ,k},E_{Σ,k}) be the complete dBG of Σ^k. The Eulerian Extension (EE) problem on G_{S,k} asks to extend G_{S,k} with a set ℬ of nodes from V_{Σ,k} and a smallest multiset 𝒜 of edges from E_{Σ,k} to make it Eulerian. Note that extending dBGs is algorithmically much more challenging than extending general directed multigraphs because some edges in dBGs are by definition forbidden. Extending dBGs lies at the heart of sequence assembly [Medvedev et al., WABI 2007], one of the most important tasks in bioinformatics. The novelty of our work with respect to existing works is that we allow not only to duplicate existing edges of G_{S,k} but to also add novel edges and nodes, in an effort to (i) connect multiple components and (ii) reduce the total EE cost. It is easy to show that EE on G_{S,k} is NP-hard via a reduction from shortest common superstring. We further show that EE remains NP-hard, even when we are not allowed to add new nodes, via a highly non-trivial reduction from 3-SAT. We thus investigate the following two problems underlying EE in dBGs: 1) When G_{S,k} is not weakly connected, we are asked to connect its d > 1 components using a minimum-weight spanning tree, whose edges are paths on the underlying G_{Σ,k} and weights are the corresponding path lengths. This way of connecting guarantees that no new unbalanced node is added. We show that this problem can be solved in 𝒪(|V|klog d+|E|) time, which is nearly optimal, since the size of G_{S,k} is Θ(|V|k+|E|). 2) When G_{S,k} is not balanced, we are asked to extend G_{S,k} to H_{S,k}(V∪ℬ,E∪𝒜) such that every node of H_{S,k} is balanced and the total number |𝒜| of added edges is minimized. We show that this problem can be solved in the optimal 𝒪(k|V| + |E|+ |𝒜|) time. Let us stress that, although our main contributions are theoretical, the algorithms we design for the above two problems are practical. We combine the two algorithms in one method that makes any dBG Eulerian; and show experimentally that the cost of the obtained feasible solutions on real-world dBGs is substantially smaller than the corresponding cost obtained by existing greedy approaches., LIPIcs, Vol. 223, 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022), pages 12:1-12:18
- Published
- 2022
- Full Text
- View/download PDF