51. Bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods
- Author
-
Ahrenfeldt, Johanne, Skaarup, Carina, Hasman, Henrik, Pedersen, Anders Gorm, Aarestrup, Frank Møller, Lund, Ole, Ahrenfeldt, Johanne, Skaarup, Carina, Hasman, Henrik, Pedersen, Anders Gorm, Aarestrup, Frank Møller, and Lund, Ole
- Abstract
Background Whole genome sequencing (WGS) is increasingly used in diagnostics and surveillance of infectious diseases. A major application for WGS is to use the data for identifying outbreak clusters, and there is therefore a need for methods that can accurately and efficiently infer phylogenies from sequencing reads. In the present study we describe a new dataset that we have created for the purpose of benchmarking such WGS-based methods for epidemiological data, and also present an analysis where we use the data to compare the performance of some current methods. Results Our aim was to create a benchmark data set that mimics sequencing data of the sort that might be collected during an outbreak of an infectious disease. This was achieved by letting an E. coli hypermutator strain grow in the lab for 8 consecutive days, each day splitting the culture in two while also collecting samples for sequencing. The result is a data set consisting of 101 whole genome sequences with known phylogenetic relationship. Among the sequenced samples 51 correspond to internal nodes in the phylogeny because they are ancestral, while the remaining 50 correspond to leaves.We also used the newly created data set to compare three different online available methods that infer phylogenies from whole-genome sequencing reads: NDtree, CSI Phylogeny and REALPHY. One complication when comparing the output of these methods with the known phylogeny is that phylogenetic methods typically build trees where all observed sequences are placed as leafs, even though some of them are in fact ancestral. We therefore devised a method for post processing the inferred trees by collapsing short branches (thus relocating some leafs to internal nodes), and also present two new measures of tree similarity that takes into account the identity of both internal and leaf nodes. Conclusions Based on this analysis we find that, among the investigated methods, CSI Phylogeny had the best performance, co
- Published
- 2017