Stephen D. Bentley, Maiju Pesonen, James M. Musser, Marcin J. Skwark, Paul Turner, Santeri Puranen, Claire Chewapreecha, Julian Parkhill, Nicholas J. Croucher, Stephen B. Beres, Erik Aurell, Yingying Xu, Simon R. Harris, Jukka Corander, Medical Research Council (MRC), Skwark, Marcin J [0000-0002-2022-6766], Croucher, Nicholas J [0000-0001-6303-8768], Puranen, Santeri [0000-0001-6388-7110], Chewapreecha, Claire [0000-0002-1313-4011], Xu, Ying Ying [0000-0002-9096-0552], Turner, Paul [0000-0002-1013-7815], Harris, Simon R [0000-0003-1512-6194], Beres, Stephen B [0000-0003-3041-0185], Parkhill, Julian [0000-0002-7069-5958], Apollo - University of Cambridge Repository, Vanderbilt University, Imperial College London, Department of Computer Science, University of Cambridge, University of Oxford, Wellcome Trust Sanger Institute, Houston Methodist Hospital, Cornell University, Department of Applied Physics, Aalto-yliopisto, Aalto University, Department of Mathematics and Statistics, Jukka Corander / Principal Investigator, and Biostatistics Helsinki
Recent advances in the scale and diversity of population genomic datasets for bacteria now provide the potential for genome-wide patterns of co-evolution to be studied at the resolution of individual bases. Here we describe a new statistical method, genomeDCA, which uses recent advances in computational structural biology to identify the polymorphic loci under the strongest co-evolutionary pressures. We apply genomeDCA to two large population data sets representing the major human pathogens Streptococcus pneumoniae (pneumococcus) and Streptococcus pyogenes (group A Streptococcus). For pneumococcus we identified 5,199 putative epistatic interactions between 1,936 sites. Over three-quarters of the links were between sites within the pbp2x, pbp1a and pbp2b genes, the sequences of which are critical in determining non-susceptibility to beta-lactam antibiotics. A network-based analysis found these genes were also coupled to that encoding dihydrofolate reductase, changes to which underlie trimethoprim resistance. Distinct from these antibiotic resistance genes, a large network component of 384 protein coding sequences encompassed many genes critical in basic cellular functions, while another distinct component included genes associated with virulence. The group A Streptococcus (GAS) data set population represents a clonal population with relatively little genetic variation and a high level of linkage disequilibrium across the genome. Despite this, we were able to pinpoint two RNA pseudouridine synthases, which were each strongly linked to a separate set of loci across the chromosome, representing biologically plausible targets of co-selection. The population genomic analysis method applied here identifies statistically significantly co-evolving locus pairs, potentially arising from fitness selection interdependence reflecting underlying protein-protein interactions, or genes whose product activities contribute to the same phenotype. This discovery approach greatly enhances the future potential of epistasis analysis for systems biology, and can complement genome-wide association studies as a means of formulating hypotheses for targeted experimental work., Author summary Epistatic interactions between polymorphisms in DNA are recognized as important drivers of evolution in numerous organisms. Study of epistasis in bacteria has been hampered by the lack of densely sampled population genomic data, suitable statistical models and inference algorithms sufficiently powered for extremely high-dimensional parameter spaces. We introduce the first model-based method for genome-wide epistasis analysis and use two of the largest available bacterial population genome data sets on Streptococcus pneumoniae (the pneumococcus) and Streptococcus pyogenes (group A Streptococcus) to demonstrate its potential for biological discovery. Our approach reveals interacting networks of resistance, virulence and core machinery genes in the pneumococcus, which highlights putative candidates for novel drug targets. We also discover a number of plausible targets of co-selection in S. pyogenes linked to RNA pseudouridine synthases. Our method significantly enhances the future potential of epistasis analysis for systems biology, and can complement genome-wide association studies as a means of formulating hypotheses for targeted experimental work.