Matthew Haynes, Dana Willner, Florent E. Angly, Robert Schmieder, Rebecca Vega-Thurber, Egbert Mundt, Lixin Zhang, Forest Rohwer, David L. Kirchman, Tracey McDole, Elizabeth A. Dinsdale, Yongfei Hu, John Douglas Mcpherson, Matthew T. Cottrell, Dionysios A. Antonopoulos, R. Michael Miller, Alejandra Prieto-Davó, Beltran Rodriguez-Mueller, Mike Furlan, Folker Meyer, Baoli Zhu, Christelle Desnues, Linda Wegley, Robert Edwards, Matthew R. Henn, Katie L. Barott, Rick Stevens, Robert K. Naviaux, San Diego State University (SDSU), Argonne National Laboratory [Lemont] (ANL), School of Marine Science and Policy, College of Earth, Ocean, and Environment [Newark] (CEOE), University of Delaware [Newark]-University of Delaware [Newark], Unité de Recherche sur les Maladies Infectieuses et Tropicales Emergentes (URMITE), Institut de Recherche pour le Développement (IRD)-Aix Marseille Université (AMU)-Institut National de la Santé et de la Recherche Médicale (INSERM)-IFR48, Institut des sciences biologiques (INSB-CNRS)-Institut des sciences biologiques (INSB-CNRS)-Centre National de la Recherche Scientifique (CNRS), Massachusetts Institute of Technology (MIT), Chinese Academy of Sciences Key Laboratory of Pathogenic Microbiology and Immunology, Chinese Academy of Sciences [Beijing] (CAS), Ontario Institute for Cancer Research [Canada] (OICR), Ontario Institute for Cancer Research, Mathematics and Computer Science Division [ANL] (MCS), University of Georgia [USA], University of California [San Francisco] (UC San Francisco), University of California (UC), INSB-INSB-Centre National de la Recherche Scientifique (CNRS), University of California [San Francisco] (UCSF), University of California, Stormo, Gary D, and Desnues, Christelle
Metagenomic studies characterize both the composition and diversity of uncultured viral and microbial communities. BLAST-based comparisons have typically been used for such analyses; however, sampling biases, high percentages of unknown sequences, and the use of arbitrary thresholds to find significant similarities can decrease the accuracy and validity of estimates. Here, we present Genome relative Abundance and Average Size (GAAS), a complete software package that provides improved estimates of community composition and average genome length for metagenomes in both textual and graphical formats. GAAS implements a novel methodology to control for sampling bias via length normalization, to adjust for multiple BLAST similarities by similarity weighting, and to select significant similarities using relative alignment lengths. In benchmark tests, the GAAS method was robust to both high percentages of unknown sequences and to variations in metagenomic sequence read lengths. Re-analysis of the Sargasso Sea virome using GAAS indicated that standard methodologies for metagenomic analysis may dramatically underestimate the abundance and importance of organisms with small genomes in environmental systems. Using GAAS, we conducted a meta-analysis of microbial and viral average genome lengths in over 150 metagenomes from four biomes to determine whether genome lengths vary consistently between and within biomes, and between microbial and viral communities from the same environment. Significant differences between biomes and within aquatic sub-biomes (oceans, hypersaline systems, freshwater, and microbialites) suggested that average genome length is a fundamental property of environments driven by factors at the sub-biome level. The behavior of paired viral and microbial metagenomes from the same environment indicated that microbial and viral average genome sizes are independent of each other, but indicative of community responses to stressors and environmental conditions., Author Summary Metagenomics uses DNA or RNA sequences isolated directly from the environment to determine what viruses or microorganisms exist in natural communities and what metabolic activities they encode. Typically, metagenomic sequences are compared to annotated sequences in public databases using the BLAST search tool. Our methods, implemented in the Genome relative Abundance and Average Size (GAAS) software, improve the way BLAST searches are processed to estimate the taxonomic composition of communities and their average genome length. GAAS provides a more accurate picture of community composition by correcting for a systematic sampling bias towards larger genomes, and is useful in situations where organisms with small genomes are abundant, such as disease outbreaks caused by small RNA viruses. Microbial average genome length relates to environmental complexity and the distribution of genome lengths describes community diversity. A study of the average genome length of viruses and microorganisms in four different biomes using GAAS on 169 metagenomes showed significantly different average genome sizes between biomes, and large variability within biomes as well. This also revealed that microbial and viral average genome sizes in the same environment are independent of each other, which reflects the different ways that microorganisms and viruses respond to stress and environmental conditions.