1. MeStanG—Resource for High-Throughput Sequencing Standard Data Sets Generation for Bioinformatic Methods Evaluation and Validation.
- Author
-
Ramos Lopez, Daniel, Flores, Francisco J., and Espindola, Andres S.
- Subjects
- *
BIOCHEMISTRY , *NUCLEOTIDE sequencing , *MICROBIAL diversity , *ENVIRONMENTAL sampling , *BACTERIAL communities , *METAGENOMICS - Abstract
Simple Summary: Metagenomics analysis measures microbiome diversity in samples without prior enrichment. Advances in High-Throughput Sequencing (HTS) have expanded its use from identifying known organisms to diagnosing diseases. Reliable results need strong validation with standard samples and databases from real and synthetic controls. We introduce the Metagenomic Standards Generator (MeStanG), a tool for creating HTS Nanopore data sets to test bioinformatics pipelines. MeStanG allows users to design and generate samples with specific numbers of reads for each organism from reference sequences and error profiles. The accuracy was tested by simulating metagenomic samples with known diversities and abundances expressed as number of reads. The analysis showed results that matched the expected organism composition in the samples. MeStanG is a valuable tool for scientists to create mock metagenomic samples useful in diagnostic assay validation studies and assess bioinformatics pipeline performance using simulated samples. Metagenomics analysis has enabled the measurement of the microbiome diversity in environmental samples without prior targeted enrichment. Functional and phylogenetic studies based on microbial diversity retrieved using HTS platforms have advanced from detecting known organisms and discovering unknown species to applications in disease diagnostics. Robust validation processes are essential for test reliability, requiring standard samples and databases deriving from real samples and in silico generated artificial controls. We propose a MeStanG as a resource for generating HTS Nanopore data sets to evaluate present and emerging bioinformatics pipelines. MeStanG allows samples to be designed with user-defined organism abundances expressed as number of reads, reference sequences, and predetermined or custom errors by sequencing profiles. The simulator pipeline was evaluated by analyzing its output mock metagenomic samples containing known read abundances using read mapping, genome assembly, and taxonomic classification on three scenarios: a bacterial community composed of nine different organisms, samples resembling pathogen-infected wheat plants, and a viral pathogen serial dilution sampling. The evaluation was able to report consistently the same organisms, and their read abundances as provided in the mock metagenomic sample design. Based on this performance and its novel capacity of generating exact number of reads, MeStanG can be used by scientists to develop mock metagenomic samples (artificial HTS data sets) to assess the diagnostic performance metrics of bioinformatic pipelines, allowing the user to choose predetermined or customized models for research and training. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF