Back to Search
Start Over
Machine Learning-Based Determination of Sampling Depth for Complex Environmental Systems: Case Study with Single-Cell Raman Spectroscopy Data in EBPR Systems
- Source :
- Environmental sciencetechnology. 56(18)
- Publication Year :
- 2022
-
Abstract
- Rapid progress in various advanced analytical methods such as single-cell technologies enable unprecedented and deeper understanding of microbial ecology beyond the resolution of conventional approaches. A major application challenge exists in the determination of sufficient sample size without sufficient prior knowledge of the community complexity and, the need to balance between statistical power and limited time or resources. This hinders the desired standardization and wider application of these technologies. Here, we proposed, tested and validated a computational sampling size assessment protocol taking advantage of a metric, named kernel divergence. This metric has two advantages: First, it directly compares dataset-wise distributional differences with no requirements on human intervention or prior knowledge-based pre-classification. Second, minimal assumptions in distribution and sample space are made in data processing to enhance its application domain. This enables test-verified appropriate handling of datasets with both linear and non-linear relationships. The model was then validated in a case study with eight SCRS phenotyping datasets each sampled from a different enhanced biological phosphorus removal (EBPR) activated sludge community located across North America. The model allows the determination of sufficient sampling size for any targeted or customized information capture capacity or resolution level. For example, an approximated sampling size of 50 or 100 spectra for full-scale EBPR-related ecosystems at 5% or 2% OPU cluster resolution. Promised by its flexibility and minimal restriction of input data types, the proposed method is expected to be a standardized approach for sampling size optimization, enabling more comparable and reproducible experiments and analysis on complex environmental samples. Finally, these advantages exhibit the capability of generalizing to other single-cell technologies or environmental applications, provided that the input datasets contain only continuous features.TOC
- Subjects :
- Biological Products
Sewage
Computer science
Phosphorus
General Chemistry
computer.software_genre
Spectrum Analysis, Raman
Statistical power
Machine Learning
Enhanced biological phosphorus removal
Activated sludge
Sample size determination
Polyphosphates
Kernel (statistics)
Metric (mathematics)
Environmental Chemistry
Humans
Data mining
Divergence (statistics)
computer
Subjects
Details
- ISSN :
- 15205851
- Volume :
- 56
- Issue :
- 18
- Database :
- OpenAIRE
- Journal :
- Environmental sciencetechnology
- Accession number :
- edsair.doi.dedup.....c6a884894beecbd0d30b4606a5660aa6