Back to Search
Start Over
Adjusting for Principal Components of Molecular Phenotypes Induces Replicating False Positives
- Source :
- Genetics, vol 211, iss 4, Genetics, Genetics, 2019, 211 (4), pp.1179-1189. ⟨10.1534/genetics.118.301768⟩, Genetics, Genetics Society of America, 2019, 211 (4), pp.1179-1189. ⟨10.1534/genetics.118.301768⟩
- Publication Year :
- 2019
- Publisher :
- eScholarship, University of California, 2019.
-
Abstract
- Biological, technical, and environmental confounders are ubiquitous in the high-dimensional, high-throughput functional genomic measurements being used to understand cellular biology and disease processes, and many approaches have been developed to estimate and correct for unmeasured confounders... High-throughput measurements of molecular phenotypes provide an unprecedented opportunity to model cellular processes and their impact on disease. These highly structured datasets are usually strongly confounded, creating false positives and reducing power. This has motivated many approaches based on principal components analysis (PCA) to estimate and correct for confounders, which have become indispensable elements of association tests between molecular phenotypes and both genetic and nongenetic factors. Here, we show that these correction approaches induce a bias, and that it persists for large sample sizes and replicates out-of-sample. We prove this theoretically for PCA by deriving an analytic, deterministic, and intuitive bias approximation. We assess other methods with realistic simulations, which show that perturbing any of several basic parameters can cause false positive rate (FPR) inflation. Our experiments show the bias depends on covariate and confounder sparsity, effect sizes, and their correlation. Surprisingly, when the covariate and confounder have ρ2≈10%, standard two-step methods all have >10-fold FPR inflation. Our analysis informs best practices for confounder correction in genomic studies, and suggests many false discoveries have been made and replicated in some differential expression analyses.
- Subjects :
- Computer science
01 natural sciences
Correlation
010104 statistics & probability
0302 clinical medicine
Models
Statistics
False positive paradox
MESH: Animals
MESH: Models, Genetic
Genetics
0303 health sciences
[STAT.AP]Statistics [stat]/Applications [stat.AP]
Principal Component Analysis
Confounding
Phenotype
confounder
MESH: Reproducibility of Results
Principal component analysis
molecular trait
[STAT.ME]Statistics [stat]/Methodology [stat.ME]
Quantitative Trait Loci
Genomics
Computational biology
Quantitative trait locus
Biology
Investigations
MESH: Phenotype
03 medical and health sciences
Genetic
Covariate
Animals
Humans
0101 mathematics
030304 developmental biology
MESH: Principal Component Analysis
MESH: Humans
Models, Genetic
Human Genome
Reproducibility of Results
MESH: Quantitative Trait Loci
[SDV.GEN.GH]Life Sciences [q-bio]/Genetics/Human genetics
eigenvector perturbation
quantitative trait loci
MESH: Genome-Wide Association Study
[SDV.SPEE]Life Sciences [q-bio]/Santé publique et épidémiologie
False positive rate
[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM]
030217 neurology & neurosurgery
Genome-Wide Association Study
Developmental Biology
Subjects
Details
- ISSN :
- 00166731
- Database :
- OpenAIRE
- Journal :
- Genetics, vol 211, iss 4, Genetics, Genetics, 2019, 211 (4), pp.1179-1189. ⟨10.1534/genetics.118.301768⟩, Genetics, Genetics Society of America, 2019, 211 (4), pp.1179-1189. ⟨10.1534/genetics.118.301768⟩
- Accession number :
- edsair.doi.dedup.....7958be379736ce8b9bc528532e2bd693
- Full Text :
- https://doi.org/10.1534/genetics.118.301768⟩