Back to Search
Start Over
Missing data estimation in morphometrics: how much is too much?
- Source :
- Systematic Biology, Systematic Biology, 2014, 63 (2), pp.203-18. ⟨10.1093/sysbio/syt100⟩, Systematic Biology, Oxford University Press (OUP), 2014, 63 (2), pp.203-18. ⟨10.1093/sysbio/syt100⟩
- Publication Year :
- 2014
- Publisher :
- HAL CCSD, 2014.
-
Abstract
- International audience; Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically-controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last years, several empirically-determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on various properties of the study data set and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation (MI) techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. MI techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of MIs with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.
- Subjects :
- MESH: Principal Component Analysis
Principal Component Analysis
Multivariate statistics
Threshold limit value
Coverage probability
[SDV.BID]Life Sciences [q-bio]/Biodiversity
Biology
Classification
Missing data
computer.software_genre
MESH: Classification
MESH: Computer Simulation
Statistics
Principal component analysis
Genetics
Superimposition
Computer Simulation
Data mining
Imputation (statistics)
Evolutionary dynamics
MESH: Phylogeny
[SDU.STU.PG]Sciences of the Universe [physics]/Earth Sciences/Paleontology
computer
Phylogeny
Ecology, Evolution, Behavior and Systematics
Subjects
Details
- Language :
- English
- ISSN :
- 10635157 and 1076836X
- Database :
- OpenAIRE
- Journal :
- Systematic Biology, Systematic Biology, 2014, 63 (2), pp.203-18. ⟨10.1093/sysbio/syt100⟩, Systematic Biology, Oxford University Press (OUP), 2014, 63 (2), pp.203-18. ⟨10.1093/sysbio/syt100⟩
- Accession number :
- edsair.doi.dedup.....866e877afd5c6c612411a56105a94594
- Full Text :
- https://doi.org/10.1093/sysbio/syt100⟩