Start Over

Missing data estimation in morphometrics: how much is too much?

Authors :: Gilles Escarguel
Julien Clavel
Gildas Merceron
Laboratoire de Géologie de Lyon - Terre, Planètes, Environnement (LGL-TPE)
École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL)
Université de Lyon-Université de Lyon-Institut national des sciences de l'Univers (INSU - CNRS)-Université Jean Monnet - Saint-Étienne (UJM)-Centre National de la Recherche Scientifique (CNRS)
Institut International de Paléoprimatologie, Paléontologie Humaine : Evolution et Paléoenvironnement (IPHEP)
Université de Poitiers-Centre National de la Recherche Scientifique (CNRS)
Laboratoire de Géologie de Lyon - Terre, Planètes, Environnement [Lyon] (LGL-TPE)
Centre National de la Recherche Scientifique (CNRS)-Institut national des sciences de l'Univers (INSU - CNRS)-Université Claude Bernard Lyon 1 (UCBL)
Université de Lyon-Université de Lyon-École normale supérieure - Lyon (ENS Lyon)
Centre National de la Recherche Scientifique (CNRS)-Université de Poitiers
Source :: Systematic Biology, Systematic Biology, 2014, 63 (2), pp.203-18. ⟨10.1093/sysbio/syt100⟩, Systematic Biology, Oxford University Press (OUP), 2014, 63 (2), pp.203-18. ⟨10.1093/sysbio/syt100⟩
Publication Year :: 2014
Publisher :: HAL CCSD, 2014.
Abstract: International audience; Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically-controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last years, several empirically-determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on various properties of the study data set and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation (MI) techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. MI techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of MIs with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.

Details

Language :: English
ISSN :: 10635157 and 1076836X
Database :: OpenAIRE
Journal :: Systematic Biology, Systematic Biology, 2014, 63 (2), pp.203-18. ⟨10.1093/sysbio/syt100⟩, Systematic Biology, Oxford University Press (OUP), 2014, 63 (2), pp.203-18. ⟨10.1093/sysbio/syt100⟩
Accession number :: edsair.doi.dedup.....866e877afd5c6c612411a56105a94594
Full Text :: https://doi.org/10.1093/sysbio/syt100⟩