Back to Search Start Over

PLS-based and regularization-based methods for the selection of relevant variables in non-targeted metabolomics data

Authors :
Renata Bujak
Emilia Daghir-Wojtkowiak
Roman Kaliszan
Michal Jan Markuszewski
Source :
Frontiers in Molecular Biosciences, Vol 3 (2016)
Publication Year :
2016
Publisher :
Frontiers Media S.A., 2016.

Abstract

Non-targeted metabolomics constitutes a part of systems biology and aims to determine many metabolites in complex biological samples. Datasets obtained in non-targeted metabolomics studies are multivariate and high-dimensional due to the sensitivity of mass spectrometry-based detection methods as well as complexity of biological matrices. Proper selection of variables which contribute into group classification is a crucial step, especially in metabolomics studies which are focused on searching for disease biomarker candidates. In the present study, three different statistical approaches were tested using two metabolomics datasets (RH and PH study). Orthogonal projections to latent structures-discriminant analysis (OPLS-DA) without and with multiple testing correction as well as least absolute shrinkage and selection operator (LASSO) were tested and compared. For the RH study, OPLS-DA model built without multiple testing correction, selected 46 and 218 variables based on VIP criteria using Pareto and UV scaling, respectively. In the case of the PH study, 217 and 320 variables were selected based on VIP criteria using Pareto and UV scaling, respectively. In the RH study, OPLS-DA model built with multiple testing correction, selected 4 and 19 variables as statistically significant in terms of Pareto and UV scaling, respectively. For PH study, 14 and 18 variables were selected based on VIP criteria in terms of Pareto and UV scaling, respectively. Additionally, the concept and fundaments of the least absolute shrinkage and selection operator (LASSO) with bootstrap procedure evaluating reproducibility of results, was demonstrated. In the RH and PH study, the LASSO selected 14 and 4 variables with reproducibility between 99.3% and 100%. However, apart from the popularity of PLS-DA and OPLS-DA methods in metabolomics, it should be highlighted that they do not control type I or type II error, but only arbitrarily establish a cut-off value for PLS-DA loadings. Such multivariate model represents high goodness-of-fit to the data, however the risk of overfitting increases relevantly. Therefore, the LASSO method was for the first time applied for statistical analysis of datasets obtained in untargeted metabolomics studies. The advantage behind LASSO lies in the ability to model different types of omics data, account for multicollinearity and p >> n problems.

Details

Language :
English
ISSN :
2296889X
Volume :
3
Database :
Directory of Open Access Journals
Journal :
Frontiers in Molecular Biosciences
Publication Type :
Academic Journal
Accession number :
edsdoj.26fd99f17fd84e98a8bfe02c3f3e283a
Document Type :
article
Full Text :
https://doi.org/10.3389/fmolb.2016.00035