Start Over

Robust and nonparametric classification of gene expression data

Authors :: Szekely, Robin
Kestler, Hans A.
Schwenker, Friedhelm
Lausen, Berthold
Hoffmann, Steve
Publication Year :: 2021
Publisher :: Universität Ulm, 2021.
Abstract: Interpretable and accurate diagnostics models are one of the major components of personalised medicine. This setting shifts a conventional therapy of patients more to a treatment based on multiple molecular markers. Nowadays, a large amount of gene expression data can be obtained by technologies such as microarrays or RNA-Seq within hours or days. The analysis of gene expression data is special since they are high-dimensional: A very high number of measurements usually comes with a low number of samples. Due to the high dimensionality, manual verification by human experts is typically impossible. Automatic systems are needed for further analyses. Supervised learning is a possible way to predict diagnostic classes of unseen samples based on a set of molecular measurements. In this thesis, I analysed three topics that combine the aspects of data acquisition, feature selection and complexity reduction for the classification of gene expression data. In the first part, feature selection is analysed in a multi-class context. Multi-class classifier systems decompose a multi-class classification problem in numerous two-class base classifiers. The main question was how an individually tailored feature selection for the base classifiers compares to a common feature selection for the whole system. Cross-validation experiments were performed on 9 multi-class gene expression datasets with different feature set sizes and two filters as feature selection algorithms. I analysed the predictive performance of such systems along with the relative feature overlaps and the stability against resampling. The use of feature selection led to a better performance in over 80% of all experiments and nearly 60% of it was achieved by the individual feature selection. However, a commonly used feature signature for the multi-class classifier system showed only slightly lower performance for larger signatures. In this case, multi-class classifier systems achieved high accuracies even if individual members worked on possibly unspecific measurements. In the second part of this thesis I performed cross-validation experiments on the same 9 multi-class datasets as in the previous part and analysed if a transfer of foreign features can improve the prediction accuracy instead of using the original subspace. Indeed, the equal or higher accuracies could be achieved using foreign features instead of the original features in over 60% over all experiments. One of the possible reasons might be an overfitting of the original features. In addition, I analysed the behaviour of a correlation based feature selection that incorporates one foreign class used as an intermediate class to estimate the relationship between two original classes. The foreign feature selection algorithm was analysed empirically and the impact on the predictive performance was discussed. Using external knowledge from a foreign class to construct a feature selection either improved or lead to equal accuracies in over 67% of all experiments. An empirical analysis additionally showed that incorporating a foreign class leads to an underestimation of the original feature score in the vast majority of cases. For one aggregation strategy, in over 98% of all cases the original score was underestimated by a foreign feature selection. This enables to find markers suitable for the original classification task without explicitly knowing their relationship but by using an intermediate foreign class. The last part of this thesis focused on invariant linear classifiers which are not affected if test samples are distorted by different types of noise. These classification models come with a lower model complexity and are more robust than a traditional linear classifier. I discussed the invariance attributes and compared the performance of four subclasses of invariant linear classifiers on 27 two-class gene expression datasets as well as artificial datasets for multiple dimensions. Experiments were performed in cross-validation experiments and with two variants of invariant support vector machines. In addition, invariance attributes were confirmed experimentally on noisy artificial data. Overall, 15000 noisy artificial datasets were used in these experiments. The results showed that for high dimensions three out of four invariant linear classifiers achieved very similar results compared to the standard linear classifier, despite having a reduced model complexity. For the 27 gene expression datasets and all artificial datasets, the differences in accuracies were under 2%. An exception is the strictest invariant subclass that only operates on two markers. For this model, the accuracies could drop up to 30% compared to the general linear classifier. Additionally, an analysis of feature selecting invariant support vector machines on the 27 gene expression datasets revealed that on average only up to 0.51% of all markers are needed to construct an invariant set of features.