Back to Search Start Over

Abstract 2108: Application of random forest machine learning techniques on mixed data from breast cancer studies

Authors :
Lawson Taylor
Anita Grigoriadis
Johan Staaf
Jelmar Quist
Source :
Cancer Research. 80:2108-2108
Publication Year :
2020
Publisher :
American Association for Cancer Research (AACR), 2020.

Abstract

Integrative analysis of diverse high-dimensional molecular, histopathological and clinical data provides an effective way to identify biologically and clinically relevant subclasses across multi-level data. However, unbiased integrative methods to identify distinctive features and group structure in such data remains problematic. We propose a machine learning clustering technique based on random forest methods that enables unbiased integration. Using a permutation-based framework for the tree construction procedure and measuring of feature importance, robust and pure clusters can be produced. The performance of standard, regularised, and conditional inference random forest methods was evaluated using the adjusted Rand index, the Calinski-Harabasz index, and cluster and feature purity. In simulations studies, random forest clustering techniques were able to identify clusters of high purity. Using datasets from the UCI Machine Learning Repository as a proof of concept, all three techniques were able to identify clusters in mixed data, whereby the conditional inference method produced clusters with the highest feature purity. Next, we applied our clustering techniques to high-dimensional data obtained from two independent breast cancer studies: (i) International Cancer Genome Consortium (ICGC), consisting of 560 cases and 147 features, and (ii) Sweden Cancerome Analysis Network - Breast (SCAN-B), incorporating 241 cases and 53 features. Features included rearrangement and mutational signatures, somatic mutation in cancer drivers, germline mutations in BRCA1/2, genomic instability measures, intrinsic molecular breast cancer subtypes, and clinico-pathological characteristics. Despite dissimilarities in the breast cancer subtype composition between these two datasets, the conditional inference random forest method was able to identify concordant subgroups between the studies supported by molecular and histopathological characteristics. Moreover, novel relationships amongst molecular features with potential clinical relevance were revealed. For example, one cluster was enriched for BRCA2-deficient breast cancer cases with MYC amplifications, while another predominantly consisted of non-basal-like triple-negative breast cancers with PIK3CA mutations. Together, these results support the use of our machine learning clustering technique based on random forest methods to identify robust and biologically relevant group structures using complex high-dimensional mixed data. Citation Format: Jelmar Quist, Lawson Taylor, Johan Staaf, Anita Grigoriadis. Application of random forest machine learning techniques on mixed data from breast cancer studies [abstract]. In: Proceedings of the Annual Meeting of the American Association for Cancer Research 2020; 2020 Apr 27-28 and Jun 22-24. Philadelphia (PA): AACR; Cancer Res 2020;80(16 Suppl):Abstract nr 2108.

Details

ISSN :
15387445 and 00085472
Volume :
80
Database :
OpenAIRE
Journal :
Cancer Research
Accession number :
edsair.doi...........1b133c471ff4bf3f8bb400ccf828cdf0
Full Text :
https://doi.org/10.1158/1538-7445.am2020-2108