Start Over

Clustering of immune-mediated diseases using genomic data

Authors :: Nicholls, Katherine
Publication Year :: 2023
Publisher :: Apollo - University of Cambridge Repository, 2023.
Abstract: Studying immune-mediated diseases (IMD) yields insights into the active immune system. In this thesis I cluster patients with IMD based on RNA-seq data using two extensions to clustering, and also cluster the diseases themselves using GWAS summary statistics. I conducted an extensive study of biclustering methods and found that a Bayesian biclustering method called SSLB had best performance on simulated datasets and recovered biologically relevant biclusters in a knockout mouse dataset and a sorted blood cell dataset. Through the study I developed tools for processing and analysing biclustering results. I applied this knowledge to perform biclustering, with SSLB, of a sorted blood cell dataset containing patients with six immune-mediated diseases. Amongst the biclusters was a bicluster recovering the genes that escape X-inactivation and a bicluster capturing type 1 interferon response, enriched for samples from patients with systemic lupus erythematosus. I found that whilst it is an advantage that biclustering has sufficient complexity to describe the immune system comprehensively, this also means that the results of the biclustering are themselves complex and thus interpretation can be a real challenge. In order to study RNA-seq data when summarised using key immune gene signatures, I developed DPMUnc, an extension to a Dirichlet process Bayesian clustering model which allows the uncertainty associated with the data points to be taken into account. I was then able to cluster patients based on their average gene expression across gene signatures, for example finding the expected enrichment of lupus patients in a cluster with high expression of interferon genes. I also clustered the immune-mediated diseases themselves using GWAS summary statistics. DPMUnc separated autoimmune from autoinflammatory diseases and isolated other subgroups such as the EGPA subtypes and multiple sclerosis. This thesis thus reveals an interesting trade-off between the complexity of the data and the utility of the results. Biclustering is sufficiently complex for the heterogeneity of the data but the results are difficult to interpret. In contrast, although some information is lost by summarising into key gene signatures, it allows for more concrete analysis.