1. A mixture model for expression deconvolution from RNA-seq in heterogeneous tissues
- Author
-
Xiaohui Xie and Yi Li
- Subjects
Cell type ,Sequence analysis ,RNA-Seq ,Computational biology ,Biology ,Biochemistry ,Cell Line ,Transcriptome ,03 medical and health sciences ,0302 clinical medicine ,Structural Biology ,Humans ,Molecular Biology ,030304 developmental biology ,Genetics ,0303 health sciences ,Abundance estimation ,Models, Statistical ,Sequence Analysis, RNA ,Gene Expression Profiling ,Applied Mathematics ,High-Throughput Nucleotide Sequencing ,Computer Science Applications ,Gene expression profiling ,Proceedings ,030220 oncology & carcinogenesis ,Deconvolution ,DNA microarray ,Algorithms - Abstract
Background RNA-seq, a next-generation sequencing based method for transcriptome analysis, is rapidly emerging as the method of choice for comprehensive transcript abundance estimation. The accuracy of RNA-seq can be highly impacted by the purity of samples. A prominent, outstanding problem in RNA-seq is how to estimate transcript abundances in heterogeneous tissues, where a sample is composed of more than one cell type and the inhomogeneity can substantially confound the transcript abundance estimation of each individual cell type. Although experimental methods have been proposed to dissect multiple distinct cell types, computationally "deconvoluting" heterogeneous tissues provides an attractive alternative, since it keeps the tissue sample as well as the subsequent molecular content yield intact. Results Here we propose a probabilistic model-based approach, Transcript Estimation from Mixed Tissue samples (TEMT), to estimate the transcript abundances of each cell type of interest from RNA-seq data of heterogeneous tissue samples. TEMT incorporates positional and sequence-specific biases, and its online EM algorithm only requires a runtime proportional to the data size and a small constant memory. We test the proposed method on both simulation data and recently released ENCODE data, and show that TEMT significantly outperforms current state-of-the-art methods that do not take tissue heterogeneity into account. Currently, TEMT only resolves the tissue heterogeneity resulting from two cell types, but it can be extended to handle tissue heterogeneity resulting from multi cell types. TEMT is written in python, and is freely available at https://github.com/uci-cbcl/TEMT. Conclusions The probabilistic model-based approach proposed here provides a new method for analyzing RNA-seq data from heterogeneous tissue samples. By applying the method to both simulation data and ENCODE data, we show that explicitly accounting for tissue heterogeneity can significantly improve the accuracy of transcript abundance estimation.
- Published
- 2013