121 results on '"Batch correction"'
Search Results
2. Mitigating Interobserver Variability in Radiomics with ComBat: A Feasibility Study.
- Author
-
D'Anna, Alessia, Stella, Giuseppe, Gueli, Anna Maria, Marino, Carmelo, and Pulvirenti, Alfredo
- Subjects
NON-small-cell lung carcinoma ,FEATURE extraction ,RADIOMICS ,COMPUTED tomography ,INDIVIDUALIZED medicine - Abstract
This study investigates Intraobserver Features Variability (IFV) in radiomics studies and assesses the effectiveness of the ComBat harmonization method in mitigating these effects. Methods: This study utilizes data from the NSCLC-Radiomics-Interobserver1 dataset, comprising CT scans of 22 Non-Small Cell Lung Cancer (NSCLC) patients, with multiple Gross Tumor Volume (GTV) delineations performed by five radiation oncologists. Segmentation was completed manually ("vis") or by autosegmentation with manual editing ("auto"). A total of 1229 radiomic features were extracted from each GTV, segmentation method, and oncologist. Features extracted included first order, shape, GLCM, GLRLM, GLSZM, and GLDM from original, wavelet-filtered, and LoG-filtered images. Results: Before implementing ComBat harmonization, 83% of features exhibited p-values below 0.05 in the "vis" approach; this percentage decreased to 34% post-harmonization. Similarly, for the "auto" approach, 75% of features demonstrated statistical significance prior to ComBat, but this figure declined to 33% after its application. Among a subset of three expert radiation oncologists, percentages changed from 77% to 25% for "vis" contouring and from 64% to 23% for "auto" contouring. This study demonstrates that ComBat harmonization could effectively reduce IFV, enhancing the feasibility of multicenter radiomics studies. It also highlights the significant impact of physician experience on radiomics analysis outcomes. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. Benchmarking clustering, alignment, and integration methods for spatial transcriptomics
- Author
-
Yunfei Hu, Manfei Xie, Yikang Li, Mingxing Rao, Wenjun Shen, Can Luo, Haoran Qin, Jihoon Baek, and Xin Maizie Zhou
- Subjects
Spatial transcriptomics ,Benchmarking ,Clustering ,Alignment ,Integration ,Batch correction ,Biology (General) ,QH301-705.5 ,Genetics ,QH426-470 - Abstract
Abstract Background Spatial transcriptomics (ST) is advancing our understanding of complex tissues and organisms. However, building a robust clustering algorithm to define spatially coherent regions in a single tissue slice and aligning or integrating multiple tissue slices originating from diverse sources for essential downstream analyses remains challenging. Numerous clustering, alignment, and integration methods have been specifically designed for ST data by leveraging its spatial information. The absence of comprehensive benchmark studies complicates the selection of methods and future method development. Results In this study, we systematically benchmark a variety of state-of-the-art algorithms with a wide range of real and simulated datasets of varying sizes, technologies, species, and complexity. We analyze the strengths and weaknesses of each method using diverse quantitative and qualitative metrics and analyses, including eight metrics for spatial clustering accuracy and contiguity, uniform manifold approximation and projection visualization, layer-wise and spot-to-spot alignment accuracy, and 3D reconstruction, which are designed to assess method performance as well as data quality. The code used for evaluation is available on our GitHub. Additionally, we provide online notebook tutorials and documentation to facilitate the reproduction of all benchmarking results and to support the study of new methods and new datasets. Conclusions Our analyses lead to comprehensive recommendations that cover multiple aspects, helping users to select optimal tools for their specific needs and guide future method development.
- Published
- 2024
- Full Text
- View/download PDF
4. Evaluation of normalization methods for predicting quantitative phenotypes in metagenomic data analysis.
- Author
-
Beibei Wang and Yihui Luan
- Subjects
METAGENOMICS ,PHENOTYPES ,STANDARD deviations ,EVALUATION methodology ,DATA analysis ,FORECASTING - Abstract
Genotype-to-phenotype mapping is an essential problem in the current genomic era. While qualitative case-control predictions have received significant attention, less emphasis has been placed on predicting quantitative phenotypes. This emerging field holds great promise in revealing intricate connections between microbial communities and host health. However, the presence of heterogeneity in microbiome datasets poses a substantial challenge to the accuracy of predictions and undermines the reproducibility of models. To tackle this challenge, we investigated 22 normalization methods that aimed at removing heterogeneity across multiple datasets, conducted a comprehensive review of them, and evaluated their effectiveness in predicting quantitative phenotypes in three simulation scenarios and 31 real datasets. The results indicate that none of these methods demonstrate significant superiority in predicting quantitative phenotypes or attain a noteworthy reduction in Root Mean Squared Error (RMSE) of the predictions. Given the frequent occurrence of batch effects and the satisfactory performance of batch correction methods in predicting datasets affected by these effects, we strongly recommend utilizing batch correction methods as the initial step in predicting quantitative phenotypes. In summary, the performance of normalization methods in predicting metagenomic data remains a dynamic and ongoing research area. Our study contributes to this field by undertaking a comprehensive evaluation of diverse methods and offering valuable insights into their effectiveness in predicting quantitative phenotypes. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. Processing-Bias Correction with DEBIAS-M Improves Cross-Study Generalization of Microbiome-Based Prediction Models
- Author
-
Austin, George I., Brown Kav, Aya, Park, Heekuk, Biermann, Jana, Uhlemann, Anne-Catrin, Korem, Tal, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, van Leeuwen, Jan, Series Editor, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Kobsa, Alfred, Series Editor, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Nierstrasz, Oscar, Series Editor, Pandu Rangan, C., Editorial Board Member, Sudan, Madhu, Series Editor, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Weikum, Gerhard, Series Editor, Vardi, Moshe Y, Series Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, and Ma, Jian, editor
- Published
- 2024
- Full Text
- View/download PDF
6. SuperCellCyto: enabling efficient analysis of large scale cytometry datasets
- Author
-
Putri, Givanna H., Howitt, George, Marsh-Wakefield, Felix, Ashhurst, Thomas M., and Phipson, Belinda
- Published
- 2024
- Full Text
- View/download PDF
7. scCorrector: a robust method for integrating multi-study single-cell data.
- Author
-
Guo, Zhen-Hao, Wang, Yan-Bin, Wang, Siguo, Zhang, Qinhu, and Huang, De-Shuang
- Subjects
- *
MULTIOMICS , *CYTOLOGY , *KNOWLEDGE transfer , *MERMAIDS , *TRANSCRIPTOMES - Abstract
The advent of single-cell sequencing technologies has revolutionized cell biology studies. However, integrative analyses of diverse single-cell data face serious challenges, including technological noise, sample heterogeneity, and different modalities and species. To address these problems, we propose scCorrector, a variational autoencoder-based model that can integrate single-cell data from different studies and map them into a common space. Specifically, we designed a Study Specific Adaptive Normalization for each study in decoder to implement these features. scCorrector substantially achieves competitive and robust performance compared with state-of-the-art methods and brings novel insights under various circumstances (e.g. various batches, multi-omics, cross-species, and development stages). In addition, the integration of single-cell data and spatial data makes it possible to transfer information between different studies, which greatly expand the narrow range of genes covered by MERFISH technology. In summary, scCorrector can efficiently integrate multi-study single-cell datasets, thereby providing broad opportunities to tackle challenges emerging from noisy resources. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
8. scInterpreter: a knowledge-regularized generative model for interpretably integrating scRNA-seq data
- Author
-
Zhen-Hao Guo, Yan Wu, Siguo Wang, Qinhu Zhang, Jin-Ming Shi, Yan-Bin Wang, and Zhan-Heng Chen
- Subjects
Single-cell RNA-seq ,Batch correction ,Integration ,Deep learning ,Knowledge-regularized ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background The rapid emergence of single-cell RNA-seq (scRNA-seq) data presents remarkable opportunities for broad investigations through integration analyses. However, most integration models are black boxes that lack interpretability or are hard to train. Results To address the above issues, we propose scInterpreter, a deep learning-based interpretable model. scInterpreter substantially outperforms other state-of-the-art (SOTA) models in multiple benchmark datasets. In addition, scInterpreter is extensible and can integrate and annotate atlas scRNA-seq data. We evaluated the robustness of scInterpreter in a variety of situations. Through comparison experiments, we found that with a knowledge prior, the training process can be significantly accelerated. Finally, we conducted interpretability analysis for each dimension (pathway) of cell representation in the embedding space. Conclusions The results showed that the cell representations obtained by scInterpreter are full of biological significance. Through weight sorting, we found several new genes related to pathways in PBMC dataset. In general, scInterpreter is an effective and interpretable integration tool. It is expected that scInterpreter will bring great convenience to the study of single-cell transcriptomics.
- Published
- 2023
- Full Text
- View/download PDF
9. Mitigating Interobserver Variability in Radiomics with ComBat: A Feasibility Study
- Author
-
Alessia D’Anna, Giuseppe Stella, Anna Maria Gueli, Carmelo Marino, and Alfredo Pulvirenti
- Subjects
radiomics ,multicenter studies ,precision medicine ,clinical imaging ,segmentation ,batch correction ,Photography ,TR1-1050 ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
This study investigates Intraobserver Features Variability (IFV) in radiomics studies and assesses the effectiveness of the ComBat harmonization method in mitigating these effects. Methods: This study utilizes data from the NSCLC-Radiomics-Interobserver1 dataset, comprising CT scans of 22 Non-Small Cell Lung Cancer (NSCLC) patients, with multiple Gross Tumor Volume (GTV) delineations performed by five radiation oncologists. Segmentation was completed manually (“vis”) or by autosegmentation with manual editing (“auto”). A total of 1229 radiomic features were extracted from each GTV, segmentation method, and oncologist. Features extracted included first order, shape, GLCM, GLRLM, GLSZM, and GLDM from original, wavelet-filtered, and LoG-filtered images. Results: Before implementing ComBat harmonization, 83% of features exhibited p-values below 0.05 in the “vis” approach; this percentage decreased to 34% post-harmonization. Similarly, for the “auto” approach, 75% of features demonstrated statistical significance prior to ComBat, but this figure declined to 33% after its application. Among a subset of three expert radiation oncologists, percentages changed from 77% to 25% for “vis” contouring and from 64% to 23% for “auto” contouring. This study demonstrates that ComBat harmonization could effectively reduce IFV, enhancing the feasibility of multicenter radiomics studies. It also highlights the significant impact of physician experience on radiomics analysis outcomes.
- Published
- 2024
- Full Text
- View/download PDF
10. Batch correction and harmonization of -Omics datasets with a tunable median polish of ratio.
- Author
-
Dammer, Eric B., Seyfried, Nicholas T., and Johnson, Erik C. B.
- Subjects
- *
SYSTEMS biology , *DATA harmonization , *QUALITY control , *DATA analysis , *PROTEOMICS - Abstract
Large scale -omics datasets can provide new insights into normal and diseaserelated biology when analyzed through a systems biology framework. However, technical artefacts present in most -omics datasets due to variations in sample preparation, batching, platform settings, personnel, and other experimental procedures prevent useful analyses of such data without prior adjustment for these technical factors. Here, we demonstrate a tunable median polish of ratio (TAMPOR) approach for batch effect correction and agglomeration of multiple, multi-batch, site-specific cohorts into a single analyte abundance data matrix that is suitable for systems biology analyses. We illustrate the utility and versatility of TAMPOR through four distinct use cases where the method has been applied to different proteomic datasets, some of which contain a specific defect that must be addressed prior to analysis. We compare quality control metrics and sources of variance before and after application of TAMPOR to show that TAMPOR is effective at removing batch effects and other unwanted sources of variance in -omics data. We also show how TAMPOR can be used to harmonize -omics datasets even when the data are acquired using different analytical approaches. TAMPOR is a powerful and flexible approach for cleaning and harmonization of -omics data prior to downstream systems biology analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
11. scInterpreter: a knowledge-regularized generative model for interpretably integrating scRNA-seq data.
- Author
-
Guo, Zhen-Hao, Wu, Yan, Wang, Siguo, Zhang, Qinhu, Shi, Jin-Ming, Wang, Yan-Bin, and Chen, Zhan-Heng
- Subjects
DEEP learning ,PROBABILISTIC generative models ,TRANSCRIPTOMES ,RNA sequencing ,PRIOR learning - Abstract
Background: The rapid emergence of single-cell RNA-seq (scRNA-seq) data presents remarkable opportunities for broad investigations through integration analyses. However, most integration models are black boxes that lack interpretability or are hard to train. Results: To address the above issues, we propose scInterpreter, a deep learning-based interpretable model. scInterpreter substantially outperforms other state-of-the-art (SOTA) models in multiple benchmark datasets. In addition, scInterpreter is extensible and can integrate and annotate atlas scRNA-seq data. We evaluated the robustness of scInterpreter in a variety of situations. Through comparison experiments, we found that with a knowledge prior, the training process can be significantly accelerated. Finally, we conducted interpretability analysis for each dimension (pathway) of cell representation in the embedding space. Conclusions: The results showed that the cell representations obtained by scInterpreter are full of biological significance. Through weight sorting, we found several new genes related to pathways in PBMC dataset. In general, scInterpreter is an effective and interpretable integration tool. It is expected that scInterpreter will bring great convenience to the study of single-cell transcriptomics. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
12. scRNASequest: an ecosystem of scRNA-seq analysis, visualization, and publishing
- Author
-
Kejie Li, Yu H. Sun, Zhengyu Ouyang, Soumya Negi, Zhen Gao, Jing Zhu, Wanli Wang, Yirui Chen, Sarbottam Piya, Wenxing Hu, Maria I. Zavodszky, Hima Yalamanchili, Shaolong Cao, Andrew Gehrke, Mark Sheehan, Dann Huh, Fergal Casey, Xinmin Zhang, and Baohong Zhang
- Subjects
Single-cell RNA-seq ,Single-nucleus RNA-seq ,Transcriptome ,Data integration ,Cell-type label transfer ,Batch correction ,Biotechnology ,TP248.13-248.65 ,Genetics ,QH426-470 - Abstract
Abstract Background Single-cell RNA sequencing is a state-of-the-art technology to understand gene expression in complex tissues. With the growing amount of data being generated, the standardization and automation of data analysis are critical to generating hypotheses and discovering biological insights. Results Here, we present scRNASequest, a semi-automated single-cell RNA-seq (scRNA-seq) data analysis workflow which allows (1) preprocessing from raw UMI count data, (2) harmonization by one or multiple methods, (3) reference-dataset-based cell type label transfer and embedding projection, (4) multi-sample, multi-condition single-cell level differential gene expression analysis, and (5) seamless integration with cellxgene VIP for visualization and with CellDepot for data hosting and sharing by generating compatible h5ad files. Conclusions We developed scRNASequest, an end-to-end pipeline for single-cell RNA-seq data analysis, visualization, and publishing. The source code under MIT open-source license is provided at https://github.com/interactivereport/scRNASequest . We also prepared a bookdown tutorial for the installation and detailed usage of the pipeline: https://interactivereport.github.io/scRNAsequest/tutorial/docs/ . Users have the option to run it on a local computer with a Linux/Unix system including MacOS, or interact with SGE/Slurm schedulers on high-performance computing (HPC) clusters.
- Published
- 2023
- Full Text
- View/download PDF
13. Batch correction and harmonization of -Omics datasets with a tunable median polish of ratio.
- Author
-
Dammer, Eric B., Seyfried, Nicholas T., and Johnson, Erik C. B.
- Subjects
SYSTEMS biology ,DATA harmonization ,QUALITY control ,DATA analysis ,PROTEOMICS - Abstract
Large scale -omics datasets can provide new insights into normal and diseaserelated biology when analyzed through a systems biology framework. However, technical artefacts present in most -omics datasets due to variations in sample preparation, batching, platform settings, personnel, and other experimental procedures prevent useful analyses of such data without prior adjustment for these technical factors. Here, we demonstrate a tunable median polish of ratio (TAMPOR) approach for batch effect correction and agglomeration of multiple, multi-batch, site-specific cohorts into a single analyte abundance data matrix that is suitable for systems biology analyses. We illustrate the utility and versatility of TAMPOR through four distinct use cases where the method has been applied to different proteomic datasets, some of which contain a specific defect that must be addressed prior to analysis. We compare quality control metrics and sources of variance before and after application of TAMPOR to show that TAMPOR is effective at removing batch effects and other unwanted sources of variance in -omics data. We also show how TAMPOR can be used to harmonize -omics datasets even when the data are acquired using different analytical approaches. TAMPOR is a powerful and flexible approach for cleaning and harmonization of -omics data prior to downstream systems biology analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
14. Perspectives for better batch effect correction in mass-spectrometry-based proteomics
- Author
-
Ser-Xian Phua, Kai-Peng Lim, and Wilson Wen-Bin Goh
- Subjects
Proteomics ,Batch effects ,Batch correction ,Batch visualization ,Biotechnology ,TP248.13-248.65 - Abstract
Mass-spectrometry-based proteomics presents some unique challenges for batch effect correction. Batch effects are technical sources of variation, can confound analysis and usually non-biological in nature. As proteomic analysis involves several stages of data transformation from spectra to protein, the decision on when and what to apply batch correction on is often unclear. Here, we explore several relevant issues pertinent to batch effect correct considerations. The first involves applications of batch effect correction requiring prior knowledge on batch factors and exploring data to uncover new/unknown batch factors. The second considers recent literature that suggests there is no single best batch effect correction algorithm---i.e., instead of a best approach, one may instead ask, what is a suitable approach. The third section considers issues of batch effect detection. And finally, we look at potential developments for proteomic-specific batch effect correction methods and how to do better functional evaluations on batch corrected data.
- Published
- 2022
- Full Text
- View/download PDF
15. Computational approaches for metagenomic analysis of the microbiome
- Author
-
Briscoe, Leah Pritnapah
- Subjects
Bioinformatics ,Genetics ,Ecology ,batch correction ,evolution ,gut microbiome ,metagenomics ,microbiome ,source tracking - Abstract
The microbiome is a community of microorganisms living in our bodies and throughout the environment. The genomic data researchers can extract from microbiomes, known as metagenomic data, can be used to predict traits about a host or environment. By identifying microbiome biomarkers associated with disease or health, researchers can develop better therapeutics for microbiome-associated diseases. However, metagenomic data is commonly affected by technical variables unrelated to the phenotype of interest, such as sequencing protocol, which can make it difficult to predict phenotype and find biomarkers of disease. Here, we evaluate methods to remove background noise due to technical variables unrelated to the phenotype of interest, such as sequencing protocol, and thereby improving our ability to find accurate biomarkers of human disease. Also crucial in understanding host health is elucidating the sources of their microbiomes, as it allows researchers to understand the dynamics behind how microbial communities form and how they respond to changing environments. In this work, we introduce a method to use metagenomic variants obtained from hundreds of species in microbiome data to perform source tracking, which is a method of estimating colonization sources for a sample of interest. These analyses shed light on phenomena like the colonization of the early infant gut microbiome, or spatial patterns in the ocean microbiomes around the world. Lastly, we analyze metagenomic data to understand how genetic diversity changes along the human gut on the species, strain and gene level. In sum, this work leverages the genomic information contained in our microbiomes to find universal patterns in microbiomes, allowing us to better understand the relationship between microbiome and phenotypes, the colonization sources of microbiomes, and also the colonization dynamics on the species and strain level.
- Published
- 2023
16. Influence of single-cell RNA sequencing data integration on the performance of differential gene expression analysis.
- Author
-
Kujawa, Tomasz, Marczyk, Michał, and Polanska, Joanna
- Subjects
DATA integration ,GENE expression ,BIOLOGICAL variation ,STATISTICAL correlation ,DATA distribution ,LIFTING & carrying (Human mechanics) ,RNA sequencing - Abstract
Large-scale comprehensive single-cell experiments are often resource- intensive and require the involvement of many laboratories and/or taking measurements at various times. This inevitably leads to batch effects, and systematic variations in the data that might occur due to different technology platforms, reagent lots, or handling personnel. Such technical differences confound biological variations of interest and need to be corrected during the data integration process. Data integration is a challenging task due to the overlapping of biological and technical factors, which makes it difficult to distinguish their individual contribution to the overall observed effect. Moreover, the choice of integration method may impact the downstream analyses, including searching for differentially expressed genes. From the existing data integration methods, we selected only those that return the full expression matrix. We evaluated six methods in terms of their influence on the performance of differential gene expression analysis in two single-cell datasets with the same biological study design that differ only in the way the measurement was done: one dataset manifests strong batch effects due to the measurements of each sample at a different time. Integrated data were visualized using the UMAP method. The evaluation was done both on individual gene level using parametric and non-parametric approaches for finding differentially expressed genes and on gene set level using gene set enrichment analysis. As an evaluation metric, we used two correlation coefficients, Pearson and Spearman, of the obtained test statistics between reference, test, and corrected studies. Visual comparison of UMAP plots highlighted ComBat-seq, limma, and MNN, which reduced batch effects and preserved differences between biological conditions. Most of the tested methods changed the data distribution after integration, which negatively impacts the use of parametric methods for the analysis. Two algorithms, MNN and Scanorama, gave very poor results in terms of differential analysis on gene and gene set levels. Finally, we highlight ComBat-seq as it led to the highest correlation of test statistics between reference and corrected dataset among others. Moreover, it does not distort the original distribution of gene expression data, so it can be used in all types of downstream analyses. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
17. Influence of single-cell RNA sequencing data integration on the performance of differential gene expression analysis
- Author
-
Tomasz Kujawa, Michał Marczyk, and Joanna Polanska
- Subjects
single-cell RNA sequencing ,data integration ,batch correction ,differential gene expression ,joint analysis ,Genetics ,QH426-470 - Abstract
Large-scale comprehensive single-cell experiments are often resource-intensive and require the involvement of many laboratories and/or taking measurements at various times. This inevitably leads to batch effects, and systematic variations in the data that might occur due to different technology platforms, reagent lots, or handling personnel. Such technical differences confound biological variations of interest and need to be corrected during the data integration process. Data integration is a challenging task due to the overlapping of biological and technical factors, which makes it difficult to distinguish their individual contribution to the overall observed effect. Moreover, the choice of integration method may impact the downstream analyses, including searching for differentially expressed genes. From the existing data integration methods, we selected only those that return the full expression matrix. We evaluated six methods in terms of their influence on the performance of differential gene expression analysis in two single-cell datasets with the same biological study design that differ only in the way the measurement was done: one dataset manifests strong batch effects due to the measurements of each sample at a different time. Integrated data were visualized using the UMAP method. The evaluation was done both on individual gene level using parametric and non-parametric approaches for finding differentially expressed genes and on gene set level using gene set enrichment analysis. As an evaluation metric, we used two correlation coefficients, Pearson and Spearman, of the obtained test statistics between reference, test, and corrected studies. Visual comparison of UMAP plots highlighted ComBat-seq, limma, and MNN, which reduced batch effects and preserved differences between biological conditions. Most of the tested methods changed the data distribution after integration, which negatively impacts the use of parametric methods for the analysis. Two algorithms, MNN and Scanorama, gave very poor results in terms of differential analysis on gene and gene set levels. Finally, we highlight ComBat-seq as it led to the highest correlation of test statistics between reference and corrected dataset among others. Moreover, it does not distort the original distribution of gene expression data, so it can be used in all types of downstream analyses.
- Published
- 2022
- Full Text
- View/download PDF
18. An R-based reproducible and user-friendly preprocessing pipeline for CyTOF data [version 2; peer review: 2 approved]
- Author
-
Helena L. Crowell, Stéphane Chevrier, Andrea Jacobs, Sujana Sivapatham, Bernd Bodenmiller, and Mark D. Robinson
- Subjects
Method Article ,Articles ,CyTOF ,Preprocessing ,Normalization ,Debarcoding ,Compensation ,Gating ,Batch correction ,Reproducibility - Abstract
Mass cytometry (CyTOF) has become a method of choice for in-depth characterization of tissue heterogeneity in health and disease, and is currently implemented in multiple clinical trials, where higher quality standards must be met. Currently, preprocessing of raw files is commonly performed in independent standalone tools, which makes it difficult to reproduce. Here, we present an R pipeline based on an updated version of CATALYST that covers all preprocessing steps required for downstream mass cytometry analysis in a fully reproducible way. This new version of CATALYST is based on Bioconductor’s SingleCellExperiment class and fully unit tested. The R-based pipeline includes file concatenation, bead-based normalization, single-cell deconvolution, spillover compensation and live cell gating after debris and doublet removal. Importantly, this pipeline also includes different quality checks to assess machine sensitivity and staining performance while allowing also for batch correction. This pipeline is based on open source R packages and can be easily be adapted to different study designs. It therefore has the potential to significantly facilitate the work of CyTOF users while increasing the quality and reproducibility of data generated with this technology.
- Published
- 2022
- Full Text
- View/download PDF
19. Analysis of single-cell RNA sequencing data based on autoencoders
- Author
-
Andrea Tangherloni, Federico Ricciuti, Daniela Besozzi, Pietro Liò, and Ana Cvejic
- Subjects
Autoencoders ,scRNA-Seq ,Dimensionality reduction ,Clustering ,Batch correction ,Data integration ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background Single-cell RNA sequencing (scRNA-Seq) experiments are gaining ground to study the molecular processes that drive normal development as well as the onset of different pathologies. Finding an effective and efficient low-dimensional representation of the data is one of the most important steps in the downstream analysis of scRNA-Seq data, as it could provide a better identification of known or putatively novel cell-types. Another step that still poses a challenge is the integration of different scRNA-Seq datasets. Though standard computational pipelines to gain knowledge from scRNA-Seq data exist, a further improvement could be achieved by means of machine learning approaches. Results Autoencoders (AEs) have been effectively used to capture the non-linearities among gene interactions of scRNA-Seq data, so that the deployment of AE-based tools might represent the way forward in this context. We introduce here scAEspy, a unifying tool that embodies: (1) four of the most advanced AEs, (2) two novel AEs that we developed on purpose, (3) different loss functions. We show that scAEspy can be coupled with various batch-effect removal tools to integrate data by different scRNA-Seq platforms, in order to better identify the cell-types. We benchmarked scAEspy against the most used batch-effect removal tools, showing that our AE-based strategies outperform the existing solutions. Conclusions scAEspy is a user-friendly tool that enables using the most recent and promising AEs to analyse scRNA-Seq data by only setting up two user-defined parameters. Thanks to its modularity, scAEspy can be easily extended to accommodate new AEs to further improve the downstream analysis of scRNA-Seq data. Considering the relevant results we achieved, scAEspy can be considered as a starting point to build a more comprehensive toolkit designed to integrate multi single-cell omics.
- Published
- 2021
- Full Text
- View/download PDF
20. GEDI: An R Package for Integration of Transcriptomic Data from Multiple Platforms for Bioinformatics Applications
- Author
-
Stokholm, Mathias N, Rabaglino, Maria B, Kadarmideen, Haja N., Stokholm, Mathias N, Rabaglino, Maria B, and Kadarmideen, Haja N.
- Abstract
Transcriptomic data is often expensive and difficult to generate in large cohorts relative to genomic data; therefore, it is often important to integrate multiple transcriptomic datasets from both microarray- and next generation sequencing (NGS)-based transcriptomic data across similar experiments or clinical trials to improve analytical power and discovery of novel transcripts and genes. However, transcriptomic data integration presents a few challenges including reannotation and batch effect removal. We developed the Gene Expression Data Integration (GEDI) R package to enable transcriptomic data integration by combining existing R packages. With just four functions, the GEDI R package makes constructing a transcriptomic data integration pipeline straightforward. Together, the functions overcome the complications in transcriptomic data integration by automatically reannotating the data and removing the batch effect. The removal of the batch effect is verified with principal component analysis and the data integration is verified using a logistic regression model with forward stepwise feature selection. To demonstrate the functionalities of the GEDI package, we integrated five bovine endometrial transcriptomic datasets from the NCBI Gene Expression Omnibus. These transcriptomic datasets were from multiple high-throughput platforms, namely, array-based Affymetrix and Agilent platforms, and NGS-based Illumina paired-end RNA-seq platform. Furthermore, we compared the GEDI package to existing tools and found that GEDI is the only tool that provides a full transcriptomic data integration pipeline including verification of both batch effect removal and data integration for downstream genomic and bioinformatics applications. © 2024 The Author(s). Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: ReadGE, a function to import gene expression datasets Basic Protocol 2: GEDI, a function to reannotate and merge gene expression datasets Basic Protocol 3: Ba
- Published
- 2024
21. An evaluation of processing methods for HumanMethylation450 BeadChip data
- Author
-
Liu, Jie and Siegmund, Kimberly D
- Subjects
Biological Sciences ,Genetics ,Alzheimer Disease ,Brain ,DNA Methylation ,Epigenesis ,Genetic ,Epigenomics ,Female ,High-Throughput Nucleotide Sequencing ,Humans ,Male ,Oligonucleotide Array Sequence Analysis ,ROC Curve ,Reproducibility of Results ,Batch correction ,Concordance plot ,HumanMethylation450 BeadChip ,Normalization ,Preprocessing ,Information and Computing Sciences ,Medical and Health Sciences ,Bioinformatics ,Biological sciences ,Biomedical and clinical sciences - Abstract
BackgroundIllumina's HumanMethylation450 arrays provide the most cost-effective means of high-throughput DNA methylation analysis. As with other types of microarray platforms, technical artifacts are a concern, including background fluorescence, dye-bias from the use of two color channels, bias caused by type I/II probe design, and batch effects. Several approaches and pipelines have been developed, either targeting a single issue or designed to address multiple biases through a combination of methods. We evaluate the effect of combining separate approaches to improve signal processing.ResultsIn this study nine processing methods, including both within- and between- array methods, are applied and compared in four datasets. For technical replicates, we found both within- and between-array methods did a comparable job in reducing variance across replicates. For evaluating biological differences, within-array processing always improved differential DNA methylation signal detection over no processing, and always benefitted from performing background correction first. Combinations of within-array procedures were always among the best performing methods, with a slight advantage appearing for the between-array method Funnorm when batch effects explained more variation in the data than the methylation alterations between cases and controls. However, when this occurred, RUVm, a new batch correction method noticeably improved reproducibility of differential methylation results over any of the signal-processing methods alone.ConclusionsThe comparisons in our study provide valuable insights in preprocessing HumanMethylation450 BeadChip data. We found the within-array combination of Noob + BMIQ always improved signal sensitivity, and when combined with the RUVm batch-correction method, outperformed all other approaches in performing differential DNA methylation analysis. The effect of the data processing method, in any given data set, was a function of both the signal and noise.
- Published
- 2016
22. protGear: A protein microarray data pre-processing suite
- Author
-
Kennedy Mwai, Nelson Kibinge, James Tuju, Gathoni Kamuyu, Rinter Kimathi, James Mburu, Emily Chepsat, Lydia Nyamako, Timothy Chege, Irene Nkumama, Samson Kinyanjui, Eustasius Musenge, and Faith Osier
- Subjects
Protein microarray ,Normalisation ,Background correction ,Batch correction ,Reproducibility ,Biotechnology ,TP248.13-248.65 - Abstract
Protein microarrays are versatile tools for high throughput study of the human proteome, but systematic and non-systematic sources of bias constrain optimal interpretation and the ultimate utility of the data. Published guidelines to limit technical variability whilst maintaining important biological variation favour DNA-based microarrays that often differ fundamentally in their experimental design. Rigorous tools to guide background correction, the quantification of within-sample variation, normalisation, and batch correction specifically for protein microarrays are limited, require extensive investigation and are not centrally accessible.Here, we develop a generic one-stop-shop pre-processing suite for protein microarrays that is compatible with data from the major protein microarray scanners. Our graphical and tabular interfaces facilitate a detailed inspection of data and are coupled with supporting guidelines that enable users to select the most appropriate algorithms to systematically address bias arising in customized experiments. The localization and distribution of background signal intensities determine the optimal correction strategy. A novel function overcomes the limitations in the interpretation of the coefficient of variation when signal intensities are at the lower end of the detection threshold. We demonstrate essential considerations in the experimental design and their impact on a range of algorithms for normalization and minimization of batch effects.Our user-friendly interactive web-based platform eliminates the need for prowess in programming. The open-source R interface includes illustrative examples, generates an auditable record, enables reproducibility, and can incorporate additional custom scripts through its online repository. This versatility will enhance its broad uptake in the infectious disease and vaccine development community.
- Published
- 2021
- Full Text
- View/download PDF
23. Complex hierarchical structures analysis in single-cell data with Poincaré deep manifold transformation.
- Author
-
Xu Y, Zang Z, Hu B, Yuan Y, Tan C, Xia J, and Li SZ
- Subjects
- Humans, Algorithms, Computational Biology methods, Gene Expression Profiling methods, RNA-Seq methods, Cell Differentiation, Single-Cell Analysis methods, Sequence Analysis, RNA methods
- Abstract
Single-cell RNA sequencing (scRNA-seq) offers remarkable insights into cellular development and differentiation by capturing the gene expression profiles of individual cells. The role of dimensionality reduction and visualization in the interpretation of scRNA-seq data has gained widely acceptance. However, current methods face several challenges, including incomplete structure-preserving strategies and high distortion in embeddings, which fail to effectively model complex cell trajectories with multiple branches. To address these issues, we propose the Poincaré deep manifold transformation (PoincaréDMT) method, which maps high-dimensional scRNA-seq data to a hyperbolic Poincaré disk. This approach preserves global structure from a graph Laplacian matrix while achieving local structure correction through a structure module combined with data augmentation. Additionally, PoincaréDMT alleviates batch effects by integrating a batch graph that accounts for batch labels into the low-dimensional embeddings during network training. Furthermore, PoincaréDMT introduces the Shapley additive explanations method based on trained model to identify the important marker genes in specific clusters and cell differentiation process. Therefore, PoincaréDMT provides a unified framework for multiple key tasks essential for scRNA-seq analysis, including trajectory inference, pseudotime inference, batch correction, and marker gene selection. We validate PoincaréDMT through extensive evaluations on both simulated and real scRNA-seq datasets, demonstrating its superior performance in preserving global and local data structures compared to existing methods., (© The Author(s) 2025. Published by Oxford University Press.)
- Published
- 2024
- Full Text
- View/download PDF
24. GEDI: An R Package for Integration of Transcriptomic Data from Multiple Platforms for Bioinformatics Applications.
- Author
-
Stokholm MN, Rabaglino MB, and Kadarmideen HN
- Subjects
- Software, Animals, High-Throughput Nucleotide Sequencing, Cattle, Computational Biology methods, Transcriptome, Gene Expression Profiling methods
- Abstract
Transcriptomic data is often expensive and difficult to generate in large cohorts relative to genomic data; therefore, it is often important to integrate multiple transcriptomic datasets from both microarray- and next generation sequencing (NGS)-based transcriptomic data across similar experiments or clinical trials to improve analytical power and discovery of novel transcripts and genes. However, transcriptomic data integration presents a few challenges including reannotation and batch effect removal. We developed the Gene Expression Data Integration (GEDI) R package to enable transcriptomic data integration by combining existing R packages. With just four functions, the GEDI R package makes constructing a transcriptomic data integration pipeline straightforward. Together, the functions overcome the complications in transcriptomic data integration by automatically reannotating the data and removing the batch effect. The removal of the batch effect is verified with principal component analysis and the data integration is verified using a logistic regression model with forward stepwise feature selection. To demonstrate the functionalities of the GEDI package, we integrated five bovine endometrial transcriptomic datasets from the NCBI Gene Expression Omnibus. These transcriptomic datasets were from multiple high-throughput platforms, namely, array-based Affymetrix and Agilent platforms, and NGS-based Illumina paired-end RNA-seq platform. Furthermore, we compared the GEDI package to existing tools and found that GEDI is the only tool that provides a full transcriptomic data integration pipeline including verification of both batch effect removal and data integration for downstream genomic and bioinformatics applications. © 2024 The Author(s). Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: ReadGE, a function to import gene expression datasets Basic Protocol 2: GEDI, a function to reannotate and merge gene expression datasets Basic Protocol 3: BatchCorrection, a function to remove batch effects from gene expression data Basic Protocol 4: VerifyGEDI, a function to confirm successful integration of gene expression data., (© 2024 The Author(s). Current Protocols published by Wiley Periodicals LLC.)
- Published
- 2024
- Full Text
- View/download PDF
25. Deep learning tackles single-cell analysis—a survey of deep learning for scRNA-seq analysis.
- Author
-
Flores, Mario, Liu, Zhentao, Zhang, Tinghe, Hasib, Md Musaddaqui, Chiu, Yu-Chiao, Ye, Zhenqing, Paniagua, Karla, Jo, Sumin, Zhang, Jianqiu, Gao, Shou-Jiang, Jin, Yu-Fang, Chen, Yidong, and Huang, Yufei
- Subjects
- *
DEEP learning , *GENERATIVE adversarial networks , *MACHINE learning , *ELECTRONIC data processing - Abstract
Since its selection as the method of the year in 2013, single-cell technologies have become mature enough to provide answers to complex research questions. With the growth of single-cell profiling technologies, there has also been a significant increase in data collected from single-cell profilings, resulting in computational challenges to process these massive and complicated datasets. To address these challenges, deep learning (DL) is positioned as a competitive alternative for single-cell analyses besides the traditional machine learning approaches. Here, we survey a total of 25 DL algorithms and their applicability for a specific step in the single cell RNA-seq processing pipeline. Specifically, we establish a unified mathematical representation of variational autoencoder, autoencoder, generative adversarial network and supervised DL models, compare the training strategies and loss functions for these models, and relate the loss functions of these models to specific objectives of the data processing step. Such a presentation will allow readers to choose suitable algorithms for their particular objective at each step in the pipeline. We envision that this survey will serve as an important information portal for learning the application of DL for scRNA-seq analysis and inspire innovative uses of DL to address a broader range of new challenges in emerging multi-omics and spatial single-cell sequencing. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
26. scAdapt: virtual adversarial domain adaptation network for single cell RNA-seq data classification across platforms and species.
- Author
-
Zhou, Xiang, Chai, Hua, Zeng, Yuansong, Zhao, Huiying, and Yang, Yuedong
- Subjects
- *
RNA sequencing , *BIOLOGICAL variation , *CELL analysis , *SPECIES , *DATA distribution , *DNA microarrays , *MICROARRAY technology - Abstract
In single cell analyses, cell types are conventionally identified based on expressions of known marker genes, whose identifications are time-consuming and irreproducible. To solve this issue, many supervised approaches have been developed to identify cell types based on the rapid accumulation of public datasets. However, these approaches are sensitive to batch effects or biological variations since the data distributions are different in cross-platforms or species predictions. In this study, we developed scAdapt, a virtual adversarial domain adaptation network, to transfer cell labels between datasets with batch effects. scAdapt used both the labeled source and unlabeled target data to train an enhanced classifier and aligned the labeled source centroids and pseudo-labeled target centroids to generate a joint embedding. The scAdapt was demonstrated to outperform existing methods for classification in simulated, cross-platforms, cross-species, spatial transcriptomic and COVID-19 immune datasets. Further quantitative evaluations and visualizations for the aligned embeddings confirm the superiority in cell mixing and the ability to preserve discriminative cluster structure present in the original datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
27. Replication of single-cell proteomics data reveals important computational challenges.
- Author
-
Vanderaa, Christophe and Gatto, Laurent
- Abstract
Mass spectrometry-based proteomics is actively embracing quantitative, single-cell level analyses. Indeed, recent advances in sample preparation and mass spectrometry (MS) have enabled the emergence of quantitative MS-based single-cell proteomics (SCP). While exciting and promising, SCP still has many rough edges. The current analysis workflows are custom and built from scratch. The field is therefore craving for standardized software that promotes principled and reproducible SCP data analyses. This special report is the first step toward the formalization and standardization of SCP data analysis. scp, the software that accompanies this work, successfully replicates one of the landmark SCP studies and is applicable to other experiments and designs. We created a repository containing the replicated workflow with comprehensive documentation in order to favor further dissemination and improvements of SCP data analyses. Replicating SCP data analyses uncovers important challenges in SCP data analysis. We describe two such challenges in detail: batch correction and data missingness. We provide the current state-of-the-art and illustrate the associated limitations. We also highlight the intimate dependence that exists between batch effects and data missingness and offer avenues for dealing with these exciting challenges. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
28. A benchmark of batch-effect correction methods for single-cell RNA sequencing data
- Author
-
Hoa Thi Nhu Tran, Kok Siong Ang, Marion Chevrier, Xiaomeng Zhang, Nicole Yee Shin Lee, Michelle Goh, and Jinmiao Chen
- Subjects
Single-cell RNA-seq ,Batch correction ,Batch effect ,Integration ,Differential gene expression ,Biology (General) ,QH301-705.5 ,Genetics ,QH426-470 - Abstract
Abstract Background Large-scale single-cell transcriptomic datasets generated using different technologies contain batch-specific systematic variations that present a challenge to batch-effect removal and data integration. With continued growth expected in scRNA-seq data, achieving effective batch integration with available computational resources is crucial. Here, we perform an in-depth benchmark study on available batch correction methods to determine the most suitable method for batch-effect removal. Results We compare 14 methods in terms of computational runtime, the ability to handle large datasets, and batch-effect correction efficacy while preserving cell type purity. Five scenarios are designed for the study: identical cell types with different technologies, non-identical cell types, multiple batches, big data, and simulated data. Performance is evaluated using four benchmarking metrics including kBET, LISI, ASW, and ARI. We also investigate the use of batch-corrected data to study differential gene expression. Conclusion Based on our results, Harmony, LIGER, and Seurat 3 are the recommended methods for batch integration. Due to its significantly shorter runtime, Harmony is recommended as the first method to try, with the other methods as viable alternatives.
- Published
- 2020
- Full Text
- View/download PDF
29. Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset
- Author
-
Judith Somekh, Shai S Shen-Orr, and Isaac S Kohane
- Subjects
Batch correction ,Batch effect ,Gene expression ,ComBat ,Principal component analysis ,GTEx ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background Correcting a heterogeneous dataset that presents artefacts from several confounders is often an essential bioinformatics task. Attempting to remove these batch effects will result in some biologically meaningful signals being lost. Thus, a central challenge is assessing if the removal of unwanted technical variation harms the biological signal that is of interest to the researcher. Results We describe a novel framework, B-CeF, to evaluate the effectiveness of batch correction methods and their tendency toward over or under correction. The approach is based on comparing co-expression of adjusted gene-gene pairs to a-priori knowledge of highly confident gene-gene associations based on thousands of unrelated experiments derived from an external reference. Our framework includes three steps: (1) data adjustment with the desired methods (2) calculating gene-gene co-expression measurements for adjusted datasets (3) evaluating the performance of the co-expression measurements against a gold standard. Using the framework, we evaluated five batch correction methods applied to RNA-seq data of six representative tissue datasets derived from the GTEx project. Conclusions Our framework enables the evaluation of batch correction methods to better preserve the original biological signal. We show that using a multiple linear regression model to correct for known confounders outperforms factor analysis-based methods that estimate hidden confounders. The code is publicly available as an R package.
- Published
- 2019
- Full Text
- View/download PDF
30. Analysis of single-cell RNA sequencing data based on autoencoders.
- Author
-
Tangherloni, Andrea, Ricciuti, Federico, Besozzi, Daniela, Liò, Pietro, and Cvejic, Ana
- Subjects
RNA sequencing ,MACHINE learning - Abstract
Background: Single-cell RNA sequencing (scRNA-Seq) experiments are gaining ground to study the molecular processes that drive normal development as well as the onset of different pathologies. Finding an effective and efficient low-dimensional representation of the data is one of the most important steps in the downstream analysis of scRNA-Seq data, as it could provide a better identification of known or putatively novel cell-types. Another step that still poses a challenge is the integration of different scRNA-Seq datasets. Though standard computational pipelines to gain knowledge from scRNA-Seq data exist, a further improvement could be achieved by means of machine learning approaches. Results: Autoencoders (AEs) have been effectively used to capture the non-linearities among gene interactions of scRNA-Seq data, so that the deployment of AE-based tools might represent the way forward in this context. We introduce here scAEspy, a unifying tool that embodies: (1) four of the most advanced AEs, (2) two novel AEs that we developed on purpose, (3) different loss functions. We show that scAEspy can be coupled with various batch-effect removal tools to integrate data by different scRNA-Seq platforms, in order to better identify the cell-types. We benchmarked scAEspy against the most used batch-effect removal tools, showing that our AE-based strategies outperform the existing solutions. Conclusions: scAEspy is a user-friendly tool that enables using the most recent and promising AEs to analyse scRNA-Seq data by only setting up two user-defined parameters. Thanks to its modularity, scAEspy can be easily extended to accommodate new AEs to further improve the downstream analysis of scRNA-Seq data. Considering the relevant results we achieved, scAEspy can be considered as a starting point to build a more comprehensive toolkit designed to integrate multi single-cell omics. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
31. CBA: Cluster-Guided Batch Alignment for Single Cell RNA-seq
- Author
-
Wenbo Yu, Ahmed Mahfouz, and Marcel J. T. Reinders
- Subjects
batch correction ,auto-encoder ,single-cell RNA sequencing ,clustering ,data integration ,Genetics ,QH426-470 - Abstract
The power of single-cell RNA sequencing (scRNA-seq) in detecting cell heterogeneity or developmental process is becoming more and more evident every day. The granularity of this knowledge is further propelled when combining two batches of scRNA-seq into a single large dataset. This strategy is however hampered by technical differences between these batches. Typically, these batch effects are resolved by matching similar cells across the different batches. Current approaches, however, do not take into account that we can constrain this matching further as cells can also be matched on their cell type identity. We use an auto-encoder to embed two batches in the same space such that cells are matched. To accomplish this, we use a loss function that preserves: (1) cell-cell distances within each of the two batches, as well as (2) cell-cell distances between two batches when the cells are of the same cell-type. The cell-type guidance is unsupervised, i.e., a cell-type is defined as a cluster in the original batch. We evaluated the performance of our cluster-guided batch alignment (CBA) using pancreas and mouse cell atlas datasets, against six state-of-the-art single cell alignment methods: Seurat v3, BBKNN, Scanorama, Harmony, LIGER, and BERMUDA. Compared to other approaches, CBA preserves the cluster separation in the original datasets while still being able to align the two datasets. We confirm that this separation is biologically meaningful by identifying relevant differential expression of genes for these preserved clusters.
- Published
- 2021
- Full Text
- View/download PDF
32. An R-based reproducible and user-friendly preprocessing pipeline for CyTOF data [version 1; peer review: 1 approved, 1 approved with reservations]
- Author
-
Helena L. Crowell, Stéphane Chevrier, Andrea Jacobs, Sujana Sivapatham, Bernd Bodenmiller, and Mark D. Robinson
- Subjects
Method Article ,Articles ,CyTOF ,Preprocessing ,Normalization ,Debarcoding ,Compensation ,Gating ,Batch correction ,Reproducibility - Abstract
Mass cytometry (CyTOF) has become a method of choice for in-depth characterization of tissue heterogeneity in health and disease, and is currently implemented in multiple clinical trials, where higher quality standards must be met. Currently, preprocessing of raw files is commonly performed in independent standalone tools, which makes it difficult to reproduce. Here, we present an R pipeline based on an updated version of CATALYST that covers all preprocessing steps required for downstream mass cytometry analysis in a fully reproducible way. This new version of CATALYST is based on Bioconductor’s SingleCellExperiment class and fully unit tested. The R-based pipeline includes file concatenation, bead-based normalization, single-cell deconvolution, spillover compensation and live cell gating after debris and doublet removal. Importantly, this pipeline also includes different quality checks to assess machine sensitivity and staining performance while allowing also for batch correction. This pipeline is based on open source R packages and can be easily be adapted to different study designs. It therefore has the potential to significantly facilitate the work of CyTOF users while increasing the quality and reproducibility of data generated with this technology.
- Published
- 2020
- Full Text
- View/download PDF
33. CBA: Cluster-Guided Batch Alignment for Single Cell RNA-seq.
- Author
-
Yu, Wenbo, Mahfouz, Ahmed, and Reinders, Marcel J. T.
- Subjects
RNA sequencing ,IDENTITY (Psychology) ,GENE clusters ,DATA integration ,PANCREAS - Abstract
The power of single-cell RNA sequencing (scRNA-seq) in detecting cell heterogeneity or developmental process is becoming more and more evident every day. The granularity of this knowledge is further propelled when combining two batches of scRNA-seq into a single large dataset. This strategy is however hampered by technical differences between these batches. Typically, these batch effects are resolved by matching similar cells across the different batches. Current approaches, however, do not take into account that we can constrain this matching further as cells can also be matched on their cell type identity. We use an auto-encoder to embed two batches in the same space such that cells are matched. To accomplish this, we use a loss function that preserves: (1) cell-cell distances within each of the two batches, as well as (2) cell-cell distances between two batches when the cells are of the same cell-type. The cell-type guidance is unsupervised, i.e., a cell-type is defined as a cluster in the original batch. We evaluated the performance of our cluster-guided batch alignment (CBA) using pancreas and mouse cell atlas datasets, against six state-of-the-art single cell alignment methods: Seurat v3, BBKNN, Scanorama, Harmony, LIGER, and BERMUDA. Compared to other approaches, CBA preserves the cluster separation in the original datasets while still being able to align the two datasets. We confirm that this separation is biologically meaningful by identifying relevant differential expression of genes for these preserved clusters. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
34. Evaluation of normalization methods for predicting quantitative phenotypes in metagenomic data analysis.
- Author
-
Wang B and Luan Y
- Abstract
Genotype-to-phenotype mapping is an essential problem in the current genomic era. While qualitative case-control predictions have received significant attention, less emphasis has been placed on predicting quantitative phenotypes. This emerging field holds great promise in revealing intricate connections between microbial communities and host health. However, the presence of heterogeneity in microbiome datasets poses a substantial challenge to the accuracy of predictions and undermines the reproducibility of models. To tackle this challenge, we investigated 22 normalization methods that aimed at removing heterogeneity across multiple datasets, conducted a comprehensive review of them, and evaluated their effectiveness in predicting quantitative phenotypes in three simulation scenarios and 31 real datasets. The results indicate that none of these methods demonstrate significant superiority in predicting quantitative phenotypes or attain a noteworthy reduction in Root Mean Squared Error (RMSE) of the predictions. Given the frequent occurrence of batch effects and the satisfactory performance of batch correction methods in predicting datasets affected by these effects, we strongly recommend utilizing batch correction methods as the initial step in predicting quantitative phenotypes. In summary, the performance of normalization methods in predicting metagenomic data remains a dynamic and ongoing research area. Our study contributes to this field by undertaking a comprehensive evaluation of diverse methods and offering valuable insights into their effectiveness in predicting quantitative phenotypes., Competing Interests: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest., (Copyright © 2024 Wang and Luan.)
- Published
- 2024
- Full Text
- View/download PDF
35. PhosR enables processing and functional analysis of phosphoproteomic data
- Author
-
Hani Jieun Kim, Taiyun Kim, Nolan J. Hoffman, Di Xiao, David E. James, Sean J. Humphrey, and Pengyi Yang
- Subjects
phosphoproteomics ,signalling networks ,imputation ,normalisation ,batch correction ,stably phosphorylated sites ,Biology (General) ,QH301-705.5 - Abstract
Summary: Mass spectrometry (MS)-based phosphoproteomics has revolutionized our ability to profile phosphorylation-based signaling in cells and tissues on a global scale. To infer the action of kinases and signaling pathways in phosphoproteomic experiments, we present PhosR, a set of tools and methodologies implemented in a suite of R packages facilitating comprehensive analysis of phosphoproteomic data. By applying PhosR to both published and new phosphoproteomic datasets, we demonstrate capabilities in data imputation and normalization by using a set of “stably phosphorylated sites” and in functional analysis for inferring active kinases and signaling pathways. In particular, we introduce a “signalome” construction method for identifying a collection of signaling modules to summarize and visualize the interaction of kinases and their collective actions on signal transduction. Together, our data and findings demonstrate the utility of PhosR in processing and generating biological knowledge from MS-based phosphoproteomic data.
- Published
- 2021
- Full Text
- View/download PDF
36. Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset.
- Author
-
Somekh, Judith, Shen-Orr, Shai S, and Kohane, Isaac S
- Subjects
BIOINFORMATICS ,RNA sequencing ,GENE expression ,BREAST cancer patients ,GENE ontology - Abstract
Background: Correcting a heterogeneous dataset that presents artefacts from several confounders is often an essential bioinformatics task. Attempting to remove these batch effects will result in some biologically meaningful signals being lost. Thus, a central challenge is assessing if the removal of unwanted technical variation harms the biological signal that is of interest to the researcher. Results: We describe a novel framework, B-CeF, to evaluate the effectiveness of batch correction methods and their tendency toward over or under correction. The approach is based on comparing co-expression of adjusted gene-gene pairs to a-priori knowledge of highly confident gene-gene associations based on thousands of unrelated experiments derived from an external reference. Our framework includes three steps: (1) data adjustment with the desired methods (2) calculating gene-gene co-expression measurements for adjusted datasets (3) evaluating the performance of the co-expression measurements against a gold standard. Using the framework, we evaluated five batch correction methods applied to RNA-seq data of six representative tissue datasets derived from the GTEx project. Conclusions: Our framework enables the evaluation of batch correction methods to better preserve the original biological signal. We show that using a multiple linear regression model to correct for known confounders outperforms factor analysis-based methods that estimate hidden confounders. The code is publicly available as an R package. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
37. Adjusting for Batch Effects in DNA Methylation Microarray Data, a Lesson Learned
- Author
-
E. M. Price and Wendy P. Robinson
- Subjects
DNA methylation ,450k array ,Illumina ,batch correction ,batch effects ,ComBat ,Genetics ,QH426-470 - Abstract
It is well-known, but frequently overlooked, that low- and high-throughput molecular data may contain batch effects, i.e., systematic technical variation. Confounding of experimental batches with the variable(s) of interest is especially concerning, as a batch effect may then be interpreted as a biologically significant finding. An integral step toward reducing false discovery in molecular data analysis includes inspection for batch effects and accounting for this signal if present. In a 30-sample pilot Illumina Infinium HumanMethylation450 (450k array) experiment, we identified two sources of batch effects: row and chip. Here, we demonstrate two approaches taken to process the 450k data in which an R function, ComBat, was applied to adjust for the non-biological signal. In the “initial analysis,” the application of ComBat to an unbalanced study design resulted in 9,612 and 19,214 significant (FDR < 0.05) DNA methylation differences, despite none present prior to correction. Suspicious of this dramatic change, a “revised processing” included changes to our analysis as well as a greater number of samples, and successfully reduced batch effects without introducing false signal. Our work supports conclusions made by an article previously published in this journal: though the ultimate antidote to batch effects is thoughtful study design, every DNA methylation microarray analysis should inspect, assess and, if necessary, account for batch effects. The analysis experience presented here can serve as a reminder to the broader community to establish research questions a priori, ensure that they match with study design and encourage communication between technicians and analysts.
- Published
- 2018
- Full Text
- View/download PDF
38. Evaluating batch correction methods for image-based cell profiling.
- Author
-
Arevalo J, Su E, van Dijk R, Carpenter AE, and Singh S
- Abstract
High-throughput image-based profiling platforms are powerful technologies capable of collecting data from billions of cells exposed to thousands of perturbations in a time- and cost-effective manner. Therefore, image-based profiling data has been increasingly used for diverse biological applications, such as predicting drug mechanism of action or gene function. However, batch effects pose severe limitations to community-wide efforts to integrate and interpret image-based profiling data collected across different laboratories and equipment. To address this problem, we benchmarked seven high-performing scRNA-seq batch correction techniques, representing diverse approaches, using a newly released Cell Painting dataset, the largest publicly accessible image-based dataset. We focused on five different scenarios with varying complexity, and we found that Harmony, a mixture-model based method, consistently outperformed the other tested methods. Our proposed framework, benchmark, and metrics can additionally be used to assess new batch correction methods in the future. Overall, this work paves the way for improvements that allow the community to make best use of public Cell Painting data for scientific discovery., Competing Interests: Declaration of interests The Authors declare the following competing interests: S.S. and A.E.C. serve as scientific advisors for companies that use image-based profiling and Cell Painting (A.E.C: Recursion, SyzOnc; S.S.: Waypoint Bio, Dewpoint Therapeutics, Deepcell) and receive research funding and occasional talk honoraria from various pharmaceutical and biotechnology companies. All other authors declare no competing interests.
- Published
- 2024
- Full Text
- View/download PDF
39. BUTTERFLY: addressing the pooled amplification paradox with unique molecular identifiers in single-cell RNA-seq
- Author
-
Gustafsson, Johan, Robinson, Jonathan, Nielsen, Jens, and Pachter, Lior
- Published
- 2021
- Full Text
- View/download PDF
40. Inferring single-cell transcriptomic dynamics with structured latent gene expression dynamics.
- Author
-
Farrell S, Mani M, and Goyal S
- Subjects
- Cell Differentiation genetics, RNA, RNA Splicing genetics, Transcriptome genetics, Gene Expression Profiling
- Abstract
Gene expression dynamics provide directional information for trajectory inference from single-cell RNA sequencing data. Traditional approaches compute RNA velocity using strict modeling assumptions about transcription and splicing of RNA. This can fail in scenarios where multiple lineages have distinct gene dynamics or where rates of transcription and splicing are time dependent. We present "LatentVelo," an approach to compute a low-dimensional representation of gene dynamics with deep learning. LatentVelo embeds cells into a latent space with a variational autoencoder and models differentiation dynamics on this "dynamics-based" latent space with neural ordinary differential equations. LatentVelo infers a latent regulatory state that controls the dynamics of an individual cell to model multiple lineages. LatentVelo can predict latent trajectories, describing the inferred developmental path for individual cells rather than just local RNA velocity vectors. The dynamics-based embedding batch corrects cell states and velocities, outperforming comparable autoencoder batch correction methods that do not consider gene expression dynamics., Competing Interests: Declaration of interests The authors declare no competing interests., (Copyright © 2023 The Authors. Published by Elsevier Inc. All rights reserved.)
- Published
- 2023
- Full Text
- View/download PDF
41. protGear: A protein microarray data pre-processing suite
- Author
-
Nelson Kibinge, Eustasius Musenge, Gathoni Kamuyu, Samson M. Kinyanjui, Timothy Chege, James Tuju, Kennedy Mwai, Emily Chepsat, Lydia Nyamako, James Mburu, Faith H. A. Osier, Rinter Kimathi, and Irene Nkumama
- Subjects
Normalization (statistics) ,Computer science ,Biophysics ,Protein microarray ,Normalisation ,computer.software_genre ,Biochemistry ,Batch correction ,03 medical and health sciences ,0302 clinical medicine ,Structural Biology ,Genetics ,Throughput (business) ,ComputingMethodologies_COMPUTERGRAPHICS ,030304 developmental biology ,0303 health sciences ,Suite ,SIGNAL (programming language) ,Reproducibility ,Computer Science Applications ,Scripting language ,030220 oncology & carcinogenesis ,Minification ,Data mining ,DNA microarray ,Background correction ,computer ,TP248.13-248.65 ,Research Article ,Biotechnology - Abstract
Graphical abstract, Protein microarrays are versatile tools for high throughput study of the human proteome, but systematic and non-systematic sources of bias constrain optimal interpretation and the ultimate utility of the data. Published guidelines to limit technical variability whilst maintaining important biological variation favour DNA-based microarrays that often differ fundamentally in their experimental design. Rigorous tools to guide background correction, the quantification of within-sample variation, normalisation, and batch correction specifically for protein microarrays are limited, require extensive investigation and are not centrally accessible. Here, we develop a generic one-stop-shop pre-processing suite for protein microarrays that is compatible with data from the major protein microarray scanners. Our graphical and tabular interfaces facilitate a detailed inspection of data and are coupled with supporting guidelines that enable users to select the most appropriate algorithms to systematically address bias arising in customized experiments. The localization and distribution of background signal intensities determine the optimal correction strategy. A novel function overcomes the limitations in the interpretation of the coefficient of variation when signal intensities are at the lower end of the detection threshold. We demonstrate essential considerations in the experimental design and their impact on a range of algorithms for normalization and minimization of batch effects. Our user-friendly interactive web-based platform eliminates the need for prowess in programming. The open-source R interface includes illustrative examples, generates an auditable record, enables reproducibility, and can incorporate additional custom scripts through its online repository. This versatility will enhance its broad uptake in the infectious disease and vaccine development community.
- Published
- 2021
42. An evaluation of processing methods for HumanMethylation450 BeadChip data.
- Author
-
Jie Liu and Siegmund, Kimberly D.
- Subjects
- *
DNA methylation , *SIGNAL processing , *NUCLEOTIDE sequencing , *TETRACYCLINE , *LOCUS (Genetics) - Abstract
Background: Illumina's HumanMethylation450 arrays provide the most cost-effective means of high-throughput DNA methylation analysis. As with other types of microarray platforms, technical artifacts are a concern, including background fluorescence, dye-bias from the use of two color channels, bias caused by type I/II probe design, and batch effects. Several approaches and pipelines have been developed, either targeting a single issue or designed to address multiple biases through a combination of methods. We evaluate the effect of combining separate approaches to improve signal processing. Results: In this study nine processing methods, including both within- and between- array methods, are applied and compared in four datasets. For technical replicates, we found both within- and between-array methods did a comparable job in reducing variance across replicates. For evaluating biological differences, within-array processing always improved differential DNA methylation signal detection over no processing, and always benefitted from performing background correction first. Combinations of within-array procedures were always among the best performing methods, with a slight advantage appearing for the between-array method Funnorm when batch effects explained more variation in the data than the methylation alterations between cases and controls. However, when this occurred, RUVm, a new batch correction method noticeably improved reproducibility of differential methylation results over any of the signal-processing methods alone. Conclusions: The comparisons in our study provide valuable insights in preprocessing HumanMethylation450 BeadChip data. We found the within-array combination of Noob + BMIQ always improved signal sensitivity, and when combined with the RUVm batch-correction method, outperformed all other approaches in performing differential DNA methylation analysis. The effect of the data processing method, in any given data set, was a function of both the signal and noise. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
43. CBA: Cluster-Guided Batch Alignment for Single Cell RNA-seq
- Author
-
Yu, W. (author), Mahfouz, A.M.E.T.A. (author), Reinders, M.J.T. (author), Yu, W. (author), Mahfouz, A.M.E.T.A. (author), and Reinders, M.J.T. (author)
- Abstract
The power of single-cell RNA sequencing (scRNA-seq) in detecting cell heterogeneity or developmental process is becoming more and more evident every day. The granularity of this knowledge is further propelled when combining two batches of scRNA-seq into a single large dataset. This strategy is however hampered by technical differences between these batches. Typically, these batch effects are resolved by matching similar cells across the different batches. Current approaches, however, do not take into account that we can constrain this matching further as cells can also be matched on their cell type identity. We use an auto-encoder to embed two batches in the same space such that cells are matched. To accomplish this, we use a loss function that preserves: (1) cell-cell distances within each of the two batches, as well as (2) cell-cell distances between two batches when the cells are of the same cell-type. The cell-type guidance is unsupervised, i.e., a cell-type is defined as a cluster in the original batch. We evaluated the performance of our cluster-guided batch alignment (CBA) using pancreas and mouse cell atlas datasets, against six state-of-the-art single cell alignment methods: Seurat v3, BBKNN, Scanorama, Harmony, LIGER, and BERMUDA. Compared to other approaches, CBA preserves the cluster separation in the original datasets while still being able to align the two datasets. We confirm that this separation is biologically meaningful by identifying relevant differential expression of genes for these preserved clusters., Pattern Recognition and Bioinformatics
- Published
- 2021
- Full Text
- View/download PDF
44. Analysis of single-cell RNA sequencing data based on autoencoders
- Author
-
Tangherloni, A, Ricciuti, F, Besozzi, D, Lio, P, Cvejic, A, Tangherloni A., Ricciuti F., Besozzi D., Lio P., Cvejic A., Tangherloni, A, Ricciuti, F, Besozzi, D, Lio, P, Cvejic, A, Tangherloni A., Ricciuti F., Besozzi D., Lio P., and Cvejic A.
- Abstract
Background: Single-cell RNA sequencing (scRNA-Seq) experiments are gaining ground to study the molecular processes that drive normal development as well as the onset of different pathologies. Finding an effective and efficient low-dimensional representation of the data is one of the most important steps in the downstream analysis of scRNA-Seq data, as it could provide a better identification of known or putatively novel cell-types. Another step that still poses a challenge is the integration of different scRNA-Seq datasets. Though standard computational pipelines to gain knowledge from scRNA-Seq data exist, a further improvement could be achieved by means of machine learning approaches. Results: Autoencoders (AEs) have been effectively used to capture the non-linearities among gene interactions of scRNA-Seq data, so that the deployment of AE-based tools might represent the way forward in this context. We introduce here scAEspy, a unifying tool that embodies: (1) four of the most advanced AEs, (2) two novel AEs that we developed on purpose, (3) different loss functions. We show that scAEspy can be coupled with various batch-effect removal tools to integrate data by different scRNA-Seq platforms, in order to better identify the cell-types. We benchmarked scAEspy against the most used batch-effect removal tools, showing that our AE-based strategies outperform the existing solutions. Conclusions: scAEspy is a user-friendly tool that enables using the most recent and promising AEs to analyse scRNA-Seq data by only setting up two user-defined parameters. Thanks to its modularity, scAEspy can be easily extended to accommodate new AEs to further improve the downstream analysis of scRNA-Seq data. Considering the relevant results we achieved, scAEspy can be considered as a starting point to build a more comprehensive toolkit designed to integrate multi single-cell omics.
- Published
- 2021
45. Practical impacts of genomic data "cleaning" on biological discovery using surrogate variable analysis.
- Author
-
Jaffe, Andrew E., Hyde, Thomas, Kleinman, Joel, Weinbergern, Daniel R., Chenoweth, Joshua G., McKay, Ronald D., Leek, Jeffrey T., and Colantuoni, Carlo
- Subjects
- *
DATA scrubbing , *DATA analysis , *GENETIC databases , *GENOMICS , *HUMAN stem cells , *PLURIPOTENT stem cells - Abstract
Background: Genomic data production is at its highest level and continues to increase, making available novel primary data and existing public data to researchers for exploration. Here we explore the consequences of "batch" correction for biological discovery in two publicly available expression datasets. We consider this to include the estimation of and adjustment for wide-spread systematic heterogeneity in genomic measurements that is unrelated to the effects under study, whether it be technical or biological in nature. Methods: We present three illustrative data analyses using surrogate variable analysis (SVA) and describe how to perform artifact discovery in light of natural heterogeneity within biological groups, secondary biological questions of interest, and non-linear treatment effects in a dataset profiling differentiating pluripotent cells (GSE32923) and another from human brain tissue (GSE30272). Results: Careful specification of biological effects of interest is very important to factor-based approaches like SVA. We demonstrate greatly sharpened global and gene-specific differential expression across treatment groups in stem cell systems. Similarly, we demonstrate how to preserve major non-linear effects of age across the lifespan in the brain dataset. However, the gains in precisely defining known effects of interest come at the cost of much other information in the "cleaned" data, including sex, common copy number effects and sample or cell line-specific molecular behavior. Conclusions: Our analyses indicate that data "cleaning" can be an important component of high-throughput genomic data analysis when interrogating explicitly defined effects in the context of data affected by robust technical artifacts. However, caution should be exercised to avoid removing biological signal of interest. It is also important to note that open data exploration is not possible after such supervised "cleaning", because effects beyond those stipulated by the researcher may have been removed. With the goal of making these statistical algorithms more powerful and transparent to researchers in the biological sciences, we provide exploratory plots and accompanying R code for identifying and guiding "cleaning" process (https://github.com/andrewejaffe/StemCellSVA). The impact of these methods is significant enough that we have made newly processed data available for the brain data set at http://braincloud.jhmi.edu/plots/ and GSE30272. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
46. High-dimensional immunotyping of tumors grown in obese and non-obese mice
- Author
-
Sturla Magnus Grøndal, Nils Halberg, James B. Lorens, Hilde E. Lien, Line Pedersen, Pahul Hanjra, Cara E. Wogsland, and Rolf A. Brekken
- Subjects
Cell type ,Cancer Model ,Neuroscience (miscellaneous) ,Mice, Obese ,Medicine (miscellaneous) ,CD8-Positive T-Lymphocytes ,Biology ,Diet, High-Fat ,Systemic inflammation ,Suspension mass cytometry ,Batch correction ,General Biochemistry, Genetics and Molecular Biology ,Immunophenotyping ,Lymphocytes, Tumor-Infiltrating ,Breast cancer ,Immune system ,Immunology and Microbiology (miscellaneous) ,Neoplasms ,Pancreatic cancer ,Tumor Microenvironment ,medicine ,Animals ,Myeloid Cells ,Mass cytometry ,Resource Article ,Obesity ,Cancer ,biochemical phenomena, metabolism, and nutrition ,medicine.disease ,Mice, Inbred C57BL ,Disease Models, Animal ,Tumor immunology ,Cancer research ,bacteria ,Female ,medicine.symptom ,Algorithms - Abstract
Obesity is a disease characterized by chronic low-grade systemic inflammation and has been causally linked to the development of 13 cancer types. Several studies have been undertaken to determine whether tumors evolving in obese environments adapt differential interactions with immune cells and whether this can be connected to disease outcome. Most of these studies have been limited to single-cell lines and tumor models and analysis of limited immune cell populations. Given the multicellular complexity of the immune system and its dysregulation in obesity, we applied high-dimensional suspension mass cytometry to investigate how obesity affects tumor immunity. We used a 36-marker immune-focused mass cytometry panel to interrogate the immune landscape of orthotopic syngeneic mouse models of pancreatic and breast cancer. Unanchored batch correction was implemented to enable simultaneous analysis of tumor cohorts to uncover the immunotypes of each cancer model and reveal remarkably model-specific immune regulation. In the E0771 breast cancer model, we demonstrate an important link to obesity with an increase in two T-cell-suppressive cell types and a decrease in CD8 T cells., Summary: We used a mass cytometry analysis pipeline to immunophenotype tumor immune infiltrate from two pancreatic and three breast cancer murine models, identifying a link to obesity in the E0771 breast cancer model.
- Published
- 2021
- Full Text
- View/download PDF
47. PhosR enables processing and functional analysis of phosphoproteomic data
- Author
-
David E. James, Di Xiao, Nolan J. Hoffman, Sean J. Humphrey, Hani Jieun Kim, Taiyun Kim, and Pengyi Yang
- Subjects
0301 basic medicine ,Proteomics ,Proteome ,Computer science ,Muscle Fibers, Skeletal ,imputation ,Computational biology ,AMP-Activated Protein Kinases ,Mass spectrometry ,General Biochemistry, Genetics and Molecular Biology ,Mass Spectrometry ,normalisation ,03 medical and health sciences ,Mice ,0302 clinical medicine ,Construction method ,batch correction ,Software Design ,Cell Line, Tumor ,Animals ,Insulin ,signalling networks ,Phosphorylation ,lcsh:QH301-705.5 ,Kinase ,Phosphoproteomics ,phosphoproteomics ,Ribonucleotides ,Aminoimidazole Carboxamide ,Rats ,Enzyme Activation ,030104 developmental biology ,Signalling ,Liver ,lcsh:Biology (General) ,stably phosphorylated sites ,Signalling pathways ,Functional analysis (psychology) ,030217 neurology & neurosurgery ,Kinase substrate ,Signal Transduction - Abstract
SUMMARYMass spectrometry (MS)-based phosphoproteomics has revolutionised our ability to profile phosphorylation-based signalling in cells and tissues on a global scale. To infer the action of kinases and signalling pathways in phosphoproteomic experiments, we present PhosR, a set of tools and methodologies implemented in a suite of R packages facilitating comprehensive analysis of phosphoproteomic data. By applying PhosR to both published and new phosphoproteomic datasets, we demonstrate capabilities in data imputation and normalisation using a novel set of ‘stably phosphorylated sites’, and in functional analysis for inferring active kinases and signalling pathways. In particular, we introduce a ‘signalome’ construction method for identifying a collection of signalling modules to summarise and visualise the interaction of kinases and their collective actions on signal transduction. Together, our data and findings demonstrate the utility of PhosR in processing and generating novel biological knowledge from MS-based phosphoproteomic data.
- Published
- 2021
48. reComBat: Batch effect removal in large-scale, multi-source omics data integration
- Author
-
Adamer, Michael F., Brüningk, Sarah, Tejada-Arranz, Alejandro, Estermann, Fabienne, Basler, Marek, and Borgwardt, Karsten M.
- Subjects
Batch effects ,Combat ,Pseudomonas aeruginosa ,Batch correction ,Empirical Bayes ,Microarray - Abstract
With the steadily increasing abundance of omics data produced all over the world, some-times decades apart and under vastly different experimental conditions residing in public databases, a crucial step in many data-driven bioinformatics applications is that of data integration. The challenge of batch effect removal for entire databases lies in the large number and coincide of both batches and desired, biological variation resulting in design matrix singularity. This problem currently cannot be solved by any common batch correction algorithm. In this study, we present reComBat, a regularised version of the empirical Bayes method to overcome this limitation. We demonstrate our approach for the harmonisation of public gene expression data of the human opportunistic pathogen Pseudomonas aeruginosa and study a several metrics to empirically demonstrate that batch effects are successfully mitigated while biologically meaningful gene expression variation is retained. reComBat fills the gap in batch correction approaches applicable to large scale, public omics databases and opens up new avenues for data driven analysis of complex biological processes beyond the scope of a single study., bioRxiv
- Published
- 2021
- Full Text
- View/download PDF
49. Integration of Single-Cell RNA-Seq Datasets: A Review of Computational Methods.
- Author
-
Ryu Y, Han GH, Jung E, and Hwang D
- Subjects
- Gene Library, Single-Cell Gene Expression Analysis
- Abstract
With the increased number of single-cell RNA sequencing (scRNA-seq) datasets in public repositories, integrative analysis of multiple scRNA-seq datasets has become commonplace. Batch effects among different datasets are inevitable because of differences in cell isolation and handling protocols, library preparation technology, and sequencing platforms. To remove these batch effects for effective integration of multiple scRNA-seq datasets, a number of methodologies have been developed based on diverse concepts and approaches. These methods have proven useful for examining whether cellular features, such as cell subpopulations and marker genes, identified from a certain dataset, are consistently present, or whether their condition-dependent variations, such as increases in cell subpopulations in particular disease-related conditions, are consistently observed in different datasets generated under similar or distinct conditions. In this review, we summarize the concepts and approaches of the integration methods and their pros and cons as has been reported in previous literature.
- Published
- 2023
- Full Text
- View/download PDF
50. BIRCH: An Automated Workflow for Evaluation, Correction, and Visualization of Batch Effect in Bottom-Up Mass Spectrometry-Based Proteomics Data.
- Author
-
Sundararaman N, Bhat A, Venkatraman V, Binek A, Dwight Z, Ariyasinghe NR, Escopete S, Joung SY, Cheng S, Parker SJ, Fert-Bober J, and Van Eyk JE
- Subjects
- Humans, Betula, Workflow, COVID-19 Vaccines, Mass Spectrometry methods, Proteomics methods, COVID-19
- Abstract
Recent surges in large-scale mass spectrometry (MS)-based proteomics studies demand a concurrent rise in methods to facilitate reliable and reproducible data analysis. Quantification of proteins in MS analysis can be affected by variations in technical factors such as sample preparation and data acquisition conditions leading to batch effects, which adds to noise in the data set. This may in turn affect the effectiveness of any biological conclusions derived from the data. Here we present Batch-effect Identification, Representation, and Correction of Heterogeneous data (BIRCH), a workflow for analysis and correction of batch effect through an automated, versatile, and easy to use web-based tool with the goal of eliminating technical variation. BIRCH also supports diagnosis of the data to check for the presence of batch effects, feasibility of batch correction, and imputation to deal with missing values in the data set. To illustrate the relevance of the tool, we explore two case studies, including an iPSC-derived cell study and a Covid vaccine study to show different context-specific use cases. Ultimately this tool can be used as an extremely powerful approach for eliminating technical bias while retaining biological bias, toward understanding disease mechanisms and potential therapeutics.
- Published
- 2023
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.