31 results on '"Segun Jung"'
Search Results
2. Genetic deletion of Sphk2 confers protection against Pseudomonas aeruginosa mediated differential expression of genes related to virulent infection and inflammation in mouse lung
- Author
-
David L. Ebenezer, Panfeng Fu, Yashaswin Krishnan, Mark Maienschein-Cline, Hong Hu, Segun Jung, Ravi Madduri, Zarema Arbieva, Anantha Harijith, and Viswanathan Natarajan
- Subjects
Pseudomonas aeruginosa ,Pneumonia ,Sphingosine kinase 2 ,Sphingolipids ,Genomics, bacterial resistance ,Biotechnology ,TP248.13-248.65 ,Genetics ,QH426-470 - Abstract
Abstract Background Pseudomonas aeruginosa (PA) is an opportunistic Gram-negative bacterium that causes serious life threatening and nosocomial infections including pneumonia. PA has the ability to alter host genome to facilitate its invasion, thus increasing the virulence of the organism. Sphingosine-1- phosphate (S1P), a bioactive lipid, is known to play a key role in facilitating infection. Sphingosine kinases (SPHK) 1&2 phosphorylate sphingosine to generate S1P in mammalian cells. We reported earlier that Sphk2 −/− mice offered significant protection against lung inflammation, compared to wild type (WT) animals. Therefore, we profiled the differential expression of genes between the protected group of Sphk2 −/− and the wild type controls to better understand the underlying protective mechanisms related to the Sphk2 deletion in lung inflammatory injury. Whole transcriptome shotgun sequencing (RNA-Seq) was performed on mouse lung tissue using NextSeq 500 sequencing system. Results Two-way analysis of variance (ANOVA) analysis was performed and differentially expressed genes following PA infection were identified using whole transcriptome of Sphk2 −/− mice and their WT counterparts. Pathway (PW) enrichment analyses of the RNA seq data identified several signaling pathways that are likely to play a crucial role in pneumonia caused by PA such as those involved in: 1. Immune response to PA infection and NF-κB signal transduction; 2. PKC signal transduction; 3. Impact on epigenetic regulation; 4. Epithelial sodium channel pathway; 5. Mucin expression; and 6. Bacterial infection related pathways. Our genomic data suggests a potential role for SPHK2 in PA-induced pneumonia through elevated expression of inflammatory genes in lung tissue. Further, validation by RT-PCR on 10 differentially expressed genes showed 100% concordance in terms of vectoral changes as well as significant fold change. Conclusion Using Sphk2 −/− mice and differential gene expression analysis, we have shown here that S1P/SPHK2 signaling could play a key role in promoting PA pneumonia. The identified genes promote inflammation and suppress others that naturally inhibit inflammation and host defense. Thus, targeting SPHK2/S1P signaling in PA-induced lung inflammation could serve as a potential therapy to combat PA-induced pneumonia.
- Published
- 2019
- Full Text
- View/download PDF
3. Identification of Genetic and Epigenetic Variants Associated with Breast Cancer Prognosis by Integrative Bioinformatics Analysis
- Author
-
Arunima Shilpi, Yingtao Bi, Segun Jung, Samir K. Patra, and Ramana V. Davuluri
- Subjects
Neoplasms. Tumors. Oncology. Including cancer and carcinogens ,RC254-282 - Abstract
Introduction Breast cancer being a multifaceted disease constitutes a wide spectrum of histological and molecular variability in tumors. However, the task for the identification of these variances is complicated by the interplay between inherited genetic and epigenetic aberrations. Therefore, this study provides an extrapolate outlook to the sinister partnership between DNA methylation and single-nucleotide polymorphisms (SNPs) in relevance to the identification of prognostic markers in breast cancer. The effect of these SNPs on methylation is defined as methylation quantitative trait loci (meQTL). Materialsand Methods We developed a novel method to identify prognostic gene signatures for breast cancer by integrating genomic and epigenomic data. This is based on the hypothesis that multiple sources of evidence pointing to the same gene or pathway are likely to lead to reduced false positives. We also apply random resampling to reduce overfitting noise by dividing samples into training and testing data sets. Specifically, the common samples between Illumina 450 DNA methylation, Affymetrix SNP array, and clinical data sets obtained from the Cancer Genome Atlas (TCGA) for breast invasive carcinoma (BRCA) were randomly divided into training and test models. An intensive statistical analysis based on log-rank test and Cox proportional hazard model has established a significant association between differential methylation and the stratification of breast cancer patients into high- and low-risk groups, respectively. Results The comprehensive assessment based on the conjoint effect of CpG–SNP pair has guided in delaminating the breast cancer patients into the high- and low-risk groups. In particular, the most significant association was found with respect to cg05370838–rs2230576, cg00956490–rs940453, and cg11340537–rs2640785 CpG–SNP pairs. These CpG–SNP pairs were strongly associated with differential expression of ADAM8 , CREB5 , and EXPH5 genes, respectively. Besides, the exclusive effect of SNPs such as rs10101376, rs140679, and rs1538146 also hold significant prognostic determinant. Conclusions Thus, the analysis based on DNA methylation and SNPs have resulted in the identification of novel susceptible loci that hold prognostic relevance in breast cancer.
- Published
- 2017
- Full Text
- View/download PDF
4. Predicting helical topologies in RNA junctions as tree graphs.
- Author
-
Christian Laing, Segun Jung, Namhee Kim, Shereef Elmetwaly, Mai Zahran, and Tamar Schlick
- Subjects
Medicine ,Science - Abstract
RNA molecules are important cellular components involved in many fundamental biological processes. Understanding the mechanisms behind their functions requires knowledge of their tertiary structures. Though computational RNA folding approaches exist, they often require manual manipulation and expert intuition; predicting global long-range tertiary contacts remains challenging. Here we develop a computational approach and associated program module (RNAJAG) to predict helical arrangements/topologies in RNA junctions. Our method has two components: junction topology prediction and graph modeling. First, junction topologies are determined by a data mining approach from a given secondary structure of the target RNAs; second, the predicted topology is used to construct a tree graph consistent with geometric preferences analyzed from solved RNAs. The predicted graphs, which model the helical arrangements of RNA junctions for a large set of 200 junctions using a cross validation procedure, yield fairly good representations compared to the helical configurations in native RNAs, and can be further used to develop all-atom models as we show for two examples. Because junctions are among the most complex structural elements in RNA, this work advances folding structure prediction methods of large RNAs. The RNAJAG module is available to academic users upon request.
- Published
- 2013
- Full Text
- View/download PDF
5. Abstract 1254: Actionable fusions detected by RNA-seq co-occur with PD-L1 expression and driver mutations in solid tumors patients
- Author
-
Miyoung Shin, Paris Petersen, Steven Lau-Rivera, Segun Jung, Sally Agersborg, Jacyln Hechtman, Fernando Lopez-Diaz, and Vincent Funari
- Subjects
Cancer Research ,Oncology - Abstract
Background: Incorporating gene fusions into a comprehensive profile is critical not only because very effective therapies targeting oncogenic fusion proteins exist but they may also negate the response to therapy of an actionable SNV and InDel mutations. We examined the co-occurrence of actionable gene fusions detected by RNA-seq with actionable SNVs/InDels and with the immunotherapy response biomarker, PD-L1. Methods: In 2021, 5341 FFPE samples were analyzed by our clinical laboratory using a novel hybridization-based RNA sequencing assay. DNA (SNV/Indels) mutations were detected with a clinical grade NGS assay and PD-L1 protein expression was determined by IHC using an appropriate LDT or FDA approved assay. De-identified data were analyzed following an approved IRB protocol. Results: Of the 5341 patients tested for gene fusions only 0.7% were profiled with a comprehensive fusion detection panel for 250 clinically relevant fusion genes. Conversely, 67% of all patients were tested only for NTRK gene fusions. The prevalence of the following most relevant fusions was: ALK=3.16% (in particular EML4-ALK), NTRKs=0.85%, FGFR2=5.60%, RET=1.34% and ROS1=1.05%. Among NTRK fusions, NTRK3 was detected in 52.9% of the positive cases; in particular, dominant fusions are ETV6-NTRK3 (26%) and EML4-NTRK3 (20%). In the 163 fusion positive cases, 37 cases had available DNA mutation testing and PD-L1 expression testing. These patients presented 24 different actionable fusions, including ALK (2), FGFR1/2 (5), NTRK 1/2/3 (10), NRG1 (3), RET (3), and ROS1 (3) fusions. Interestingly, while 73% (27/37) of those tumors were PD-L1 positive, similar to the 75% found on the fusion negative samples, PD-L1 was positive in 88% (15/17) of the lung samples with pathogenic mutations from this subset. This was strikingly higher than the 45% (192/424) found in the fusion negative cohort. Finally, excluding TP53 mutations, NTRK fusions frequently co-occurred with RNF43, FBXW7, TERT promoter, and ARID1A pathogenic mutations. FGFR fusions co-occurred with BAP1, PBRM1, or KRAS mutations, which correlates with the fact that both FGFR2 fusions and swi/snf alterations are enriched in intrahepatic cholangiocarcinoma, where the FGFR fusions were detected. RET fusions co-occurred with ARID1A and TERT. ROS1 fusions co-occurred with SMARCA4 and KRAS pathogenic mutations simultaneously. NRG1 fusions co-occurred with ARID1A and FBXW7 mutations. EML4-ALK fusions co-occurred with FGFR2 and KMT2D variants of unknown significance. Conclusion: Actionable cancer-driving gene fusions were detected by RNA-sequencing and co-occurred with other biomarkers that can guide the selection of therapies, such as immune checkpoint inhibitors (ICI) and olaparib. The data suggest that gene fusion testing is an important addition to the genomic profiling for therapy selection in solid tumors. Citation Format: Miyoung Shin, Paris Petersen, Steven Lau-Rivera, Segun Jung, Sally Agersborg, Jacyln Hechtman, Fernando Lopez-Diaz, Vincent Funari. Actionable fusions detected by RNA-seq co-occur with PD-L1 expression and driver mutations in solid tumors patients [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2022; 2022 Apr 8-13. Philadelphia (PA): AACR; Cancer Res 2022;82(12_Suppl):Abstract nr 1254.
- Published
- 2022
- Full Text
- View/download PDF
6. Integration of genomics and transcriptomics predicts diabetic retinopathy susceptibility genes
- Author
-
Segun Jung, Barbara E. Stranger, Maria Sverdlov, Dingcai Cao, Ionut Bebu, Amy Y. Lin, Ana Marija Sokovic, Michael A. Grassi, Dcct, Sarah Fazal, Andrew D Skol, Siquan Chen, Poulami P Borkar, Olukayode A. Sosina, and Anand Swaroop
- Subjects
Male ,0301 basic medicine ,folliculin ,Disease ,Bioinformatics ,Transcriptome ,0302 clinical medicine ,Lymphocytes ,Biology (General) ,Cell Line, Transformed ,General Neuroscience ,General Medicine ,Diabetic retinopathy ,diabetic retinopathy ,030220 oncology & carcinogenesis ,Medicine ,Female ,Research Article ,Human ,Retinopathy ,Adult ,QH301-705.5 ,Science ,Quantitative Trait Loci ,Blood sugar ,eQTL ,Polymorphism, Single Nucleotide ,General Biochemistry, Genetics and Molecular Biology ,Young Adult ,03 medical and health sciences ,LCLs ,Proto-Oncogene Proteins ,Diabetes mellitus ,medicine ,Humans ,Genetic Predisposition to Disease ,Folliculin ,Gene ,General Immunology and Microbiology ,business.industry ,Gene Expression Profiling ,Tumor Suppressor Proteins ,Genetics and Genomics ,Mendelian Randomization Analysis ,medicine.disease ,Diabetes Mellitus, Type 1 ,Glucose ,030104 developmental biology ,Case-Control Studies ,gene expression ,mendelian randomization ,business ,Genome-Wide Association Study - Abstract
We determined differential gene expression in response to high glucose in lymphoblastoid cell lines derived from matched individuals with type 1 diabetes with and without retinopathy. Those genes exhibiting the largest difference in glucose response were assessed for association with diabetic retinopathy in a genome-wide association study meta-analysis. Expression quantitative trait loci (eQTLs) of the glucose response genes were tested for association with diabetic retinopathy. We detected an enrichment of the eQTLs from the glucose response genes among small association p-values and identified folliculin (FLCN) as a susceptibility gene for diabetic retinopathy. Expression of FLCN in response to glucose was greater in individuals with diabetic retinopathy. Independent cohorts of individuals with diabetes revealed an association of FLCN eQTLs with diabetic retinopathy. Mendelian randomization confirmed a direct positive effect of increased FLCN expression on retinopathy. Integrating genetic association with gene expression implicated FLCN as a disease gene for diabetic retinopathy., eLife digest One of the side effects of diabetes is loss of vision from diabetic retinopathy, which is caused by injury to the light sensing tissue in the eye, the retina. Almost all individuals with diabetes develop diabetic retinopathy to some extent, and it is the leading cause of irreversible vision loss in working-age adults in the United States. How long a person has been living with diabetes, the extent of increased blood sugars and genetics all contribute to the risk and severity of diabetic retinopathy. Unfortunately, virtually no genes associated with diabetic retinopathy have yet been identified. When a gene is activated, it produces messenger molecules known as mRNA that are used by cells as instructions to produce proteins. The analysis of mRNA molecules, as well as genes themselves, can reveal the role of certain genes in disease. The studies of all genes and their associated mRNAs are respectively called genomics and transcriptomics. Genomics reveals what genes are present, while transcriptomics shows how active genes are in different cells. Skol et al. developed methods to study genomics and transcriptomics together to help discover genes that cause diabetic retinopathy. Genes involved in how cells respond to high blood sugar were first identified using cells grown in the lab. By comparing the activity of these genes in people with and without retinopathy the study identified genes associated with an increased risk of retinopathy in diabetes. In people with retinopathy, the activity of the folliculin gene (FLCN) increased more in response to high blood sugar. This was further verified with independent groups of people and using computer models to estimate the effect of different versions of the folliculin gene. The methods used here could be applied to understand complex genetics in other diseases. The results provide new understanding of the effects of diabetes. They may also help in the development of new treatments for diabetic retinopathy, which are likely to improve on the current approach of using laser surgery or injections into the eye.
- Published
- 2020
- Full Text
- View/download PDF
7. Author response: Integration of genomics and transcriptomics predicts diabetic retinopathy susceptibility genes
- Author
-
Maria Sverdlov, Ionut Bebu, Michael A. Grassi, Poulami P Borkar, Olukayode A. Sosina, Anand Swaroop, Ana Marija Sokovic, Barbara E. Stranger, Segun Jung, Siquan Chen, Andrew D Skol, Amy Y. Lin, Dingcai Cao, Sarah Fazal, and Dcct
- Subjects
Transcriptome ,medicine ,Genomics ,Susceptibility gene ,Computational biology ,Diabetic retinopathy ,Biology ,medicine.disease - Published
- 2020
- Full Text
- View/download PDF
8. Mendelian randomization identifies folliculin expression as a mediator of diabetic retinopathy
- Author
-
Barbara E. Stranger, Maria Sverdlov, Ionut Bebu, Andrew D Skol, Amy Y. Lin, Michael A. Grassi, Segun Jung, Sarah Fazal, Siquan Chen, Poulami P Borkar, Dcct, Olukayode A. Sosina, Anand Swaroop, Dingcai Cao, and Ana Marija Sokovic
- Subjects
Type 1 diabetes ,business.industry ,Diabetes mellitus ,Mendelian randomization ,Expression quantitative trait loci ,Medicine ,Diabetic retinopathy ,Folliculin ,business ,medicine.disease ,Bioinformatics ,Genetic association ,Retinopathy - Abstract
The goal of the study was to identify genes whose aberrant expression can contribute to diabetic retinopathy. We determined differential gene expression in response to high glucose in lymphoblastoid cell lines derived from matched individuals with type 1 diabetes (T1D) with and without retinopathy. Those genes exhibiting the largest difference in glucose response between individuals with diabetes with and without retinopathy were assessed for association to diabetic retinopathy utilizing genotype data from a genome-wide association study meta-analysis. All genetic variants associated with gene expression (expression Quantitative Trait Loci, eQTLs) of the glucose response genes were tested for association with diabetic retinopathy. We detected an enrichment of the eQTLs from the glucose response genes among small association p-values and identified folliculin (FLCN) as a susceptibility gene for diabetic retinopathy. We show that expression of FLCN in response to glucose was greater in individuals with diabetic retinopathy compared to individuals with diabetes without retinopathy. Three large, independent cohorts of individuals with diabetes revealed an association of FLCN eQTLs to diabetic retinopathy. Mendelian randomization further confirmed a direct positive effect of increased FLCN expression on retinopathy in individuals with diabetes. Together, our studies integrating genetic association and gene expression implicate FLCN as a disease gene for diabetic retinopathy.
- Published
- 2020
- Full Text
- View/download PDF
9. A novel MERTK mutation causing retinitis pigmentosa
- Author
-
Segun Jung, Kaanan P. Shah, Michael A. Grassi, Ravi Madduri, Hasenin Al-khersan, and Alex Rodriguez
- Subjects
Male ,0301 basic medicine ,Proband ,DNA Mutational Analysis ,Nonsense mutation ,Biology ,Retina ,Article ,03 medical and health sciences ,Cellular and Molecular Neuroscience ,0302 clinical medicine ,Locus heterogeneity ,medicine ,Humans ,Exome ,Exome sequencing ,Genetic testing ,Genetics ,c-Mer Tyrosine Kinase ,medicine.diagnostic_test ,Genetic heterogeneity ,DNA ,MERTK ,medicine.disease ,Sensory Systems ,Pedigree ,Ophthalmoscopy ,Ophthalmology ,030104 developmental biology ,Mutation ,030221 ophthalmology & optometry ,Female ,Allelic heterogeneity ,Retinitis Pigmentosa - Abstract
Retinitis pigmentosa (RP) is a genetically heterogeneous inherited retinal dystrophy. To date, over 80 genes have been implicated in RP. However, the disease demonstrates significant locus and allelic heterogeneity not entirely captured by current testing platforms. The purpose of the present study was to characterize the underlying mutation in a patient with RP without a molecular diagnosis after initial genetic testing. Whole-exome sequencing of the affected proband was performed. Candidate gene mutations were selected based on adherence to expected genetic inheritance pattern and predicted pathogenicity. Sanger sequencing of MERTK was completed on the patient’s unaffected mother, affected brother, and unaffected sister to determine genetic phase. Eight sequence variants were identified in the proband in known RP-associated genes. Sequence analysis revealed that the proband was a compound heterozygote with two independent mutations in MERTK, a novel nonsense mutation (c.2179C > T) and a previously reported missense variant (c.2530C > T). The proband’s affected brother also had both mutations. Predicted phase was confirmed in unaffected family members. Our study identifies a novel nonsense mutation in MERTK in a family with RP and no prior molecular diagnosis. The present study also demonstrates the clinical value of exome sequencing in determining the genetic basis of Mendelian diseases when standard genetic testing is unsuccessful.
- Published
- 2017
- Full Text
- View/download PDF
10. Identification and validation of regulatory SNPs that modulate transcription factor chromatin binding and gene expression in prostate cancer
- Author
-
Segun Jung, Ramana V. Davuluri, Hongjian Jin, and Auditi R. DebRoy
- Subjects
Male ,0301 basic medicine ,SNP ,Single-nucleotide polymorphism ,Genome-wide association study ,Kaplan-Meier Estimate ,Biology ,eQTL ,Polymorphism, Single Nucleotide ,03 medical and health sciences ,0302 clinical medicine ,Cell Line, Tumor ,Humans ,Genetic Predisposition to Disease ,Enhancer ,CRISPR/Cas9 ,Gene ,Alleles ,transcription factor ,Genetics ,Base Sequence ,Chromatin binding ,Prostatic Neoplasms ,prostate cancer ,Chromatin ,3. Good health ,Gene Expression Regulation, Neoplastic ,030104 developmental biology ,Oncology ,030220 oncology & carcinogenesis ,Expression quantitative trait loci ,Functional genomics ,Chromatin immunoprecipitation ,Genome-Wide Association Study ,Protein Binding ,Transcription Factors ,Research Paper - Abstract
// Hong-Jian Jin 1 , Segun Jung 1 , Auditi R. DebRoy 1 , Ramana V. Davuluri 1 1 Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611, USA Correspondence to: Ramana V. Davuluri, email: ramana.davuluri@northwestern.edu Keywords: SNP, prostate cancer, transcription factor, CRISPR/Cas9, eQTL Received: March 23, 2016 Accepted: May 23, 2016 Published: July 09, 2016 ABSTRACT Prostate cancer (PCa) is the second most common solid tumor for cancer related deaths in American men. Genome wide association studies (GWAS) have identified single nucleotide polymorphisms (SNPs) associated with the increased risk of PCa. Because most of the susceptibility SNPs are located in noncoding regions, little is known about their functional mechanisms. We hypothesize that functional SNPs reside in cell type-specific regulatory elements that mediate the binding of critical transcription factors (TFs), which in turn result in changes in target gene expression. Using PCa-specific functional genomics data, here we identify 38 regulatory candidate SNPs and their target genes in PCa. Through risk analysis by incorporating gene expression and clinical data, we identify 6 target genes (ZG16B, ANKRD5, RERE, FAM96B, NAALADL2 and GTPBP10) as significant predictors of PCa biochemical recurrence. In addition, 5 SNPs (rs2659051, rs10936845, rs9925556, rs6057110 and rs2742624) are selected for experimental validation using Chromatin immunoprecipitation (ChIP), dual-luciferase reporter assay in LNCaP cells, showing allele-specific enhancer activity. Furthermore, we delete the rs2742624-containing region using CRISPR/Cas9 genome editing and observe the drastic downregulation of its target gene UPK3A. Taken together, our results illustrate that this new methodology can be applied to identify regulatory SNPs and their target genes that likely impact PCa risk. We suggest that similar studies can be performed to characterize regulatory variants in other diseases.
- Published
- 2016
- Full Text
- View/download PDF
11. Atlas of Transcription Factor Binding Sites from ENCODE DNase Hypersensitivity Data across 27 Tissue Types
- Author
-
Cory C. Funk, Nilufer Ertekin-Taner, Leroy Hood, Matthew A. Richards, Nathan D. Price, Alex Rodriguez, Gustavo Glusman, Yukai Xiao, Alex M. Casella, Segun Jung, Ben Heavner, Ian Foster, Kyle Chard, Paul Shannon, Rory Donovan-Maiye, Carl Kesselman, John D. Van Horn, Todd E. Golde, Arthur W. Toga, Ravi Madduri, and Seth A. Ament
- Subjects
0301 basic medicine ,genetic processes ,information science ,Gene regulatory network ,Computational biology ,Biology ,ENCODE ,General Biochemistry, Genetics and Molecular Biology ,DNase-Seq ,Article ,03 medical and health sciences ,0302 clinical medicine ,Genetic variation ,Humans ,natural sciences ,Transcription factor ,Binding Sites ,Deoxyribonucleases ,Genomics ,Footprinting ,DNA binding site ,030104 developmental biology ,health occupations ,Human genome ,Sequence motif ,Hypersensitive site ,030217 neurology & neurosurgery ,Transcription Factors - Abstract
There is intense interest in mapping the tissue-specific binding sites of transcription factors in the human genome to reconstruct gene regulatory networks and predict functions for non-coding genetic variation. DNase-seq footprinting provides a means to predict genome-wide binding sites for hundreds of transcription factors (TFs) simultaneously. However, despite the public availability of DNase-seq data for hundreds of samples, there is neither a unified analytical workflow nor a publicly accessible database providing the locations of footprints across all available samples. Here, we implemented a workflow for uniform processing of footprints using two state-of-the-art footprinting algorithms: Wellington and HINT. Our workflow scans the footprints generated by these algorithms for 1,530 sequence motifs to predict binding sites for 1,515 human transcription factors. We applied our workflow to detect footprints in 192 DNase-seq experiments from ENCODE spanning 27 human tissues. This collection of footprints describes an expansive landscape of potential TF occupancy. At thresholds optimized through machine learning, we report high-quality footprints covering 9.8% of the human genome. These footprints were enriched for true positive TF binding sites as defined by ChIP-seq peaks, as well as for genetic variants associated with changes in gene expression. Integrating our footprint atlas with summary statistics from genome-wide association studies revealed that risk for neuropsychiatric traits was enriched specifically at highly-scoring footprints in human brain, while risk for immune traits was enriched specifically at highly-scoring footprints in human lymphoblasts. Our cloud-based workflow is available at github.com/globusgenomics/genomics-footprint and a database with all footprints and TF binding site predictions are publicly available at http://data.nemoarchive.org/other/grant/sament/sament/footprint_atlas.
- Published
- 2018
12. Reproducible big data science: A case study in continuous FAIRness
- Author
-
Nathan D. Price, Kyle Chard, Paul Shannon, Cory C. Funk, Alexis A. Rodriguez, Matthew A. Richards, Eric W. Deutsch, Dinanath Sulakhe, Ravi Madduri, Segun Jung, Mike D'Arcy, Gustavo Glusman, Carl Kesselman, Ben Heavner, and Ian Foster
- Subjects
Big Data ,Databases, Factual ,Process (engineering) ,Computer science ,Science ,Data management ,Big data ,Interoperability ,Cloud computing ,Terabyte ,03 medical and health sciences ,0302 clinical medicine ,Humans ,Longitudinal Studies ,030304 developmental biology ,0303 health sciences ,Multidisciplinary ,business.industry ,Information Dissemination ,Data Science ,Usability ,Data science ,Metadata ,Medicine ,business ,030217 neurology & neurosurgery ,Algorithms ,Software - Abstract
Big biomedical data create exciting opportunities for discovery but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi- step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility--thus ensuring that big data are not hard-to-(re)use data. In this talk, we will describe the enhancements made to the Galaxy framework to support working with datasets referred to by minids, support analyzing BagIt-based research objects called BDBags, execution using software encapsulated using docker containers with unique identifiers. We will describe the tools, services developed to create an end-to-end reproducible analysis pipelines while adhering to FAIR principles., Accepted talk at RO2018
- Published
- 2018
13. Expression profiling of genes regulated by Sphingosine kinase 2 in a murine model of Pseudomonas aeruginosa mediated acute lung inflammation
- Author
-
Anantha Harijith, Panfeng Fu, Yashaswin Krishnan, Ravi Madduri, Hong Hu, David L. Ebenezer, Segun Jung, Zarema Arbieva, and Viswanathan Natarajan
- Subjects
Lung ,Pseudomonas aeruginosa ,Sphingosine Kinase 2 ,Inflammation ,Biology ,medicine.disease_cause ,Biochemistry ,Gene expression profiling ,medicine.anatomical_structure ,Murine model ,Genetics ,medicine ,Cancer research ,medicine.symptom ,Molecular Biology ,Gene ,Biotechnology - Published
- 2018
- Full Text
- View/download PDF
14. Abstract A2-47: Informatics framework for clustering and deriving gene signatures for prognostic stratification of cancer patients
- Author
-
Yingtao Bi, Segun Jung, and Ramana V. Davuluri
- Subjects
Cancer Research ,Oncology ,business.industry ,Informatics ,Medicine ,Cancer ,Bioinformatics ,business ,Cluster analysis ,medicine.disease ,Gene ,Prognostic stratification - Abstract
Stratification of cancer patients into different molecular groups is essential for developing targeted therapies. High-throughput technologies, such as microarrays and next-generation sequencing, have been extensively used for generating multi-omics data. Indeed, The Cancer Genome Atlas (TCGA) consortium, one of the most comprehensive and popular databases of cancer, has been accumulating large volumes of invaluable data for more than 30 cancer types, offering unprecedented opportunity to attain new insights in cancer biology. Despite the technological advances, analyzing, integrating and translating the gene signatures across different platforms remains a computationally challenging task. Here, we developed a novel computational framework that integrates genomic and clinical data to stratify cancer patients into different molecular subgroups and predict clinically applicable phenotypes, such as survival. Application of this user-friendly framework derives platform-independent isoform-level gene signatures that can be translated from high-dimensional platforms (e.g., RNA-Seq) to clinically adaptable low-dimensional molecular assays (e.g., RT-PCR) for prognostic stratification. We applied the pipeline on two TCGA lung cancer datasets—Lung Adenocarcinoma (LUAD) and Lung Squamous Cell Carcinoma (LUSC). Using independent test data, we achieved about 93% (LUAD) and 98% (LUSC) classification accuracy using less than 70 isoform-level gene signature. The proposed informatics platform is applicable to other cancer types in TCGA data portal. Citation Format: Segun Jung, Yingtao Bi, Ramana V. Davuluri. Informatics framework for clustering and deriving gene signatures for prognostic stratification of cancer patients. [abstract]. In: Proceedings of the AACR Special Conference on Translation of the Cancer Genome; Feb 7-9, 2015; San Francisco, CA. Philadelphia (PA): AACR; Cancer Res 2015;75(22 Suppl 1):Abstract nr A2-47.
- Published
- 2015
- Full Text
- View/download PDF
15. Interconversion between Parallel and Antiparallel Conformations of a 4H RNA junction in Domain 3 of Foot-and-Mouth Disease Virus IRES Captured by Dynamics Simulations
- Author
-
Tamar Schlick and Segun Jung
- Subjects
Principal Component Analysis ,Base Sequence ,Biophysics ,Stacking ,RNA ,Molecular Dynamics Simulation ,Biology ,Antiparallel (biochemistry) ,Internal ribosome entry site ,Crystallography ,Molecular dynamics ,Förster resonance energy transfer ,Foot-and-Mouth Disease Virus ,Nucleic Acid Conformation ,RNA, Viral ,Proteins and Nucleic Acids ,Peptide Chain Initiation, Translational ,Conformational isomerism ,Protein secondary structure - Abstract
RNA junctions are common secondary structural elements present in a wide range of RNA species. They play crucial roles in directing the overall folding of RNA molecules as well as in a variety of biological functions. In particular, there has been great interest in the dynamics of RNA junctions, including conformational pathways of fully base-paired 4-way (4H) RNA junctions. In such constructs, all nucleotides participate in one of the four double-stranded stem regions, with no connecting loops. Dynamical aspects of these 4H RNAs are interesting because frequent interchanges between parallel and antiparallel conformations are thought to occur without binding of other factors. Gel electrophoresis and single-molecule fluorescence resonance energy transfer experiments have suggested two possible pathways: one involves a helical rearrangement via disruption of coaxial stacking, and the other occurs by a rotation between the helical axes of coaxially stacked conformers. Employing molecular dynamics simulations, we explore this conformational variability in a 4H junction derived from domain 3 of the foot-and-mouth disease virus internal ribosome entry site (IRES); this junction contains highly conserved motifs for RNA-RNA and RNA-protein interactions, important for IRES activity. Our simulations capture transitions of the 4H junction between parallel and antiparallel conformations. The interconversion is virtually barrier-free and occurs via a rotation between the axes of coaxially stacked helices with a transient perpendicular intermediate. We characterize this transition, with various interhelical orientations, by pseudodihedral angle and interhelical distance measures. The high flexibility of the junction, as also demonstrated experimentally, is suitable for IRES activity. Because foot-and-mouth disease virus IRES structure depends on long-range interactions involving domain 3, the perpendicular intermediate, which maintains coaxial stacking of helices and thereby consensus primary and secondary structure information, may be beneficial for guiding the overall organization of the RNA system in domain 3.
- Published
- 2014
- Full Text
- View/download PDF
16. Identification of Genetic and Epigenetic Variants Associated with Breast Cancer Prognosis by Integrative Bioinformatics Analysis
- Author
-
Ramana V. Davuluri, Arunima Shilpi, Yingtao Bi, Segun Jung, and Samir Kumar Patra
- Subjects
0301 basic medicine ,Cancer Research ,overall survival ,Single-nucleotide polymorphism ,Genomics ,Computational biology ,Quantitative trait locus ,Biology ,Bioinformatics ,lcsh:RC254-282 ,03 medical and health sciences ,0302 clinical medicine ,Breast cancer ,medicine ,Epigenetics ,Original Research ,Epigenomics ,DNA methylation ,meQTLs ,single-nucleotide polymorphism ,medicine.disease ,lcsh:Neoplasms. Tumors. Oncology. Including cancer and carcinogens ,3. Good health ,030104 developmental biology ,Oncology ,030220 oncology & carcinogenesis ,SNP array - Abstract
IntroductionBreast cancer being a multifaceted disease constitutes a wide spectrum of histological and molecular variability in tumors. However, the task for the identification of these variances is complicated by the interplay between inherited genetic and epigenetic aberrations. Therefore, this study provides an extrapolate outlook to the sinister partnership between DNA methylation and single-nucleotide polymorphisms (SNPs) in relevance to the identification of prognostic markers in breast cancer. The effect of these SNPs on methylation is defined as methylation quantitative trait loci (meQTL).Materialsand MethodsWe developed a novel method to identify prognostic gene signatures for breast cancer by integrating genomic and epigenomic data. This is based on the hypothesis that multiple sources of evidence pointing to the same gene or pathway are likely to lead to reduced false positives. We also apply random resampling to reduce overfitting noise by dividing samples into training and testing data sets. Specifically, the common samples between Illumina 450 DNA methylation, Affymetrix SNP array, and clinical data sets obtained from the Cancer Genome Atlas (TCGA) for breast invasive carcinoma (BRCA) were randomly divided into training and test models. An intensive statistical analysis based on log-rank test and Cox proportional hazard model has established a significant association between differential methylation and the stratification of breast cancer patients into high- and low-risk groups, respectively.ResultsThe comprehensive assessment based on the conjoint effect of CpG–SNP pair has guided in delaminating the breast cancer patients into the high- and low-risk groups. In particular, the most significant association was found with respect to cg05370838–rs2230576, cg00956490–rs940453, and cg11340537–rs2640785 CpG–SNP pairs. These CpG–SNP pairs were strongly associated with differential expression of ADAM8, CREB5, and EXPH5 genes, respectively. Besides, the exclusive effect of SNPs such as rs10101376, rs140679, and rs1538146 also hold significant prognostic determinant.ConclusionsThus, the analysis based on DNA methylation and SNPs have resulted in the identification of novel susceptible loci that hold prognostic relevance in breast cancer.
- Published
- 2017
17. Candidate RNA structures for domain 3 of the foot-and-mouth-disease virus internal ribosome entry site
- Author
-
Segun Jung and Tamar Schlick
- Subjects
Models, Molecular ,Computational biology ,Biology ,Molecular Dynamics Simulation ,010402 general chemistry ,01 natural sciences ,Tetraloop ,03 medical and health sciences ,Eukaryotic translation ,Untranslated Regions ,Genetics ,Nucleic acid structure ,Binding site ,Peptide Chain Initiation, Translational ,Conserved Sequence ,030304 developmental biology ,0303 health sciences ,Base Sequence ,RNA ,Computational Biology ,Translation (biology) ,Virology ,Protein tertiary structure ,0104 chemical sciences ,Internal ribosome entry site ,Foot-and-Mouth Disease Virus ,Nucleic Acid Conformation ,RNA, Viral - Abstract
The foot-and-mouth-disease virus (FMDV) utilizes non-canonical translation initiation for viral protein synthesis, by forming a specific RNA structure called internal ribosome entry site (IRES). Domain 3 in FMDV IRES is phylogenetically conserved and highly structured; it contains four-way junctions where intramolecular RNA-RNA interactions serve as a scaffold for the RNA to fold for efficient IRES activity. Although the 3D structure of domain 3 is crucial to exploring and deciphering the initiation mechanism of translation, little is known. Here, we employ a combination of various modeling approaches to propose candidate tertiary structures for the apical region of domain 3, thought to be crucial for IRES function. We begin by modeling junction topology candidates and build atomic 3D models consistent with available experimental data. We then investigate each of the four candidate 3D structures by molecular dynamics simulations to determine the most energetically favorable configurations and to analyze specific tertiary interactions. Only one model emerges as viable containing not only the specific binding site for the GNRA tetraloop but also helical arrangements which enhance the stability of domain 3. These collective findings, together with available experimental data, suggest a plausible theoretical tertiary structure of the apical region in FMDV IRES domain 3.
- Published
- 2012
18. Tertiary Motifs Revealed in Analyses of Higher-Order RNA Junctions
- Author
-
Tamar Schlick, Abdul Iqbal, Segun Jung, and Christian Laing
- Subjects
Models, Molecular ,Base pair ,Stacking ,Protein Data Bank (RCSB PDB) ,RNA ,Biology ,Article ,Crystallography ,chemistry.chemical_compound ,Models, Chemical ,chemistry ,Structural Biology ,Chemical physics ,Nucleic Acid Conformation ,Nucleic acid structure ,Coaxial ,Base Pairing ,Molecular Biology ,Cytosine - Abstract
RNA junctions are secondary structure elements formed when three or more helices come together. They are present in diverse RNA molecules with various fundamental functions in the cell. To better understand the intricate architecture of three-dimensional RNAs, we analyze currently solved 3D RNA junctions in terms of basepair interactions and three-dimensional configurations. First, we study basepair interaction diagrams for solved RNA junctions with five to ten helices and discuss common features. Second, we compare these higher-order junctions to those containing three or four helices and identify global motif patterns such as coaxial-stacking and parallel and perpendicular helical configurations. These analyses show that higher order junctions organize their helical components in parallel and helical configurations similar to lower order junctions. Their sub-junctions also resemble local helical configurations found in three and four-way junctions, and are stabilized by similar long-range interaction preferences such as A-minor interactions. Furthermore, loop regions within junctions are high in adenine but low in cytosine. And, in agreement with previous studies, we suggest that coaxial stacking between helices likely forms when the common single stranded loop is small in size; however, other factors such as stacking interactions involving non-canonical basepairs and proteins can greatly determine or disrupt coaxial stacking. Finally, we introduce the ribo-base interactions: when combined with the along-groove packing motif, these ribo-base interactions form novel motifs involved in perpendicular helix-helix interactions. Overall, these analyses suggest recurrent tertiary motifs that stabilize junction architecture, pack helices, and help form helical configurations that occur as sub-elements of larger junction networks. The frequent occurrence of similar helical motifs suggest Nature’s finite and perhaps limited repertoire of RNA helical conformation preferences. More generally, studies of RNA junctions and tertiary building blocks can ultimately help in the difficult task of RNA 3D structure prediction.
- Published
- 2009
- Full Text
- View/download PDF
19. Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping
- Author
-
Segun Jung, Yingtao Bi, and Ramana V. Davuluri
- Subjects
Discretization ,exon-array ,Feature selection ,Biology ,Machine Learning ,Multiclass classification ,03 medical and health sciences ,0302 clinical medicine ,RNA Isoforms ,Genetics ,Cluster Analysis ,Humans ,030304 developmental biology ,data discretization ,0303 health sciences ,business.industry ,Gene Expression Profiling ,Research ,Computational Biology ,platform transition ,Pattern recognition ,multi-class classification ,Class (biology) ,Expression (mathematics) ,Random forest ,Statistical classification ,Identification (information) ,ComputingMethodologies_PATTERNRECOGNITION ,030220 oncology & carcinogenesis ,Artificial intelligence ,RNA-seq ,Glioblastoma ,business ,Algorithms ,Biotechnology - Abstract
Background Many supervised learning algorithms have been applied in deriving gene signatures for patient stratification from gene expression data. However, transferring the multi-gene signatures from one analytical platform to another without loss of classification accuracy is a major challenge. Here, we compared three unsupervised data discretization methods--Equal-width binning, Equal-frequency binning, and k-means clustering--in accurately classifying the four known subtypes of glioblastoma multiforme (GBM) when the classification algorithms were trained on the isoform-level gene expression profiles from exon-array platform and tested on the corresponding profiles from RNA-seq data. Results We applied an integrated machine learning framework that involves three sequential steps; feature selection, data discretization, and classification. For models trained and tested on exon-array data, the addition of data discretization step led to robust and accurate predictive models with fewer number of variables in the final models. For models trained on exon-array data and tested on RNA-seq data, the addition of data discretization step dramatically improved the classification accuracies with Equal-frequency binning showing the highest improvement with more than 90% accuracies for all the models with features chosen by Random Forest based feature selection. Overall, SVM classifier coupled with Equal-frequency binning achieved the best accuracy (> 95%). Without data discretization, however, only 73.6% accuracy was achieved at most. Conclusions The classification algorithms, trained and tested on data from the same platform, yielded similar accuracies in predicting the four GBM subgroups. However, when dealing with cross-platform data, from exon-array to RNA-seq, the classifiers yielded stable models with highest classification accuracies on data transformed by Equal frequency binning. The approach presented here is generally applicable to other cancer types for classification and identification of molecular subgroups by integrating data across different gene expression platforms.
- Published
- 2015
- Full Text
- View/download PDF
20. Development of Bioinformatics Infrastructure for Genomics Research
- Author
-
Nicola J. Mulder, Ezekiel Adebiyi, Marion Adebiyi, Seun Adeyemi, Azza Ahmed, Rehab Ahmed, Bola Akanle, Mohamed Alibi, Don L. Armstrong, Shaun Aron, Efejiro Ashano, Shakuntala Baichoo, Alia Benkahla, David K. Brown, Emile R. Chimusa, Faisal M. Fadlelmola, Dare Falola, Segun Fatumo, Kais Ghedira, Amel Ghouila, Scott Hazelhurst, Itunuoluwa Isewon, Segun Jung, Samar Kamal Kassim, Jonathan K. Kayondo, Mamana Mbiyavanga, Ayton Meintjes, Somia Mohammed, Abayomi Mosaku, Ahmed Moussa, Mustafa Muhammd, Zahra Mungloo-Dilmohamud, Oyekanmi Nashiru, Trust Odia, Adaobi Okafor, Olaleye Oladipo, Victor Osamor, Jellili Oyelade, Khalid Sadki, Samson Pandam Salifu, Jumoke Soyemi, Sumir Panji, Fouzia Radouani, Oussama Souiai, Özlem Tastan Bishop, The HABioNet Consortium, as Members of the HAfrica Consortium, University of Cape Town, Department of Computer and Information Sciences, Covenant University, Covenant University Bioinformatics Research (CUBRe), University of Khartoum, Laboratoire de Bioinformatique, biomathématiques, biostatistiques (BIMS) (LR11IPT09), Institut Pasteur de Tunis, Réseau International des Instituts Pasteur (RIIP)-Réseau International des Instituts Pasteur (RIIP)-Université de Tunis El Manar (UTM), University of Illinois at Urbana-Champaign [Urbana], University of Illinois System, University of the Witwatersrand [Johannesburg] (WITS), Federal Ministry of Science and Technology [Abuja] (FMST), University of Mauritius, Rhodes University, Grahamstown, Institute of Infectious Diseases and Molecular Medicine (IDM), Future University of Sudan, Laboratoire de Transmission, Contrôle et Immunobiologie des Infections - Laboratory of Transmission, Control and Immunobiology of Infection (LR11IPT02), Réseau International des Instituts Pasteur (RIIP)-Réseau International des Instituts Pasteur (RIIP), Computation Institute [Chicago], University of Chicago, Université Ain Shams, Uganda Virus Research Institute (UVRI), Laboratoire des Technologies de l'Information et de la Communication [Tanger] (Labtic), Ecole Nationale des Sciences Appliquées [Tanger] (ENSAT), Landmark University [Omu-Aran], Université Mohammed V, Kwame Nkrumah University of Science and Technology [GHANA] (KNUST), École polytechnique fédérale d'Ilaro, Institut Pasteur du Maroc, Réseau International des Instituts Pasteur (RIIP), and H3ABioNet is supported by the National Institutes of Health Common Fund (grant number U41HG006941)
- Subjects
0301 basic medicine ,MESH: Genomics/methods ,Epidemiology ,Computer science ,[SDV]Life Sciences [q-bio] ,media_common.quotation_subject ,Genomics ,MESH: Africa ,Bioinformatics ,Data type ,03 medical and health sciences ,0302 clinical medicine ,Excellence ,Controlled vocabulary ,media_common ,MESH: Computational Biology/trends ,Community and Home Care ,Spatial data infrastructure ,MESH: Humans ,Data collection ,MESH: Biomedical Research/methods ,Data science ,Metadata ,030104 developmental biology ,Workflow ,Cardiology and Cardiovascular Medicine ,030217 neurology & neurosurgery - Abstract
Background: Although pockets of bioinformatics excellence have developed in Africa, generally, large-scale genomic data analysis has been limited by the availability of expertise and infrastructure. H3ABioNet, a pan-African bioinformatics network, was established to build capacity specifically to enable H3Africa (Human Heredity and Health in Africa) researchers to analyze their data in Africa. Since the inception of the H3Africa initiative, H3ABioNet’s role has evolved in response to changing needs from the consortium and the African bioinformatics community.Objectives: H3ABioNet set out to develop core bioinformatics infrastructure and capacity for genomics research in various aspects of data collection, transfer, storage, and analysis.Methods and Results: Various resources have been developed to address genomic data management and analysis needs of H3Africa researchers and other scientific communities on the continent. NetMap was developed and used to build an accurate picture of network performance within Africa and between Africa and the rest of the world, and Globus Online has been rolled out to facilitate data transfer. A participant recruitment database was developed to monitor participant enrollment, and data is being harmonized through the use of ontologies and controlled vocabularies. The standardized metadata will be integrated to provide a search facility for H3Africa data and biospecimens. Because H3Africa projects are generating large-scale genomic data, facilities for analysis and interpretation are critical. H3ABioNet is implementing several data analysis platforms that provide a large range of bioinformatics tools or workflows, such as Galaxy, the Job Management System, and eBiokits. A set of reproducible, portable, and cloud-scalable pipelines to support the multiple H3Africa data types are also being developed and dockerized to enable execution on multiple computing infrastructures. In addition, new tools have been developed for analysis of the uniquely divergent African data and for downstream interpretation of prioritized variants. To provide support for these and other bioinformatics queries, an online bioinformatics helpdesk backed by broad consortium expertise has been established. Further support is provided by means of various modes of bioinformatics training.Conclusions: For the past 4 years, the development of infrastructure support and human capacity through H3ABioNet, have significantly contributed to the establishment of African scientific networks, data analysis facilities, and training programs. Here, we describe the infrastructure and how it has affected genomics and bioinformatics research in Africa.HighlightsH3ABioNet is building capacity to enable analysis of genomic data in Africa.Infrastructure has been built for clinical and genomic data storage, management, and analysis.New algorithms and pipelines for African genomic data analysis have been developed.Data are being harmonized using ontologies to enable easy search and retrieval.Genomics training is implemented using various online and face-to-face approaches.
- Published
- 2017
- Full Text
- View/download PDF
21. Naïve Bayes for microRNA target predictions--machine learning for microRNA targets
- Author
-
Louise C. Showe, Segun Jung, Michael K. Showe, Andrew V. Kossenkov, and Malik Yousef
- Subjects
Statistics and Probability ,Molecular Sequence Data ,Sequence alignment ,Biology ,Machine learning ,computer.software_genre ,Biochemistry ,Pattern Recognition, Automated ,Naive Bayes classifier ,Bayes' theorem ,Artificial Intelligence ,microRNA ,Base sequence ,Molecular Biology ,Sequence ,Base Sequence ,business.industry ,Sequence Analysis, RNA ,Gene targeting ,Pattern recognition ,Bayes Theorem ,RNA Probes ,Computer Science Applications ,Computational Mathematics ,MicroRNAs ,Computational Theory and Mathematics ,Gene Targeting ,Artificial intelligence ,Target gene ,business ,computer ,Sequence Alignment ,Algorithms - Abstract
Motivation: Most computational methodologies for miRNA:mRNA target gene prediction use the seed segment of the miRNA and require cross-species sequence conservation in this region of the mRNA target. Methods that do not rely on conservation generate numbers of predictions, which are too large to validate. We describe a target prediction method (NBmiRTar) that does not require sequence conservation, using instead, machine learning by a naïve Bayes classifier. It generates a model from sequence and miRNA:mRNA duplex information from validated targets and artificially generated negative examples. Both the ‘seed’ and ‘out-seed’ segments of the miRNA:mRNA duplex are used for target identification.Results: The application of machine-learning techniques to the features we have used is a useful and general approach for microRNA target gene prediction. Our technique produces fewer false positive predictions and fewer target candidates to be tested. It exhibits higher sensitivity and specificity than algorithms that rely on conserved genomic regions to decrease false positive predictions.Availability: The NBmiRTar program is available at http://wotan.wistar.upenn.edu/NBmiRTar/Contact: yousef@wistar.orgSupplementary information: http://wotan.wistar.upenn.edu/NBmiRTar/
- Published
- 2007
22. Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data
- Author
-
Michael K. Showe, Louise C. Showe, Malik Yousef, and Segun Jung
- Subjects
Male ,Clustering high-dimensional data ,Computer science ,Gene Expression ,Feature selection ,lcsh:Computer applications to medicine. Medical informatics ,computer.software_genre ,Biochemistry ,Gene interaction ,Structural Biology ,Databases, Genetic ,Feature (machine learning) ,Humans ,Cluster analysis ,lcsh:QH301-705.5 ,Molecular Biology ,Regulation of gene expression ,business.industry ,Gene Expression Profiling ,Applied Mathematics ,Prostatic Neoplasms ,Pattern recognition ,Linear discriminant analysis ,Computer Science Applications ,Gene Expression Regulation, Neoplastic ,Support vector machine ,Gene expression profiling ,Statistical classification ,lcsh:Biology (General) ,Head and Neck Neoplasms ,Multigene Family ,lcsh:R858-859.7 ,Data mining ,Artificial intelligence ,business ,computer ,Research Article - Abstract
Background Classification studies using gene expression datasets are usually based on small numbers of samples and tens of thousands of genes. The selection of those genes that are important for distinguishing the different sample classes being compared, poses a challenging problem in high dimensional data analysis. We describe a new procedure for selecting significant genes as recursive cluster elimination (RCE) rather than recursive feature elimination (RFE). We have tested this algorithm on six datasets and compared its performance with that of two related classification procedures with RFE. Results We have developed a novel method for selecting significant genes in comparative gene expression studies. This method, which we refer to as SVM-RCE, combines K-means, a clustering method, to identify correlated gene clusters, and Support Vector Machines (SVMs), a supervised machine learning classification method, to identify and score (rank) those gene clusters for the purpose of classification. K-means is used initially to group genes into clusters. Recursive cluster elimination (RCE) is then applied to iteratively remove those clusters of genes that contribute the least to the classification performance. SVM-RCE identifies the clusters of correlated genes that are most significantly differentially expressed between the sample classes. Utilization of gene clusters, rather than individual genes, enhances the supervised classification accuracy of the same data as compared to the accuracy when either SVM or Penalized Discriminant Analysis (PDA) with recursive feature elimination (SVM-RFE and PDA-RFE) are used to remove genes based on their individual discriminant weights. Conclusion SVM-RCE provides improved classification accuracy with complex microarray data sets when it is compared to the classification accuracy of the same datasets using either SVM-RFE or PDA-RFE. SVM-RCE identifies clusters of correlated genes that when considered together provide greater insight into the structure of the microarray data. Clustering genes for classification appears to result in some concomitant clustering of samples into subgroups. Our present implementation of SVM-RCE groups genes using the correlation metric. The success of the SVM-RCE method in classification suggests that gene interaction networks or other biologically relevant metrics that group genes based on functional parameters might also be useful.
- Published
- 2007
- Full Text
- View/download PDF
23. Abstract 2185: Informatics framework for clustering and deriving gene signatures for prognostic stratification of cancer patients
- Author
-
Segun Jung, Yingtao Bi, and Ramana V. Davuluri
- Subjects
Cancer Research ,Oncology - Abstract
Stratification of cancer patients into different molecular groups is essential for developing targeted therapies. High-throughput technologies, such as microarrays and next-generation sequencing, have been extensively used for generating multi-omics data. Indeed, The Cancer Genome Atlas (TCGA) consortium, one of the most comprehensive and popular databases of cancer, has been accumulating large volumes of invaluable data for more than 30 cancer types, offering unprecedented opportunity to attain new insights in cancer biology. Despite the technological advances, analyzing, integrating and translating the gene signatures across different platforms remains a computationally challenging task. Here, we developed a novel computational framework that integrates genomic and clinical data to stratify cancer patients into different molecular subgroups and predict clinically applicable phenotypes, such as survival. Application of this user-friendly framework derives platform-independent isoform-level gene signatures that can be translated from high-dimensional platforms (e.g., RNA-Seq) to clinically adaptable low-dimensional molecular assays (e.g., RT-PCR) for prognostic stratification. We applied the pipeline on two TCGA lung cancer datasets_Lung Adenocarcinoma (LUAD) and Lung Squamous Cell Carcinoma (LUSC). Using independent test data, we achieved about 93% (LUAD) and 98% (LUSC) classification accuracy using less than 70 isoform-level gene signature. The proposed informatics platform is applicable to other cancer types in TCGA data portal. Note: This abstract was not presented at the meeting. Citation Format: Segun Jung, Yingtao Bi, Ramana V. Davuluri. Informatics framework for clustering and deriving gene signatures for prognostic stratification of cancer patients. [abstract]. In: Proceedings of the 106th Annual Meeting of the American Association for Cancer Research; 2015 Apr 18-22; Philadelphia, PA. Philadelphia (PA): AACR; Cancer Res 2015;75(15 Suppl):Abstract nr 2185. doi:10.1158/1538-7445.AM2015-2185
- Published
- 2015
- Full Text
- View/download PDF
24. Learning from positive examples when the negative class is undetermined- microRNA gene identification
- Author
-
Malik Yousef, Segun Jung, Michael K. Showe, and Louise C. Showe
- Subjects
lcsh:QH426-470 ,business.industry ,Computer science ,Applied Mathematics ,Research ,MicroRNA Gene ,External validation ,Machine learning ,computer.software_genre ,Matthews correlation coefficient ,Support vector machine ,Naive Bayes classifier ,lcsh:Genetics ,lcsh:Biology (General) ,Computational Theory and Mathematics ,Structural Biology ,Artificial intelligence ,Data mining ,business ,computer ,Classifier (UML) ,lcsh:QH301-705.5 ,Molecular Biology - Abstract
Background The application of machine learning to classification problems that depend only on positive examples is gaining attention in the computational biology community. We and others have described the use of two-class machine learning to identify novel miRNAs. These methods require the generation of an artificial negative class. However, designation of the negative class can be problematic and if it is not properly done can affect the performance of the classifier dramatically and/or yield a biased estimate of performance. We present a study using one-class machine learning for microRNA (miRNA) discovery and compare one-class to two-class approaches using naïve Bayes and Support Vector Machines. These results are compared to published two-class miRNA prediction approaches. We also examine the ability of the one-class and two-class techniques to identify miRNAs in newly sequenced species. Results Of all methods tested, we found that 2-class naive Bayes and Support Vector Machines gave the best accuracy using our selected features and optimally chosen negative examples. One class methods showed average accuracies of 70–80% versus 90% for the two 2-class methods on the same feature sets. However, some one-class methods outperform some recently published two-class approaches with different selected features. Using the EBV genome as and external validation of the method we found one-class machine learning to work as well as or better than a two-class approach in identifying true miRNAs as well as predicting new miRNAs. Conclusion One and two class methods can both give useful classification accuracies when the negative class is well characterized. The advantage of one class methods is that it eliminates guessing at the optimal features for the negative class when they are not well defined. In these cases one-class methods can be superior to two-class methods when the features which are chosen as representative of that positive class are well defined. Availability The OneClassmiRNA program is available at: [1]
- Published
- 2008
25. Graph-based sampling for approximating global helical topologies of RNA.
- Author
-
Namhee Kim, Laing, Christian, Elmetwaly, Shereef, Segun Jung, Curuksu, Jeremy, and Schlick, Tamar
- Subjects
MOLECULAR structure of RNA ,TOPOLOGY ,NUCLEOTIDES ,DATA mining ,STATISTICAL sampling - Abstract
A current challenge in RNA structure prediction is the description of global helical arrangements compatible with a given secondary structure. Here we address this problem by developing a hierarchical graph sampling/data mining approach to reduce conformational space and accelerate global sampling of candidate topologies. Starting from a 2D structure, we construct an initial graph from size measures deduced from solved RNAs and junction topologies predicted by our data-mining algorithm RNAJAG trained on known RNAs. We sample these graphs in 3D space guided by knowledge-based statistical potentials derived from bending and torsion measures of internal loops as well as radii of gyration for known RNAs. Graph sampling results for 30 representative RNAs are analyzed and compared with reference graphs from both solved structures and predicted structures by available programs. This comparison indicates promise for our graph-based sampling approach for characterizing global helical arrangements in large RNAs: graph rmsds range from 2.52 to 28.24 Å for RNAs of size 25-158 nucleotides, and more than half of our graph predictions improve upon other programs. The efficiency in graph sampling, however, implies an additional step of translating candidate graphs into atomic models. Such models can be built with the same idea of graph partitioning and build-up procedures we used for RNA design. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
26. Learning from positive examples when the negative class is undetermined- microRNA gene identification.
- Author
-
Yousef, Malik, Segun Jung, Showe, Louise C., and Showe, Michael K.
- Subjects
- *
MACHINE learning , *RNA , *NUCLEOTIDE sequence , *GENETICS , *MATHEMATICAL models , *NUCLEIC acid probes , *COMPUTATIONAL biology - Abstract
Background: The application of machine learning to classification problems that depend only on positive examples is gaining attention in the computational biology community. We and others have described the use of two-class machine learning to identify novel miRNAs. These methods require the generation of an artificial negative class. However, designation of the negative class can be problematic and if it is not properly done can affect the performance of the classifier dramatically and/or yield a biased estimate of performance. We present a study using one-class machine learning for microRNA (miRNA) discovery and compare one-class to two-class approaches using naïve Bayes and Support Vector Machines. These results are compared to published two-class miRNA prediction approaches. We also examine the ability of the one-class and two-class techniques to identify miRNAs in newly sequenced species. Results: Of all methods tested, we found that 2-class naive Bayes and Support Vector Machines gave the best accuracy using our selected features and optimally chosen negative examples. One class methods showed average accuracies of 70-80% versus 90% for the two 2-class methods on the same feature sets. However, some one-class methods outperform some recently published two-class approaches with different selected features. Using the EBV genome as and external validation of the method we found one-class machine learning to work as well as or better than two-class approach in identifying true miRNAs as well as predicting new miRNAs. Conclusion: One and two class methods can both give useful classification accuracies when the negative class is well characterized. The advantage of one class methods is that it eliminates guessing at the optimal features for the negative class when they are not well defined. In these cases one-class methods can be superior to two-class methods when the features which are chosen as representative of that positive class are well defined. Availability: The OneClassmiRNA program is available at: [1] [ABSTRACT FROM AUTHOR]
- Published
- 2008
- Full Text
- View/download PDF
27. MOESM1 of Genetic deletion of Sphk2 confers protection against Pseudomonas aeruginosa mediated differential expression of genes related to virulent infection and inflammation in mouse lung
- Author
-
Ebenezer, David, Panfeng Fu, Yashaswin Krishnan, Maienschein-Cline, Mark, Hu, Hong, Segun Jung, Madduri, Ravi, Arbieva, Zarema, Anantha Harijith, and Viswanathan Natarajan
- Subjects
3. Good health - Abstract
Additional file 1 : Figure S1. Lungs from WT or Sphk2−/− mice were removed for protein extraction as described in Materials and Methods. Whole lung homogenates were subjected to SDS-PAGE and Western blotting (A). Immunoblot showed almost absent expression of SPHK2 in the Sphk2−/− mice compared to the WT mice.
28. MOESM2 of Genetic deletion of Sphk2 confers protection against Pseudomonas aeruginosa mediated differential expression of genes related to virulent infection and inflammation in mouse lung
- Author
-
Ebenezer, David, Panfeng Fu, Yashaswin Krishnan, Maienschein-Cline, Mark, Hu, Hong, Segun Jung, Madduri, Ravi, Arbieva, Zarema, Anantha Harijith, and Viswanathan Natarajan
- Subjects
3. Good health - Abstract
Additional file 2. Details of the primers used to perform RTPCR on genes used to validate the RNAseq daa.
29. MOESM1 of Genetic deletion of Sphk2 confers protection against Pseudomonas aeruginosa mediated differential expression of genes related to virulent infection and inflammation in mouse lung
- Author
-
Ebenezer, David, Panfeng Fu, Yashaswin Krishnan, Maienschein-Cline, Mark, Hu, Hong, Segun Jung, Madduri, Ravi, Arbieva, Zarema, Anantha Harijith, and Viswanathan Natarajan
- Subjects
3. Good health - Abstract
Additional file 1 : Figure S1. Lungs from WT or Sphk2−/− mice were removed for protein extraction as described in Materials and Methods. Whole lung homogenates were subjected to SDS-PAGE and Western blotting (A). Immunoblot showed almost absent expression of SPHK2 in the Sphk2−/− mice compared to the WT mice.
30. MOESM2 of Genetic deletion of Sphk2 confers protection against Pseudomonas aeruginosa mediated differential expression of genes related to virulent infection and inflammation in mouse lung
- Author
-
Ebenezer, David, Panfeng Fu, Yashaswin Krishnan, Maienschein-Cline, Mark, Hu, Hong, Segun Jung, Madduri, Ravi, Arbieva, Zarema, Anantha Harijith, and Viswanathan Natarajan
- Subjects
3. Good health - Abstract
Additional file 2. Details of the primers used to perform RTPCR on genes used to validate the RNAseq daa.
31. Naive Bayes for microRNA target predictions machine learning for microRNA targets.
- Author
-
Malik Yousef, Segun Jung, Andrew V. Kossenkov, Louise C. Showe, and Michael K. Showe
- Subjects
- *
MESSENGER RNA , *GENETICS , *NUCLEOTIDE sequence , *ALGORITHMS - Abstract
Motivation: Most computational methodologies for miRNA:mRNA target gene prediction use the seed segment of the miRNA and require cross-species sequence conservation in this region of the mRNA target. Methods that do not rely on conservation generate numbers of predictions, which are too large to validate. We describe a target prediction method (NBmiRTar) that does not require sequence conservation, using instead, machine learning by a naïve Bayes classifier. It generates a model from sequence and miRNA:mRNA duplex information from validated targets and artificially generated negative examples. Both the âseedâ and âout-seedâ segments of the miRNA:mRNA duplex are used for target identification. Results: The application of machine-learning techniques to the features we have used is a useful and general approach for microRNA target gene prediction. Our technique produces fewer false positive predictions and fewer target candidates to be tested. It exhibits higher sensitivity and specificity than algorithms that rely on conserved genomic regions to decrease false positive predictions. Availability: The NBmiRTar program is available at http://wotan.wistar.upenn.edu/NBmiRTar/ Contact: yousef@wistar.org Supplementary information: http://wotan.wistar.upenn.edu/NBmiRTar/ [ABSTRACT FROM AUTHOR]
- Published
- 2007
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.