10 results
Search Results
2. Fuzzy logic based approaches for gene regulatory network inference.
- Author
-
Raza, Khalid
- Subjects
- *
GENE regulatory networks , *FUZZY logic , *COMPUTATIONAL intelligence , *FUZZY neural networks , *INFORMATION theory , *SYSTEMS biology , *LOGIC , *MOLECULAR structure , *BIOINFORMATICS - Abstract
The rapid advancements in high-throughput techniques have fueled large-scale production of biological data at very affordable costs. Some of these techniques are microarrays and next-generation sequencing that provide genome level insight of living cells. As a result, the size of most of the biological databases, such as NCBI-GEO, NCBI-SRA, etc., is growing exponentially. These biological data are analyzed using various computational techniques for knowledge discovery - which is also one of the objectives of bioinformatics research. Gene regulatory network (GRN) is a gene-gene interaction network which plays a pivotal role in understanding gene regulation processes and disease mechanism at the molecular level. From last couple of decades, researchers are interested in developing computational algorithms for GRN inference (GRNI) from high-throughput experimental data. Several computational approaches have been proposed for inferring GRN from gene expression data including statistical techniques (correlation coefficient), information theory (mutual information), regression-based approaches, probabilistic approaches (Bayesian networks, naïve byes), artificial neural networks and fuzzy logic. The fuzzy logic, along with its hybridization with other intelligent approaches, is a well-studied technique in GRNI due to its several advantages. In this paper, we present a consolidated review on fuzzy logic and its hybrid approaches developed during last two decades for GRNI. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
3. isGPT: An optimized model to identify sub-Golgi protein types using SVM and Random Forest based feature selection.
- Author
-
Rahman, M. Saifur, Rahman, Md. Khaledur, Kaykobad, M., and Rahman, M. Sohel
- Subjects
- *
GOLGI apparatus , *PROTEIN synthesis , *EUKARYOTIC cells , *RANDOM forest algorithms , *SUPPORT vector machines , *PROTEIN analysis , *AMINO acids , *ANIMAL experimentation , *COMPARATIVE studies , *CYTOPLASM , *DATABASES , *RESEARCH methodology , *MEDICAL cooperation , *OLIGOPEPTIDES , *PROTEINS , *RESEARCH , *BIOINFORMATICS , *EVALUATION research ,RESEARCH evaluation - Abstract
The Golgi Apparatus (GA) is a key organelle for protein synthesis within the eukaryotic cell. The main task of GA is to modify and sort proteins for transport throughout the cell. Proteins permeate through the GA on the ER (Endoplasmic Reticulum) facing side (cis side) and depart on the other side (trans side). Based on this phenomenon, we get two types of GA proteins, namely, cis-Golgi protein and trans-Golgi protein. Any dysfunction of GA proteins can result in congenital glycosylation disorders and some other forms of difficulties that may lead to neurodegenerative and inherited diseases like diabetes, cancer and cystic fibrosis. So, the exact classification of GA proteins may contribute to drug development which will further help in medication. In this paper, we focus on building a new computational model that not only introduces easy ways to extract features from protein sequences but also optimizes classification of trans-Golgi and cis-Golgi proteins. After feature extraction, we have employed Random Forest (RF) model to rank the features based on the importance score obtained from it. After selecting the top ranked features, we have applied Support Vector Machine (SVM) to classify the sub-Golgi proteins. We have trained regression model as well as classification model and found the former to be superior. The model shows improved performance over all previous methods. As the benchmark dataset is significantly imbalanced, we have applied Synthetic Minority Over-sampling Technique (SMOTE) to the dataset to make it balanced and have conducted experiments on both versions. Our method, namely, identification of sub-Golgi Protein Types (isGPT), achieves accuracy values of 95.4%, 95.9% and 95.3% for 10-fold cross-validation test, jackknife test and independent test respectively. According to different performance metrics, isGPT performs better than state-of-the-art techniques. The source code of isGPT, along with relevant dataset and detailed experimental results, can be found at https://github.com/srautonu/isGPT. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
4. Subcellular localization prediction of apoptosis proteins based on evolutionary information and support vector machine.
- Author
-
Xiang, Qilin, Liao, Bo, Li, Xianhong, Xu, Huimin, Chen, Jing, Shi, Zhuoxing, Dai, Qi, and Yao, Yuhua
- Subjects
- *
APOPTOSIS , *PROTEIN analysis , *SUPPORT vector machines , *PREDICTION models , *ACCURACY , *AMINO acids , *DATABASES , *BIOLOGICAL evolution , *PROTEINS , *BIOINFORMATICS - Abstract
Objectives: In this paper, a high-quality sequence encoding scheme is proposed for predicting subcellular location of apoptosis proteins.Methods: In the proposed methodology, the novel evolutionary-conservative information is introduced to represent protein sequences. Meanwhile, based on the proportion of golden section in mathematics, position-specific scoring matrix (PSSM) is divided into several blocks. Then, these features are predicted by support vector machine (SVM) and the predictive capability of proposed method is implemented by jackknife test RESULTS: The results show that the golden section method is better than no segmentation method. The overall accuracy for ZD98 and CL317 is 98.98% and 91.11%, respectively, which indicates that our method can play a complimentary role to the existing methods in the relevant areas.Conclusions: The proposed feature representation is powerful and the prediction accuracy will be improved greatly, which denotes our method provides the state-of-the-art performance for predicting subcellular location of apoptosis proteins. [ABSTRACT FROM AUTHOR]- Published
- 2017
- Full Text
- View/download PDF
5. A high-order representation and classification method for transcription factor binding sites recognition in Escherichia coli.
- Author
-
Zhang, Xiongpan, Peng, Qinke, and Sun, Shiquan
- Subjects
- *
TRANSCRIPTION factors , *ESCHERICHIA coli , *BINDING site assay , *CALCULUS of tensors , *DNA , *ALGORITHMS , *BINDING sites , *BIOINFORMATICS - Abstract
Background: Identifying transcription factors binding sites (TFBSs) plays an important role in understanding gene regulatory processes. The underlying mechanism of the specific binding for transcription factors (TFs) is still poorly understood. Previous machine learning-based approaches to identifying TFBSs commonly map a known TFBS to a one-dimensional vector using its physicochemical properties. However, when the dimension-sample rate is large (i.e., number of dimensions/number of samples), concatenating different physicochemical properties to a one-dimensional vector not only is likely to lose some structural information, but also poses significant challenges to recognition methods.Materials and Method: In this paper, we introduce a purely geometric representation method, tensor (also called multidimensional array), to represent TFs using their physicochemical properties. Accompanying the multidimensional array representation, we also develop a tensor-based recognition method, tensor partial least squares classifier (abbreviated as TPLSC). Intuitively, multidimensional arrays enable borrowing more information than one-dimensional arrays. The performance of each method is evaluated by average F-measure on 51 Escherichia coli TFs from RegulonDB database.Results: In our first experiment, the results show that multiple nucleotide properties can obtain more power than dinucleotide properties. In the second experiment, the results demonstrate that our method can gain increased prediction power, roughly 33% improvements more than the best result from existing methods.Conclusion: The representation method for TFs is an important step in TFBSs recognition. We illustrate the benefits of this representation on real data application via a series of experiments. This method can gain further insights into the mechanism of TF binding and be of great use for metabolic engineering applications. [ABSTRACT FROM AUTHOR]- Published
- 2017
- Full Text
- View/download PDF
6. Predicting overlapping protein complexes from weighted protein interaction graphs by gradually expanding dense neighborhoods.
- Author
-
Dimitrakopoulos, Christos, Theofilatos, Konstantinos, Pegkas, Andreas, Likothanassis, Spiros, and Mavroudi, Seferina
- Subjects
- *
PROTEIN-protein interactions , *MICROCLUSTERS , *SACCHAROMYCES cerevisiae , *GENE ontology , *MOLECULAR weights , *ALGORITHMS , *CLUSTER analysis (Statistics) , *MOLECULAR probes , *METABOLISM , *YEAST , *BIOINFORMATICS - Abstract
Objective: Proteins are vital biological molecules driving many fundamental cellular processes. They rarely act alone, but form interacting groups called protein complexes. The study of protein complexes is a key goal in systems biology. Recently, large protein-protein interaction (PPI) datasets have been published and a plethora of computational methods that provide new ideas for the prediction of protein complexes have been implemented. However, most of the methods suffer from two major limitations: First, they do not account for proteins participating in multiple functions and second, they are unable to handle weighted PPI graphs. Moreover, the problem remains open as existing algorithms and tools are insufficient in terms of predictive metrics.Method: In the present paper, we propose gradually expanding neighborhoods with adjustment (GENA), a new algorithm that gradually expands neighborhoods in a graph starting from highly informative "seed" nodes. GENA considers proteins as multifunctional molecules allowing them to participate in more than one protein complex. In addition, GENA accepts weighted PPI graphs by using a weighted evaluation function for each cluster.Results: In experiments with datasets from Saccharomyces cerevisiae and human, GENA outperformed Markov clustering, restricted neighborhood search and clustering with overlapping neighborhood expansion, three state-of-the-art methods for computationally predicting protein complexes. Seven PPI networks and seven evaluation datasets were used in total. GENA outperformed existing methods in 16 out of 18 experiments achieving an average improvement of 5.5% when the maximum matching ratio metric was used. Our method was able to discover functionally homogeneous protein clusters and uncover important network modules in a Parkinson expression dataset. When used on the human networks, around 47% of the detected clusters were enriched in gene ontology (GO) terms with depth higher than five in the GO hierarchy.Conclusions: In the present manuscript, we introduce a new method for the computational prediction of protein complexes by making the realistic assumption that proteins participate in multiple protein complexes and cellular functions. Our method can detect accurate and functionally homogeneous clusters. [ABSTRACT FROM AUTHOR]- Published
- 2016
- Full Text
- View/download PDF
7. Effective gene expression data generation framework based on multi-model approach.
- Author
-
Sirin, Utku, Erdogdu, Utku, Polat, Faruk, Tan, Mehmet, and Alhajj, Reda
- Subjects
- *
GENETIC regulation , *ACQUISITION of data , *COMPUTATIONAL biology , *ARTIFICIAL intelligence in medicine , *BOOLEAN algebra , *ALGORITHMS , *MATHEMATICAL models , *MOLECULAR structure , *BIOINFORMATICS , *THEORY , *GENE expression profiling - Abstract
Objective: Overcome the lack of enough samples in gene expression data sets having thousands of genes but a small number of samples challenging the computational methods using them.Methods and Material: This paper introduces a multi-model artificial gene expression data generation framework where different gene regulatory network (GRN) models contribute to the final set of samples based on the characteristics of their underlying paradigms. In the first stage, we build different GRN models, and sample data from each of them separately. Then, we pool the generated samples into a rich set of gene expression samples, and finally try to select the best of the generated samples based on a multi-objective selection method measuring the quality of the generated samples from three different aspects such as compatibility, diversity and coverage. We use four alternative GRN models, namely, ordinary differential equations, probabilistic Boolean networks, multi-objective genetic algorithm and hierarchical Markov model.Results: We conducted a comprehensive set of experiments based on both real-life biological and synthetic gene expression data sets. We show that our multi-objective sample selection mechanism effectively combines samples from different models having up to 95% compatibility, 10% diversity and 50% coverage. We show that the samples generated by our framework has up to 1.5x higher compatibility, 2x higher diversity and 2x higher coverage than the samples generated by the individual models that the multi-model framework uses. Moreover, the results show that the GRNs inferred from the samples generated by our framework can have 2.4x higher precision, 12x higher recall, and 5.4x higher f-measure values than the GRNs inferred from the original gene expression samples.Conclusions: Therefore, we show that, we can significantly improve the quality of generated gene expression samples by integrating different computational models into one unified framework without dealing with complex internal details of each individual model. Moreover, the rich set of artificial gene expression samples is able to capture some biological relations that can even not be captured by the original gene expression data set. [ABSTRACT FROM AUTHOR]- Published
- 2016
- Full Text
- View/download PDF
8. The feature selection bias problem in relation to high-dimensional gene data.
- Author
-
Krawczuk, Jerzy and Łukaszuk, Tomasz
- Subjects
- *
FEATURE selection , *GENETIC databases , *DATA mining , *REGRESSION analysis , *LEUKEMIA diagnosis , *ALGORITHMS , *COMPARATIVE studies , *DATABASES , *DECISION making , *GENES , *INFORMATION science , *RESEARCH methodology , *MEDICAL cooperation , *RESEARCH , *BIOINFORMATICS , *EVALUATION research , *RESEARCH bias , *OLIGONUCLEOTIDE arrays , *GENE expression profiling ,RESEARCH evaluation - Abstract
Objective: Feature selection is a technique widely used in data mining. The aim is to select the best subset of features relevant to the problem being considered. In this paper, we consider feature selection for the classification of gene datasets. Gene data is usually composed of just a few dozen objects described by thousands of features. For this kind of data, it is easy to find a model that fits the learning data. However, it is not easy to find one that will simultaneously evaluate new data equally well as learning data. This overfitting issue is well known as regards classification and regression, but it also applies to feature selection.Methods and Materials: We address this problem and investigate its importance in an empirical study of four feature selection methods applied to seven high-dimensional gene datasets. We chose datasets that are well studied in the literature-colon cancer, leukemia and breast cancer. All the datasets are characterized by a significant number of features and the presence of exactly two decision classes. The feature selection methods used are ReliefF, minimum redundancy maximum relevance, support vector machine-recursive feature elimination and relaxed linear separability.Results: Our main result reveals the existence of positive feature selection bias in all 28 experiments (7 datasets and 4 feature selection methods). Bias was calculated as the difference between validation and test accuracies and ranges from 2.6% to as much as 41.67%. The validation accuracy (biased accuracy) was calculated on the same dataset on which the feature selection was performed. The test accuracy was calculated for data that was not used for feature selection (by so called external cross-validation).Conclusions: This work provides evidence that using the same dataset for feature selection and learning is not appropriate. We recommend using cross-validation for feature selection in order to reduce selection bias. [ABSTRACT FROM AUTHOR]- Published
- 2016
- Full Text
- View/download PDF
9. Identification of transcription factors that may reprogram lung adenocarcinoma
- Author
-
Yu-Hang Zhang, Tao Huang, Yu-Dong Cai, and Chenglin Liu
- Subjects
0301 basic medicine ,Lung Neoplasms ,Transcription, Genetic ,Gene regulatory network ,Medicine (miscellaneous) ,Adenocarcinoma of Lung ,Adenocarcinoma ,Biology ,medicine.disease_cause ,Bioinformatics ,Epigenesis, Genetic ,Malignant transformation ,03 medical and health sciences ,Artificial Intelligence ,Transcription (biology) ,Databases, Genetic ,medicine ,Humans ,Gene Regulatory Networks ,Epigenetics ,Transcription factor ,Computational Biology ,medicine.disease ,Gene Expression Regulation, Neoplastic ,Cell Transformation, Neoplastic ,030104 developmental biology ,Cancer research ,Carcinogenesis ,Reprogramming ,Signal Transduction ,Transcription Factors - Abstract
The method can identify the core transcription factors that regulate lung adenocarcinoma associated genes.Seven core transcription factors are detected, and have been reported to relate to tumorigenesis of lung adenocarcinoma.The discovered functional core set may reverse malignant transformation and reprogram cancer cells. BackgroundLung adenocarcinoma is one of most threatening disease to human health. Although many efforts have been devoted to its genetic study, few researches have been focused on the transcription factors which regulate tumor initiation and progression by affecting multiple downstream gene transcription. It is proved that proper transcription factors may mediate the direct reprogramming of cancer cells, and reverse the tumorigenesis on the epigenetic and transcription levels. MethodsIn this paper, a computational method is proposed to identify the core transcription factors that can regulate as many as possible lung adenocarcinoma associated genes with as little as possible redundancy. A greedy strategy is applied to find the smallest collection of transcription factors that can cover the differentially expressed genes by its downstream targets. The optimal subset which is mostly enriched in the differentially expressed genes is then selected. ResultsSeven core transcription factors (MCM4, VWF, ECT2, RBMS3, LIMCH1, MYBL2 and FBXL7) are detected, and have been reported to contribute to tumorigenesis of lung adenocarcinoma. The identification of the transcription factors provides a new insight into its oncogenic role in tumor initiation and progression, and benefits the discovery of functional core set that may reverse malignant transformation and reprogram cancer cells.
- Published
- 2017
10. A hierarchical classifier based on human blood plasma fluorescence for non-invasive colorectal cancer screening
- Author
-
Karin Becker, Felipe Soares, and Michel José Anzanello
- Subjects
0301 basic medicine ,Support Vector Machine ,Computer science ,Colorectal cancer ,Population ,Colonic Polyps ,Medicine (miscellaneous) ,Feature selection ,Bioinformatics ,Hierarchical classifier ,Adenomatous Polyps ,03 medical and health sciences ,0302 clinical medicine ,Predictive Value of Tests ,Artificial Intelligence ,Biomarkers, Tumor ,medicine ,Humans ,education ,Early Detection of Cancer ,education.field_of_study ,business.industry ,Reproducibility of Results ,Cancer ,Pattern recognition ,medicine.disease ,Support vector machine ,Identification (information) ,Spectrometry, Fluorescence ,030104 developmental biology ,Colorectal cancer screening ,Case-Control Studies ,030211 gastroenterology & hepatology ,Artificial intelligence ,Colorectal Neoplasms ,business - Abstract
Colorectal cancer (CRC) a leading cause of death by cancer, and screening programs for its early identification are at the heart of the increasing survival rates. To motivate population participation, non-invasive, accurate, scalable and cost-effective diagnosis methods are required. Blood fluorescence spectroscopy provides rich information that can be used for cancer identification. The main challenges in analyzing blood fluorescence data for CRC classification are related to its high dimensionality and inherent variability, especially when analyzing a small number of samples. In this paper, we present a hierarchical classification method based on plasma fluorescence to identify not only CRC, but also adenomas and other non-malignant colorectal findings that may require further medical investigation. A feature selection algorithm is proposed to deal with the high dimensionality and select discriminant fluorescence wavelengths. These are used to train a binary support vector machine (SVM) in the first level to identify the CRC samples. The remaining samples are then presented to a one-class SVM trained on healthy subjects to detect deviant samples, and thus non-malignant findings. This hierarchical design, together with the one class-SVM, aims to reduce the effects of small samples and high variability. Using a dataset analyzed in previous studies comprised of 12,341 wavelengths, we achieved much superior results. Sensitivity and specificity are 0.87 and 0.95 for CRC detection, and 0.60 and 0.79 for non-malignant findings, respectively. Compared to related work, the proposed method presented a better accuracy, required fewer features, and provides a unified approach that expands CRC detection to non-malignant findings.
- Published
- 2017
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.