22 results
Search Results
2. Fuzzy logic based approaches for gene regulatory network inference.
- Author
-
Raza, Khalid
- Subjects
- *
GENE regulatory networks , *FUZZY logic , *COMPUTATIONAL intelligence , *FUZZY neural networks , *INFORMATION theory , *SYSTEMS biology , *LOGIC , *MOLECULAR structure , *BIOINFORMATICS - Abstract
The rapid advancements in high-throughput techniques have fueled large-scale production of biological data at very affordable costs. Some of these techniques are microarrays and next-generation sequencing that provide genome level insight of living cells. As a result, the size of most of the biological databases, such as NCBI-GEO, NCBI-SRA, etc., is growing exponentially. These biological data are analyzed using various computational techniques for knowledge discovery - which is also one of the objectives of bioinformatics research. Gene regulatory network (GRN) is a gene-gene interaction network which plays a pivotal role in understanding gene regulation processes and disease mechanism at the molecular level. From last couple of decades, researchers are interested in developing computational algorithms for GRN inference (GRNI) from high-throughput experimental data. Several computational approaches have been proposed for inferring GRN from gene expression data including statistical techniques (correlation coefficient), information theory (mutual information), regression-based approaches, probabilistic approaches (Bayesian networks, naïve byes), artificial neural networks and fuzzy logic. The fuzzy logic, along with its hybridization with other intelligent approaches, is a well-studied technique in GRNI due to its several advantages. In this paper, we present a consolidated review on fuzzy logic and its hybrid approaches developed during last two decades for GRNI. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
3. A graph theoretic approach to protein structure selection
- Author
-
Vassura, Marco, Margara, Luciano, Fariselli, Piero, and Casadio, Rita
- Subjects
- *
PROTEIN folding , *BIOINFORMATICS , *LIFE sciences , *GRAPHIC methods - Abstract
Summary: Objective: Protein structure prediction (PSP) aims to reconstruct the 3D structure of a given protein starting from its primary structure (chain of amino acidic residues). It is a well-known fact that the 3D structure of a protein only depends on its primary structure. PSP is one of the most important and still unsolved problems in computational biology. Protein structure selection (PSS), instead of reconstructing a 3D model for the given chain, aims to select among a given, possibly large, number of 3D structures (called decoys) those that are closer (according to a given notion of distance) to the original (unknown) one. In this paper we address PSS problem using graph theoretic techniques. Methods and materials: Existing methods for solving PSS make use of suitably defined energy functions which heavily rely on the primary structure of the protein and on protein chemistry. In this paper we present a new approach to PSS which does not take advantage of the knowledge of the primary structure of the protein but only depends on the graph theoretic properties of the decoys graphs (vertices represent residues and edges represent pairs of residues whose Euclidean distance is less than or equal to a fixed threshold). Results: Even if our methods only rely on approximate geometric information, experimental results show that some of the adopted graph properties score similarly to energy-based filtering functions in selecting the best decoys. Conclusion: Our results highlight the principal role of geometric information in PSS, setting a new starting point and filtering method for existing energy function-based techniques. [Copyright &y& Elsevier]
- Published
- 2009
- Full Text
- View/download PDF
4. isGPT: An optimized model to identify sub-Golgi protein types using SVM and Random Forest based feature selection.
- Author
-
Rahman, M. Saifur, Rahman, Md. Khaledur, Kaykobad, M., and Rahman, M. Sohel
- Subjects
- *
GOLGI apparatus , *PROTEIN synthesis , *EUKARYOTIC cells , *RANDOM forest algorithms , *SUPPORT vector machines , *PROTEIN analysis , *AMINO acids , *ANIMAL experimentation , *COMPARATIVE studies , *CYTOPLASM , *DATABASES , *RESEARCH methodology , *MEDICAL cooperation , *OLIGOPEPTIDES , *PROTEINS , *RESEARCH , *BIOINFORMATICS , *EVALUATION research ,RESEARCH evaluation - Abstract
The Golgi Apparatus (GA) is a key organelle for protein synthesis within the eukaryotic cell. The main task of GA is to modify and sort proteins for transport throughout the cell. Proteins permeate through the GA on the ER (Endoplasmic Reticulum) facing side (cis side) and depart on the other side (trans side). Based on this phenomenon, we get two types of GA proteins, namely, cis-Golgi protein and trans-Golgi protein. Any dysfunction of GA proteins can result in congenital glycosylation disorders and some other forms of difficulties that may lead to neurodegenerative and inherited diseases like diabetes, cancer and cystic fibrosis. So, the exact classification of GA proteins may contribute to drug development which will further help in medication. In this paper, we focus on building a new computational model that not only introduces easy ways to extract features from protein sequences but also optimizes classification of trans-Golgi and cis-Golgi proteins. After feature extraction, we have employed Random Forest (RF) model to rank the features based on the importance score obtained from it. After selecting the top ranked features, we have applied Support Vector Machine (SVM) to classify the sub-Golgi proteins. We have trained regression model as well as classification model and found the former to be superior. The model shows improved performance over all previous methods. As the benchmark dataset is significantly imbalanced, we have applied Synthetic Minority Over-sampling Technique (SMOTE) to the dataset to make it balanced and have conducted experiments on both versions. Our method, namely, identification of sub-Golgi Protein Types (isGPT), achieves accuracy values of 95.4%, 95.9% and 95.3% for 10-fold cross-validation test, jackknife test and independent test respectively. According to different performance metrics, isGPT performs better than state-of-the-art techniques. The source code of isGPT, along with relevant dataset and detailed experimental results, can be found at https://github.com/srautonu/isGPT. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
5. iACP-GAEnsC: Evolutionary genetic algorithm based ensemble classification of anticancer peptides by utilizing hybrid feature space.
- Author
-
Akbar, Shahid, Hayat, Maqsood, Iqbal, Muhammad, and Jan, Mian Ahmad
- Subjects
- *
IMAGING of cancer , *GENETIC algorithms , *CLASSIFICATION algorithms , *ANTINEOPLASTIC agents , *CANCER chemotherapy , *DRUG design , *TUMOR treatment , *COMPUTER simulation , *PEPTIDES , *BIOINFORMATICS , *AMINO acids , *SEQUENCE analysis - Abstract
Cancer is a fatal disease, responsible for one-quarter of all deaths in developed countries. Traditional anticancer therapies such as, chemotherapy and radiation, are highly expensive, susceptible to errors and ineffective techniques. These conventional techniques induce severe side-effects on human cells. Due to perilous impact of cancer, the development of an accurate and highly efficient intelligent computational model is desirable for identification of anticancer peptides. In this paper, evolutionary intelligent genetic algorithm-based ensemble model, 'iACP-GAEnsC', is proposed for the identification of anticancer peptides. In this model, the protein sequences are formulated, using three different discrete feature representation methods, i.e., amphiphilic Pseudo amino acid composition, g-Gap dipeptide composition, and Reduce amino acid alphabet composition. The performance of the extracted feature spaces are investigated separately and then merged to exhibit the significance of hybridization. In addition, the predicted results of individual classifiers are combined together, using optimized genetic algorithm and simple majority technique in order to enhance the true classification rate. It is observed that genetic algorithm-based ensemble classification outperforms than individual classifiers as well as simple majority voting base ensemble. The performance of genetic algorithm-based ensemble classification is highly reported on hybrid feature space, with an accuracy of 96.45%. In comparison to the existing techniques, 'iACP-GAEnsC' model has achieved remarkable improvement in terms of various performance metrics. Based on the simulation results, it is observed that 'iACP-GAEnsC' model might be a leading tool in the field of drug design and proteomics for researchers. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
6. Subcellular localization prediction of apoptosis proteins based on evolutionary information and support vector machine.
- Author
-
Xiang, Qilin, Liao, Bo, Li, Xianhong, Xu, Huimin, Chen, Jing, Shi, Zhuoxing, Dai, Qi, and Yao, Yuhua
- Subjects
- *
APOPTOSIS , *PROTEIN analysis , *SUPPORT vector machines , *PREDICTION models , *ACCURACY , *AMINO acids , *DATABASES , *BIOLOGICAL evolution , *PROTEINS , *BIOINFORMATICS - Abstract
Objectives: In this paper, a high-quality sequence encoding scheme is proposed for predicting subcellular location of apoptosis proteins.Methods: In the proposed methodology, the novel evolutionary-conservative information is introduced to represent protein sequences. Meanwhile, based on the proportion of golden section in mathematics, position-specific scoring matrix (PSSM) is divided into several blocks. Then, these features are predicted by support vector machine (SVM) and the predictive capability of proposed method is implemented by jackknife test RESULTS: The results show that the golden section method is better than no segmentation method. The overall accuracy for ZD98 and CL317 is 98.98% and 91.11%, respectively, which indicates that our method can play a complimentary role to the existing methods in the relevant areas.Conclusions: The proposed feature representation is powerful and the prediction accuracy will be improved greatly, which denotes our method provides the state-of-the-art performance for predicting subcellular location of apoptosis proteins. [ABSTRACT FROM AUTHOR]- Published
- 2017
- Full Text
- View/download PDF
7. A high-order representation and classification method for transcription factor binding sites recognition in Escherichia coli.
- Author
-
Zhang, Xiongpan, Peng, Qinke, and Sun, Shiquan
- Subjects
- *
TRANSCRIPTION factors , *ESCHERICHIA coli , *BINDING site assay , *CALCULUS of tensors , *DNA , *ALGORITHMS , *BINDING sites , *BIOINFORMATICS - Abstract
Background: Identifying transcription factors binding sites (TFBSs) plays an important role in understanding gene regulatory processes. The underlying mechanism of the specific binding for transcription factors (TFs) is still poorly understood. Previous machine learning-based approaches to identifying TFBSs commonly map a known TFBS to a one-dimensional vector using its physicochemical properties. However, when the dimension-sample rate is large (i.e., number of dimensions/number of samples), concatenating different physicochemical properties to a one-dimensional vector not only is likely to lose some structural information, but also poses significant challenges to recognition methods.Materials and Method: In this paper, we introduce a purely geometric representation method, tensor (also called multidimensional array), to represent TFs using their physicochemical properties. Accompanying the multidimensional array representation, we also develop a tensor-based recognition method, tensor partial least squares classifier (abbreviated as TPLSC). Intuitively, multidimensional arrays enable borrowing more information than one-dimensional arrays. The performance of each method is evaluated by average F-measure on 51 Escherichia coli TFs from RegulonDB database.Results: In our first experiment, the results show that multiple nucleotide properties can obtain more power than dinucleotide properties. In the second experiment, the results demonstrate that our method can gain increased prediction power, roughly 33% improvements more than the best result from existing methods.Conclusion: The representation method for TFs is an important step in TFBSs recognition. We illustrate the benefits of this representation on real data application via a series of experiments. This method can gain further insights into the mechanism of TF binding and be of great use for metabolic engineering applications. [ABSTRACT FROM AUTHOR]- Published
- 2017
- Full Text
- View/download PDF
8. Predicting overlapping protein complexes from weighted protein interaction graphs by gradually expanding dense neighborhoods.
- Author
-
Dimitrakopoulos, Christos, Theofilatos, Konstantinos, Pegkas, Andreas, Likothanassis, Spiros, and Mavroudi, Seferina
- Subjects
- *
PROTEIN-protein interactions , *MICROCLUSTERS , *SACCHAROMYCES cerevisiae , *GENE ontology , *MOLECULAR weights , *ALGORITHMS , *CLUSTER analysis (Statistics) , *MOLECULAR probes , *METABOLISM , *YEAST , *BIOINFORMATICS - Abstract
Objective: Proteins are vital biological molecules driving many fundamental cellular processes. They rarely act alone, but form interacting groups called protein complexes. The study of protein complexes is a key goal in systems biology. Recently, large protein-protein interaction (PPI) datasets have been published and a plethora of computational methods that provide new ideas for the prediction of protein complexes have been implemented. However, most of the methods suffer from two major limitations: First, they do not account for proteins participating in multiple functions and second, they are unable to handle weighted PPI graphs. Moreover, the problem remains open as existing algorithms and tools are insufficient in terms of predictive metrics.Method: In the present paper, we propose gradually expanding neighborhoods with adjustment (GENA), a new algorithm that gradually expands neighborhoods in a graph starting from highly informative "seed" nodes. GENA considers proteins as multifunctional molecules allowing them to participate in more than one protein complex. In addition, GENA accepts weighted PPI graphs by using a weighted evaluation function for each cluster.Results: In experiments with datasets from Saccharomyces cerevisiae and human, GENA outperformed Markov clustering, restricted neighborhood search and clustering with overlapping neighborhood expansion, three state-of-the-art methods for computationally predicting protein complexes. Seven PPI networks and seven evaluation datasets were used in total. GENA outperformed existing methods in 16 out of 18 experiments achieving an average improvement of 5.5% when the maximum matching ratio metric was used. Our method was able to discover functionally homogeneous protein clusters and uncover important network modules in a Parkinson expression dataset. When used on the human networks, around 47% of the detected clusters were enriched in gene ontology (GO) terms with depth higher than five in the GO hierarchy.Conclusions: In the present manuscript, we introduce a new method for the computational prediction of protein complexes by making the realistic assumption that proteins participate in multiple protein complexes and cellular functions. Our method can detect accurate and functionally homogeneous clusters. [ABSTRACT FROM AUTHOR]- Published
- 2016
- Full Text
- View/download PDF
9. Effective gene expression data generation framework based on multi-model approach.
- Author
-
Sirin, Utku, Erdogdu, Utku, Polat, Faruk, Tan, Mehmet, and Alhajj, Reda
- Subjects
- *
GENETIC regulation , *ACQUISITION of data , *COMPUTATIONAL biology , *ARTIFICIAL intelligence in medicine , *BOOLEAN algebra , *ALGORITHMS , *MATHEMATICAL models , *MOLECULAR structure , *BIOINFORMATICS , *THEORY , *GENE expression profiling - Abstract
Objective: Overcome the lack of enough samples in gene expression data sets having thousands of genes but a small number of samples challenging the computational methods using them.Methods and Material: This paper introduces a multi-model artificial gene expression data generation framework where different gene regulatory network (GRN) models contribute to the final set of samples based on the characteristics of their underlying paradigms. In the first stage, we build different GRN models, and sample data from each of them separately. Then, we pool the generated samples into a rich set of gene expression samples, and finally try to select the best of the generated samples based on a multi-objective selection method measuring the quality of the generated samples from three different aspects such as compatibility, diversity and coverage. We use four alternative GRN models, namely, ordinary differential equations, probabilistic Boolean networks, multi-objective genetic algorithm and hierarchical Markov model.Results: We conducted a comprehensive set of experiments based on both real-life biological and synthetic gene expression data sets. We show that our multi-objective sample selection mechanism effectively combines samples from different models having up to 95% compatibility, 10% diversity and 50% coverage. We show that the samples generated by our framework has up to 1.5x higher compatibility, 2x higher diversity and 2x higher coverage than the samples generated by the individual models that the multi-model framework uses. Moreover, the results show that the GRNs inferred from the samples generated by our framework can have 2.4x higher precision, 12x higher recall, and 5.4x higher f-measure values than the GRNs inferred from the original gene expression samples.Conclusions: Therefore, we show that, we can significantly improve the quality of generated gene expression samples by integrating different computational models into one unified framework without dealing with complex internal details of each individual model. Moreover, the rich set of artificial gene expression samples is able to capture some biological relations that can even not be captured by the original gene expression data set. [ABSTRACT FROM AUTHOR]- Published
- 2016
- Full Text
- View/download PDF
10. The feature selection bias problem in relation to high-dimensional gene data.
- Author
-
Krawczuk, Jerzy and Łukaszuk, Tomasz
- Subjects
- *
FEATURE selection , *GENETIC databases , *DATA mining , *REGRESSION analysis , *LEUKEMIA diagnosis , *ALGORITHMS , *COMPARATIVE studies , *DATABASES , *DECISION making , *GENES , *INFORMATION science , *RESEARCH methodology , *MEDICAL cooperation , *RESEARCH , *BIOINFORMATICS , *EVALUATION research , *RESEARCH bias , *OLIGONUCLEOTIDE arrays , *GENE expression profiling ,RESEARCH evaluation - Abstract
Objective: Feature selection is a technique widely used in data mining. The aim is to select the best subset of features relevant to the problem being considered. In this paper, we consider feature selection for the classification of gene datasets. Gene data is usually composed of just a few dozen objects described by thousands of features. For this kind of data, it is easy to find a model that fits the learning data. However, it is not easy to find one that will simultaneously evaluate new data equally well as learning data. This overfitting issue is well known as regards classification and regression, but it also applies to feature selection.Methods and Materials: We address this problem and investigate its importance in an empirical study of four feature selection methods applied to seven high-dimensional gene datasets. We chose datasets that are well studied in the literature-colon cancer, leukemia and breast cancer. All the datasets are characterized by a significant number of features and the presence of exactly two decision classes. The feature selection methods used are ReliefF, minimum redundancy maximum relevance, support vector machine-recursive feature elimination and relaxed linear separability.Results: Our main result reveals the existence of positive feature selection bias in all 28 experiments (7 datasets and 4 feature selection methods). Bias was calculated as the difference between validation and test accuracies and ranges from 2.6% to as much as 41.67%. The validation accuracy (biased accuracy) was calculated on the same dataset on which the feature selection was performed. The test accuracy was calculated for data that was not used for feature selection (by so called external cross-validation).Conclusions: This work provides evidence that using the same dataset for feature selection and learning is not appropriate. We recommend using cross-validation for feature selection in order to reduce selection bias. [ABSTRACT FROM AUTHOR]- Published
- 2016
- Full Text
- View/download PDF
11. An fMRI framework for identifying statistical differences in blood oxygenated level dependent response levels: A brain injury demonstration
- Author
-
Sumrall, Jeffrey G., Chaudry, Maryam S., and Chakravarthy, Ramya
- Subjects
- *
ARTIFICIAL intelligence in medicine , *DATA mining , *DECISION support systems , *BIOINFORMATICS , *DIAGNOSTIC imaging , *BRAIN injuries - Abstract
Objective: The general concept surrounding fMRI data analysis for decision support is leveraging previously hidden knowledge from publicly available metadata sources with a high degree of precision. Methods and materials: Normalized fMRI scans are used to calculate cumulative voxel intensity curves for every subject in the dataset that fits chosen demographic criteria. The voxel intensity curve has a direct linear relationship to the subject''s neuronal activity. In the case of head trauma, a subject''s voxel intensity curve would be statistically compared to the weighted average curve for every subject in dataset that is demographically similar. If the new subject''s neuronal activity falls below the threshold for their demographic group, the brain injury detection (BID) system would then pinpoint the areas of deficiency based on Broadmann''s cortical areas. Analysis: The analysis presented in this paper indicates that statistical differences among demographic groups exist in BOLD fMRI responses. Conclusion: Useful knowledge can in fact be leveraged from mining stockpiled fMRI data without the need for unique human identifiers. The BID system offers the radiologist a statistically based decision support for brain injury. [Copyright &y& Elsevier]
- Published
- 2007
- Full Text
- View/download PDF
12. Case-based reasoning in the health sciences: What's next?
- Author
-
Bichindaritz, Isabelle and Marling, Cindy
- Subjects
- *
MEDICAL sciences , *MEDICAL care , *CONFERENCES & conventions , *MEDICINE - Abstract
Summary: Objectives: This paper presents current work in case-based reasoning (CBR) in the health sciences, describes current trends and issues, and projects future directions for work in this field. Methods and material: It represents the contributions of researchers at two workshops on case-based reasoning in the health sciences. These workshops were held at the Fifth International Conference on Case-Based Reasoning (ICCBR-03) and the Seventh European Conference on Case-Based Reasoning (ECCBR-04). Results: Current research in CBR in the health sciences is marked by its richness. Highlighted trends include work in bioinformatics, support to the elderly and people with disabilities, formalization of CBR in biomedicine, and feature and case mining. Conclusion: CBR systems are being better designed to account for the complexity of biomedicine, to integrate into clinical settings and to communicate and interact with diverse systems and methods. [Copyright &y& Elsevier]
- Published
- 2006
- Full Text
- View/download PDF
13. Granular support vector machines with association rules mining for protein homology prediction
- Author
-
Tang, Yuchun, Jin, Bo, and Zhang, Yan-Qing
- Subjects
- *
BIOINFORMATICS , *AMINO acid sequence , *COMPUTERS in biology , *INFORMATION science - Abstract
Summary: Objective:: Protein homology prediction between protein sequences is one of critical problems in computational biology. Such a complex classification problem is common in medical or biological information processing applications. How to build a model with superior generalization capability from training samples is an essential issue for mining knowledge to accurately predict/classify unseen new samples and to effectively support human experts to make correct decisions. Methodology:: A new learning model called granular support vector machines (GSVM) is proposed based on our previous work. GSVM systematically and formally combines the principles from statistical learning theory and granular computing theory and thus provides an interesting new mechanism to address complex classification problems. It works by building a sequence of information granules and then building support vector machines (SVM) in some of these information granules on demand. A good granulation method to find suitable granules is crucial for modeling a GSVM with good performance. In this paper, we also propose an association rules-based granulation method. For the granules induced by association rules with high enough confidence and significant support, we leave them as they are because of their high “purity” and significant effect on simplifying the classification task. For every other granule, a SVM is modeled to discriminate the corresponding data. In this way, a complex classification problem is divided into multiple smaller problems so that the learning task is simplified. Results and conclusions:: The proposed algorithm, here named GSVM-AR, is compared with SVM by KDDCUP04 protein homology prediction data. The experimental results show that finding the splitting hyperplane is not a trivial task (we should be careful to select the association rules to avoid overfitting) and GSVM-AR does show significant improvement compared to building one single SVM in the whole feature space. Another advantage is that the utility of GSVM-AR is very good because it is easy to be implemented. More importantly and more interestingly, GSVM provides a new mechanism to address complex classification problems. [Copyright &y& Elsevier]
- Published
- 2005
- Full Text
- View/download PDF
14. Identification of transcription factors that may reprogram lung adenocarcinoma
- Author
-
Yu-Hang Zhang, Tao Huang, Yu-Dong Cai, and Chenglin Liu
- Subjects
0301 basic medicine ,Lung Neoplasms ,Transcription, Genetic ,Gene regulatory network ,Medicine (miscellaneous) ,Adenocarcinoma of Lung ,Adenocarcinoma ,Biology ,medicine.disease_cause ,Bioinformatics ,Epigenesis, Genetic ,Malignant transformation ,03 medical and health sciences ,Artificial Intelligence ,Transcription (biology) ,Databases, Genetic ,medicine ,Humans ,Gene Regulatory Networks ,Epigenetics ,Transcription factor ,Computational Biology ,medicine.disease ,Gene Expression Regulation, Neoplastic ,Cell Transformation, Neoplastic ,030104 developmental biology ,Cancer research ,Carcinogenesis ,Reprogramming ,Signal Transduction ,Transcription Factors - Abstract
The method can identify the core transcription factors that regulate lung adenocarcinoma associated genes.Seven core transcription factors are detected, and have been reported to relate to tumorigenesis of lung adenocarcinoma.The discovered functional core set may reverse malignant transformation and reprogram cancer cells. BackgroundLung adenocarcinoma is one of most threatening disease to human health. Although many efforts have been devoted to its genetic study, few researches have been focused on the transcription factors which regulate tumor initiation and progression by affecting multiple downstream gene transcription. It is proved that proper transcription factors may mediate the direct reprogramming of cancer cells, and reverse the tumorigenesis on the epigenetic and transcription levels. MethodsIn this paper, a computational method is proposed to identify the core transcription factors that can regulate as many as possible lung adenocarcinoma associated genes with as little as possible redundancy. A greedy strategy is applied to find the smallest collection of transcription factors that can cover the differentially expressed genes by its downstream targets. The optimal subset which is mostly enriched in the differentially expressed genes is then selected. ResultsSeven core transcription factors (MCM4, VWF, ECT2, RBMS3, LIMCH1, MYBL2 and FBXL7) are detected, and have been reported to contribute to tumorigenesis of lung adenocarcinoma. The identification of the transcription factors provides a new insight into its oncogenic role in tumor initiation and progression, and benefits the discovery of functional core set that may reverse malignant transformation and reprogram cancer cells.
- Published
- 2017
15. A hierarchical classifier based on human blood plasma fluorescence for non-invasive colorectal cancer screening
- Author
-
Karin Becker, Felipe Soares, and Michel José Anzanello
- Subjects
0301 basic medicine ,Support Vector Machine ,Computer science ,Colorectal cancer ,Population ,Colonic Polyps ,Medicine (miscellaneous) ,Feature selection ,Bioinformatics ,Hierarchical classifier ,Adenomatous Polyps ,03 medical and health sciences ,0302 clinical medicine ,Predictive Value of Tests ,Artificial Intelligence ,Biomarkers, Tumor ,medicine ,Humans ,education ,Early Detection of Cancer ,education.field_of_study ,business.industry ,Reproducibility of Results ,Cancer ,Pattern recognition ,medicine.disease ,Support vector machine ,Identification (information) ,Spectrometry, Fluorescence ,030104 developmental biology ,Colorectal cancer screening ,Case-Control Studies ,030211 gastroenterology & hepatology ,Artificial intelligence ,Colorectal Neoplasms ,business - Abstract
Colorectal cancer (CRC) a leading cause of death by cancer, and screening programs for its early identification are at the heart of the increasing survival rates. To motivate population participation, non-invasive, accurate, scalable and cost-effective diagnosis methods are required. Blood fluorescence spectroscopy provides rich information that can be used for cancer identification. The main challenges in analyzing blood fluorescence data for CRC classification are related to its high dimensionality and inherent variability, especially when analyzing a small number of samples. In this paper, we present a hierarchical classification method based on plasma fluorescence to identify not only CRC, but also adenomas and other non-malignant colorectal findings that may require further medical investigation. A feature selection algorithm is proposed to deal with the high dimensionality and select discriminant fluorescence wavelengths. These are used to train a binary support vector machine (SVM) in the first level to identify the CRC samples. The remaining samples are then presented to a one-class SVM trained on healthy subjects to detect deviant samples, and thus non-malignant findings. This hierarchical design, together with the one class-SVM, aims to reduce the effects of small samples and high variability. Using a dataset analyzed in previous studies comprised of 12,341 wavelengths, we achieved much superior results. Sensitivity and specificity are 0.87 and 0.95 for CRC detection, and 0.60 and 0.79 for non-malignant findings, respectively. Compared to related work, the proposed method presented a better accuracy, required fewer features, and provides a unified approach that expands CRC detection to non-malignant findings.
- Published
- 2017
16. Data mining of gene expression changes in Alzheimer brain
- Author
-
Qing Yan Liu, Boleslaw Lach, A. Fazel Famili, P. Roy Walker, Ziying Liu, Julio J. Valdés, and Brandon Smith
- Subjects
expression génique ,exploration de données ,Information Storage and Retrieval ,Medicine (miscellaneous) ,Context (language use) ,Disease ,Biology ,Bioinformatics ,computer.software_genre ,identification de gènes ,génomique ,gene identifications ,Alzheimer Disease ,maladie d'Alzheimer et biopuce ,Artificial Intelligence ,expression ,Databases, Genetic ,genomics ,medicine ,Humans ,Genetic Predisposition to Disease ,gene ,Oligonucleotide Array Sequence Analysis ,Expressed Sequence Tags ,Neurons ,Expressed sequence tag ,Microarray analysis techniques ,Gene Expression Profiling ,data mining ,medicine.disease ,Alzheimer's disease and microarray ,Gene expression profiling ,gene expression ,Data mining ,DNA microarray ,Alzheimer's disease ,Toxicogenomics ,computer - Abstract
Genome-wide transcription profiling is a powerful technique for studying the enormous complexity of cellular states. Moreover, when applied to disease tissue it may reveal quantitative and qualitative alterations in gene expression that give information on the context or underlying basis for the disease and may provide a new diagnostic approach. However, the data obtained from high-density microarrays is highly complex and poses considerable challenges in data mining. The data requires care in both pre-processing and the application of data mining techniques. This paper addresses the problem of dealing with microarray data that come from two known classes (Alzheimer and normal). We have applied three separate techniques to discover genes associated with Alzheimer disease (AD). The 67 genes identified in this study included a total of 17 genes that are already known to be associated with Alzheimer's or other neurological diseases. This is higher than any of the previously published Alzheimer's studies. Twenty known genes, not previously associated with the disease, have been identified as well as 30 uncharacterized expressed sequence tags (ESTs). Given the success in identifying genes already associated with AD, we can have some confidence in the involvement of the latter genes and ESTs. From these studies we can attempt to define therapeutic strategies that would prevent the loss of specific components of neuronal function in susceptible patients or be in a position to stimulate the replacement of lost cellular function in damaged neurons. Although our study is based on a relatively small number of patients (four AD and five normal), we think our approach sets the stage for a major step in using gene expression data for disease modeling (i.e. classification and diagnosis). It can also contribute to the future of gene function identification, pathology, toxicogenomics, and pharmacogenomics.
- Published
- 2004
17. Association of genetic profiles to Crohn's disease by linear combinations of single nucleotide polymorphisms
- Author
-
Rosalia Maglietta, Annarita D'Addabbo, Anna Latiano, Orazio Palmieri, Nicola Ancona, Vito Annese, and Maria Teresa Creanza
- Subjects
Genetics ,Gene Expression Profiling ,Medicine (miscellaneous) ,Locus (genetics) ,Single-nucleotide polymorphism ,Epistasis, Genetic ,Odds ratio ,Biology ,Bioinformatics ,Phenotype ,Polymorphism, Single Nucleotide ,PTPN22 ,Crohn Disease ,Artificial Intelligence ,Genetic marker ,Humans ,Gene ,SNP array - Abstract
Motivations: A large number of single nucleotide polymorphisms (SNPs) are supposed to be involved in onset, differentiation and development of complex diseases. Univariate analysis is limited in studying complex traits since does not take into account gene-gene interaction, and the correlation of multiple SNPs with a specific phenotype. Moreover it might underestimate gene variants with weaker genetic contribution. Therefore more sophisticated techniques should be adopted when investigating the role of a panel of genetic markers in disease predisposition. Methods: In this paper we describe a general method to simultaneously investigate the association between SNPs profile and Crohn's disease (CD), by evaluating the susceptibility or protective role of single or groups of markers. As an association measure we adopted a weighted linear combination of SNPs in which suitable weighting vectors belonged to predefined and over-complete vocabularies of vectors (frames), or were determined by the data. Results: The proposed method found a weighted linear combination of SNPs statistically associated to CD (p=3.81x10^-^1^0) describing the role of the markers in the pathology. In particular, MCP1-A2518G gave the major contribution as protective locus, similarly to TNF-@a-C857T, DLG5 rs124869, PTPN22 C1858T variants. The NF@kB -94ATTG variants was found to be irrelevant for CD. For the remaining markers, a susceptibility role was attributed also confirming that markers on CARD15 gene, in particular G908R and L1007fsinsC, are involved with CD to the same extent as FcGIIIA G559T and TNF-@a-G308A. Moreover, an odds ratio of 3.99(p
- Published
- 2008
18. Adaptive bandwidth selection for biomarker discovery in mass spectrometry
- Author
-
Volker Roth, Bernd Fischer, and Joachim M. Buhmann
- Subjects
Model selection ,Bandwidth (signal processing) ,Medicine (miscellaneous) ,Models, Theoretical ,Bioinformatics ,Mass spectrometry ,Global model ,Mass Spectrometry ,Scale space ,Time changes ,Artificial Intelligence ,Biomarker discovery ,Biological system ,Canonical correlation ,Biomarkers ,Mathematics ,Chromatography, Liquid - Abstract
Objective: Differential quantification of proteins by liquid chromatography/mass spectrometry requires the alignment of a retention time axis. The alignment automatically corrects for time changes in the liquid chromatography unit when repeating two experiments. Methods: In this paper we will show an extension of non-negative canonical correlation analysis. We introduce an adaptive scale space estimation that adapts the complexity of a monotone regression function to the density of measurements across the retention time. Furthermore, a global model selection of the scale is replaced by a local one, where we estimate the scale for each individual time axis, instead of a global parameter that holds for all time axes. Results: We show in experiments that we got a 13% gain. The performance gain is measured in the number of proteins that are detected to differ significantly in abundance for two different biological samples. Conclusion: We conclude that the adaptive scale estimation and the local model selection can outperform the global model selection which yields a more effective selection of differentially abundant proteins.
- Published
- 2007
19. A multi-approaches-guided genetic algorithm with application to operon prediction
- Author
-
Fangxun Sun, Yanchun Liang, Shuqin Wang, Chunguang Zhou, Wei Du, Xiumei Wang, and Yan Wang
- Subjects
Computer science ,Operon ,Medicine (miscellaneous) ,Computational biology ,Bioinformatics ,Data type ,Genome ,symbols.namesake ,Cog ,Artificial Intelligence ,Databases, Genetic ,Preprocessor ,Gene Regulatory Networks ,Gene ,Molecular Biology ,Oligonucleotide Array Sequence Analysis ,biology ,Bacteria ,Systems Biology ,Computational Biology ,Prokaryote ,Gene Expression Regulation, Bacterial ,biology.organism_classification ,Pearson product-moment correlation coefficient ,symbols ,Algorithms ,Genome, Bacterial - Abstract
Objective: The prediction of operons is critical to the reconstruction of regulatory networks at the whole genome level. Multiple genome features have been used for predicting operons. However, multiple genome features are usually dealt with using only single method in the literatures. The aim of this paper is to develop a combined method for operon prediction by using different methods to preprocess different genome features in order for exerting their unique characteristics. Methods: A novel multi-approach-guided genetic algorithm for operon prediction is presented. We exploit different methods for intergenic distance, cluster of orthologous groups (COG) gene functions, metabolic pathway and microarray expression data. A novel local-entropy-minimization method is proposed to partition intergenic distance. Our program can be used for other newly sequenced genomes by transferring the knowledge that has been obtained from Escherichia coli data. We calculate the log-likelihood for COG gene functions and Pearson correlation coefficient for microarray expression data. The genetic algorithm is used for integrating the four types of data. Results: The proposed method is examined on E. coliK12 genome, Bacillus subtilis genome, and Pseudomonas aeruginosa PAO1 genome. The accuracies of prediction for these three genomes are 85.9987%, 88.296%, and 81.2384%, respectively. Conclusion: Simulated experimental results demonstrate that in the genetic algorithm the preprocessing for genome data using multiple approaches ensures the effective utilization of different biological characteristics. Experimental results also show that the proposed method is applicable for predicting operons in prokaryote.
- Published
- 2006
20. Selection of relevant genes in cancer diagnosis based on their prediction accuracy
- Author
-
Annarita D'Addabbo, Ada Piepoli, Francesco Perri, Sabino Liuni, Graziano Pesole, Nicola Ancona, and Rosalia Maglietta
- Subjects
Colorectal cancer ,Medicine (miscellaneous) ,Disease ,Biology ,Bioinformatics ,Artificial Intelligence ,Predictive Value of Tests ,medicine ,Biomarkers, Tumor ,Humans ,Genetic Testing ,Gene ,Genetic testing ,Oligonucleotide Array Sequence Analysis ,Models, Statistical ,medicine.diagnostic_test ,Models, Genetic ,Gene Expression Profiling ,Supervised learning ,medicine.disease ,Prognosis ,Phenotype ,Gene expression profiling ,Gene Expression Regulation, Neoplastic ,Colonic Neoplasms ,DNA microarray - Abstract
Motivations: One of the main problems in cancer diagnosis by using DNA microarray data is selecting genes relevant for the pathology by analyzing their expression profiles in tissues in two different phenotypical conditions. The question we pose is the following: how do we measure the relevance of a single gene in a given pathology? Methods: A gene is relevant for a particular disease if we are able to correctly predict the occurrence of the pathology in new patients on the basis of its expression level only. In other words, a gene is informative for the disease if its expression levels are useful for training a classifier able to generalize, that is, able to correctly predict the status of new patients. In this paper we present a selection bias free, statistically well founded method for finding relevant genes on the basis of their classification ability. Results: We applied the method on a colon cancer data set and produced a list of relevant genes, ranked on the basis of their prediction accuracy. We found, out of more than 6500 available genes, 54 overexpressed in normal tissues and 77 overexpressed in tumor tissues having prediction accuracy greater than 70% with [email protected][email protected]?0.05. Conclusions: The relevance of the selected genes was assessed (a) statistically, evaluating the p-value of the estimate prediction accuracy of each gene; (b) biologically, confirming the involvement of many genes in generic carcinogenic processes and in particular for the colon; (c) comparatively, verifying the presence of these genes in other studies on the same data-set.
- Published
- 2006
21. Computational modeling of oligonucleotide positional densities for human promoter prediction
- Author
-
Wing-Kin Sung, Ankush Mittal, and Vipin Narang
- Subjects
Regulation of gene expression ,Base Sequence ,Models, Genetic ,Oligonucleotide ,Mammalian promoter database ,Medicine (miscellaneous) ,Promoter ,Statistical model ,Bayes Theorem ,Computational biology ,Biology ,Bioinformatics ,Genome ,DNA binding site ,Oligodeoxyribonucleotides ,Artificial Intelligence ,Computer Simulation ,Promoter Regions, Genetic ,Gene ,Software ,Transcription Factors - Abstract
Objective:: The gene promoter region controls transcriptional initiation of a gene, which is the most important step in gene regulation. In-silico detection of promoter region in genomic sequences has a number of applications in gene discovery and understanding gene expression regulation. However, computational prediction of eukaryotic poly-II promoters has remained a difficult task. This paper introduces a novel statistical technique for detecting promoter regions in long genomic sequences. Method:: A number of existing techniques analyze the occurrence frequencies of oligonucleotides in promoter sequences as compared to other genomic regions. In contrast, the present work studies the positional densities of oligonucleotides in promoter sequences. The analysis does not require any non-promoter sequence dataset or any model of the background oligonucleotide content of the genome. The statistical model learnt from a dataset of promoter sequences automatically recognizes a number of transcription factor binding sites simultaneously with their occurrence positions relative to the transcription start site. Based on this model, a continuous naive Bayes classifier is developed for the detection of human promoters and transcription start sites in genomic sequences. Results:: The present study extends the scope of statistical models in general promoter modeling and prediction. Promoter sequence features learnt by the model correlate well with known biological facts. Results of human transcription start site prediction compare favorably with existing 2nd generation promoter prediction tools.
- Published
- 2004
22. Active subgroup mining: a case study in coronary heart disease risk group detection
- Author
-
Nada Lavrač, Dragan Gamberger, and Goran Krstačić
- Subjects
medicine.medical_specialty ,Medical Records Systems, Computerized ,business.industry ,coronary heart disease ,active mining ,machine learning ,subgroup discovery ,risk group detection ,non-invasive cardiovascular tests ,Patient screening ,Medicine (miscellaneous) ,Early detection ,Information Storage and Retrieval ,Coronary Artery Disease ,medicine.disease ,Bioinformatics ,Prognosis ,Coronary heart disease ,Coronary artery disease ,Risk groups ,Knowledge extraction ,Artificial Intelligence ,Risk Factors ,medicine ,Humans ,Intensive care medicine ,business ,High potential - Abstract
This paper presents an approach to active mining of patient records aimed at discovering patient groups at high risk for coronary heart disease. The approach proposes active expert involvement in the following steps of the knowledge discovery process: data gathering, cleaning and transformation, subgroup discovery, statistical characterization of induced subgroups, their interpretation, and the evaluation of results. As in the discovery and characterization of risk subgroups the main risk factors are made explicit, the proposed methodology has high potential for patient screening and early detection of patient groups at risk for coronary heart disease.
- Published
- 2003
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.