1,413 results on '"Position-Specific Scoring Matrices"'
Search Results
2. Early Diverging Insect-Pathogenic Fungi of the Order Entomophthorales Possess Diverse and Unique Subtilisin-Like Serine Proteases.
- Author
-
Arnesen, Jonathan A, Małagocka, Joanna, Gryganskyi, Andrii, Grigoriev, Igor V, Voigt, Kerstin, Stajich, Jason E, and De Fine Licht, Henrik H
- Subjects
Animals ,Entomophthorales ,Subtilisins ,Cluster Analysis ,Sequence Analysis ,DNA ,Phylogeny ,Amino Acid Sequence ,Amino Acid Motifs ,Catalytic Domain ,Databases ,Nucleic Acid ,Position-Specific Scoring Matrices ,Protein Domains ,Insecta ,Subtilase ,early-diverging fungi ,insect-pathogen ,phylogenomics ,proteases ,Sequence Analysis ,DNA ,Databases ,Nucleic Acid ,Emerging Infectious Diseases ,Infectious Diseases ,2.2 Factors relating to physical environment ,Infection ,Genetics - Abstract
Insect-pathogenic fungi use subtilisin-like serine proteases (SLSPs) to degrade chitin-associated proteins in the insect procuticle. Most insect-pathogenic fungi in the order Hypocreales (Ascomycota) are generalist species with a broad host-range, and most species possess a high number of SLSPs. The other major clade of insect-pathogenic fungi is part of the subphylum Entomophthoromycotina (Zoopagomycota, formerly Zygomycota) which consists of high host-specificity insect-pathogenic fungi that naturally only infect a single or very few host species. The extent to which insect-pathogenic fungi in the order Entomophthorales rely on SLSPs is unknown. Here we take advantage of recently available transcriptomic and genomic datasets from four genera within Entomophthoromycotina: the saprobic or opportunistic pathogens Basidiobolus meristosporus, Conidiobolus coronatus, C. thromboides, C. incongruus, and the host-specific insect pathogens Entomophthora muscae and Pandora formicae, specific pathogens of house flies (Muscae domestica) and wood ants (Formica polyctena), respectively. In total 154 SLSP from six fungi in the subphylum Entomophthoromycotina were identified: E. muscae (n = 22), P. formicae (n = 6), B. meristosporus (n = 60), C. thromboides (n = 18), C. coronatus (n = 36), and C. incongruus (n = 12). A unique group of 11 SLSPs was discovered in the genomes of the obligate biotrophic fungi E. muscae, P. formicae and the saprobic human pathogen C. incongruus that loosely resembles bacillopeptidase F-like SLSPs. Phylogenetics and protein domain analysis show this class represents a unique group of SLSPs so far only observed among Bacteria, Oomycetes and early diverging fungi such as Cryptomycota, Microsporidia, and Entomophthoromycotina. This group of SLSPs is missing in the sister fungal lineages of Kickxellomycotina and the fungal phyla Mucoromyocta, Ascomycota and Basidiomycota fungi suggesting interesting gene loss patterns.
- Published
- 2018
3. MIPPIS: protein-protein interaction site prediction network with multi-information fusion.
- Author
-
Wang S, Dong K, Liang D, Zhang Y, Li X, and Song T
- Subjects
- Protein Interaction Mapping methods, Proteins chemistry, Proteins metabolism, Databases, Protein, Markov Chains, Algorithms, Amino Acid Sequence, Position-Specific Scoring Matrices, Computational Biology methods
- Abstract
Background: The prediction of protein-protein interaction sites plays a crucial role in biochemical processes. Investigating the interaction between viruses and receptor proteins through biological techniques aids in understanding disease mechanisms and guides the development of corresponding drugs. While various methods have been proposed in the past, they often suffer from drawbacks such as long processing times, high costs, and low accuracy., Results: Addressing these challenges, we propose a novel protein-protein interaction site prediction network based on multi-information fusion. In our approach, the initial amino acid features are depicted by the position-specific scoring matrix, hidden Markov model, dictionary of protein secondary structure, and one-hot encoding. Simultaneously, we adopt a multi-channel approach to extract deep-level amino acids features from different perspectives. The graph convolutional network channel effectively extracts spatial structural information. The bidirectional long short-term memory channel treats the amino acid sequence as natural language, capturing the protein's primary structure information. The ProtT5 protein large language model channel outputs a more comprehensive amino acid embedding representation, providing a robust complement to the two aforementioned channels. Finally, the obtained amino acid features are fed into the prediction layer for the final prediction., Conclusion: Compared with six protein structure-based methods and six protein sequence-based methods, our model achieves optimal performance across evaluation metrics, including accuracy, precision, F
1 , Matthews correlation coefficient, and area under the precision recall curve, which demonstrates the superiority of our model., (© 2024. The Author(s).)- Published
- 2024
- Full Text
- View/download PDF
4. CompariPSSM: a PSSM-PSSM comparison tool for motif-binding determinant analysis.
- Author
-
Tsitsa I, Krystkowiak I, and Davey NE
- Subjects
- Position-Specific Scoring Matrices, Databases, Protein, Protein Binding, Proteins chemistry, Proteins metabolism, Binding Sites, Proteomics methods, Amino Acid Motifs, Software
- Abstract
Motivation: Short linear motifs (SLiMs) are compact functional modules that mediate low-affinity protein-protein interactions. SLiMs direct the function of many dynamic signalling and regulatory complexes playing a central role in most biological processes of the cell. Motif-binding determinants describe the contribution of each residue in a motif-containing peptide to the affinity and specificity of binding to the motif-binding partner. Motif-binding determinants are generally defined as a motif consensus pattern or a position-specific scoring matrix (PSSM) encoding quantitative preferences. Motif-binding determinant comparison is an important motif analysis task and can be applied to motif annotation, classification, clustering, discovery and benchmarking. Currently, binding determinant comparison is generally performed by analysing consensus similarity; however, this ignores important quantitative information in both the consensus and non-consensus positions., Results: We have created a new tool, CompariPSSM, that quantifies the similarity between motif-binding determinants using sliding window PSSM-PSSM comparison and scores PSSM similarity using a randomisation-based probabilistic framework. The tool has been benchmarked on curated data from the eukaryotic linear motif database and experimental data from proteomic peptidephage display. CompariPSSM can be used for peptide classification to validate motif classes, peptide clustering to group functionally related conserved disordered regions, and benchmarking experimental motif discovery methods., Availability and Implementation: CompariPSSM is available at https://slim.icr.ac.uk/projects/comparipssm., (© The Author(s) 2024. Published by Oxford University Press.)
- Published
- 2024
- Full Text
- View/download PDF
5. StackDPPred: Multiclass prediction of defensin peptides using stacked ensemble learning with optimized features.
- Author
-
Arif M, Musleh S, Ghulam A, Fida H, Alqahtani Y, and Alam T
- Subjects
- Computational Biology methods, Principal Component Analysis, Amino Acid Sequence, Algorithms, Position-Specific Scoring Matrices, Defensins chemistry, Machine Learning
- Abstract
Host defense or antimicrobial peptides (AMPs) are promising candidates for protecting host against microbial pathogens for example bacteria, virus, fungi, yeast. Defensins are the type of AMPs that act as potential therapeutic drug agent and perform vital role in various biological process. Conventional Experiments to identify defensin peptides (DPs) are time consuming and expensive. Thus, the shortcomings of wet lab experiments are leveraged by computational methods to accurately predict the functional types of DPs. In this paper, we aim to propose a novel multi-class ensemble-based prediction model called StackDPPred for identifying the properties of DPs. The peptide sequences are encoded using split amino acid composition (SAAC), segmented position specific scoring matrix (SegPSSM), histogram of oriented gradients-based PSSM (HOGPSSM) and feature extraction based graphical and statistical (FEGS) descriptors. Next, principal component analysis (PCA) is used to select the best subset of attributes. After that, the optimized features are fed into single machine learning and stacking-based ensemble classifiers. Furthermore, the ablation study demonstrates the robustness and efficacy of the stacking approach using reduced features for predicting DPs and their families. The proposed StackDPPred method improves the overall accuracy by 13.41% and 7.62% compared to existing DPs predictors iDPF-PseRAAC and iDEF-PseRAAC, respectively on validation test. Additionally, we applied the local interpretable model-agnostic explanations (LIME) algorithm to understand the contribution of selected features to the overall prediction. We believe, StackDPPred could serve as a valuable tool accelerating the screening of large-scale DPs and peptide-based drug discovery process., Competing Interests: Declaration of Competing Interest I, Muhammad Arif, hereby declare that the authors have no conflict of interest., (Copyright © 2024 The Author(s). Published by Elsevier Inc. All rights reserved.)
- Published
- 2024
- Full Text
- View/download PDF
6. Conservation and divergence of small RNA pathways and microRNAs in land plants
- Author
-
You, Chenjiang, Cui, Jie, Wang, Hui, Qi, Xinping, Kuo, Li-Yaung, Ma, Hong, Gao, Lei, Mo, Beixin, and Chen, Xuemei
- Subjects
Climate Change Impacts and Adaptation ,Biological Sciences ,Bioinformatics and Computational Biology ,Evolutionary Biology ,Genetics ,Environmental Sciences ,Biotechnology ,Generic health relevance ,Life on Land ,Base Sequence ,Conserved Sequence ,Embryophyta ,Evolution ,Molecular ,Ferns ,Gene Expression Regulation ,Plant ,Genes ,Plant ,Genetic Variation ,MicroRNAs ,Phenotype ,Phylogeny ,Position-Specific Scoring Matrices ,RNA Isoforms ,Small RNA ,Lycophyte ,Fern ,miRNA ,Evolution ,Argonaute ,DICER-LIKE ,RdDM ,Information and Computing Sciences ,Bioinformatics - Abstract
BackgroundAs key regulators of gene expression in eukaryotes, small RNAs have been characterized in many seed plants, and pathways for their biogenesis, degradation, and action have been defined in model angiosperms. However, both small RNAs themselves and small RNA pathways are not well characterized in other land plants such as lycophytes and ferns, preventing a comprehensive evolutionary perspective on small RNAs in land plants.ResultsUsing 25 representatives from major lineages of lycophytes and ferns, most of which lack sequenced genomes, we characterized small RNAs and small RNA pathways in these plants. We identified homologs of DICER-LIKE (DCL), ARGONAUTE (AGO), and other genes involved in small RNA pathways, predicted over 2600 conserved microRNA (miRNA) candidates, and performed phylogenetic analyses on small RNA pathways as well as miRNAs. Pathways underlying miRNA biogenesis, degradation, and activity were established in the common ancestor of land plants, but the 24-nucleotide siRNA pathway that guides DNA methylation is incomplete in sister species of seed plants, especially lycophytes. We show that the functional diversification of key gene families such as DCL and AGO as observed in angiosperms occurred early in land plants followed by parallel expansion of the AGO family in ferns and angiosperms. We uncovered a conserved AGO subfamily absent in angiosperms.ConclusionsOur phylogenetic analyses of miRNAs in bryophytes, lycophytes, ferns, and angiosperms refine the time-of-origin for conserved miRNA families as well as small RNA machinery in land plants.
- Published
- 2017
7. A Novel Ensemble Learning-Based Computational Method to Predict Protein-Protein Interactions from Protein Primary Sequences.
- Author
-
Pan, Jie, Wang, Shiwei, Yu, Changqing, Li, Liping, You, Zhuhong, and Sun, Yanmei
- Subjects
- *
AMINO acid sequence , *HILBERT transform , *SUPPORT vector machines , *K-nearest neighbor classification , *RANDOM forest algorithms - Abstract
Simple Summary: Protein–protein interactions (PPIs) play a central role in the evolution and progression of various biological processes. In this article, we constructed a novel ensemble-learning-based model to predict potential PPIs, which only utilized the protein sequence information. The presented method used Discrete Hilbert transform to extract amino acid sequence information from position-specific scoring matrices. Then these extracted features were fed into rotation forest for training and predicting. When applying our method to the three datasets (Yeast, Human, and Oryza sativa) for detecting PPIs, we obtained excellent prediction performance. Furthermore, the comparison results indicated that our computational model is effective and robust in predicting potential PPI pairs. Protein–protein interactions (PPIs) are crucial for understanding the cellular processes, including signal cascade, DNA transcription, metabolic cycles, and repair. In the past decade, a multitude of high-throughput methods have been introduced to detect PPIs. However, these techniques are time-consuming, laborious, and always suffer from high false negative rates. Therefore, there is a great need of new computational methods as a supplemental tool for PPIs prediction. In this article, we present a novel sequence-based model to predict PPIs that combines Discrete Hilbert transform (DHT) and Rotation Forest (RoF). This method contains three stages: firstly, the Position-Specific Scoring Matrices (PSSM) was adopted to transform the amino acid sequence into a PSSM matrix, which can contain rich information about protein evolution. Then, the 400-dimensional DHT descriptor was constructed for each protein pair. Finally, these feature descriptors were fed to the RoF classifier for identifying the potential PPI class. When exploring the proposed model on the Yeast, Human, and Oryza sativa PPIs datasets, it yielded excellent prediction accuracies of 91.93, 96.35, and 94.24%, respectively. In addition, we also conducted numerous experiments on cross-species PPIs datasets, and the predictive capacity of our method is also very excellent. To further access the prediction ability of the proposed approach, we present the comparison of RoF with four powerful classifiers, including Support Vector Machine (SVM), Random Forest (RF), K-nearest Neighbor (KNN), and AdaBoost. We also compared it with some existing superiority works. These comprehensive experimental results further confirm the excellent and feasibility of the proposed approach. In future work, we hope it can be a supplemental tool for the proteomics analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
8. Identification, characterization, and gene expression analysis of nucleotide binding site (NB)-type resistance gene homologues in switchgrass
- Author
-
Frazier, Taylor P, Palmer, Nathan A, Xie, Fuliang, Tobias, Christian M, Donze-Reiner, Teresa J, Bombarely, Aureliano, Childs, Kevin L, Shu, Shengqiang, Jenkins, Jerry W, Schmutz, Jeremy, Zhang, Baohong, Sarath, Gautam, and Zhao, Bingyu
- Subjects
Biological Sciences ,Bioinformatics and Computational Biology ,Genetics ,Alleles ,Amino Acid Sequence ,Computational Biology ,Databases ,Nucleic Acid ,Disease Resistance ,Gene Expression Profiling ,Genes ,Plant ,Genetic Association Studies ,Genetic Predisposition to Disease ,Genome ,Plant ,Genomics ,Panicum ,Phylogeny ,Polymorphism ,Single Nucleotide ,Position-Specific Scoring Matrices ,Protein Interaction Domains and Motifs ,Reproducibility of Results ,Biofuel ,Disease resistance ,Gene expression ,NB-LRR ,Panicum virgatum ,RNA-seq ,SNP ,Information and Computing Sciences ,Medical and Health Sciences ,Bioinformatics ,Biological sciences ,Biomedical and clinical sciences - Abstract
BackgroundSwitchgrass (Panicum virgatum L.) is a warm-season perennial grass that can be used as a second generation bioenergy crop. However, foliar fungal pathogens, like switchgrass rust, have the potential to significantly reduce switchgrass biomass yield. Despite its importance as a prominent bioenergy crop, a genome-wide comprehensive analysis of NB-LRR disease resistance genes has yet to be performed in switchgrass.ResultsIn this study, we used a homology-based computational approach to identify 1011 potential NB-LRR resistance gene homologs (RGHs) in the switchgrass genome (v 1.1). In addition, we identified 40 RGHs that potentially contain unique domains including major sperm protein domain, jacalin-like binding domain, calmodulin-like binding, and thioredoxin. RNA-sequencing analysis of leaf tissue from 'Alamo', a rust-resistant switchgrass cultivar, and 'Dacotah', a rust-susceptible switchgrass cultivar, identified 2634 high quality variants in the RGHs between the two cultivars. RNA-sequencing data from field-grown cultivar 'Summer' plants indicated that the expression of some of these RGHs was developmentally regulated.ConclusionsOur results provide useful insight into the molecular structure, distribution, and expression patterns of members of the NB-LRR gene family in switchgrass. These results also provide a foundation for future work aimed at elucidating the molecular mechanisms underlying disease resistance in this important bioenergy crop.
- Published
- 2016
9. Probing RNA recognition by human ADAR2 using a high-throughput mutagenesis method
- Author
-
Wang, Yuru and Beal, Peter A
- Subjects
Biochemistry and Cell Biology ,Bioinformatics and Computational Biology ,Biological Sciences ,Genetics ,1.1 Normal biological development and functioning ,Underpinning research ,Generic health relevance ,Adenosine Deaminase ,Amino Acid Sequence ,Base Sequence ,Binding Sites ,Flow Cytometry ,Gene Expression ,Genes ,Reporter ,Humans ,Models ,Molecular ,Molecular Conformation ,Mutagenesis ,Nucleic Acid Conformation ,Position-Specific Scoring Matrices ,Protein Binding ,RNA ,RNA Editing ,RNA-Binding Proteins ,Single-Cell Analysis ,Structure-Activity Relationship ,Yeasts ,Environmental Sciences ,Information and Computing Sciences ,Developmental Biology ,Biological sciences ,Chemical sciences ,Environmental sciences - Abstract
Adenosine deamination is one of the most prevalent post-transcriptional modifications in mRNA. In humans, ADAR1 and ADAR2 catalyze this modification and their malfunction correlates with disease. Recently our laboratory reported crystal structures of the human ADAR2 deaminase domain bound to duplex RNA revealing a protein loop that binds the RNA on the 5' side of the modification site. This 5' binding loop appears to be one contributor to substrate specificity differences between ADAR family members. In this study, we endeavored to reveal detailed structure-activity relationships in this loop to advance our understanding of RNA recognition by ADAR2. To achieve this goal, we established a high-throughput mutagenesis approach which allows rapid screening of ADAR variants in single yeast cells and provides quantitative evaluation for enzymatic activity. Using this approach, we determined the importance of specific amino acids at 19 different positions in the ADAR2 5' binding loop and revealed six residues that provide essential structural elements supporting the fold of the loop and key RNA-binding functional groups. This work provided new insight into RNA recognition by ADAR2 and established a new tool for defining structure-function relationships in ADAR reactions.
- Published
- 2016
10. Identification and Characterization of a cis-Regulatory Element for Zygotic Gene Expression in Chlamydomonas reinhardtii
- Author
-
Hamaji, Takashi, Lopez, David, Pellegrini, Matteo, and Umen, James
- Subjects
Biological Sciences ,Bioinformatics and Computational Biology ,Biotechnology ,Genetics ,Base Sequence ,Chlamydomonas reinhardtii ,Gene Expression ,Gene Expression Regulation ,Plant ,Genes ,Reporter ,Homeodomain Proteins ,Models ,Biological ,Nucleotide Motifs ,Position-Specific Scoring Matrices ,Promoter Regions ,Genetic ,Protein Multimerization ,Regulatory Sequences ,Nucleic Acid ,Zygote ,Chlamydomonas ,cis-regulatory element ,fertilization ,homeodomain protein ,zygote ,Biochemistry and cell biology ,Statistics - Abstract
Upon fertilization Chlamydomonas reinhardtii zygotes undergo a program of differentiation into a diploid zygospore that is accompanied by transcription of hundreds of zygote-specific genes. We identified a distinct sequence motif we term a zygotic response element (ZYRE) that is highly enriched in promoter regions of C reinhardtii early zygotic genes. A luciferase reporter assay was used to show that native ZYRE motifs within the promoter of zygotic gene ZYS3 or intron of zygotic gene DMT4 are necessary for zygotic induction. A synthetic luciferase reporter with a minimal promoter was used to show that ZYRE motifs introduced upstream are sufficient to confer zygotic upregulation, and that ZYRE-controlled zygotic transcription is dependent on the homeodomain transcription factor GSP1. We predict that ZYRE motifs will correspond to binding sites for the homeodomain proteins GSP1-GSM1 that heterodimerize and activate zygotic gene expression in early zygotes.
- Published
- 2016
11. PSSM-Sumo: deep learning based intelligent model for prediction of sumoylation sites using discriminative features.
- Author
-
Khan S, AlQahtani SA, Noor S, and Ahmad N
- Subjects
- Support Vector Machine, Computational Biology methods, Algorithms, Humans, Position-Specific Scoring Matrices, Protein Processing, Post-Translational, Sumoylation, Deep Learning
- Abstract
Post-translational modifications (PTMs) are fundamental to essential biological processes, exerting significant influence over gene expression, protein localization, stability, and genome replication. Sumoylation, a PTM involving the covalent addition of a chemical group to a specific protein sequence, profoundly impacts the functional diversity of proteins. Notably, identifying sumoylation sites has garnered significant attention due to their crucial roles in proteomic functions and their implications in various diseases, including Parkinson's and Alzheimer's. Despite the proposal of several computational models for identifying sumoylation sites, their effectiveness could be improved by the limitations associated with conventional learning methodologies. In this study, we introduce pseudo-position-specific scoring matrix (PsePSSM), a robust computational model designed for accurately predicting sumoylation sites using an optimized deep learning algorithm and efficient feature extraction techniques. Moreover, to streamline computational processes and eliminate irrelevant and noisy features, sequential forward selection using a support vector machine (SFS-SVM) is implemented to identify optimal features. The multi-layer Deep Neural Network (DNN) is a robust classifier, facilitating precise sumoylation site prediction. We meticulously assess the performance of PSSM-Sumo through a tenfold cross-validation approach, employing various statistical metrics such as the Matthews Correlation Coefficient (MCC), accuracy, sensitivity, specificity, and the Area under the ROC Curve (AUC). Comparative analyses reveal that PSSM-Sumo achieves an exceptional average prediction accuracy of 98.71%, surpassing existing models. The robustness and accuracy of the proposed model position it as a promising tool for advancing drug discovery and the diagnosis of diverse diseases linked to sumoylation sites., (© 2024. The Author(s).)
- Published
- 2024
- Full Text
- View/download PDF
12. StackedEnC-AOP: prediction of antioxidant proteins using transform evolutionary and sequential features based multi-scale vector with stacked ensemble learning.
- Author
-
Rukh G, Akbar S, Rehman G, Alarfaj FK, and Zou Q
- Subjects
- Computational Biology methods, Machine Learning, Algorithms, Wavelet Analysis, Support Vector Machine, Databases, Protein, Position-Specific Scoring Matrices, Antioxidants chemistry, Proteins chemistry, Proteins metabolism
- Abstract
Background: Antioxidant proteins are involved in several biological processes and can protect DNA and cells from the damage of free radicals. These proteins regulate the body's oxidative stress and perform a significant role in many antioxidant-based drugs. The current invitro-based medications are costly, time-consuming, and unable to efficiently screen and identify the targeted motif of antioxidant proteins., Methods: In this model, we proposed an accurate prediction method to discriminate antioxidant proteins namely StackedEnC-AOP. The training sequences are formulation encoded via incorporating a discrete wavelet transform (DWT) into the evolutionary matrix to decompose the PSSM-based images via two levels of DWT to form a Pseudo position-specific scoring matrix (PsePSSM-DWT) based embedded vector. Additionally, the Evolutionary difference formula and composite physiochemical properties methods are also employed to collect the structural and sequential descriptors. Then the combined vector of sequential features, evolutionary descriptors, and physiochemical properties is produced to cover the flaws of individual encoding schemes. To reduce the computational cost of the combined features vector, the optimal features are chosen using Minimum redundancy and maximum relevance (mRMR). The optimal feature vector is trained using a stacking-based ensemble meta-model., Results: Our developed StackedEnC-AOP method reported a prediction accuracy of 98.40% and an AUC of 0.99 via training sequences. To evaluate model validation, the StackedEnC-AOP training model using an independent set achieved an accuracy of 96.92% and an AUC of 0.98., Conclusion: Our proposed StackedEnC-AOP strategy performed significantly better than current computational models with a ~ 5% and ~ 3% improved accuracy via training and independent sets, respectively. The efficacy and consistency of our proposed StackedEnC-AOP make it a valuable tool for data scientists and can execute a key role in research academia and drug design., (© 2024. The Author(s).)
- Published
- 2024
- Full Text
- View/download PDF
13. Hybrid framework for membrane protein type prediction based on the PSSM.
- Author
-
Ruan X, Xia S, Li S, Su Z, and Yang J
- Subjects
- Machine Learning, Deep Learning, Databases, Protein, Humans, Algorithms, Membrane Proteins chemistry, Neural Networks, Computer, Computational Biology methods, Position-Specific Scoring Matrices
- Abstract
Membrane proteins are considered the major source of drug targets and are indispensable for drug design and disease prevention. However, traditional biomechanical experiments are costly and time-consuming; thus, many computational methods for predicting membrane protein types are gaining popularity. The position-specific scoring matrix (PSSM) method is an excellent method for describing the evolutionary information of protein sequences. In this study, we propose an improved capsule neural network (ICNN) model based on a capsule neural network to acquire sufficient relevant information from the PSSM. Furthermore, accounting for the complementarity between traditional machine learning and deep learning, we propose a hybrid framework that combines both approaches to predict protein types. This framework trains 41 baseline models based on the PSSM. The optimal subset features, selected after traversal, are fused using a two-level decision-level feature fusion approach. Subsequently, comparisons are made using three combined strategies within an ensemble learning framework. The experimental results demonstrate that solely relying on PSSM input, the proposed method not only surpasses the optimal methods by 1.52 % , 2.26 % and 2.67 % on Dataset1, Dataset2, and Datasets3, respectively, but also exhibits superior generalizability. Furthermore, the code and dataset can be free download at https://github.com/ruanxiaoli/membrane-protein-types ., (© 2024. The Author(s).)
- Published
- 2024
- Full Text
- View/download PDF
14. Functionally conserved enhancers with divergent sequences in distant vertebrates
- Author
-
Yang, Song, Oksenberg, Nir, Takayama, Sachiko, Heo, Seok-Jin, Poliakov, Alexander, Ahituv, Nadav, Dubchak, Inna, and Boffelli, Dario
- Subjects
Biological Sciences ,Bioinformatics and Computational Biology ,Genetics ,Human Genome ,Biotechnology ,Underpinning research ,1.1 Normal biological development and functioning ,Generic health relevance ,Animals ,Animals ,Genetically Modified ,Binding Sites ,Computational Biology ,Conserved Sequence ,Enhancer Elements ,Genetic ,Evolution ,Molecular ,Gene Expression Profiling ,Genetic Variation ,Genome-Wide Association Study ,Nucleotide Motifs ,Position-Specific Scoring Matrices ,Protein Binding ,Reproducibility of Results ,Transcription Factors ,Vertebrates ,Information and Computing Sciences ,Medical and Health Sciences ,Bioinformatics ,Biological sciences ,Biomedical and clinical sciences - Abstract
BackgroundTo examine the contributions of sequence and function conservation in the evolution of enhancers, we systematically identified enhancers whose sequences are not conserved among distant groups of vertebrate species, but have homologous function and are likely to be derived from a common ancestral sequence. Our approach combined comparative genomics and epigenomics to identify potential enhancer sequences in the genomes of three groups of distantly related vertebrate species.ResultsWe searched for sequences that were conserved within groups of closely related species but not between groups of more distant species, and were associated with an epigenetic mark of enhancer activity. To facilitate inferring orthology between non-conserved sequences, we limited our search to introns whose orthology could be unambiguously established by mapping the bracketing exons. We show that a subset of these non-conserved but syntenic sequences from the mouse and zebrafish genomes have homologous functions in a zebrafish transgenic enhancer assay. The conserved expression patterns driven by these enhancers are probably associated with short transcription factor-binding motifs present in the divergent sequences.ConclusionsWe have identified numerous potential enhancers with divergent sequences but a conserved function. These results indicate that selection on function, rather than sequence, may be a common mode of enhancer evolution; evidence for selection at the sequence level is not a necessary criterion to define a gene regulatory element.
- Published
- 2015
15. σ54-dependent regulome in Desulfovibrio vulgaris Hildenborough
- Author
-
Kazakov, Alexey E, Rajeev, Lara, Chen, Amy, Luning, Eric G, Dubchak, Inna, Mukhopadhyay, Aindrila, and Novichkov, Pavel S
- Subjects
Biochemistry and Cell Biology ,Biological Sciences ,Genetics ,Bacterial Proteins ,Binding Sites ,Cluster Analysis ,DNA-Binding Proteins ,Desulfovibrio vulgaris ,Enhancer Elements ,Genetic ,Gene Expression Regulation ,Bacterial ,Nucleotide Motifs ,Phylogeny ,Position-Specific Scoring Matrices ,Promoter Regions ,Genetic ,Protein Binding ,Sigma Factor ,Transcription Factors ,Type III Secretion Systems ,Transcription factor ,Transcriptional regulation ,Sigma factor ,Enhancer binding proteins ,Information and Computing Sciences ,Medical and Health Sciences ,Bioinformatics ,Biological sciences ,Biomedical and clinical sciences - Abstract
BackgroundThe σ(54) subunit controls a unique class of promoters in bacteria. Such promoters, without exception, require enhancer binding proteins (EBPs) for transcription initiation. Desulfovibrio vulgaris Hildenborough, a model bacterium for sulfate reduction studies, has a high number of EBPs, more than most sequenced bacteria. The cellular processes regulated by many of these EBPs remain unknown.ResultsTo characterize the σ(54)-dependent regulome of D. vulgaris Hildenborough, we identified EBP binding motifs and regulated genes by a combination of computational and experimental techniques. These predictions were supported by our reconstruction of σ(54)-dependent promoters by comparative genomics. We reassessed and refined the results of earlier studies on regulation in D. vulgaris Hildenborough and consolidated them with our new findings. It allowed us to reconstruct the σ(54) regulome in D. vulgaris Hildenborough. This regulome includes 36 regulons that consist of 201 coding genes and 4 non-coding RNAs, and is involved in nitrogen, carbon and energy metabolism, regulation, transmembrane transport and various extracellular functions. To the best of our knowledge, this is the first report of direct regulation of alanine dehydrogenase, pyruvate metabolism genes and type III secretion system by σ(54)-dependent regulators.ConclusionsThe σ(54)-dependent regulome is an important component of transcriptional regulatory network in D. vulgaris Hildenborough and related free-living Deltaproteobacteria. Our study provides a representative collection of σ(54)-dependent regulons that can be used for regulation prediction in Deltaproteobacteria and other taxa.
- Published
- 2015
16. NF2 Loss Promotes Oncogenic RAS-Induced Thyroid Cancers via YAP-Dependent Transactivation of RAS Proteins and Sensitizes Them to MEK Inhibition.
- Author
-
Garcia-Rendueles, Maria, Ricarte-Filho, Julio, Untch, Brian, Landa, Iňigo, Knauf, Jeffrey, Voza, Francesca, Smith, Vicki, Ganly, Ian, Persaud, Yogindra, Oler, Gisele, Fang, Yuqiang, Jhanwar, Suresh, Viale, Agnes, Heguy, Adriana, Huberman, Kety, Giancotti, Filippo, Ghossein, Ronald, Fagin, James, and Taylor, Barry
- Subjects
Animals ,Binding Sites ,Cell Cycle Proteins ,Cell Line ,Tumor ,Cell Transformation ,Neoplastic ,Chromosome Deletion ,Chromosomes ,Human ,Pair 22 ,DNA Copy Number Variations ,Disease Models ,Animal ,Drug Resistance ,Neoplasm ,Gene Deletion ,Gene Expression Regulation ,Neoplastic ,Gene Order ,Gene Targeting ,Genes ,ras ,Humans ,Mice ,Mice ,Transgenic ,Mitogen-Activated Protein Kinases ,Models ,Biological ,Neoplasm Staging ,Neurofibromin 2 ,Nuclear Proteins ,Nucleotide Motifs ,Position-Specific Scoring Matrices ,Promoter Regions ,Genetic ,Protein Binding ,Protein Kinase Inhibitors ,Signal Transduction ,Thyroid Neoplasms ,Transcription Factors ,Transcriptional Activation - Abstract
UNLABELLED: Ch22q LOH is preferentially associated with RAS mutations in papillary and in poorly differentiated thyroid cancer (PDTC). The 22q tumor suppressor NF2, encoding merlin, is implicated in this interaction because of its frequent loss of function in human thyroid cancer cell lines. Nf2 deletion or Hras mutation is insufficient for transformation, whereas their combined disruption leads to murine PDTC with increased MAPK signaling. Merlin loss induces RAS signaling in part through inactivation of Hippo, which activates a YAP-TEAD transcriptional program. We find that the three RAS genes are themselves YAP-TEAD1 transcriptional targets, providing a novel mechanism of promotion of RAS-induced tumorigenesis. Moreover, pharmacologic disruption of YAP-TEAD with verteporfin blocks RAS transcription and signaling and inhibits cell growth. The increased MAPK output generated by NF2 loss in RAS-mutant cancers may inform therapeutic strategies, as it generates greater dependency on the MAPK pathway for viability. SIGNIFICANCE: Intensification of mutant RAS signaling through copy-number imbalances is commonly associated with transformation. We show that NF2/merlin inactivation augments mutant RAS signaling by promoting YAP/TEAD-driven transcription of oncogenic and wild-type RAS, resulting in greater MAPK output and increased sensitivity to MEK inhibitors.
- Published
- 2015
17. Comparative validation of the D. melanogaster modENCODE transcriptome annotation
- Author
-
Chen, Zhen-Xia, Sturgill, David, Qu, Jiaxin, Jiang, Huaiyang, Park, Soo, Boley, Nathan, Suzuki, Ana Maria, Fletcher, Anthony R, Plachetzki, David C, FitzGerald, Peter C, Artieri, Carlo G, Atallah, Joel, Barmina, Olga, Brown, James B, Blankenburg, Kerstin P, Clough, Emily, Dasgupta, Abhijit, Gubbala, Sai, Han, Yi, Jayaseelan, Joy C, Kalra, Divya, Kim, Yoo-Ah, Kovar, Christie L, Lee, Sandra L, Li, Mingmei, Malley, James D, Malone, John H, Mathew, Tittu, Mattiuzzo, Nicolas R, Munidasa, Mala, Muzny, Donna M, Ongeri, Fiona, Perales, Lora, Przytycka, Teresa M, Pu, Ling-Ling, Robinson, Garrett, Thornton, Rebecca L, Saada, Nehad, Scherer, Steven E, Smith, Harold E, Vinson, Charles, Warner, Crystal B, Worley, Kim C, Wu, Yuan-Qing, Zou, Xiaoyan, Cherbas, Peter, Kellis, Manolis, Eisen, Michael B, Piano, Fabio, Kionte, Karin, Fitch, David H, Sternberg, Paul W, Cutter, Asher D, Duff, Michael O, Hoskins, Roger A, Graveley, Brenton R, Gibbs, Richard A, Bickel, Peter J, Kopp, Artyom, Carninci, Piero, Celniker, Susan E, Oliver, Brian, and Richards, Stephen
- Subjects
Biological Sciences ,Bioinformatics and Computational Biology ,Genetics ,Human Genome ,Biotechnology ,Generic health relevance ,Animals ,Cluster Analysis ,Computational Biology ,Drosophila melanogaster ,Evolution ,Molecular ,Exons ,Female ,Gene Expression Profiling ,Genome ,Insect ,Humans ,Male ,Molecular Sequence Annotation ,Nucleotide Motifs ,Phylogeny ,Position-Specific Scoring Matrices ,Promoter Regions ,Genetic ,RNA Editing ,RNA Splice Sites ,RNA Splicing ,Reproducibility of Results ,Transcription Initiation Site ,Transcriptome ,Medical and Health Sciences ,Bioinformatics - Abstract
Accurate gene model annotation of reference genomes is critical for making them useful. The modENCODE project has improved the D. melanogaster genome annotation by using deep and diverse high-throughput data. Since transcriptional activity that has been evolutionarily conserved is likely to have an advantageous function, we have performed large-scale interspecific comparisons to increase confidence in predicted annotations. To support comparative genomics, we filled in divergence gaps in the Drosophila phylogeny by generating draft genomes for eight new species. For comparative transcriptome analysis, we generated mRNA expression profiles on 81 samples from multiple tissues and developmental stages of 15 Drosophila species, and we performed cap analysis of gene expression in D. melanogaster and D. pseudoobscura. We also describe conservation of four distinct core promoter structures composed of combinations of elements at three positions. Overall, each type of genomic feature shows a characteristic divergence rate relative to neutral models, highlighting the value of multispecies alignment in annotating a target genome that should prove useful in the annotation of other high priority genomes, especially human and other mammalian genomes that are rich in noncoding sequences. We report that the vast majority of elements in the annotation are evolutionarily conserved, indicating that the annotation will be an important springboard for functional genetic testing by the Drosophila community.
- Published
- 2014
18. A Novel Ensemble Learning-Based Computational Method to Predict Protein-Protein Interactions from Protein Primary Sequences
- Author
-
Jie Pan, Shiwei Wang, Changqing Yu, Liping Li, Zhuhong You, and Yanmei Sun
- Subjects
protein–protein interaction ,Discrete Hilbert transform ,rotation forest ,position-specific scoring matrices ,Biology (General) ,QH301-705.5 - Abstract
Protein–protein interactions (PPIs) are crucial for understanding the cellular processes, including signal cascade, DNA transcription, metabolic cycles, and repair. In the past decade, a multitude of high-throughput methods have been introduced to detect PPIs. However, these techniques are time-consuming, laborious, and always suffer from high false negative rates. Therefore, there is a great need of new computational methods as a supplemental tool for PPIs prediction. In this article, we present a novel sequence-based model to predict PPIs that combines Discrete Hilbert transform (DHT) and Rotation Forest (RoF). This method contains three stages: firstly, the Position-Specific Scoring Matrices (PSSM) was adopted to transform the amino acid sequence into a PSSM matrix, which can contain rich information about protein evolution. Then, the 400-dimensional DHT descriptor was constructed for each protein pair. Finally, these feature descriptors were fed to the RoF classifier for identifying the potential PPI class. When exploring the proposed model on the Yeast, Human, and Oryza sativa PPIs datasets, it yielded excellent prediction accuracies of 91.93, 96.35, and 94.24%, respectively. In addition, we also conducted numerous experiments on cross-species PPIs datasets, and the predictive capacity of our method is also very excellent. To further access the prediction ability of the proposed approach, we present the comparison of RoF with four powerful classifiers, including Support Vector Machine (SVM), Random Forest (RF), K-nearest Neighbor (KNN), and AdaBoost. We also compared it with some existing superiority works. These comprehensive experimental results further confirm the excellent and feasibility of the proposed approach. In future work, we hope it can be a supplemental tool for the proteomics analysis.
- Published
- 2022
- Full Text
- View/download PDF
19. Polysaccharides utilization in human gut bacterium Bacteroides thetaiotaomicron: comparative genomics reconstruction of metabolic and regulatory networks.
- Author
-
Ravcheev, Dmitry A, Godzik, Adam, Osterman, Andrei L, and Rodionov, Dmitry A
- Subjects
Gastrointestinal Tract ,Humans ,Bacteroides ,Polysaccharides ,Bacterial Proteins ,Trans-Activators ,Transcription Factors ,Genomics ,Phylogeny ,Gene Expression Regulation ,Bacterial ,Binding Sites ,Base Sequence ,Gene Regulatory Networks ,Metabolic Networks and Pathways ,Position-Specific Scoring Matrices ,Nucleotide Motifs ,Regulatory network ,Regulon ,Transcription factor ,BACTEROIDES ,Carbohydrate utilization ,Gene Expression Regulation ,Bacterial ,Biological Sciences ,Information and Computing Sciences ,Medical and Health Sciences ,Bioinformatics - Abstract
BackgroundBacteroides thetaiotaomicron, a predominant member of the human gut microbiota, is characterized by its ability to utilize a wide variety of polysaccharides using the extensive saccharolytic machinery that is controlled by an expanded repertoire of transcription factors (TFs). The availability of genomic sequences for multiple Bacteroides species opens an opportunity for their comparative analysis to enable characterization of their metabolic and regulatory networks.ResultsA comparative genomics approach was applied for the reconstruction and functional annotation of the carbohydrate utilization regulatory networks in 11 Bacteroides genomes. Bioinformatics analysis of promoter regions revealed putative DNA-binding motifs and regulons for 31 orthologous TFs in the Bacteroides. Among the analyzed TFs there are 4 SusR-like regulators, 16 AraC-like hybrid two-component systems (HTCSs), and 11 regulators from other families. Novel DNA motifs of HTCSs and SusR-like regulators in the Bacteroides have the common structure of direct repeats with a long spacer between two conserved sites.ConclusionsThe inferred regulatory network in B. thetaiotaomicron contains 308 genes encoding polysaccharide and sugar catabolic enzymes, carbohydrate-binding and transport systems, and TFs. The analyzed TFs control pathways for utilization of host and dietary glycans to monosaccharides and their further interconversions to intermediates of the central metabolism. The reconstructed regulatory network allowed us to suggest and refine specific functional assignments for sugar catabolic enzymes and transporters, providing a substantial improvement to the existing metabolic models for B. thetaiotaomicron. The obtained collection of reconstructed TF regulons is available in the RegPrecise database (http://regprecise.lbl.gov).
- Published
- 2013
20. Fine-scale mapping of the FGFR2 breast cancer risk locus: putative functional variants differentially bind FOXA1 and E2F1.
- Author
-
Meyer, Kerstin B, O'Reilly, Martin, Michailidou, Kyriaki, Carlebur, Saskia, Edwards, Stacey L, French, Juliet D, Prathalingham, Radhika, Dennis, Joe, Bolla, Manjeet K, Wang, Qin, de Santiago, Ines, Hopper, John L, Tsimiklis, Helen, Apicella, Carmel, Southey, Melissa C, Schmidt, Marjanka K, Broeks, Annegien, Van 't Veer, Laura J, Hogervorst, Frans B, Muir, Kenneth, Lophatananon, Artitaya, Stewart-Brown, Sarah, Siriwanarangsan, Pornthep, Fasching, Peter A, Lux, Michael P, Ekici, Arif B, Beckmann, Matthias W, Peto, Julian, Dos Santos Silva, Isabel, Fletcher, Olivia, Johnson, Nichola, Sawyer, Elinor J, Tomlinson, Ian, Kerin, Michael J, Miller, Nicola, Marme, Federick, Schneeweiss, Andreas, Sohn, Christof, Burwinkel, Barbara, Guénel, Pascal, Truong, Thérèse, Laurent-Puig, Pierre, Menegaux, Florence, Bojesen, Stig E, Nordestgaard, Børge G, Nielsen, Sune F, Flyger, Henrik, Milne, Roger L, Zamora, M Pilar, Arias, Jose I, Benitez, Javier, Neuhausen, Susan, Anton-Culver, Hoda, Ziogas, Argyrios, Dur, Christina C, Brenner, Hermann, Müller, Heiko, Arndt, Volker, Stegmaier, Christa, Meindl, Alfons, Schmutzler, Rita K, Engel, Christoph, Ditsch, Nina, Brauch, Hiltrud, Brüning, Thomas, Ko, Yon-Dschun, GENICA Network, Nevanlinna, Heli, Muranen, Taru A, Aittomäki, Kristiina, Blomqvist, Carl, Matsuo, Keitaro, Ito, Hidemi, Iwata, Hiroji, Yatabe, Yasushi, Dörk, Thilo, Helbig, Sonja, Bogdanova, Natalia V, Lindblom, Annika, Margolin, Sara, Mannermaa, Arto, Kataja, Vesa, Kosma, Veli-Matti, Hartikainen, Jaana M, Chenevix-Trench, Georgia, kConFab Investigators, Australian Ovarian Cancer Study Group, Wu, Anna H, Tseng, Chiu-Chen, Van Den Berg, David, Stram, Daniel O, Lambrechts, Diether, Thienpont, Bernard, Christiaens, Marie-Rose, Smeets, Ann, Chang-Claude, Jenny, Rudolph, Anja, Seibold, Petra, Flesch-Janys, Dieter, and Radice, Paolo
- Subjects
GENICA Network ,kConFab Investigators ,Australian Ovarian Cancer Study Group ,Cell Line ,Tumor ,Humans ,Breast Neoplasms ,Case-Control Studies ,Chromatin Immunoprecipitation ,Chromosome Mapping ,Gene Expression Regulation ,Neoplastic ,RNA Interference ,Binding Sites ,Protein Binding ,Haplotypes ,Alleles ,Female ,E2F1 Transcription Factor ,Receptor ,Fibroblast Growth Factor ,Type 2 ,Hepatocyte Nuclear Factor 3-alpha ,Promoter Regions ,Genetic ,Genetic Loci ,Position-Specific Scoring Matrices ,Genetic Association Studies ,Asian People ,White People ,Black People ,Cancer ,Human Genome ,Breast Cancer ,Genetics ,Prevention ,2.1 Biological and endogenous factors ,Aetiology ,Biological Sciences ,Medical and Health Sciences ,Genetics & Heredity - Abstract
The 10q26 locus in the second intron of FGFR2 is the locus most strongly associated with estrogen-receptor-positive breast cancer in genome-wide association studies. We conducted fine-scale mapping in case-control studies genotyped with a custom chip (iCOGS), comprising 41 studies (n = 89,050) of European ancestry, 9 Asian ancestry studies (n = 13,983), and 2 African ancestry studies (n = 2,028) from the Breast Cancer Association Consortium. We identified three statistically independent risk signals within the locus. Within risk signals 1 and 3, genetic analysis identified five and two variants, respectively, highly correlated with the most strongly associated SNPs. By using a combination of genetic fine mapping, data on DNase hypersensitivity, and electrophoretic mobility shift assays to study protein-DNA binding, we identified rs35054928, rs2981578, and rs45631563 as putative functional SNPs. Chromatin immunoprecipitation showed that FOXA1 preferentially bound to the risk-associated allele (C) of rs2981578 and was able to recruit ERα to this site in an allele-specific manner, whereas E2F1 preferentially bound the risk variant of rs35054928. The risk alleles were preferentially found in open chromatin and bound by Ser5 phosphorylated RNA polymerase II, suggesting that the risk alleles are associated with changes in transcription. Chromatin conformation capture demonstrated that the risk region was able to interact with the promoter of FGFR2, the likely target gene of this risk region. A role for FOXA1 in mediating breast cancer susceptibility at this locus is consistent with the finding that the FGFR2 risk locus primarily predisposes to estrogen-receptor-positive disease.
- Published
- 2013
21. Selective regulation of lymphopoiesis and leukemogenesis by individual zinc fingers of Ikaros
- Author
-
Schjerven, Hilde, McLaughlin, Jami, Arenzana, Teresita L, Frietze, Seth, Cheng, Donghui, Wadsworth, Sarah E, Lawson, Gregory W, Bensinger, Steven J, Farnham, Peggy J, Witte, Owen N, and Smale, Stephen T
- Subjects
Biotechnology ,Rare Diseases ,Genetics ,Hematology ,Cancer ,Underpinning research ,1.1 Normal biological development and functioning ,Inflammatory and immune system ,Animals ,B-Lymphocytes ,Base Sequence ,Binding Sites ,Cell Differentiation ,Cell Transformation ,Neoplastic ,Chromatin Immunoprecipitation ,Cluster Analysis ,Fusion Proteins ,bcr-abl ,Gene Expression Profiling ,Gene Expression Regulation ,Germ-Line Mutation ,High-Throughput Nucleotide Sequencing ,Ikaros Transcription Factor ,Immunophenotyping ,Leukemia ,Lymphoma ,Lymphopoiesis ,Mice ,Mice ,Knockout ,Molecular Sequence Data ,Nucleotide Motifs ,Phenotype ,Position-Specific Scoring Matrices ,Protein Binding ,Thymocytes ,Immunology - Abstract
C2H2 zinc fingers are found in several key transcriptional regulators in the immune system. However, these proteins usually contain more fingers than are needed for sequence-specific DNA binding, which suggests that different fingers regulate different genes and functions. Here we found that mice lacking finger 1 or finger 4 of Ikaros exhibited distinct subsets of the hematological defects of Ikaros-null mice. Most notably, the two fingers controlled different stages of lymphopoiesis, and finger 4 was selectively required for tumor suppression. The distinct defects support the hypothesis that only a small number of genes that are targets of Ikaros are critical for each of its biological functions. The subcategorization of functions and target genes by mutagenesis of individual zinc fingers will facilitate efforts to understand how zinc-finger transcription factors regulate development, immunity and disease.
- Published
- 2013
22. EcoCyc: fusing model organism databases with systems biology
- Author
-
Keseler, Ingrid M, Mackie, Amanda, Peralta-Gil, Martin, Santos-Zavaleta, Alberto, Gama-Castro, Socorro, Bonavides-Martínez, César, Fulcher, Carol, Huerta, Araceli M, Kothari, Anamika, Krummenacker, Markus, Latendresse, Mario, Muñiz-Rascado, Luis, Ong, Quang, Paley, Suzanne, Schröder, Imke, Shearer, Alexander G, Subhraveti, Pallavi, Travers, Mike, Weerasinghe, Deepika, Weiss, Verena, Collado-Vides, Julio, Gunsalus, Robert P, Paulsen, Ian, and Karp, Peter D
- Subjects
Genetics ,Human Genome ,Infection ,Binding Sites ,Databases ,Genetic ,Escherichia coli K12 ,Escherichia coli Proteins ,Gene Expression Regulation ,Bacterial ,Internet ,Membrane Transport Proteins ,Models ,Genetic ,Molecular Sequence Annotation ,Phenotype ,Position-Specific Scoring Matrices ,Promoter Regions ,Genetic ,Systems Biology ,Transcription Factors ,Transcription ,Genetic ,Environmental Sciences ,Biological Sciences ,Information and Computing Sciences ,Developmental Biology - Abstract
EcoCyc (http://EcoCyc.org) is a model organism database built on the genome sequence of Escherichia coli K-12 MG1655. Expert manual curation of the functions of individual E. coli gene products in EcoCyc has been based on information found in the experimental literature for E. coli K-12-derived strains. Updates to EcoCyc content continue to improve the comprehensive picture of E. coli biology. The utility of EcoCyc is enhanced by new tools available on the EcoCyc web site, and the development of EcoCyc as a teaching tool is increasing the impact of the knowledge collected in EcoCyc.
- Published
- 2013
23. Comprehensive Research on Druggable Proteins: From PSSM to Pre-Trained Language Models.
- Author
-
Chu H and Liu T
- Subjects
- Computational Biology methods, Drug Discovery methods, Position-Specific Scoring Matrices, Databases, Protein, Humans, Algorithms, Proteins metabolism
- Abstract
Identification of druggable proteins can greatly reduce the cost of discovering new potential drugs. Traditional experimental approaches to exploring these proteins are often costly, slow, and labor-intensive, making them impractical for large-scale research. In response, recent decades have seen a rise in computational methods. These alternatives support drug discovery by creating advanced predictive models. In this study, we proposed a fast and precise classifier for the identification of druggable proteins using a protein language model (PLM) with fine-tuned evolutionary scale modeling 2 (ESM-2) embeddings, achieving 95.11% accuracy on the benchmark dataset. Furthermore, we made a careful comparison to examine the predictive abilities of ESM-2 embeddings and position-specific scoring matrix (PSSM) features by using the same classifiers. The results suggest that ESM-2 embeddings outperformed PSSM features in terms of accuracy and efficiency. Recognizing the potential of language models, we also developed an end-to-end model based on the generative pre-trained transformers 2 (GPT-2) with modifications. To our knowledge, this is the first time a large language model (LLM) GPT-2 has been deployed for the recognition of druggable proteins. Additionally, a more up-to-date dataset, known as Pharos, was adopted to further validate the performance of the proposed model.
- Published
- 2024
- Full Text
- View/download PDF
24. DPI_CDF: druggable protein identifier using cascade deep forest.
- Author
-
Arif M, Fang G, Ghulam A, Musleh S, and Alam T
- Subjects
- Amino Acid Sequence, Position-Specific Scoring Matrices, Biological Evolution, Computational Biology methods, Proteins, Software
- Abstract
Background: Drug targets in living beings perform pivotal roles in the discovery of potential drugs. Conventional wet-lab characterization of drug targets is although accurate but generally expensive, slow, and resource intensive. Therefore, computational methods are highly desirable as an alternative to expedite the large-scale identification of druggable proteins (DPs); however, the existing in silico predictor's performance is still not satisfactory., Methods: In this study, we developed a novel deep learning-based model DPI_CDF for predicting DPs based on protein sequence only. DPI_CDF utilizes evolutionary-based (i.e., histograms of oriented gradients for position-specific scoring matrix), physiochemical-based (i.e., component protein sequence representation), and compositional-based (i.e., normalized qualitative characteristic) properties of protein sequence to generate features. Then a hierarchical deep forest model fuses these three encoding schemes to build the proposed model DPI_CDF., Results: The empirical outcomes on 10-fold cross-validation demonstrate that the proposed model achieved 99.13 % accuracy and 0.982 of Matthew's-correlation-coefficient (MCC) on the training dataset. The generalization power of the trained model is further examined on an independent dataset and achieved 95.01% of maximum accuracy and 0.900 MCC. When compared to current state-of-the-art methods, DPI_CDF improves in terms of accuracy by 4.27% and 4.31% on training and testing datasets, respectively. We believe, DPI_CDF will support the research community to identify druggable proteins and escalate the drug discovery process., Availability: The benchmark datasets and source codes are available in GitHub: http://github.com/Muhammad-Arif-NUST/DPI_CDF ., (© 2024. The Author(s).)
- Published
- 2024
- Full Text
- View/download PDF
25. Basic leucine zipper transcription factor Hac1 binds DNA in two distinct modes as revealed by microfluidic analyses.
- Author
-
Fordyce, Polly, Pincus, David, Kimmig, Philipp, Nelson, Christopher, El-samad, Hana, Derisi, Joseph, and Walter, Peter
- Subjects
Alternative Splicing ,Base Pairing ,Base Sequence ,Basic-Leucine Zipper Transcription Factors ,Binding Sites ,DNA ,Genome ,Microfluidic Analytical Techniques ,Molecular Sequence Data ,Mutant Proteins ,Mutation ,Nucleotide Motifs ,Nucleotides ,Position-Specific Scoring Matrices ,Protein Binding ,Protein Structure ,Tertiary ,RNA ,Messenger ,Repressor Proteins ,Response Elements ,Saccharomyces cerevisiae ,Saccharomyces cerevisiae Proteins ,Sequence Homology ,Amino Acid ,Unfolded Protein Response - Abstract
A quantitative understanding of how transcription factors interact with genomic target sites is crucial for reconstructing transcriptional networks in vivo. Here, we use Hac1, a well-characterized basic leucine zipper (bZIP) transcription factor involved in the unfolded protein response (UPR) as a model to investigate interactions between bZIP transcription factors and their target sites. During the UPR, the accumulation of unfolded proteins leads to unconventional splicing and subsequent translation of HAC1 mRNA, followed by transcription of UPR target genes. Initial candidate-based approaches identified a canonical cis-acting unfolded protein response element (UPRE-1) within target gene promoters; however, subsequent studies identified a large set of Hac1 target genes lacking this UPRE-1 and containing a different motif (UPRE-2). Using a combination of unbiased and directed microfluidic DNA binding assays, we established that Hac1 binds in two distinct modes: (i) to short (6-7 bp) UPRE-2-like motifs and (ii) to significantly longer (11-13 bp) extended UPRE-1-like motifs. Using a genetic screen, we demonstrate that a region of extended homology N-terminal to the basic DNA binding domain is required for this dual site recognition. These results establish Hac1 as the first bZIP transcription factor known to adopt more than one binding mode and unify previously conflicting and discrepant observations of Hac1 function into a cohesive model of UPR target gene activation. Our results also suggest that even structurally simple transcription factors can recognize multiple divergent target sites of very different lengths, potentially enriching their downstream target repertoire.
- Published
- 2012
26. Bayesian multiple-instance motif discovery with BAMBI: inference of recombinase and transcription factor binding sites
- Author
-
Jajamovich, Guido H, Wang, Xiaodong, Arkin, Adam P, and Samoilov, Michael S
- Subjects
Biological Sciences ,Bioinformatics and Computational Biology ,Human Genome ,Genetics ,Generic health relevance ,Algorithms ,Bayes Theorem ,Binding Sites ,Cyclic AMP Receptor Protein ,Databases ,Nucleic Acid ,Databases ,Protein ,Nucleotide Motifs ,Position-Specific Scoring Matrices ,Recombinases ,Sequence Analysis ,DNA ,Transcription Factors ,Environmental Sciences ,Information and Computing Sciences ,Developmental Biology ,Biological sciences ,Chemical sciences ,Environmental sciences - Abstract
Finding conserved motifs in genomic sequences represents one of essential bioinformatic problems. However, achieving high discovery performance without imposing substantial auxiliary constraints on possible motif features remains a key algorithmic challenge. This work describes BAMBI-a sequential Monte Carlo motif-identification algorithm, which is based on a position weight matrix model that does not require additional constraints and is able to estimate such motif properties as length, logo, number of instances and their locations solely on the basis of primary nucleotide sequence data. Furthermore, should biologically meaningful information about motif attributes be available, BAMBI takes advantage of this knowledge to further refine the discovery results. In practical applications, we show that the proposed approach can be used to find sites of such diverse DNA-binding molecules as the cAMP receptor protein (CRP) and Din-family site-specific serine recombinases. Results obtained by BAMBI in these and other settings demonstrate better statistical performance than any of the four widely-used profile-based motif discovery methods: MEME, BioProspector with BioOptimizer, SeSiMCMC and Motif Sampler as measured by the nucleotide-level correlation coefficient. Additionally, in the case of Din-family recombinase target site discovery, the BAMBI-inferred motif is found to be the only one functionally accurate from the underlying biochemical mechanism standpoint. C++ and Matlab code is available at http://www.ee.columbia.edu/~guido/BAMBI or http://genomics.lbl.gov/BAMBI/.
- Published
- 2011
27. Transcription factor binding site clusters identify target genes with similar tissue-wide expression and buffer against mutations [version 2; peer review: 2 approved]
- Author
-
Ruipeng Lu and Peter K. Rogan
- Subjects
Research Article ,Articles ,Transcription factors ,position-specific scoring matrices ,chromatin ,binding sites ,gene expression profiles ,Bray-Curtis similarity ,mutation ,machine learning ,information theory - Abstract
Background: The distribution and composition of cis-regulatory modules composed of transcription factor (TF) binding site (TFBS) clusters in promoters substantially determine gene expression patterns and TF targets. TF knockdown experiments have revealed that TF binding profiles and gene expression levels are correlated. We use TFBS features within accessible promoter intervals to predict genes with similar tissue-wide expression patterns and TF targets using Machine Learning (ML). Methods: Bray-Curtis Similarity was used to identify genes with correlated expression patterns across 53 tissues. TF targets from knockdown experiments were also analyzed by this approach to set up the ML framework. TFBSs were selected within DNase I-accessible intervals of corresponding promoter sequences using information theory-based position weight matrices (iPWMs) for each TF. Features from information-dense clusters of TFBSs were input to ML classifiers which predict these gene targets along with their accuracy, specificity and sensitivity. Mutations in TFBSs were analyzed in silico to examine their impact on TFBS clustering and predict changes in gene regulation. Results: The glucocorticoid receptor gene ( NR3C1), whose regulation has been extensively studied, was selected to test this approach. SLC25A32 and TANK exhibited the most similar expression patterns to NR3C1. A Decision Tree classifier exhibited the best performance in detecting such genes, based on Area Under the Receiver Operating Characteristic curve (ROC). TF target gene prediction was confirmed using siRNA knockdown, which was more accurate than CRISPR/CAS9 inactivation. TFBS mutation analyses revealed that accurate target gene prediction required at least 1 information-dense TFBS cluster. Conclusions: ML based on TFBS information density, organization, and chromatin accessibility accurately identifies gene targets with comparable tissue-wide expression patterns. Multiple information-dense TFBS clusters in promoters appear to protect promoters from effects of deleterious binding site mutations in a single TFBS that would otherwise alter regulation of these genes.
- Published
- 2019
- Full Text
- View/download PDF
28. Identifying SNARE Proteins Using an Alignment-Free Method Based on Multiscan Convolutional Neural Network and PSSM Profiles
- Author
-
Quang-Hien Kha, Quang-Thai Ho, and Nguyen Quoc Khanh Le
- Subjects
General Chemical Engineering ,Position-Specific Scoring Matrices ,Neural Networks, Computer ,General Chemistry ,Library and Information Sciences ,SNARE Proteins ,Computer Science Applications - Published
- 2022
- Full Text
- View/download PDF
29. Modeling Peptide-Protein Interactions by a Logo-Based Method: Application in Peptide-HLA Binding Predictions.
- Author
-
Doytchinova I, Atanasova M, Fernandez A, Moreno FJ, Koning F, and Dimitrov I
- Subjects
- Humans, Amino Acids, Molecular Biology, Position-Specific Scoring Matrices, Peptides, Celiac Disease
- Abstract
Peptide-protein interactions form a cornerstone in molecular biology, governing cellular signaling, structure, and enzymatic activities in living organisms. Improving computational models and experimental techniques to describe and predict these interactions remains an ongoing area of research. Here, we present a computational method for peptide-protein interactions' description and prediction based on leveraged amino acid frequencies within specific binding cores. Utilizing normalized frequencies, we construct quantitative matrices (QMs), termed 'logo models' derived from sequence logos. The method was developed to predict peptide binding to HLA-DQ2.5 and HLA-DQ8.1 proteins associated with susceptibility to celiac disease. The models were validated by more than 17,000 peptides demonstrating their efficacy in discriminating between binding and non-binding peptides. The logo method could be applied to diverse peptide-protein interactions, offering a versatile tool for predictive analysis in molecular binding studies.
- Published
- 2024
- Full Text
- View/download PDF
30. Machine learning-based model for accurate identification of druggable proteins using light extreme gradient boosting.
- Author
-
Alghushairy O, Ali F, Alghamdi W, Khalid M, Alsini R, and Asiry O
- Subjects
- Computational Biology methods, Humans, Drug Discovery methods, Amino Acids chemistry, Position-Specific Scoring Matrices, Machine Learning, Proteins chemistry, Proteins metabolism, Algorithms
- Abstract
The identification of druggable proteins (DPs) is significant for the development of new drugs, personalized medicine, understanding of disease mechanisms, drug repurposing, and economic benefits. By identifying new druggable targets, researchers can develop new therapies for a range of diseases, leading to better patient outcomes. Identification of DPs by machine learning strategies is more efficient and cost-effective than conventional methods. In this study, a computational predictor, namely Drug-LXGB, is introduced to enhance the identification of DPs. Features are discovered by composition, transition, and distribution (CTD), composition of K-spaced amino acid pair (CKSAAP), pseudo-position-specific scoring matrix (PsePSSM), and a novel descriptor, called multi-block pseudo amino acid composition (MB-PseAAC). The dimensions of CTD, CKSAAP, PsePSSM, and MB-PseAAC are integrated and utilized the sequential forward selection as feature selection algorithm. The best characteristics are provided by random forest, extreme gradient boosting, and light eXtreme gradient boosting (LXGB). The predictive analysis of these learning methods is measured via 10-fold cross-validation. The LXGB-based model secures the highest results than other existing predictors. Our novel protocol will perform an active role in designing novel drugs and would be fruitful to explore the potential target. This study will help better to capture a more universal view of a potential target.Communicated by Ramaswamy H. Sarma.
- Published
- 2024
- Full Text
- View/download PDF
31. Prior knowledge facilitates low homologous protein secondary structure prediction with DSM distillation
- Author
-
Qin Wang, Jun Wei, Yuzhe Zhou, Mingzhi Lin, Ruobing Ren, Sheng Wang, Shuguang Cui, and Zhen Li
- Subjects
Statistics and Probability ,Computational Mathematics ,Computational Theory and Mathematics ,Humans ,Computational Biology ,Position-Specific Scoring Matrices ,Proteins ,Neural Networks, Computer ,Sequence Alignment ,Molecular Biology ,Biochemistry ,Protein Structure, Secondary ,Computer Science Applications - Abstract
Motivation Protein secondary structure prediction (PSSP) is one of the fundamental and challenging problems in the field of computational biology. Accurate PSSP relies on sufficient homologous protein sequences to build the multiple sequence alignment (MSA). Unfortunately, many proteins lack homologous sequences, which results in the low quality of MSA and poor performance. In this article, we propose the novel dynamic scoring matrix (DSM)-Distil to tackle this issue, which takes advantage of the pretrained BERT and exploits the knowledge distillation on the newly designed DSM features. Specifically, we propose the DSM to replace the widely used profile and PSSM (position-specific scoring matrix) features. DSM could automatically dig for the suitable feature for each residue, based on the original profile. Namely, DSM-Distil not only could adapt to the low homologous proteins but also is compatible with high homologous ones. Thanks to the dynamic property, DSM could adapt to the input data much better and achieve higher performance. Moreover, to compensate for low-quality MSA, we propose to generate the pseudo-DSM from a pretrained BERT model and aggregate it with the original DSM by adaptive residue-wise fusion, which helps to build richer and more complete input features. In addition, we propose to supervise the learning of low-quality DSM features using high-quality ones. To achieve this, a novel teacher–student model is designed to distill the knowledge from proteins with high homologous sequences to that of low ones. Combining all the proposed methods, our model achieves the new state-of-the-art performance for low homologous proteins. Results Compared with the previous state-of-the-art method ‘Bagging’, DSM-Distil achieves an improvement about 5% and 7.3% improvement for proteins with MSA count ≤30 and extremely low homologous cases, respectively. We also compare DSM-Distil with Alphafold2 which is a state-of-the-art framework for protein structure prediction. DSM-Distil outperforms Alphafold2 by 4.1% on extremely low-quality MSA on 8-state secondary structure prediction. Moreover, we release a large-scale up-to-date test dataset BC40 for low-quality MSA structure prediction evaluation. Availability and implementation BC40 dataset: https://drive.google.com/drive/folders/15vwRoOjAkhhwfjDk6-YoKGf4JzZXIMC. HardCase dataset: https://drive.google.com/drive/folders/1BvduOr2b7cObUHy6GuEWk-aUkKJgzTUv. Code: https://github.com/qinwang-ai/DSM-Distil.
- Published
- 2022
- Full Text
- View/download PDF
32. Transcription factor binding site clusters identify target genes with similar tissue-wide expression and buffer against mutations [version 1; referees: 2 approved with reservations]
- Author
-
Ruipeng Lu and Peter K. Rogan
- Subjects
Research Article ,Articles ,Transcription factors ,position-specific scoring matrices ,chromatin ,binding sites ,gene expression profiles ,Bray-Curtis similarity ,mutation ,machine learning ,information theory - Abstract
Background: The distribution and composition of cis-regulatory modules composed of transcription factor (TF) binding site (TFBS) clusters in promoters substantially determine gene expression patterns and TF targets. TF knockdown experiments have revealed that TF binding profiles and gene expression levels are correlated. We use TFBS features within accessible promoter intervals to predict genes with similar tissue-wide expression patterns and TF targets. Methods: Genes with correlated expression patterns across 53 tissues and TF targets were respectively identified from Bray-Curtis Similarity and TF knockdown experiments. Corresponding promoter sequences were reduced to DNase I-accessible intervals; TFBSs were then identified within these intervals using information theory-based position weight matrices for each TF (iPWMs) and clustered. Features from information-dense TFBS clusters predicted these genes with machine learning classifiers, which were evaluated for accuracy, specificity and sensitivity. Mutations in TFBSs were analyzed to in silico examine their impact on cluster densities and the regulatory states of target genes. Results: We initially chose the glucocorticoid receptor gene ( NR3C1), whose regulation has been extensively studied, to test this approach. SLC25A32 and TANK were found to exhibit the most similar expression patterns to NR3C1. A Decision Tree classifier exhibited the largest area under the Receiver Operating Characteristic (ROC) curve in detecting such genes. Target gene prediction was confirmed using siRNA knockdown of TFs, which was found to be more accurate than those predicted after CRISPR/CAS9 inactivation. In-silico mutation analyses of TFBSs also revealed that one or more information-dense TFBS clusters in promoters are required for accurate target gene prediction. Conclusions: Machine learning based on TFBS information density, organization, and chromatin accessibility accurately identifies gene targets with comparable tissue-wide expression patterns. Multiple information-dense TFBS clusters in promoters appear to protect promoters from effects of deleterious binding site mutations in a single TFBS that would otherwise alter regulation of these genes.
- Published
- 2018
- Full Text
- View/download PDF
33. Prediction of FMN Binding Sites in Electron Transport Chains Based on 2-D CNN and PSSM Profiles
- Author
-
Binh P. Nguyen and Nguyen-Quoc-Khanh Le
- Subjects
Flavin Mononucleotide ,0206 medical engineering ,02 engineering and technology ,Flavin group ,Computational biology ,Convolutional neural network ,Electron Transport ,Deep Learning ,FMN binding ,Genetics ,Position-Specific Scoring Matrices ,Binding site ,Binding Sites ,business.industry ,Chemistry ,Applied Mathematics ,Deep learning ,Computational Biology ,Electron transport chain ,Neural Networks, Computer ,Artificial intelligence ,business ,Algorithms ,020602 bioinformatics ,Biotechnology - Abstract
Flavin mono-nucleotides (FMNs) are cofactors that hold responsibility for carrying and transferring electrons in the electron transport chain stage of cellular respiration. Without being facilitated by FMNs, energy production is stagnant due to the interruption in most of the cellular processes. Investigation on FMN's functions, therefore, can gain holistic understanding about human diseases and molecular information on drug targets. We proposed a deep learning model using a two-dimensional convolutional neural network and position specific scoring matrices that could identify FMN interacting residues with the sensitivity of 83.7 percent, specificity of 99.2 percent, accuracy of 98.2 percent, and Matthews correlation coefficients of 0.85 for an independent dataset containing 141 FMN binding sites and 1,920 non-FMN binding sites. The proposed method outperformed other previous studies using similar evaluation metrics. Our positive outcome can also promote the utilization of deep learning in dealing with various problems in bioinformatics and computational biology.
- Published
- 2021
- Full Text
- View/download PDF
34. iEnhancer-KL: A Novel Two-Layer Predictor for Identifying Enhancers by Position Specific of Nucleotide Composition
- Author
-
Fei Guo, Zhen Zhang, Yinuo Lyu, Jiawei Li, Wenying He, and Yijie Ding
- Subjects
Base Composition ,Support Vector Machine ,Computer science ,Applied Mathematics ,Feature extraction ,Computational Biology ,DNA ,Sequence Analysis, DNA ,Computational biology ,Identification (information) ,Dimension (vector space) ,Lasso (statistics) ,Genetics ,Position-Specific Scoring Matrices ,Transcription (software) ,Divergence (statistics) ,Enhancer ,Transcription factor ,Algorithms ,Software ,Biotechnology - Abstract
An enhancer is a short region of DNA with the ability to recruit transcription factors and their complexes, increasing the likelihood of the transcription of a particular gene. Considering the importance of enhancers, enhancer identification is a prevailing problem in computational biology. In this paper, we propose a novel two-layer enhancer predictor called iEnhancer-KL, using computational biology algorithms to identify enhancers and then classify these enhancers into strong or weak types. Kullback-Leibler (KL) divergence is creatively taken into consideration to improve the feature extraction method PSTNPss. Then, LASSO is used to reduce the dimension of features and finally helps to get better prediction performance. Furthermore, the selected features are tested on several machine learning models, and the SVM algorithm achieves the best performance. The rigorous cross-validation indicates that our predictor is remarkably superior to the existing state-of-the-art methods with an Acc of 84.23 percent and the MCC of 0.6849 for identifying enhancers. Our code and results can be freely downloaded from https://github.com/Not-so-middle/iEnhancer-KL.git.
- Published
- 2021
- Full Text
- View/download PDF
35. plotnineSeqSuite: a Python package for visualizing sequence data using ggplot2 style.
- Author
-
Cao T, Li Q, Huang Y, and Li A
- Subjects
- Programming Languages, Computational Biology, Position-Specific Scoring Matrices, Artificial Intelligence, Software
- Abstract
Background: The visual sequence logo has been a hot area in the development of bioinformatics tools. ggseqlogo written in R language has been the most popular API since it was published. With the popularity of artificial intelligence and deep learning, Python is currently the most popular programming language. The programming language used by bioinformaticians began to shift to Python. Providing APIs in Python that are similar to those in R can reduce the learning cost of relearning a programming language. And compared to ggplot2 in R, drawing framework is not as easy to use in Python. The appearance of plotnine (ggplot2 in Python version) makes it possible to unify the programming methods of bioinformatics visualization tools between R and Python., Results: Here, we introduce plotnineSeqSuite, a new plotnine-based Python package provides a ggseqlogo-like API for programmatic drawing of sequence logos, sequence alignment diagrams and sequence histograms. To be more precise, it supports custom letters, color themes, and fonts. Moreover, the class for drawing layers is based on object-oriented design so that users can easily encapsulate and extend it., Conclusions: plotnineSeqSuite is the first ggplot2-style package to implement visualization of sequence -related graphs in Python. It enhances the uniformity of programmatic plotting between R and Python. Compared with tools appeared already, the categories supported by plotnineSeqSuite are much more complete. The source code of plotnineSeqSuite can be obtained on GitHub ( https://github.com/caotianze/plotnineseqsuite ) and PyPI ( https://pypi.org/project/plotnineseqsuite ), and the documentation homepage is freely available on GitHub at ( https://caotianze.github.io/plotnineseqsuite/ )., (© 2023. BioMed Central Ltd., part of Springer Nature.)
- Published
- 2023
- Full Text
- View/download PDF
36. Robust and accurate prediction of self-interacting proteins from protein sequence information by exploiting weighted sparse representation based classifier
- Author
-
Yang, Li, Xue-Gang, Hu, Zhu-Hong, You, Li-Ping, Li, Pei-Pei, Li, Yan-Bin, Wang, and Yu-An, Huang
- Subjects
Structural Biology ,Applied Mathematics ,Leukocytes ,Humans ,Position-Specific Scoring Matrices ,Computational Biology ,Amino Acid Sequence ,Saccharomyces cerevisiae ,Biological Evolution ,Molecular Biology ,Biochemistry ,Computer Science Applications - Abstract
Background Self-interacting proteins (SIPs), two or more copies of the protein that can interact with each other expressed by one gene, play a central role in the regulation of most living cells and cellular functions. Although numerous SIPs data can be provided by using high-throughput experimental techniques, there are still several shortcomings such as in time-consuming, costly, inefficient, and inherently high in false-positive rates, for the experimental identification of SIPs even nowadays. Therefore, it is more and more significant how to develop efficient and accurate automatic approaches as a supplement of experimental methods for assisting and accelerating the study of predicting SIPs from protein sequence information. Results In this paper, we present a novel framework, termed GLCM-WSRC (gray level co-occurrence matrix-weighted sparse representation based classification), for predicting SIPs automatically based on protein evolutionary information from protein primary sequences. More specifically, we firstly convert the protein sequence into Position Specific Scoring Matrix (PSSM) containing protein sequence evolutionary information, exploiting the Position Specific Iterated BLAST (PSI-BLAST) tool. Secondly, using an efficient feature extraction approach, i.e., GLCM, we extract abstract salient and invariant feature vectors from the PSSM, and then perform a pre-processing operation, the adaptive synthetic (ADASYN) technique, to balance the SIPs dataset to generate new feature vectors for classification. Finally, we employ an efficient and reliable WSRC model to identify SIPs according to the known information of self-interacting and non-interacting proteins. Conclusions Extensive experimental results show that the proposed approach exhibits high prediction performance with 98.10% accuracy on the yeast dataset, and 91.51% accuracy on the human dataset, which further reveals that the proposed model could be a useful tool for large-scale self-interacting protein prediction and other bioinformatics tasks detection in the future.
- Published
- 2022
- Full Text
- View/download PDF
37. Prediction of antifreeze proteins using machine learning
- Author
-
Adnan, Khan, Jamal, Uddin, Farman, Ali, Ashfaq, Ahmad, Omar, Alghushairy, Ameen, Banjar, and Ali, Daud
- Subjects
Machine Learning ,Multidisciplinary ,Antifreeze Proteins ,Animals ,Position-Specific Scoring Matrices ,Agriculture ,alpha-Fetoproteins - Abstract
Living organisms including fishes, microbes, and animals can live in extremely cold weather. To stay alive in cold environments, these species generate antifreeze proteins (AFPs), also referred to as ice-binding proteins. Moreover, AFPs are extensively utilized in many important fields including medical, agricultural, industrial, and biotechnological. Several predictors were constructed to identify AFPs. However, due to the sequence and structural heterogeneity of AFPs, correct identification is still a challenging task. It is highly desirable to develop a more promising predictor. In this research, a novel computational method, named AFP-LXGB has been proposed for prediction of AFPs more precisely. The information is explored by Dipeptide Composition (DPC), Grouped Amino Acid Composition (GAAC), Position Specific Scoring Matrix-Segmentation-Autocorrelation Transformation (Sg-PSSM-ACT), and Pseudo Position Specific Scoring Matrix Tri-Slicing (PseTS-PSSM). Keeping the benefits of ensemble learning, these feature sets are concatenated into different combinations. The best feature set is selected by Extremely Randomized Tree-Recursive Feature Elimination (ERT-RFE). The models are trained by Light eXtreme Gradient Boosting (LXGB), Random Forest (RF), and Extremely Randomized Tree (ERT). Among classifiers, LXGB has obtained the best prediction results. The novel method (AFP-LXGB) improved the accuracies by 3.70% and 4.09% than the best methods. These results verified that AFP-LXGB can predict AFPs more accurately and can participate in a significant role in medical, agricultural, industrial, and biotechnological fields.
- Published
- 2022
- Full Text
- View/download PDF
38. Comprehensive Study on Enhancing Low-Quality Position-Specific Scoring Matrix with Deep Learning for Accurate Protein Structure Property Prediction: Using Bagging Multiple Sequence Alignment Learning
- Author
-
Junzhou Huang, Sheng Wang, Hehuan Ma, Jiaxiang Wu, and Yuzhi Guo
- Subjects
Scheme (programming language) ,Protein Conformation ,Property (programming) ,Computer science ,Context (language use) ,Protein Structure, Secondary ,03 medical and health sciences ,Deep Learning ,0302 clinical medicine ,Position (vector) ,Genetics ,Position-Specific Scoring Matrices ,Molecular Biology ,Protein secondary structure ,030304 developmental biology ,computer.programming_language ,0303 health sciences ,Multiple sequence alignment ,business.industry ,Deep learning ,Computational Biology ,Proteins ,Pattern recognition ,Computational Mathematics ,Computational Theory and Mathematics ,030220 oncology & carcinogenesis ,Modeling and Simulation ,Unsupervised learning ,Neural Networks, Computer ,Artificial intelligence ,business ,Sequence Alignment ,computer ,Algorithms - Abstract
Accurate predictions of protein structure properties, for example, secondary structure and solvent accessibility, are essential in analyzing the structure and function of a protein. Position-specific scoring matrix (PSSM) features are widely used in the structure property prediction. However, some proteins may have low-quality PSSM features due to insufficient homologous sequences, leading to limited prediction accuracy. To address this limitation, we propose an enhancing scheme for PSSM features. We introduce the "Bagging MSA" (multiple sequence alignment) method to calculate PSSM features used to train our model, adopt a convolutional network to capture local context features and bidirectional long short-term memory for long-term dependencies, and integrate them under an unsupervised framework. Structure property prediction models are then built upon such enhanced PSSM features for more accurate predictions. Moreover, we develop two frameworks to evaluate the effectiveness of the enhanced PSSM features, which also bring proposed method into real-world scenarios. Empirical evaluation of CB513, CASP11, and CASP12 data sets indicates that our unsupervised enhancing scheme indeed generates more informative PSSM features for structure property prediction.
- Published
- 2021
- Full Text
- View/download PDF
39. A deep learning-based method for the prediction of DNA interacting residues in a protein
- Author
-
Sumeet Patiyal, Anjali Dhall, and Gajendra P S Raghava
- Subjects
DNA-Binding Proteins ,Deep Learning ,Position-Specific Scoring Matrices ,DNA ,Databases, Protein ,Molecular Biology ,Information Systems - Abstract
DNA–protein interaction is one of the most crucial interactions in the biological system, which decides the fate of many processes such as transcription, regulation and splicing of genes. In this study, we trained our models on a training dataset of 646 DNA-binding proteins having 15 636 DNA interacting and 298 503 non-interacting residues. Our trained models were evaluated on an independent dataset of 46 DNA-binding proteins having 965 DNA interacting and 9911 non-interacting residues. All proteins in the independent dataset have less than 30% of sequence similarity with proteins in the training dataset. A wide range of traditional machine learning and deep learning (1D-CNN) techniques-based models have been developed using binary, physicochemical properties and Position-Specific Scoring Matrix (PSSM)/evolutionary profiles. In the case of machine learning technique, eXtreme Gradient Boosting-based model achieved a maximum area under the receiver operating characteristics (AUROC) curve of 0.77 on the independent dataset using PSSM profile. Deep learning-based model achieved the highest AUROC of 0.79 on the independent dataset using a combination of all three profiles. We evaluated the performance of existing methods on the independent dataset and observed that our proposed method outperformed all the existing methods. In order to facilitate scientific community, we developed standalone software and web server, which are accessible from https://webs.iiitd.edu.in/raghava/dbpred.
- Published
- 2022
40. Top-Down Crawl: a method for the ultra-rapid and motif-free alignment of sequences with associated binding metrics
- Author
-
Brendon H Cooper, Tsu-Pei Chiu, and Remo Rohs
- Subjects
Statistics and Probability ,Computational Mathematics ,Binding Sites ,Computational Theory and Mathematics ,Position-Specific Scoring Matrices ,Sequence Analysis, DNA ,Molecular Biology ,Biochemistry ,Software ,Computer Science Applications ,Protein Binding - Abstract
Summary Several high-throughput protein–DNA binding methods currently available produce highly reproducible measurements of binding affinity at the level of the k-mer. However, understanding where a k-mer is positioned along a binding site sequence depends on alignment. Here, we present Top-Down Crawl (TDC), an ultra-rapid tool designed for the alignment of k-mer level data in a rank-dependent and position weight matrix (PWM)-independent manner. As the framework only depends on the rank of the input, the method can accept input from many types of experiments (protein binding microarray, SELEX-seq, SMiLE-seq, etc.) without the need for specialized parameterization. Measuring the performance of the alignment using multiple linear regression with 5-fold cross-validation, we find TDC to perform as well as or better than computationally expensive PWM-based methods. Availability and implementation TDC can be run online at https://topdowncrawl.usc.edu or locally as a python package available through pip at https://pypi.org/project/TopDownCrawl. Supplementary information Supplementary data are available at Bioinformatics online.
- Published
- 2022
41. 3DCONS-DB: A Database of Position-Specific Scoring Matrices in Protein Structures.
- Author
-
Sanchez-Garcia, Ruben, Sorzano, Carlos Oscar Sanchez, Carazo, Jose Maria, and Segura, Joan
- Subjects
- *
MEDICAL databases , *PROTEIN structure - Abstract
Many studies have used position-specific scoring matrices (PSSM) profiles to characterize residues in protein structures and to predict a broad range of protein features. Moreover, PSSM profiles of Protein Data Bank (PDB) entries have been recalculated in many works for different purposes. Although the computational cost of calculating a single PSSM profile is affordable, many statistical studies or machine learning-based methods used thousands of profiles to achieve their goals, thereby leading to a substantial increase of the computational cost. In this work we present a new database compiling PSSM profiles for the proteins of the PDB. Currently, the database contains 333,532 protein chain profiles involving 123,135 different PDB entries. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
42. A convolutional network and attention mechanism-based approach to predict protein-RNA binding residues.
- Author
-
Li K, Wu H, Yue Z, Sun Y, and Xia C
- Subjects
- Protein Binding, Position-Specific Scoring Matrices, RNA chemistry, Amino Acids metabolism
- Abstract
Protein-RNA interactions play a key role in various biological cellular processes, and many experimental and computational studies have been initiated to analyze their interactions. However, experimental determination is quite complex and expensive. Therefore, researchers have worked to develop efficient computational tools to detect protein-RNA binding residues. The accuracy of existing methods is limited by the features of the target and the performance of the computational models; there remains room for improvement. To solve the problem of the accurate detection of protein-RNA binding residues, we propose a convolutional network model named PBRPre based on improved MobileNet. First, by extracting the position information of the target complex and the 3-mer amino acid feature data, the position-specific scoring matrix (PSSM) is improved by using spatial neighbor smoothing processing and discrete wavelet transform to fully exploit the spatial structure information of the target and enrich the feature dataset. Second, the deep learning model MobileNet is used to integrate and optimize the potential features in the target complexes; then, by introducing the Vision Transformer (ViT) network classification layer, the deep-level information of the target is mined to enhance the processing ability of the model for global information and to improve the detection accuracy of the classifiers. The results show that the AUC value of the model can reach 0.866 in the independent testing dataset, which shows that PBRPre can effectively realize the detection of protein-RNA binding residues. All datasets and resource codes of PBRPre are available at https://github.com/linglewu/PBRPre for academic use., Competing Interests: Declaration of Competing Interest The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest., (Copyright © 2023 The Authors. Published by Elsevier Ltd.. All rights reserved.)
- Published
- 2023
- Full Text
- View/download PDF
43. Block Aligner: an adaptive SIMD-accelerated aligner for sequences and position-specific scoring matrices.
- Author
-
Liu D and Steinegger M
- Subjects
- Position-Specific Scoring Matrices, Sequence Alignment, Sequence Analysis, Software, Algorithms, Proteins
- Abstract
Motivation: Efficiently aligning sequences is a fundamental problem in bioinformatics. Many recent algorithms for computing alignments through Smith-Waterman-Gotoh dynamic programming (DP) exploit Single Instruction Multiple Data (SIMD) operations on modern CPUs for speed. However, these advances have largely ignored difficulties associated with efficiently handling complex scoring matrices or large gaps (insertions or deletions)., Results: We propose a new SIMD-accelerated algorithm called Block Aligner for aligning nucleotide and protein sequences against other sequences or position-specific scoring matrices. We introduce a new paradigm that uses blocks in the DP matrix that greedily shift, grow, and shrink. This approach allows regions of the DP matrix to be adaptively computed. Our algorithm reaches over 5-10 times faster than some previous methods while incurring an error rate of less than 3% on protein and long read datasets, despite large gaps and low sequence identities., Availability and Implementation: Our algorithm is implemented for global, local, and X-drop alignments. It is available as a Rust library (with C bindings) at https://github.com/Daniel-Liu-c0deb0t/block-aligner., (© The Author(s) 2023. Published by Oxford University Press.)
- Published
- 2023
- Full Text
- View/download PDF
44. Integrating reduced amino acid composition into PSSM for improving copper ion-binding protein prediction.
- Author
-
Liu S, Liang Y, Li J, Yang S, Liu M, Liu C, Yang D, and Zuo Y
- Subjects
- Position-Specific Scoring Matrices, Algorithms, Amino Acids chemistry, Databases, Protein, Computational Biology methods, Copper, Proteins chemistry
- Abstract
Copper ion-binding proteins play an essential role in metabolic processes and are critical factors in many diseases, such as breast cancer, lung cancer, and Menkes disease. Many algorithms have been developed for predicting metal ion classification and binding sites, but none have been applied to copper ion-binding proteins. In this study, we developed a copper ion-bound protein classifier, RPCIBP, which integrating the reduced amino acid composition into position-specific scoring matrix (PSSM). The reduced amino acid composition filters out a large number of useless evolutionary features, improving the operational efficiency and predictive ability of the model (feature dimension from 2900 to 200, ACC from 83 % to 85.1 %). Compared with the basic model using only three sequence feature extraction methods (ACC in training set between 73.8 %-86.2 %, ACC in test set between 69.3 %-87.5 %), the model integrating the evolutionary features of the reduced amino acid composition showed higher accuracy and robustness (ACC in training set between 83.1 %-90.8 %, ACC in test set between 79.1 %-91.9 %). Best copper ion-binding protein classifiers filtered by feature selection progress were deployed in a user-friendly web server (http://bioinfor.imu.edu.cn/RPCIBP). RPCIBP can accurately predict copper ion-binding proteins, which is convenient for further structural and functional studies, and conducive to mechanism exploration and target drug development., Competing Interests: Declaration of competing interest I would like to declare on behalf of my co-authors that the work described was original research that has not been published previously, and not under consideration for publication elsewhere, in whole or in part. No conflict of interest exits in the submission of this manuscript., (Copyright © 2023 Elsevier B.V. All rights reserved.)
- Published
- 2023
- Full Text
- View/download PDF
45. Deep-AGP: Prediction of angiogenic protein by integrating two-dimensional convolutional neural network with discrete cosine transform.
- Author
-
Ali F, Alghamdi W, Almagrabi AO, Alghushairy O, Banjar A, and Khalid M
- Subjects
- Position-Specific Scoring Matrices, Neural Networks, Computer, Machine Learning
- Abstract
Angiogenic proteins (AGPs) play a primary role in the formation of new blood vessels from pre-existing ones. AGPs have diverse applications in cancer, including serving as biomarkers, guiding anti-angiogenic therapies, and aiding in tumor imaging. Understanding the role of AGPs in cardiovascular and neurodegenerative diseases is vital for developing new diagnostic tools and therapeutic approaches. Considering the significance of AGPs, in this research, we first time established a computational model using deep learning for identifying AGPs. First, we constructed a sequence-based dataset. Second, we explored features by designing a novel feature encoder, called position-specific scoring matrix-decomposition-discrete cosine transform (PSSM-DC-DCT) and existing descriptors including Dipeptide Deviation from Expected Mean (DDE) and bigram-position-specific scoring matrix (Bi-PSSM). Third, each feature set is fed into two-dimensional convolutional neural network (2D-CNN) and machine learning classifiers. Finally, the performance of each learning model is validated by 10-fold cross-validation (CV). The experimental results demonstrate that 2D-CNN with proposed novel feature descriptor achieved the highest success rate on both training and testing datasets. In addition to being an accurate predictor for identification of angiogenic proteins, our proposed method (Deep-AGP) might be fruitful in understanding cancer, cardiovascular, and neurodegenerative diseases, development of their novel therapeutic methods and drug designing., Competing Interests: Declaration of competing interest Authors intend no competing interest., (Copyright © 2023 Elsevier B.V. All rights reserved.)
- Published
- 2023
- Full Text
- View/download PDF
46. Analysis of genomic sequence motifs for deciphering transcription factor binding and transcriptional regulation in eukaryotic cells
- Author
-
Valentina eBoeva
- Subjects
Binding Sites ,Transcription Factors ,ChIP-seq ,Position-Specific Scoring Matrices ,motif discovery ,regulation of gene transcription ,Genetics ,QH426-470 - Abstract
Eukaryotic genomes contain a variety of structured patterns: repetitive elements, binding sites of DNA and RNA associated proteins, splice sites and so on. Often, these structured patterns can be formalized as motifs and described using a proper mathematical model such as position weight matrix and IUPAC consensus. Two key tasks are typically carried out for motifs in the context of the analysis of genomic sequences. These are: identification in a set of DNA regions of over-represented motifs from a particular motif database, and de novo discovery of over-represented motifs. Here we describe existing methodology to perform these two tasks for motifs characterizing transcription factor binding. When applied to the output of ChIP-seq and ChIP-exo experiments, or to promoter regions of co-modulated genes, motif analysis techniques allow for the prediction of transcription factor binding events and enable identification of transcriptional regulators and co-regulators. The usefulness of motif analysis is further exemplified in this review by how motif discovery improves peak calling in ChIP-seq and ChIP-exo experiments and, when coupled with information on gene expression, allows insights into physical mechanisms of transcriptional modulation.
- Published
- 2016
- Full Text
- View/download PDF
47. Integrative approach for detecting membrane proteins
- Author
-
Gregory Butler and Munira Alballa
- Subjects
Computer science ,Feature extraction ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,k-nearest neighbors algorithm ,Structural Biology ,Prediction model ,Integrative approach ,Machine learning ,Position-Specific Scoring Matrices ,Amino Acids ,Databases, Protein ,Molecular Biology ,Integral membrane protein ,lcsh:QH301-705.5 ,Surface-bound membrane proteins ,business.industry ,Research ,Applied Mathematics ,Amino acid composition ,Membrane ,Membrane Proteins ,Pattern recognition ,Transmembrane ,Transmembrane protein ,Computer Science Applications ,ROC Curve ,Membrane protein ,lcsh:Biology (General) ,Area Under Curve ,Membrane topology ,lcsh:R858-859.7 ,Artificial intelligence ,DNA microarray ,business ,Algorithms ,Integral membrane proteins - Abstract
Background Membrane proteins are key gates that control various vital cellular functions. Membrane proteins are often detected using transmembrane topology prediction tools. While transmembrane topology prediction tools can detect integral membrane proteins, they do not address surface-bound proteins. In this study, we focused on finding the best techniques for distinguishing all types of membrane proteins. Results This research first demonstrates the shortcomings of merely using transmembrane topology prediction tools to detect all types of membrane proteins. Then, the performance of various feature extraction techniques in combination with different machine learning algorithms was explored. The experimental results obtained by cross-validation and independent testing suggest that applying an integrative approach that combines the results of transmembrane topology prediction and position-specific scoring matrix (Pse-PSSM) optimized evidence-theoretic k nearest neighbor (OET-KNN) predictors yields the best performance. Conclusion The integrative approach outperforms the state-of-the-art methods in terms of accuracy and MCC, where the accuracy reached a 92.51% in independent testing, compared to the 89.53% and 79.42% accuracies achieved by the state-of-the-art methods.
- Published
- 2020
48. Succinylation Site Prediction Based on Protein Sequences Using the IFS-LightGBM (BO) Model
- Author
-
Xinyi Qin, Min Liu, Guangzhong Liu, and Lu Zhang
- Subjects
Article Subject ,Computer science ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Succinic Acid ,Feature selection ,Bayesian optimization algorithm ,Models, Biological ,General Biochemistry, Genetics and Molecular Biology ,Machine Learning ,03 medical and health sciences ,Succinylation ,0302 clinical medicine ,Protein structure ,Animals ,Humans ,Position-Specific Scoring Matrices ,Amino Acid Sequence ,Databases, Protein ,Pseudo amino acid composition ,030304 developmental biology ,0303 health sciences ,Binding Sites ,General Immunology and Microbiology ,business.industry ,Lysine ,Applied Mathematics ,Computational Biology ,Proteins ,Bayes Theorem ,Pattern recognition ,General Medicine ,Matthews correlation coefficient ,Lysine residue ,Models, Chemical ,Modeling and Simulation ,Artificial intelligence ,business ,Protein Processing, Post-Translational ,Classifier (UML) ,Algorithms ,030217 neurology & neurosurgery ,Research Article - Abstract
Succinylation is an important posttranslational modification of proteins, which plays a key role in protein conformation regulation and cellular function control. Many studies have shown that succinylation modification on protein lysine residue is closely related to the occurrence of many diseases. To understand the mechanism of succinylation profoundly, it is necessary to identify succinylation sites in proteins accurately. In this study, we develop a new model, IFS-LightGBM (BO), which utilizes the incremental feature selection (IFS) method, the LightGBM feature selection method, the Bayesian optimization algorithm, and the LightGBM classifier, to predict succinylation sites in proteins. Specifically, pseudo amino acid composition (PseAAC), position-specific scoring matrix (PSSM), disorder status, and Composition of k -spaced Amino Acid Pairs (CKSAAP) are firstly employed to extract feature information. Then, utilizing the combination of the LightGBM feature selection method and the incremental feature selection (IFS) method selects the optimal feature subset for the LightGBM classifier. Finally, to increase prediction accuracy and reduce the computation load, the Bayesian optimization algorithm is used to optimize the parameters of the LightGBM classifier. The results reveal that the IFS-LightGBM (BO)-based prediction model performs better when it is evaluated by some common metrics, such as accuracy, recall, precision, Matthews Correlation Coefficient (MCC), and F -measure.
- Published
- 2020
- Full Text
- View/download PDF
49. Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss
- Author
-
Wei Wang, Shicai Fan, Kai Zhang, Vu Ngo, Mengchi Wang, and David Y. Wang
- Subjects
Mean squared error ,Computer science ,Sequence analysis ,0206 medical engineering ,02 engineering and technology ,Information loss ,Investigations ,Biology ,sequence logo ,Information theory ,03 medical and health sciences ,0302 clinical medicine ,Methods, Technology, & Resources ,Genetics ,Consensus sequence ,Humans ,Position-Specific Scoring Matrices ,Nucleotide ,Binding site ,information theory ,030304 developmental biology ,chemistry.chemical_classification ,0303 health sciences ,motif ,Genome, Human ,business.industry ,Pattern recognition ,Sequence Analysis, DNA ,Mutual information ,Amino acid ,Sequence logo ,transcription factor binding ,chemistry ,consensus ,Human genome ,Motif (music) ,Artificial intelligence ,business ,Algorithm ,Algorithms ,020602 bioinformatics ,030217 neurology & neurosurgery ,Transcription Factors - Abstract
Sequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, representing motifs by wildcard-style consensus sequences is compact and sufficient for interpreting the motif information and search for motif match. Based on mutual information theory and Jenson-Shannon Divergence, we propose a mathematical framework to minimize the information loss in converting PWMs to consensus sequences. We name this representation as sequence Motto and have implemented an efficient algorithm with flexible options for converting motif PWMs into Motto from nucleotides, amino acids, and customized alphabets. Here we show that this representation provides a simple and efficient way to identify the binding sites of 1156 common TFs in the human genome. The effectiveness of the method was benchmarked by comparing sequence matches found by Motto with PWM scanning results found by FIMO. On average, our method achieves 0.81 area under the precision-recall curve, significantly (p-value < 0.01) outperforming all existing methods, including maximal positional weight, Douglas and minimal mean square error. We believe this representation provides a distilled summary of a motif, as well as the statistical justification.AVAILABILITYMotto is freely available at http://wanglab.ucsd.edu/star/motto.
- Published
- 2020
- Full Text
- View/download PDF
50. GTB-PPI: Predict Protein–protein Interactions Based on L1-regularized Logistic Regression and Gradient Tree Boosting
- Author
-
Hongyan Zhou, Bin Yu, Cheng Chen, Bingqiang Liu, and Qin Ma
- Subjects
Boosting (machine learning) ,L1-regularized logistic regression ,Computer science ,Feature vector ,Crossover ,Method ,Saccharomyces cerevisiae ,Computational biology ,Logistic regression ,Biochemistry ,Machine Learning ,Mice ,03 medical and health sciences ,0302 clinical medicine ,Gradient tree boosting ,Protein Interaction Mapping ,Genetics ,Feature (machine learning) ,Redundancy (engineering) ,Animals ,Humans ,Position-Specific Scoring Matrices ,Molecular Biology ,Pseudo amino acid composition ,030304 developmental biology ,Feature fusion ,0303 health sciences ,Computational Biology ,Protein–protein interaction ,Computational Mathematics ,Tree (data structure) ,Logistic Models ,030217 neurology & neurosurgery - Abstract
Protein–protein interactions (PPIs) are of great importance to understand genetic mechanisms, delineate disease pathogenesis, and guide drug design. With the increase of PPI data and development of machine learning technologies, prediction and identification of PPIs have become a research hotspot in proteomics. In this study, we propose a new prediction pipeline for PPIs based on gradient tree boosting (GTB). First, the initial feature vector is extracted by fusing pseudo amino acid composition (PseAAC), pseudo position-specific scoring matrix (PsePSSM), reduced sequence and index-vectors (RSIV), and autocorrelation descriptor (AD). Second, to remove redundancy and noise, we employ L1-regularized logistic regression (L1-RLR) to select an optimal feature subset. Finally, GTB-PPI model is constructed. Five-fold cross-validation showed that GTB-PPI achieved the accuracies of 95.15% and 90.47% on Saccharomyces cerevisiae and Helicobacter pylori datasets, respectively. In addition, GTB-PPI could be applied to predict the independent test datasets for Caenorhabditis elegans, Escherichia coli, Homo sapiens, and Mus musculus, the one-core PPI network for CD9, and the crossover PPI network for the Wnt-related signaling pathways. The results show that GTB-PPI can significantly improve accuracy of PPI prediction. The code and datasets of GTB-PPI can be downloaded from https://github.com/QUST-AIBBDRC/GTB-PPI/.
- Published
- 2020
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.