Author: "Jia, Cangzhi" / Language: english - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Jia, Cangzhi"' showing total 116 results

Start Over Author "Jia, Cangzhi" Language english

116 results on '"Jia, Cangzhi"'

1. N-GlycoPred: A hybrid deep learning model for accurate identification of N-glycosylation sites

Author: Hu, Fengzhu, Gao, Jie, Zheng, Jia, Kwoh, Cheekeong, and Jia, Cangzhi
Published: 2024
Full Text: View/download PDF

2. An Interpretable Prediction Model for Identifying N7-Methylguanosine Sites Based on XGBoost and SHAP

Author: Bi, Yue, Xiang, Dongxu, Ge, Zongyuan, Li, Fuyi, Jia, Cangzhi, and Song, Jiangning
Published: 2020
Full Text: View/download PDF

3. AMPpred-MFA: An Interpretable Antimicrobial Peptide Predictor with a Stacking Architecture, Multiple Features, and Multihead Attention.

Author: Li, Changjiang, Zou, Quan, Jia, Cangzhi, and Zheng, Jia
Published: 2024
Full Text: View/download PDF

4. O‑GlcNAcPRED-DL: Prediction of Protein O‑GlcNAcylation Sites Based on an Ensemble Model of Deep Learning.

Author: Hu, Fengzhu, Li, Weiyu, Li, Yaoxiang, Hou, Chunyan, Ma, Junfeng, and Jia, Cangzhi
Published: 2024
Full Text: View/download PDF

5. PredLLPS_PSSM: a novel predictor for liquid–liquid protein separation identification based on evolutionary information and a deep neural network.

Author: Zhou, Shengming, Zhou, Yetong, Liu, Tian, Zheng, Jia, and Jia, Cangzhi
Subjects: DEEP learning, PROTEOMICS, PROTEIN fractionation, PHASE separation, AMINO acid sequence
Abstract: The formation of biomolecular condensates by liquid–liquid phase separation (LLPS) has become a universal mechanism for spatiotemporal coordination of biological activities in cells and has been widely observed to directly regulate the key cellular processes involved in cancer cell pathology. However, the complexity of protein sequences and the diversity of conformations are inherently disordered, which poses great challenges for LLPS protein calculations and experimental research. Herein, we proposed a novel predictor named PredLLPS_PSSM for LLPS protein identification based only on sequence evolution information. Because finding real and reliable samples is the cornerstone of building predictors, we collected anew and collated the LLPS proteins from the latest versions of three databases. By comparing the performance of the position-specific score matrix (PSSM) and word embedding, PredLLPS_PSSM combined PSSM-based information and two deep learning frameworks. Independent tests using three existing independent test datasets and two newly constructed independent test datasets demonstrated the superiority of PredLLPS_PSSM compared with state-of-the-art methods. Furthermore, we tested PredLLPS_PSSM on nine experimentally identified LLPS proteins from three insects that were not included in any of the databases. In addition, the powerful Shapley Additive exPlanation algorithm and heatmap were applied to find the most critical amino acids relevant to LLPS. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

6. TIMER is a Siamese neural network-based framework for identifying both general and species-specific bacterial promoters.

Author: Zhu, Yan, Li, Fuyi, Guo, Xudong, Wang, Xiaoyu, Coin, Lachlan J M, Webb, Geoffrey I, Song, Jiangning, and Jia, Cangzhi
Subjects: ARTIFICIAL neural networks, INTERNET servers, RNA polymerases, DNA sequencing
Abstract: Background Promoters are DNA regions that initiate the transcription of specific genes near the transcription start sites. In bacteria, promoters are recognized by RNA polymerases and associated sigma factors. Effective promoter recognition is essential for synthesizing the gene-encoded products by bacteria to grow and adapt to different environmental conditions. A variety of machine learning-based predictors for bacterial promoters have been developed; however, most of them were designed specifically for a particular species. To date, only a few predictors are available for identifying general bacterial promoters with limited predictive performance. Results In this study, we developed TIMER, a Siamese neural network-based approach for identifying both general and species-specific bacterial promoters. Specifically, TIMER uses DNA sequences as the input and employs three Siamese neural networks with the attention layers to train and optimize the models for a total of 13 species-specific and general bacterial promoters. Extensive 10-fold cross-validation and independent tests demonstrated that TIMER achieves a competitive performance and outperforms several existing methods on both general and species-specific promoter prediction. As an implementation of the proposed method, the web server of TIMER is publicly accessible at http://web.unimelb-bioinfortools.cloud.edu.au/TIMER/. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

7. COPPER: an ensemble deep-learning approach for identifying exclusive virus-derived small interfering RNAs in plants.

Author: Bu, Yuanyuan, Jia, Cangzhi, Guo, Xudong, Li, Fuyi, and Song, Jiangning
Subjects: *SMALL interfering RNA, *COPPER, *PLANT RNA, *CONVOLUTIONAL neural networks, *RNA interference
Abstract: Antiviral defenses are one of the significant roles of RNA interference (RNAi) in plants. It has been reported that the host RNAi mechanism machinery can target viral RNAs for destruction because virus-derived small interfering RNAs (vsiRNAs) are found in infected host cells. Therefore, the recognition of plant vsiRNAs is the key to understanding the functional mechanisms of vsiRNAs and developing antiviral plants. In this work, we introduce a deep learning-based stacking ensemble approach, named computational prediction of plant exclusive virus-derived small interfering RNAs (COPPER), for plant vsiRNA prediction. COPPER used word2vec and fastText to generate sequence features and a hybrid deep learning framework, including a convolutional neural network, multiscale residual network and bidirectional long short-term memory network with a self-attention mechanism to enable precise predictions of plant vsiRNAs. Extensive benchmarking experiments with different sequence homology thresholds and ablation studies illustrated the comparative predictive performance of COPPER. In addition, the performance comparison with PVsiRNAPred conducted on an independent test dataset showed that COPPER significantly improved the predictive performance for plant vsiRNAs compared with other state-of-the-art methods. The datasets and source codes are publicly available at https://github.com/yuanyuanbu/COPPER. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

8. 70ProPred: a predictor for discovering sigma70 promoters based on combining multiple features

Author: He, Wenying, Jia, Cangzhi, Duan, Yucong, and Zou, Quan
Published: 2018
Full Text: View/download PDF

9. Harmonic number identities via the Newton–Andrews method

Author: Wang, Weiping and Jia, Cangzhi
Published: 2014
Full Text: View/download PDF

10. Clarion is a multi-label problem transformation method for identifying mRNA subcellular localizations.

Author: Bi, Yue, Li, Fuyi, Guo, Xudong, Wang, Zhikang, Pan, Tong, Guo, Yuming, Webb, Geoffrey I, Yao, Jianhua, Jia, Cangzhi, and Song, Jiangning
Subjects: GENE regulatory networks, GENETIC regulation
Abstract: Subcellular localization of messenger RNAs (mRNAs) plays a key role in the spatial regulation of gene activity. The functions of mRNAs have been shown to be closely linked with their localizations. As such, understanding of the subcellular localizations of mRNAs can help elucidate gene regulatory networks. Despite several computational methods that have been developed to predict mRNA localizations within cells, there is still much room for improvement in predictive performance, especially for the multiple-location prediction. In this study, we proposed a novel multi-label multi-class predictor, termed Clarion, for mRNA subcellular localization prediction. Clarion was developed based on a manually curated benchmark dataset and leveraged the weighted series method for multi-label transformation. Extensive benchmarking tests demonstrated Clarion achieved competitive predictive performance and the weighted series method plays a crucial role in securing superior performance of Clarion. In addition, the independent test results indicate that Clarion outperformed the state-of-the-art methods and can secure accuracy of 81.47, 91.29, 79.77, 92.10, 89.15, 83.74, 80.74, 79.23 and 84.74% for chromatin, cytoplasm, cytosol, exosome, membrane, nucleolus, nucleoplasm, nucleus and ribosome, respectively. The webserver and local stand-alone tool of Clarion is freely available at http://monash.bioweb.cloud.edu.au/Clarion/. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

11. Quartic theta hypergeometric series

Author: Chu, Wenchang and Jia, Cangzhi
Published: 2013
Full Text: View/download PDF

12. Some results on the Apostol–Bernoulli and Apostol–Euler polynomials

Author: Wang, Weiping, Jia, Cangzhi, and Wang, Tianming
Published: 2008
Full Text: View/download PDF

13. Abel's method on summation by parts and theta hypergeometric series

Author: Chu, Wenchang and Jia, Cangzhi
Published: 2008
Full Text: View/download PDF

14. Abel's method on summation by parts and terminating well-poised q-series identities

Author: Chu, Wenchang and Jia, Cangzhi
Published: 2007
Full Text: View/download PDF

15. Transformation and reduction formulae for double q-Clausen series of type [formula omitted]

Author: Jia, Cangzhi and Wang, Tianming
Published: 2007
Full Text: View/download PDF

16. Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction.

Author: Zhang, Meng, Jia, Cangzhi, Li, Fuyi, Li, Chen, Zhu, Yan, Akutsu, Tatsuya, Webb, Geoffrey I, Zou, Quan, Coin, Lachlan J M, and Song, Jiangning
Subjects: *DEEP learning, *MACHINE learning, *DROSOPHILA melanogaster, *MICE, *CORN, *GENETIC regulation
Abstract: Promoters are crucial regulatory DNA regions for gene transcriptional activation. Rapid advances in next-generation sequencing technologies have accelerated the accumulation of genome sequences, providing increased training data to inform computational approaches for both prokaryotic and eukaryotic promoter prediction. However, it remains a significant challenge to accurately identify species-specific promoter sequences using computational approaches. To advance computational support for promoter prediction, in this study, we curated 58 comprehensive, up-to-date, benchmark datasets for 7 different species (i.e. Escherichia coli , Bacillus subtilis , Homo sapiens , Mus musculus , Arabidopsis thaliana , Zea mays and Drosophila melanogaster) to assist the research community to assess the relative functionality of alternative approaches and support future research on both prokaryotic and eukaryotic promoters. We revisited 106 predictors published since 2000 for promoter identification (40 for prokaryotic promoter, 61 for eukaryotic promoter, and 5 for both). We systematically evaluated their training datasets, computational methodologies, calculated features, performance and software usability. On the basis of these benchmark datasets, we benchmarked 19 predictors with functioning webservers/local tools and assessed their prediction performance. We found that deep learning and traditional machine learning–based approaches generally outperformed scoring function–based approaches. Taken together, the curated benchmark dataset repository and the benchmarking analysis in this study serve to inform the design and implementation of computational approaches for promoter prediction and facilitate more rigorous comparison of new techniques in the future. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

17. O-glycosylation site prediction for Homo sapiens by combining properties and sequence features with support vector machine.

Author: Zhu, Yan, Yin, Shuwan, Zheng, Jia, Shi, Yixia, and Jia, Cangzhi
Subjects: SUPPORT vector machines, HUMAN beings, DEEP learning, MACHINE learning, SOURCE code
Abstract: O-glycosylation is a protein posttranslational modification important in regulating almost all cells. It is related to a large number of physiological and pathological phenomena. Recognizing O-glycosylation sites is the key to further investigating the molecular mechanism of protein posttranslational modification. This study aimed to collect a reliable dataset on Homo sapiens and develop an O-glycosylation predictor for Homo sapiens, named Captor, through multiple features. A random undersampling method and a synthetic minority oversampling technique were employed to deal with imbalanced data. In addition, the Kruskal–Wallis (K–W) test was adopted to optimize feature vectors and improve the performance of the model. A support vector machine, due to its optimal performance, was used to train and optimize the final prediction model after a comprehensive comparison of various classifiers in traditional machine learning methods and deep learning. On the independent test set, Captor outperformed the existing O-glycosylation tool, suggesting that Captor could provide more instructive guidance for further experimental research on O-glycosylation. The source code and datasets are available at https://github.com/YanZhu06/Captor/. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

18. Positive-unlabeled learning in bioinformatics and computational biology: a brief review.

Author: Li, Fuyi, Dong, Shuangyu, Leier, André, Han, Meiya, Guo, Xudong, Xu, Jing, Wang, Xiaoyu, Pan, Shirui, Jia, Cangzhi, Zhang, Yang, Webb, Geoffrey I, Coin, Lachlan J M, Li, Chen, and Song, Jiangning
Subjects: COMPUTATIONAL biology, CLASSIFICATION algorithms, BIOINFORMATICS, MACHINE learning, SUPERVISED learning
Abstract: Conventional supervised binary classification algorithms have been widely applied to address significant research questions using biological and biomedical data. This classification scheme requires two fully labeled classes of data (e.g. positive and negative samples) to train a classification model. However, in many bioinformatics applications, labeling data is laborious, and the negative samples might be potentially mislabeled due to the limited sensitivity of the experimental equipment. The positive unlabeled (PU) learning scheme was therefore proposed to enable the classifier to learn directly from limited positive samples and a large number of unlabeled samples (i.e. a mixture of positive or negative samples). To date, several PU learning algorithms have been developed to address various biological questions, such as sequence identification, functional site characterization and interaction prediction. In this paper, we revisit a collection of 29 state-of-the-art PU learning bioinformatic applications to address various biological questions. Various important aspects are extensively discussed, including PU learning methodology, biological application, classifier design and evaluation strategy. We also comment on the existing issues of PU learning and offer our perspectives for the future development of PU learning applications. We anticipate that our work serves as an instrumental guideline for a better understanding of the PU learning framework in bioinformatics and further developing next-generation PU learning frameworks for critical biological applications. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

19. Further bibasic hypergeometric transformations and their applications

Author: Liu, Hongmei, Jia, Cangzhi, and Wang, Tianming
Published: 2006
Full Text: View/download PDF

20. Formator: Predicting Lysine Formylation Sites Based on the Most Distant Undersampling and Safe-Level Synthetic Minority Oversampling.

Author: Jia, Cangzhi, Zhang, Meng, Fan, Cunshuo, Li, Fuyi, and Song, Jiangning
Abstract: Lysine formylation is a reversible type of protein post-translational modification and has been found to be involved in a myriad of biological processes, including modulation of chromatin conformation and gene expression in histones and other nuclear proteins. Accurate identification of lysine formylation sites is essential for elucidating the underlying molecular mechanisms of formylation. Traditional experimental methods are time-consuming and expensive. As such, it is desirable and necessary to develop computational methods for accurate prediction of formylation sites. In this study, we propose a novel predictor, termed Formator, for identifying lysine formylation sites from sequences information. Formator is developed using the ensemble learning (EL) strategy based on four individual support vector machine classifiers via a voting system. Moreover, the most distant undersampling and Safe-Level-SMOTE oversampling techniques were integrated to deal with the data imbalance problem of the training dataset. Four effective feature extraction methods, namely bi-profile Bayes (BPB), k-nearest neighbor (KNN), amino acid physicochemical properties (AAindex), and composition and transition (CTD) were employed to encode the surrounding sequence features of potential formylation sites. Extensive empirical studies show that Formator achieved the accuracy of 87.24 and 74.96 percent on jackknife test and the independent test, respectively. Performance comparison results on the independent test indicate that Formator outperforms current existing prediction tool, LFPred, suggesting that it has a great potential to serve as a useful tool in identifying novel lysine formylation sites and facilitating hypothesis-driven experimental efforts. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

21. Predicting the interaction biomolecule types for lncRNA: an ensemble deep learning approach.

Author: Zhang, Yu, Jia, Cangzhi, and Kwoh, Chee Keong
Subjects: *LINCRNA, *DEEP learning, *RECEIVER operating characteristic curves, *MOLECULAR interactions
Abstract: Long noncoding RNAs (lncRNAs) play significant roles in various physiological and pathological processes via their interactions with biomolecules like DNA, RNA and protein. The existing in silico methods used for predicting the functions of lncRNA mainly rely on calculating the similarity of lncRNA or investigating whether an lncRNA can interact with a specific biomolecule or disease. In this work, we explored the functions of lncRNA from a different perspective: we presented a tool for predicting the interaction biomolecule type for a given lncRNA. For this purpose, we first investigated the main molecular mechanisms of the interactions of lncRNA–RNA, lncRNA–protein and lncRNA–DNA. Then, we developed an ensemble deep learning model: lncIBTP (lncRNA Interaction Biomolecule Type Prediction). This model predicted the interactions between lncRNA and different types of biomolecules. On the 5-fold cross-validation, the lncIBTP achieves average values of 0.7042 in accuracy, 0.7903 and 0.6421 in macro-average area under receiver operating characteristic curve and precision–recall curve, respectively, which illustrates the model effectiveness. Besides, based on the analysis of the collected published data and prediction results, we hypothesized that the characteristics of lncRNAs that interacted with DNA may be different from those that interacted with only RNA. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

22. Computational identification of eukaryotic promoters based on cascaded deep capsule neural networks.

Author: Zhu, Yan, Li, Fuyi, Xiang, Dongxu, Akutsu, Tatsuya, Song, Jiangning, and Jia, Cangzhi
Subjects: CAPSULE neural networks, CONVOLUTIONAL neural networks, DEEP learning, DNA sequencing, RNA polymerases, DROSOPHILA melanogaster
Abstract: A promoter is a region in the DNA sequence that defines where the transcription of a gene by RNA polymerase initiates, which is typically located proximal to the transcription start site (TSS). How to correctly identify the gene TSS and the core promoter is essential for our understanding of the transcriptional regulation of genes. As a complement to conventional experimental methods, computational techniques with easy-to-use platforms as essential bioinformatics tools can be effectively applied to annotate the functions and physiological roles of promoters. In this work, we propose a deep learning-based method termed Depicter (D eep l e arning for p red ic ting promo ter), for identifying three specific types of promoters, i.e. promoter sequences with the TATA-box (TATA model), promoter sequences without the TATA-box (non-TATA model), and indistinguishable promoters (TATA and non-TATA model). Depicter is developed based on an up-to-date, species-specific dataset which includes Homo sapiens, Mus musculus, Drosophila melanogaster and Arabidopsis thaliana promoters. A convolutional neural network coupled with capsule layers is proposed to train and optimize the prediction model of Depicter. Extensive benchmarking and independent tests demonstrate that Depicter achieves an improved predictive performance compared with several state-of-the-art methods. The webserver of Depicter is implemented and freely accessible at https://depicter.erc.monash.edu/. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

23. DeepTorrent: a deep learning-based approach for predicting DNA N4-methylcytosine sites.

Author: Liu, Quanzhong, Chen, Jinxiang, Wang, Yanze, Li, Shuqin, Jia, Cangzhi, Song, Jiangning, and Li, Fuyi
Subjects: DEEP learning, DNA sequencing, CONVOLUTIONAL neural networks, DNA
Abstract: DNA N4-methylcytosine (4mC) is an important epigenetic modification that plays a vital role in regulating DNA replication and expression. However, it is challenging to detect 4mC sites through experimental methods, which are time-consuming and costly. Thus, computational tools that can identify 4mC sites would be very useful for understanding the mechanism of this important type of DNA modification. Several machine learning-based 4mC predictors have been proposed in the past 3 years, although their performance is unsatisfactory. Deep learning is a promising technique for the development of more accurate 4mC site predictions. In this work, we propose a deep learning-based approach, called DeepTorrent, for improved prediction of 4mC sites from DNA sequences. It combines four different feature encoding schemes to encode raw DNA sequences and employs multi-layer convolutional neural networks with an inception module integrated with bidirectional long short-term memory to effectively learn the higher-order feature representations. Dimension reduction and concatenated feature maps from the filters of different sizes are then applied to the inception module. In addition, an attention mechanism and transfer learning techniques are also employed to train the robust predictor. Extensive benchmarking experiments demonstrate that DeepTorrent significantly improves the performance of 4mC site prediction compared with several state-of-the-art methods. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

24. DeepCPP: a deep neural network based on nucleotide bias information and minimum distribution similarity feature selection for RNA coding potential prediction.

Author: Zhang, Yu, Jia, Cangzhi, Fullwood, Melissa Jane, and Kwoh, Chee Keong
Subjects: *FEATURE selection, *LINCRNA, *DEEP learning, *RNA, *NON-coding RNA, *BOTTLENECKS (Manufacturing), *SPEECH processing systems
Abstract: The development of deep sequencing technologies has led to the discovery of novel transcripts. Many in silico methods have been developed to assess the coding potential of these transcripts to further investigate their functions. Existing methods perform well on distinguishing majority long noncoding RNAs (lncRNAs) and coding RNAs (mRNAs) but poorly on RNAs with small open reading frames (sORFs). Here, we present DeepCPP (deep neural network for coding potential prediction), a deep learning method for RNA coding potential prediction. Extensive evaluations on four previous datasets and six new datasets constructed in different species show that DeepCPP outperforms other state-of-the-art methods, especially on sORF type data, which overcomes the bottleneck of sORF mRNA identification by improving more than 4.31, 37.24 and 5.89% on its accuracy for newly discovered human, vertebrate and insect data, respectively. Additionally, we also revealed that discontinuous k-mer, and our newly proposed nucleotide bias and minimal distribution similarity feature selection method play crucial roles in this classification problem. Taken together, DeepCPP is an effective method for RNA coding potential prediction. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

25. PASSION: an ensemble neural network approach for identifying the binding sites of RBPs on circRNAs.

Author: Jia, Cangzhi, Bi, Yue, Chen, Jinxiang, Leier, André, Li, Fuyi, and Song, Jiangning
Subjects: *BINDING sites, *INTERNET servers, *NAIVE Bayes classification, *ARTIFICIAL neural networks, *ALGORITHMS, *SUPPORT vector machines, *FEATURE selection, *CIRCULAR RNA
Abstract: Motivation Different from traditional linear RNAs (containing 5′ and 3′ ends), circular RNAs (circRNAs) are a special type of RNAs that have a closed ring structure. Accumulating evidence has indicated that circRNAs can directly bind proteins and participate in a myriad of different biological processes. Results For identifying the interaction of circRNAs with 37 different types of circRNA-binding proteins (RBPs), we develop an ensemble neural network, termed PASSION, which is based on the concatenated artificial neural network (ANN) and hybrid deep neural network frameworks. Specifically, the input of the ANN is the optimal feature subset for each RBP, which has been selected from six types of feature encoding schemes through incremental feature selection and application of the XGBoost algorithm. In turn, the input of the hybrid deep neural network is a stacked codon-based scheme. Benchmarking experiments indicate that the ensemble neural network reaches the average best area under the curve (AUC) of 0.883 across the 37 circRNA datasets when compared with XGBoost, k-nearest neighbor, support vector machine, random forest, logistic regression and Naive Bayes. Moreover, each of the 37 RBP models is extensively tested by performing independent tests, with the varying sequence similarity thresholds of 0.8, 0.7, 0.6 and 0.5, respectively. The corresponding average AUC obtained are 0.883, 0.876, 0.868 and 0.883, respectively, highlighting the effectiveness and robustness of PASSION. Extensive benchmarking experiments demonstrate that PASSION achieves a competitive performance for identifying binding sites between circRNA and RBPs, when compared with several state-of-the-art methods. Availability and implementation A user-friendly web server of PASSION is publicly accessible at http://flagship.erc.monash.edu/PASSION/. Supplementary information Supplementary data are available at Bioinformatics online. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

26. PRISMOID: a comprehensive 3D structure database for post-translational modifications and mutations with functional impact.

Author: Li, Fuyi, Fan, Cunshuo, Marquez-Lago, Tatiana T, Leier, André, Revote, Jerico, Jia, Cangzhi, Zhu, Yan, Smith, A Ian, Webb, Geoffrey I, Liu, Quanzhong, Wei, Leyi, Li, Jian, and Song, Jiangning
Subjects: INTERNET servers, JAVASCRIPT programming language, WEB databases, AMINO acid sequence, DATA visualization, MORPHOLOGY, DATABASES
Abstract: Post-translational modifications (PTMs) play very important roles in various cell signaling pathways and biological process. Due to PTMs' extremely important roles, many major PTMs have been studied, while the functional and mechanical characterization of major PTMs is well documented in several databases. However, most currently available databases mainly focus on protein sequences, while the real 3D structures of PTMs have been largely ignored. Therefore, studies of PTMs 3D structural signatures have been severely limited by the deficiency of the data. Here, we develop PRISMOID, a novel publicly available and free 3D structure database for a wide range of PTMs. PRISMOID represents an up-to-date and interactive online knowledge base with specific focus on 3D structural contexts of PTMs sites and mutations that occur on PTMs and in the close proximity of PTM sites with functional impact. The first version of PRISMOID encompasses 17 145 non-redundant modification sites on 3919 related protein 3D structure entries pertaining to 37 different types of PTMs. Our entry web page is organized in a comprehensive manner, including detailed PTM annotation on the 3D structure and biological information in terms of mutations affecting PTMs, secondary structure features and per-residue solvent accessibility features of PTM sites, domain context, predicted natively disordered regions and sequence alignments. In addition, high-definition JavaScript packages are employed to enhance information visualization in PRISMOID. PRISMOID equips a variety of interactive and customizable search options and data browsing functions; these capabilities allow users to access data via keyword, ID and advanced options combination search in an efficient and user-friendly way. A download page is also provided to enable users to download the SQL file, computational structural features and PTM sites' data. We anticipate PRISMOID will swiftly become an invaluable online resource, assisting both biologists and bioinformaticians to conduct experiments and develop applications supporting discovery efforts in the sequence–structural–functional relationship of PTMs and providing important insight into mutations and PTM sites interaction mechanisms. The PRISMOID database is freely accessible at http://prismoid.erc.monash.edu/. The database and web interface are implemented in MySQL, JSP, JavaScript and HTML with all major browsers supported. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

27. Pippin: A random forest-based method for identifying presynaptic and postsynaptic neurotoxins.

Author: Li, Pengyu, Zhang, He, Zhao, Xuyang, Jia, Cangzhi, Li, Fuyi, and Song, Jiangning
Subjects: NEUROTOXIC agents, FEATURE selection, FORECASTING, K-nearest neighbor classification
Abstract: Presynaptic and postsynaptic neurotoxins are two types of neurotoxins from venomous animals and functionally important molecules in the neurosciences; however, their experimental characterization is difficult, time-consuming, and costly. Therefore, bioinformatics tools that can identify presynaptic and postsynaptic neurotoxins would be very useful for understanding their functions and mechanisms. In this study, we propose Pippin, a novel machine learning-based method that allows users to rapidly and accurately identify these two types of neurotoxins. Pippin was developed using the random forest (RF) algorithm and evaluated based on an up-to-date dataset. A variety of sequence and motif features were combined, and a two-step feature-selection algorithm was employed to characterize the optimal feature subset for presynaptic and postsynaptic neurotoxin prediction. Extensive benchmark tests illustrate that Pippin significantly improved predictive performance as compared with six other commonly used machine-learning algorithms, including the naïve Bayes classifier, Multinomial Naïve Bayes classifier (MNBC), AdaBoost, Bagging, K -nearest neighbors, and XGBoost. Additionally, we developed an online webserver for Pippin to facilitate public use. To the best of our knowledge, this is the first webserver for presynaptic and postsynaptic neurotoxin prediction. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

28. MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters.

Author: Zhang, Meng, Li, Fuyi, Marquez-Lago, Tatiana T, Leier, André, Fan, Cunshuo, Kwoh, Chee Keong, Chou, Kuo-Chen, Song, Jiangning, and Jia, Cangzhi
Subjects: INTERNET servers, PROMOTERS (Genetics), FEATURE selection, NUCLEOTIDE sequence, GENE enhancers
Abstract: Motivation Promoters are short DNA consensus sequences that are localized proximal to the transcription start sites of genes, allowing transcription initiation of particular genes. However, the precise prediction of promoters remains a challenging task because individual promoters often differ from the consensus at one or more positions. Results In this study, we present a new multi-layer computational approach, called MULTiPly, for recognizing promoters and their specific types. MULTiPly took into account the sequences themselves, including both local information such as k-tuple nucleotide composition, dinucleotide-based auto covariance and global information of the entire samples based on bi-profile Bayes and k -nearest neighbour feature encodings. Specifically, the F-score feature selection method was applied to identify the best unique type of feature prediction results, in combination with other types of features that were subsequently added to further improve the prediction performance of MULTiPly. Benchmarking experiments on the benchmark dataset and comparisons with five state-of-the-art tools show that MULTiPly can achieve a better prediction performance on 5-fold cross-validation and jackknife tests. Moreover, the superiority of MULTiPly was also validated on a newly constructed independent test dataset. MULTiPly is expected to be used as a useful tool that will facilitate the discovery of both general and specific types of promoters in the post-genomic era. Availability and implementation The MULTiPly webserver and curated datasets are freely available at http://flagshipnt.erc.monash.edu/MULTiPly/. Supplementary information Supplementary data are available at Bioinformatics online. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

29. 4mCPred: machine learning methods for DNA N 4 -methylcytosine sites prediction.

Author: He, Wenying, Zou, Quan, and Jia, Cangzhi
Subjects: MACHINE learning, METHYLCYTOSINE, EPIGENETICS, DNA repair, ELECTRON-ion collisions
Abstract: Motivation N4-methylcytosine (4mC), an important epigenetic modification formed by the action of specific methyltransferases, plays an essential role in DNA repair, expression and replication. The accurate identification of 4mC sites aids in-depth research to biological functions and mechanisms. Because, experimental identification of 4mC sites is time-consuming and costly, especially given the rapid accumulation of gene sequences. Supplementation with efficient computational methods is urgently needed. Results In this study, we developed a new tool, 4mCPred, for predicting 4mC sites in Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Escherichia coli, Geoalkalibacter subterraneus and Geobacter pickeringii. 4mCPred consists of two independent models, 4mCPred_I and 4mCPred_II, for each species. The predictive results of independent and cross-species tests demonstrated that the performance of 4mCPred_I is a useful tool. To identify position-specific trinucleotide propensity (PSTNP) and electron-ion interaction potential features, we used the F-score method to construct predictive models and to compare their PSTNP features. Compared with other existing predictors, 4mCPred achieved much higher accuracies in rigorous jackknife and independent tests. We also analyzed the importance of different features in detail. Availability and implementation The web-server 4mCPred is accessible at http://server.malab.cn/4mCPred/index.jsp. Supplementary information Supplementary data are available at Bioinformatics online. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

30. NucPosPred: Predicting species-specific genomic nucleosome positioning via four different modes of general PseKNC.

Author: Jia, Cangzhi, Yang, Qing, and Zou, Quan
Subjects: *MOLECULAR structure of chromatin, *EUKARYOTIC cell genetics, *DNA replication, *GENOMES, *RNA splicing, *CAENORHABDITIS elegans genetics
Abstract: The nucleosome is the basic structure of chromatin in eukaryotic cells, with essential roles in the regulation of many biological processes, such as DNA transcription, replication and repair, and RNA splicing. Because of the importance of nucleosomes, the factors that determine their positioning within genomes should be investigated. High-resolution nucleosome-positioning maps are now available for organisms including Saccharomyces cerevisiae, Drosophila melanogaster and Caenorhabditis elegans , enabling the identification of nucleosome positioning by application of computational tools. Here, we describe a novel predictor called NucPosPred, which was specifically designed for large-scale identification of nucleosome positioning in C. elegans and D. melanogaster genomes. NucPosPred was separately optimized for each species for four types of DNA sequence feature extraction, with consideration of two classification algorithms (gradient-boosting decision tree and support vector machine). The overall accuracy obtained with NucPosPred was 92.29% for C. elegans and 88.26% for D. melanogaster , outperforming previous methods and demonstrating the potential for species-specific prediction of nucleosome positioning. For the convenience of most experimental scientists, a web-server for the predictor NucPosPred is available at http://121.42.167.206/NucPosPred/index.jsp. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

31. O-GlcNAcPRED-II: an integrated classification algorithm for identifying O-GlcNAcylation sites based on fuzzy undersampling and a K-means PCA oversampling technique.

Author: Jia, Cangzhi, Zuo, Yun, and Zou, Quan
Subjects: *GENETIC algorithms, *K-means clustering, *THREONINE, *SERINE, *BAYESIAN analysis, *GENOMICS, *COMPUTER software
Abstract: Motivation: Protein O-GlcNAcylation (O-GlcNAc) is an important post-translational modification of serine (S)/threonine (T) residues that involves multiple molecular and cellular processes. Recent studies have suggested that abnormal O-G1cNAcylation causes many diseases, such as cancer and various neurodegenerative diseases. With the available protein O-G1cNAcylation sites experimentally verified, it is highly desired to develop automated methods to rapidly and effectively identify O-GlcNAcylation sites. Although some computational methods have been proposed, their performance has been unsatisfactory, particularly in terms of prediction sensitivity. Results: In this study, we developed an ensemble model O-GlcNAcPRED-II to identify potential OGlcNAcylation sites. A K-means principal component analysis oversampling technique (KPCA) and fuzzy undersampling method (FUS) were first proposed and incorporated to reduce the proportion of the original positive and negative training samples. Then, rotation forest, a type of classifierintegrated system, was adopted to divide the eight types of feature space into several subsets using four sub-classifiers: random forest, k-nearest neighbour, naive Bayesian and support vector machine. We observed that O-GlcNAcPRED-II achieved a sensitivity of 81.05%, specificity of 95.91%, accuracy of 91.43% and Matthew's correlation coefficient of 0.7928 for five-fold cross-validation run 10 times. Additionally, the results obtained by O-GlcNAcPRED-II on two independent datasets also indicated that the proposed predictor outperformed five published prediction tools. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

32. Prediction of aptamer–protein interacting pairs based on sparse autoencoder feature extraction and an ensemble classifier.

Author: Yang, Qing, Jia, Cangzhi, and Li, Taoying
Subjects: *APTAMERS, *FEATURE extraction, *BASE pairs, *FEATURE selection, *AMINO acid sequence, *SUPPORT vector machines
Abstract: • Aptamer–protein interacting pairs play important roles in physiological functions and structural characterization. Identifying aptamer–protein interacting pairs is challenging and limited, despite the tremendous applications of aptamers. • A sparse autoencoder was used to characterize features for the target protein sequences. • Gradient boosting decision tree and incremental feature selection methods were used to obtain the optimal combination of features. Aptamer–protein interacting pairs play important roles in physiological functions and structural characterization. Identifying aptamer–protein interacting pairs is challenging and limited, despite of the tremendous applications of aptamers. Therefore, it is vital to construct a high prediction performance model for identifying aptamer–target interacting pairs. In this study, a novel ensemble method is presented to predict aptamer–protein interacting pairs by integrating sequence characteristics derived from aptamers and the target proteins. The features extracted for aptamers were the compositions of amino acids and pseudo K-tuple nucleotides. In addition, a sparse autoencoder was used to characterize features for the target protein sequences. To remove redundant features, gradient boosting decision tree (GBDT) and incremental feature selection (IFS) methods were used to obtain the optimum combination of sequence characters. Based on 616 selected features, an ensemble of three sub- support vector machine (SVM) classifiers was used to construct our prediction model. Evaluated on an independent dataset, our predictor obtained an accuracy of 75.7%, Matthew's Correlation Coefficient of 0.478, and Youden's Index of 0.538, which were superior to the values reached using other existing predictors. The results show that our model can be used to distinguishing novel aptamer–protein interacting pairs and revealing the interrelation between aptamers and proteins. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

33. S-SulfPred: A sensitive predictor to capture S-sulfenylation sites based on a resampling one-sided selection undersampling-synthetic minority oversampling technique.

Author: Jia, Cangzhi and Zuo, Yun
Subjects: *HYDROXIDES, *CELL communication, *SENSITIVITY analysis, *BIG data, *BIOINFORMATICS
Abstract: Protein S-sulfenylation is a reversible post-translational modification involving covalent attachment of hydroxide to the thiol group of cysteine residues, which is involved in various biological processes including cell signaling, response to stress and protein functions. Herein we present S-SulfPred, a support vector machine based model to capture potential S-sulfenylation sites and improve the efficiency and relevance of experimental identification of protein S-sulfenylation sites. One-sided selection (OSS) undersampling and synthetic minority oversampling technique (SMOTE) oversampling were combined to establish balanced training datasets. This approach is shown to perform better than using only OSS or SMOTE in an independent test. The best combination of position-specific amino acid propensity and five physicochemical properties of amino acids were selected to optimize the predictor performance. Using S-SulfPred, we achieve an average sensitivity of 74.62%, and an average specificity of 71.62% on independent datasets. Compared with other published tools, S-SulfPred attains both higher sensitivity and specificity. We not only propose a highly accurate method to predict protein S-sulfenylation sites, but also provide insights that could improve the efficiency of other bioinformatics tools. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

34. EnhancerPred2.0: predicting enhancers and their strength based on position-specific trinucleotide propensity and electron–ion interaction potential feature selection.

Author: He, Wenying and Jia, Cangzhi
Published: 2017
Full Text: View/download PDF

35. Prediction of mitochondrial proteins of malaria parasite using bi-profile Bayes feature extraction

Author: Jia, Cangzhi, Liu, Tian, Chang, Alan K., and Zhai, Yingying
Subjects: *MITOCHONDRIA, *PROTEINS, *PLASMODIUM, *MALARIA, *SUPPORT vector machines, *ORGANELLES, *ANTIMALARIALS, *PLASMODIUM falciparum, *AMINO acid sequence
Abstract: Abstract: Mitochondrial proteins of Plasmodium falciparum are considered as attractive targets for anti-malarial drugs, but the experimental identification of these proteins is a difficult and time-consuming task. Computational prediction of mitochondrial proteins offers an alternative approach. However, the commonly used subcellular location prediction methods are unsuited for P. falciparum mitochondrial proteins whereas the organism and organelle-specific methods were constructed on the basis of a rather small dataset. In this study, a novel dataset termed PfM233, which included 108 mitochondrial and 125 non-mitochondrial proteins with sequence similarity below 25%, was established and the methods for predicting mitochondrial proteins of P. falciparum were described. Both bi-profile Bayes and split amino acid composition were applied to extract the features from the N- and C-terminal sequences of these proteins, which were then used to construct two SVM based classifiers (PfMP-N25 and PfMP-30). Using PfM233 as the dataset, PfMP-N25 and PfMP-30 achieved accuracies (MCCs) of 90.13% (0.80) and 90.99% (0.82). When tested with the commonly used 40 mitochondrial proteins in PfM175 and the 108 mitochondrial proteins in PfM233, these two methods obviously outperformed the existing general, organelle-specific and organism and organelle-specific methods. [Copyright &y& Elsevier]
Published: 2011
Full Text: View/download PDF

36. A high-accuracy protein structural class prediction algorithm using predicted secondary structural information

Author: Liu, Tian and Jia, Cangzhi
Subjects: *PROTEIN structure, *PREDICTION theory, *SUPPORT vector machines, *ALGORITHMS, *MATHEMATICAL models, *MATHEMATICAL programming
Abstract: Abstract: One major problem with the existing algorithm for the prediction of protein structural classes is low accuracies for proteins from α/β and α+β classes. In this study, three novel features were rationally designed to model the differences between proteins from these two classes. In combination with other rational designed features, an 11-dimensional vector prediction method was proposed. By means of this method, the overall prediction accuracy based on 25PDB dataset was 1.5% higher than the previous best-performing method, MODAS. Furthermore, the prediction accuracy for proteins from α+β class based on 25PDB dataset was 5% higher than the previous best-performing method, SCPRED. The prediction accuracies obtained with the D675 and FC699 datasets were also improved. [ABSTRACT FROM AUTHOR]
Published: 2010
Full Text: View/download PDF

37. Protein secondary structure class assignment on the basis of a new graphic representation.

Author: Jia, Cangzhi, Liu, Tian, Zhang, Xiangde, and Yan, Shijun
Subjects: *CLUSTER analysis (Statistics), *PROTEINS, *ORGANIC compounds, *AMINO acids, *AMINO compounds, *PEPTIDES
Abstract: A novel 2D representation (M-curve) has been provided to visualize the structure information of protein secondary structure sequences: (1) end point of the curve reflects difference of amino acid residue numbers in α-helices and β-strands of a protein; (2) Up/down ladder-like structures in the curve show that α-helices/β-strands appear sequentially in this region; (3) triangular-like/trapeziform-like structures of the curve show that α-helices and β-strands appear alternatively in this region. So from the M-curve, a protein could be directly assigned into corresponding secondary structure class. Moreover, a new numerical descriptor, four-component vector, is introduced and applied to cluster analysis. © 2008 Wiley Periodicals, Inc. Int J Quantum Chem, 2009 [ABSTRACT FROM AUTHOR]
Published: 2009
Full Text: View/download PDF

38. Transformation and reduction formulae for double q-Clausen hypergeometric series.

Author: Chu, Wenchang and Jia, Cangzhi
Published: 2008
Full Text: View/download PDF

39. Transformation and reduction formulae for double q-Clausen series of type Φ1:1;μ1:2;λ

Author: Jia, Cangzhi and Wang, Tianming
Subjects: Double q-Clausen series, q-Chu–Vandermonde convolution, Sears transformations, Mathematics::Classical Analysis and ODEs, Computer Science::Symbolic Computation, Basic hypergeometric series, q-Gauss summation formula
Abstract: The Sears transformations are employed to establish several general series transformations for double q-Clausen hypergeometric series of type Φ1:1;μ1:2;λ. These transformations yield further a number of reduction and summation formulae on the double basic hypergeometric series.
Full Text: View/download PDF

40. Inspector: a lysine succinylation predictor based on edited nearest-neighbor undersampling and adaptive synthetic oversampling.

Author: Zhu, Yan, Jia, Cangzhi, Li, Fuyi, and Song, Jiangning
Subjects: *POST-translational modification, *RANDOM forest algorithms, *K-nearest neighbor classification, *FEATURE extraction, *LYSINE
Abstract: Lysine succinylation is an important type of protein post-translational modification and plays a key role in regulating protein function and structural changes. The mechanism and function of succinylation have not been clarified. The key to better understanding the precise mechanism and functional role of succinylation is the identification of lysine succinylation sites. However, conventional experimental methods for succinylation identification are often expensive, time-consuming, and labor-intensive. Therefore, the new development of computational approaches to effectively identify lysine succinylation sites from sequence data is much needed. In this study, we proposed a novel predictor for lysine succinylation identification, Inspector, which was developed by using the random forest algorithm combined with a variety of sequence-based feature-encoding schemes. Edited nearest-neighbor undersampling method and adaptive synthetic oversampling approach were employed to solve dataset imbalance, and a two-step feature-selection strategy was applied to optimize the feature set for training the accuracy of the prediction model. Empirical studies on performance comparison with existing tools showed that Inspector was able to achieve competitive predictive performance for distinguishing lysine succinylation sites. Image 1 • Six feature extraction methods. • ENN undersampling and ADASYN oversampling are applied to balance the training dataset. • RF algorithm combined with a two-step feature-selection strategy to boost predictive performance. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

41. Recognition of Protein Pupylation Sites by Adopting Resampling Approach.

Author: Li, Tao, Chen, Yan, Li, Taoying, and Jia, Cangzhi
Subjects: UBIQUITINATION, POST-translational modification, PROTEIN genetics, MACHINE learning, MULTIPLE correspondence analysis (Statistics), FUZZY systems, RESAMPLING (Statistics)
Abstract: With the in-depth study of posttranslational modification sites, protein ubiquitination has become the key problem to study the molecular mechanism of posttranslational modification. Pupylation is a widely used process in which a prokaryotic ubiquitin-like protein (Pup) is attached to a substrate through a series of biochemical reactions. However, the experimental methods of identifying pupylation sites is often time-consuming and laborious. This study aims to propose an improved approach for predicting pupylation sites. Firstly, the Pearson correlation coefficient was used to reflect the correlation among different amino acid pairs calculated by the frequency of each amino acid. Then according to a descending ranked order, the multiple types of features were filtered separately by values of Pearson correlation coefficient. Thirdly, to get a qualified balanced dataset, the K-means principal component analysis (KPCA) oversampling technique was employed to synthesize new positive samples and Fuzzy undersampling method was employed to reduce the number of negative samples. Finally, the performance of our method was verified by means of jackknife and a 10-fold cross-validation test. The average results of 10-fold cross-validation showed that the sensitivity (Sn) was 90.53%, specificity (Sp) was 99.8%, accuracy (Acc) was 95.09%, and Matthews Correlation Coefficient (MCC) was 0.91. Moreover, an independent test dataset was used to further measure its performance, and the prediction results achieved the Acc of 83.75%, MCC of 0.49, which was superior to previous predictors. The better performance and stability of our proposed method showed it is an effective way to predict pupylation sites. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

42. EnhancerPred: a predictor for discovering enhancers based on the combination and selection of multiple features.

Author: Jia, Cangzhi and He, Wenying
Abstract: Enhancers are cis elements that play an important role in regulating gene expression by enhancing it. Recent study of modifications revealed that enhancers are a large group of functional elements with many different subgroups, which have different biological activities and regulatory effects on target genes. As powerful auxiliary tools, several computational methods have been proposed to distinguish enhancers from other regulatory elements, but only one method has been considered to clustering them into subgroups. In this study, we developed a predictor (called EnhancerPred) to distinguish between enhancers and nonenhancers and to determine enhancers' strength. A two-step wrapper-based feature selection method was applied in high dimension feature vector from bi-profile Bayes and pseudo-nucleotide composition. Finally, the combination of 104 features from bi-profile Bayes, 1 feature from nucleotide composition and 9 features from pseudo-nucleotide composition yielded the best performance for identifying enhancers and nonenhancers, with overall Acc of 77.39%. The combination of 89 features from bi-profile Bayes and 10 features from pseudo-nucleotide composition yielded the best performance for identifying strong and weak enhancers, with overall Acc of 68.19%. The process and steps of feature optimization illustrated that it is necessary to construct a particular model for identifying strong enhancers and weak enhancers. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

43. MAMLCDA: A Meta-Learning Model for Predicting circRNA-Disease Association Based on MAML Combined With CNN.

Author: Tian Y, Zou Q, Wang C, and Jia C
Subjects: Humans, Computational Biology methods, Machine Learning, Algorithms, Principal Component Analysis, RNA, Circular genetics, Neural Networks, Computer
Abstract: Circular RNAs (circRNAs) exist in vivo and are a class of noncoding RNA molecules. They have a single-stranded, closed, annular structure. Many studies have shown that circRNAs and diseases are linked. Therefore, it is critical to build a reliable and accurate predictor to find the circRNA-disease association. In this paper, we presented a meta-learning model named MAMLCDA to identify the circRNA-disease association, which is based on model-agnostic meta-learning (MAML) combined with CNN classification. Specifically, similarities between diseases and circRNAs are extracted and integrated to characterize their relationships, and k-means is used to cluster majority samples and select a certain number of samples from each cluster to obtain the same number of negative samples as the positive samples. To further reduce the dimension of the features and save operation time, we applied probabilistic principal component analysis (PPCA) to compact the integrated circRNA and disease similarity network feature vectors. The feature vectors are converted into images. At this time, the prediction problem is transformed into the 2-way 1-shot problem of the image and input into the model with MAML as the meta-learner and CNN as the base-learner. Comparison results of five-fold cross-validation on two benchmark datasets illustrate that MAMLCDA outperforms several state-of-the-art approaches with the best accuracies of 95.33% and 98%. Therefore, MAMLCDA can help to understand the pathogenesis of complex diseases at the circRNA level.
Published: 2024
Full Text: View/download PDF

44. Characterization of double-stranded RNA and its silencing efficiency for insects using hybrid deep-learning framework.

Author: Cheng H, Xu L, and Jia C
Abstract: RNA interference (RNAi) technology is widely used in the biological prevention and control of terrestrial insects. One of the main factors with the application of RNAi in insects is the difference in RNAi efficiency, which may vary not only in different insects, but also in different genes of the same insect, and even in different double-stranded RNAs (dsRNAs) of the same gene. This work focuses on the last question and establishes a bioinformatics software that can help researchers screen for the most efficient dsRNA targeting target genes. Among insects, the red flour beetle (Tribolium castaneum) is known to be one of the most sensitive to RNAi. From iBeetle-Base, we extracted 12 027 efficient dsRNA sequences with a lethality rate of ≥20% or with experimentation-induced phenotypic changes and processed these data to correspond to specific silence efficiency. Based on the first complied novel benchmark dataset, we specifically designed a deep neural network to identify and characterize efficient dsRNA for RNAi in insects. The dna2vec word embedding model was trained to extract distributed feature representations, and three powerful modules, namely convolutional neural network, bidirectional long short-term memory network, and self-attention mechanism, were integrated to form our predictor model to characterize the extracted dsRNAs and their silencing efficiencies for T. castaneum. Our model dsRNAPredictor showed reliable performance in multiple independent tests based on different species, including both T. castaneum and Aedes aegypti. This indicates that dsRNAPredictor can facilitate prescreening for designing high-efficiency dsRNA targeting target genes of insects in advance., (© The Author(s) 2024. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
Published: 2024
Full Text: View/download PDF

45. GNet: An integrated context-aware neural framework for transcription factor binding signal at single nucleotide resolution prediction.

Author: Zhuang J, Feng K, Teng X, and Jia C
Subjects: Humans, Animals, Mice, Protein Binding, Chromatin, DNA, Nucleotides metabolism, Transcription Factors genetics
Abstract: Transcription factors (TFs) are important factors that regulate gene expression. Revealing the mechanism affecting the binding specificity of TFs is the key to understanding gene regulation. Most of the previous studies focus on TF-DNA binding sites at the sequence level, and they seldom utilize the contextual features of DNA sequences. In this paper, we develop an integrated spatiotemporal context-aware neural network framework, named GNet, for predicting TF-DNA binding signal at single nucleotide resolution by achieving three tasks: single nucleotide resolution signal prediction, identification of binding regions at the sequence level, and TF-DNA binding motif prediction. GNet extracts implicit spatial contextual information with a gated highway neural mechanism, which captures large context multi-level patterns using linear shortcut connections, and the idea of it permeates the encoder and decoder parts of GNet. The improved dual external attention mechanism, which learns implicit relationships both within and among samples, and improves the performance of the model. Experimental results on 53 human TF ChIP-seq datasets and 6 chromatin accessibility ATAC-seq datasets shows that GNet outperforms the state-of-the-art methods in the three tasks, and the results of cross-species studies on 15 human and 18 mouse TF datasets of the corresponding TF families indicate that GNet also shows the best performance in cross-species prediction over the competitive methods.
Published: 2023
Full Text: View/download PDF

46. NeuroCNN_GNB: an ensemble model to predict neuropeptides based on a convolution neural network and Gaussian naive Bayes.

Author: Liu D, Lin Z, and Jia C
Abstract: Neuropeptides contain more chemical information than other classical neurotransmitters and have multiple receptor recognition sites. These characteristics allow neuropeptides to have a correspondingly higher selectivity for nerve receptors and fewer side effects. Traditional experimental methods, such as mass spectrometry and liquid chromatography technology, still need the support of a complete neuropeptide precursor database and the basic characteristics of neuropeptides. Incomplete neuropeptide precursor and information databases will lead to false-positives or reduce the sensitivity of recognition. In recent years, studies have proven that machine learning methods can rapidly and effectively predict neuropeptides. In this work, we have made a systematic attempt to create an ensemble tool based on four convolution neural network models. These baseline models were separately trained on one-hot encoding, AAIndex, G-gap dipeptide encoding and word2vec and integrated using Gaussian Naive Bayes (NB) to construct our predictor designated NeuroCNN_GNB. Both 5-fold cross-validation tests using benchmark datasets and independent tests showed that NeuroCNN_GNB outperformed other state-of-the-art methods. Furthermore, this novel framework provides essential interpretations that aid the understanding of model success by leveraging the powerful Shapley Additive exPlanation (SHAP) algorithm, thereby highlighting the most important features relevant for predicting neuropeptides., Competing Interests: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest., (Copyright © 2023 Liu, Lin and Jia.)
Published: 2023
Full Text: View/download PDF

47. An efficient deep learning based predictor for identifying miRNA-triggered phasiRNA loci in plant.

Author: Bu Y, Zheng J, and Jia C
Subjects: Plant Development, RNA, Messenger, Software, MicroRNAs genetics, Deep Learning
Abstract: Phasic small interfering RNAs are plant secondary small interference RNAs that typically generated by the convergence of miRNAs and polyadenylated mRNAs. A growing number of studies have shown that miRNA-initiated phasiRNA plays crucial roles in regulating plant growth and stress responses. Experimental verification of miRNA-initiated phasiRNA loci may take considerable time, energy and labor. Therefore, computational methods capable of processing high throughput data have been proposed one by one. In this work, we proposed a predictor (DIGITAL) for identifying miRNA-initiated phasiRNAs in plant, which combined a multi-scale residual network with a bi-directional long-short term memory network. The negative dataset was constructed based on positive data, through replacing 60% of nucleotides randomly in each positive sample. Our predictor achieved the accuracy of 98.48% and 94.02% respectively on two independent test datasets with different sequence length. These independent testing results indicate the effectiveness of our model. Furthermore, DIGITAL is of robustness and generalization ability, and thus can be easily extended and applied for miRNA target recognition of other species. We provide the source code of DIGITAL, which is freely available at https://github.com/yuanyuanbu/DIGITAL.
Published: 2023
Full Text: View/download PDF

48. SPREAD: An ensemble predictor based on DNA autoencoder framework for discriminating promoters in Pseudomonas aeruginosa .

Author: Zhou S, Zheng J, and Jia C
Subjects: Promoter Regions, Genetic, DNA, Pseudomonas aeruginosa genetics, Pseudomonas aeruginosa metabolism, Bacteria
Abstract: Regulatory elements in DNA sequences, such as promoters, enhancers, terminators and so on, are essential for gene expression in physiological and pathological processes. A promoter is the specific DNA sequence that is located upstream of the coding gene and acts as the "switch" for gene transcriptional regulation. Lots of promoter predictors have been developed for different bacterial species, but only a few are designed for Pseudomonas aeruginosa , a widespread Gram-negative conditional pathogen in nature. In this work, an ensemble model named SPREAD is proposed for the recognition of promoters in Pseudomonas aeruginosa . In SPREAD, the DNA sequence autoencoder model LSTM is employed to extract potential sequence information, and the mean output probability value of CNN and RF is applied as the final prediction. Compared with G4PromFinder, the only state-of-the-art classifier for promoters in Pseudomonas aeruginosa , SPREAD improves the prediction performance significantly, with an accuracy of 0.98, recall of 0.98, precision of 0.98, specificity of 0.97 and F1-score of 0.98.
Published: 2022
Full Text: View/download PDF

49. UPFPSR: a ubiquitylation predictor for plant through combining sequence information and random forest.

Author: Yin S, Zheng J, Jia C, Zou Q, Lin Z, and Shi H
Subjects: Algorithms, Computational Biology methods, Protein Processing, Post-Translational, Ubiquitination, Lysine chemistry, Lysine metabolism, Software
Abstract: As one of the most significant protein post-translational modifications (PTMs) in eukaryotes, ubiquitylation plays an essential role in regulating diverse cellular functions, such as apoptosis, cell division, DNA repair and replication, intracellular transport and immune reactions. Traditional experimental methods have the defect of being time-consuming, costly and labor-intensive. Therefore, it is highly desired to develop automated computational methods that can recognize potential ubiquitylation sites rapidly and accurately. In this study, we propose a novel predictor, named UPFPSR, for predicting lysine ubiquitylation sites in plant. UPFPSR is developed using multiple physicochemical properties of amino acids and sequence-based statistical information. In order to find a suitable classification algorithm, four traditional algorithms and two deep learning networks are compared, and the random forest with superior performance is selected ultimately. An extensive benchmarking shows that UPFPSR outperforms the most advanced ubiquitylation prediction tool on each measurement indicator, with the accuracy of 77.3%, precision of 75%, recall of 81.7%, F1-score of 0.7824, and AUC of 0.84 on the independent test dataset. The results indicate that UPFPSR can provide new guidance for further experimental study on ubiquitylation. The data sets and source code used in this study are freely available at https://github.com/ysw-sunshine/UPFPSR.
Published: 2022
Full Text: View/download PDF

50. Staem5: A novel computational approachfor accurate prediction of m5C site.

Author: Chai D, Jia C, Zheng J, Zou Q, and Li F
Abstract: 5-Methylcytosine (m5C) is an important post-transcriptional modification that has been extensively found in multiple types of RNAs. Many studies have shown that m5C plays vital roles in many biological functions, such as RNA structure stability and metabolism. Computational approaches act as an efficient way to identify m5C sites from high-throughput RNA sequence data and help interpret the functional mechanism of this important modification. This study proposed a novel species-specific computational approach, Staem5, to accurately predict RNA m5C sites in Mus musculus and Arabidopsis thaliana . Staem5 was developed by employing feature fusion tactics to leverage informatic sequence profiles, and a stacking ensemble learning framework combined five popular machine learning algorithms. Extensive benchmarking tests demonstrated that Staem5 outperformed state-of-the-art approaches in both cross-validation and independent tests. We provide the source code of Staem5, which is publicly available at https://github.com/Cxd-626/Staem5.git., Competing Interests: The authors declare no competing interests., (© 2021 The Authors.)
Published: 2021
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Publication Type

Journal

Database

Publisher

116 results on '"Jia, Cangzhi"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources