235 results
Search Results
202. Automated Detection of P. falciparum Using Machine Learning Algorithms with Quantitative Phase Images of Unstained Cells.
- Author
-
Park, Han Sang, Rinehart, Matthew T., Walzer, Katelyn A., Chi, Jen-Tsan Ashley, and Wax, Adam
- Subjects
- *
MALARIA diagnosis , *PLASMODIUM falciparum genetics , *TROPHOZOITES , *PARASITIC disease treatment , *K-nearest neighbor classification , *MACHINE learning - Abstract
Malaria detection through microscopic examination of stained blood smears is a diagnostic challenge that heavily relies on the expertise of trained microscopists. This paper presents an automated analysis method for detection and staging of red blood cells infected by the malaria parasite Plasmodium falciparum at trophozoite or schizont stage. Unlike previous efforts in this area, this study uses quantitative phase images of unstained cells. Erythrocytes are automatically segmented using thresholds of optical phase and refocused to enable quantitative comparison of phase images. Refocused images are analyzed to extract 23 morphological descriptors based on the phase information. While all individual descriptors are highly statistically different between infected and uninfected cells, each descriptor does not enable separation of populations at a level satisfactory for clinical utility. To improve the diagnostic capacity, we applied various machine learning techniques, including linear discriminant classification (LDC), logistic regression (LR), and k-nearest neighbor classification (NNC), to formulate algorithms that combine all of the calculated physical parameters to distinguish cells more effectively. Results show that LDC provides the highest accuracy of up to 99.7% in detecting schizont stage infected cells compared to uninfected RBCs. NNC showed slightly better accuracy (99.5%) than either LDC (99.0%) or LR (99.1%) for discriminating late trophozoites from uninfected RBCs. However, for early trophozoites, LDC produced the best accuracy of 98%. Discrimination of infection stage was less accurate, producing high specificity (99.8%) but only 45.0%-66.8% sensitivity with early trophozoites most often mistaken for late trophozoite or schizont stage and late trophozoite and schizont stage most often confused for each other. Overall, this methodology points to a significant clinical potential of using quantitative phase imaging to detect and stage malaria infection without staining or expert analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
203. Remaining Useful Life Prediction for Lithium-Ion Batteries Based on Gaussian Processes Mixture.
- Author
-
Li, Lingling, Wang, Pengchong, Chao, Kuei-Hsiang, Zhou, Yatong, and Xie, Yang
- Subjects
- *
LITHIUM-ion batteries , *GAUSSIAN processes , *SUPPORT vector machines , *ARTIFICIAL intelligence , *COGNITIVE science , *ARTIFICIAL neural networks , *COMPUTATIONAL biology , *COMPUTATIONAL neuroscience - Abstract
The remaining useful life (RUL) prediction of Lithium-ion batteries is closely related to the capacity degeneration trajectories. Due to the self-charging and the capacity regeneration, the trajectories have the property of multimodality. Traditional prediction models such as the support vector machines (SVM) or the Gaussian Process regression (GPR) cannot accurately characterize this multimodality. This paper proposes a novel RUL prediction method based on the Gaussian Process Mixture (GPM). It can process multimodality by fitting different segments of trajectories with different GPR models separately, such that the tiny differences among these segments can be revealed. The method is demonstrated to be effective for prediction by the excellent predictive result of the experiments on the two commercial and chargeable Type 1850 Lithium-ion batteries, provided by NASA. The performance comparison among the models illustrates that the GPM is more accurate than the SVM and the GPR. In addition, GPM can yield the predictive confidence interval, which makes the prediction more reliable than that of traditional models. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
204. Semi-Supervised Active Learning for Sound Classification in Hybrid Learning Environments.
- Author
-
Han, Wenjing, Coutinho, Eduardo, Ruan, Huabin, Li, Haifeng, Schuller, Björn, Yu, Xiaojie, and Zhu, Xuan
- Subjects
- *
SUPERVISED learning , *BLENDED learning , *ANNOTATIONS , *PASSIVE learning , *ACTIVE learning - Abstract
Coping with scarcity of labeled data is a common problem in sound classification tasks. Approaches for classifying sounds are commonly based on supervised learning algorithms, which require labeled data which is often scarce and leads to models that do not generalize well. In this paper, we make an efficient combination of confidence-based Active Learning and Self-Training with the aim of minimizing the need for human annotation for sound classification model training. The proposed method pre-processes the instances that are ready for labeling by calculating their classifier confidence scores, and then delivers the candidates with lower scores to human annotators, and those with high scores are automatically labeled by the machine. We demonstrate the feasibility and efficacy of this method in two practical scenarios: pool-based and stream-based processing. Extensive experimental results indicate that our approach requires significantly less labeled instances to reach the same performance in both scenarios compared to Passive Learning, Active Learning and Self-Training. A reduction of 52.2% in human labeled instances is achieved in both of the pool-based and stream-based scenarios on a sound classification task considering 16,930 sound instances. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
205. Human Detection Using Random Color Similarity Feature and Random Ferns Classifier.
- Author
-
Zhang, Miaohui and Xin, Ming
- Subjects
- *
COLOR vision , *FEATURE selection , *SUPPORT vector machines , *INTRACLASS correlation , *BAYESIAN analysis , *HISTOGRAMS - Abstract
We explore a novel approach for human detection based on random color similarity feature (RCS) and random ferns classifier which is also known as semi-naive Bayesian classifier. In contrast to other existing features employed by human detection, color-based features are rarely used in vision-based human detection because of large intra-class variations. In this paper, we propose a novel color-based feature, RCS feature, which is yielded by simple color similarity computation between image cells randomly picked in still images, and can effectively characterize human appearances. In addition, a histogram of oriented gradient based local binary feature (HOG-LBF) is also introduced to enrich the human descriptor set. Furthermore, random ferns classifier is used in the proposed approach because of its faster speed in training and testing than traditional classifiers such as Support Vector Machine (SVM) classifier, without a loss in performance. Finally, the proposed method is conducted in public datasets and achieves competitive detection results. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
206. DyHAP: Dynamic Hybrid ANFIS-PSO Approach for Predicting Mobile Malware.
- Author
-
Afifi, Firdaus, Anuar, Nor Badrul, Shamshirband, Shahaboddin, and Choo, Kim-Kwang Raymond
- Subjects
- *
MALWARE , *MOBILE apps , *FUZZY systems , *PARTICLE swarm optimization , *PROGRAM transformation - Abstract
To deal with the large number of malicious mobile applications (e.g. mobile malware), a number of malware detection systems have been proposed in the literature. In this paper, we propose a hybrid method to find the optimum parameters that can be used to facilitate mobile malware identification. We also present a multi agent system architecture comprising three system agents (i.e. sniffer, extraction and selection agent) to capture and manage the pcap file for data preparation phase. In our hybrid approach, we combine an adaptive neuro fuzzy inference system (ANFIS) and particle swarm optimization (PSO). Evaluations using data captured on a real-world Android device and the MalGenome dataset demonstrate the effectiveness of our approach, in comparison to two hybrid optimization methods which are differential evolution (ANFIS-DE) and ant colony optimization (ANFIS-ACO). [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
207. The LET Procedure for Prosthetic Myocontrol: Towards Multi-DOF Control Using Single-DOF Activations.
- Author
-
Nowak, Markus and Castellini, Claudio
- Subjects
- *
ARTIFICIAL hands , *DEGREES of freedom , *ELECTROMYOGRAPHY , *ARTIFICIAL intelligence , *HUMAN-machine systems , *SIGNAL processing - Abstract
Simultaneous and proportional myocontrol of dexterous hand prostheses is to a large extent still an open problem. With the advent of commercially and clinically available multi-fingered hand prostheses there are now more independent degrees of freedom (DOFs) in prostheses than can be effectively controlled using surface electromyography (sEMG), the current standard human-machine interface for hand amputees. In particular, it is uncertain, whether several DOFs can be controlled simultaneously and proportionally by exclusively calibrating the intended activation of single DOFs. The problem is currently solved by training on all required combinations. However, as the number of available DOFs grows, this approach becomes overly long and poses a high cognitive burden on the subject. In this paper we present a novel approach to overcome this problem. Multi-DOF activations are artificially modelled from single-DOF ones using a simple linear combination of sEMG signals, which are then added to the training set. This procedure, which we named LET (Linearly Enhanced Training), provides an augmented data set to any machine-learning-based intent detection system. In two experiments involving intact subjects, one offline and one online, we trained a standard machine learning approach using the full data set containing single- and multi-DOF activations as well as using the LET-augmented data set in order to evaluate the performance of the LET procedure. The results indicate that the machine trained on the latter data set obtains worse results in the offline experiment compared to the full data set. However, the online implementation enables the user to perform multi-DOF tasks with almost the same precision as single-DOF tasks without the need of explicitly training multi-DOF activations. Moreover, the parameters involved in the system are statistically uniform across subjects. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
208. Integrating Domain Specific Knowledge and Network Analysis to Predict Drug Sensitivity of Cancer Cell Lines.
- Author
-
Kim, Sebo, Sundaresan, Varsha, Zhou, Lei, and Kahveci, Tamer
- Subjects
- *
CELL lines , *CANCER cells , *DRUG analysis , *ANTINEOPLASTIC agents , *MACHINE learning , *ARTIFICIAL intelligence - Abstract
One of fundamental challenges in cancer studies is that varying molecular characteristics of different tumor types may lead to resistance to certain drugs. As a result, the same drug can lead to significantly different results in different types of cancer thus emphasizing the need for individualized medicine. Individual prediction of drug response has great potential to aid in improving the clinical outcome and reduce the financial costs associated with prescribing chemotherapy drugs to which the patient’s tumor might be resistant. In this paper we develop a network based classifier (NBC) method for predicting sensitivity of cell lines to anticancer drugs from transcriptome data. In the literature, this strategy has been used for predicting cancer types. Here, we extend it to estimate sensitivity of cells from different tumor types to various anticancer drugs. Furthermore, we incorporate domain specific knowledge such as the use of apoptotic gene list and clinical dose information in our method to impart biological significance to the prediction. Our experimental results suggest that our network based classifier (NBC) method outperforms existing classifiers in estimating sensitivity of cell lines for different drugs. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
209. A Fast Alignment-Free Approach for De Novo Detection of Protein Conserved Regions.
- Author
-
Abnousi, Armen, Broschat, Shira L., and Kalyanaraman, Ananth
- Subjects
- *
AMINO acid sequence , *CONSERVED sequences (Genetics) , *MACHINE learning , *PROTEIN domains , *SEQUENCE alignment , *MOLECULAR biology - Abstract
Background: Identifying conserved regions in protein sequences is a fundamental operation, occurring in numerous sequence-driven analysis pipelines. It is used as a way to decode domain-rich regions within proteins, to compute protein clusters, to annotate sequence function, and to compute evolutionary relationships among protein sequences. A number of approaches exist for identifying and characterizing protein families based on their domains, and because domains represent conserved portions of a protein sequence, the primary computation involved in protein family characterization is identification of such conserved regions. However, identifying conserved regions from large collections (millions) of protein sequences presents significant challenges. Methods: In this paper we present a new, alignment-free method for detecting conserved regions in protein sequences called NADDA (No-Alignment Domain Detection Algorithm). Our method exploits the abundance of exact matching short subsequences (k-mers) to quickly detect conserved regions, and the power of machine learning is used to improve the prediction accuracy of detection. We present a parallel implementation of NADDA using the MapReduce framework and show that our method is highly scalable. Results: We have compared NADDA with Pfam and InterPro databases. For known domains annotated by Pfam, accuracy is 83%, sensitivity 96%, and specificity 44%. For sequences with new domains not present in the training set an average accuracy of 63% is achieved when compared to Pfam. A boost in results in comparison with InterPro demonstrates the ability of NADDA to capture conserved regions beyond those present in Pfam. We have also compared NADDA with ADDA and MKDOM2, assuming Pfam as ground-truth. On average NADDA shows comparable accuracy, more balanced sensitivity and specificity, and being alignment-free, is significantly faster. Excluding the one-time cost of training, runtimes on a single processor were 49s, 10,566s, and 456s for NADDA, ADDA, and MKDOM2, respectively, for a data set comprised of approximately 2500 sequences. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
210. Research on B Cell Algorithm for Learning to Rank Method Based on Parallel Strategy.
- Author
-
Tian, Yuling and Zhang, Hongxian
- Subjects
- *
B cells , *INFORMATION retrieval , *MACHINE learning , *CLONAL selection algorithms , *IMMUNITY - Abstract
For the purposes of information retrieval, users must find highly relevant documents from within a system (and often a quite large one comprised of many individual documents) based on input query. Ranking the documents according to their relevance within the system to meet user needs is a challenging endeavor, and a hot research topic–there already exist several rank-learning methods based on machine learning techniques which can generate ranking functions automatically. This paper proposes a parallel B cell algorithm, RankBCA, for rank learning which utilizes a clonal selection mechanism based on biological immunity. The novel algorithm is compared with traditional rank-learning algorithms through experimentation and shown to outperform the others in respect to accuracy, learning time, and convergence rate; taken together, the experimental results show that the proposed algorithm indeed effectively and rapidly identifies optimal ranking functions. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
211. Surface-Based fMRI-Driven Diffusion Tractography in the Presence of Significant Brain Pathology: A Study Linking Structure and Function in Cerebral Palsy.
- Author
-
Reid, Lee B., Cunnington, Ross, Boyd, Roslyn N., and Rose, Stephen E.
- Subjects
- *
FUNCTIONAL magnetic resonance imaging , *CEREBRAL palsy , *WHITE matter (Nerve tissue) , *THALAMOCORTICAL system , *MACHINE learning , *DIAGNOSIS - Abstract
Diffusion MRI (dMRI) tractography analyses are difficult to perform in the presence of brain pathology. Automated methods that rely on cortical parcellation for structural connectivity studies often fail, while manually defining regions is extremely time consuming and can introduce human error. Both methods also make assumptions about structure-function relationships that may not hold after cortical reorganisation. Seeding tractography with functional-MRI (fMRI) activation is an emerging method that reduces these confounds, but inherent smoothing of fMRI signal may result in the inclusion of irrelevant pathways. This paper describes a novel fMRI-seeded dMRI-analysis pipeline based on surface-meshes that reduces these issues and utilises machine-learning to generate task specific white matter pathways, minimising the requirement for manually-drawn ROIs. We directly compared this new strategy to a standard voxelwise fMRI-dMRI approach, by investigating correlations between clinical scores and dMRI metrics of thalamocortical and corticomotor tracts in 31 children with unilateral cerebral palsy. The surface-based approach successfully processed more participants (87%) than the voxel-based approach (65%), and provided significantly more-coherent tractography. Significant correlations between dMRI metrics and five clinical scores of function were found for the more superior regions of these tracts. These significant correlations were stronger and more frequently found with the surface-based method (15/20 investigated were significant; R2 = 0.43–0.73) than the voxelwise analysis (2 sig. correlations; 0.38 & 0.49). More restricted fMRI signal, better-constrained tractography, and the novel track-classification method all appeared to contribute toward these differences. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
212. Machine Learning of Protein Interactions in Fungal Secretory Pathways.
- Author
-
Kludas, Jana, Arvas, Mikko, Castillo, Sandra, Pakula, Tiina, Oja, Merja, Brouard, Céline, Jäntti, Jussi, Penttilä, Merja, and Rousu, Juho
- Subjects
- *
MACHINE learning , *PROTEIN-protein interactions , *FUNGI , *SECRETION , *KERNEL functions - Abstract
In this paper we apply machine learning methods for predicting protein interactions in fungal secretion pathways. We assume an inter-species transfer setting, where training data is obtained from a single species and the objective is to predict protein interactions in other, related species. In our methodology, we combine several state of the art machine learning approaches, namely, multiple kernel learning (MKL), pairwise kernels and kernelized structured output prediction in the supervised graph inference framework. For MKL, we apply recently proposed centered kernel alignment and p-norm path following approaches to integrate several feature sets describing the proteins, demonstrating improved performance. For graph inference, we apply input-output kernel regression (IOKR) in supervised and semi-supervised modes as well as output kernel trees (OK3). In our experiments simulating increasing genetic distance, Input-Output Kernel Regression proved to be the most robust prediction approach. We also show that the MKL approaches improve the predictions compared to uniform combination of the kernels. We evaluate the methods on the task of predicting protein-protein-interactions in the secretion pathways in fungi, S.cerevisiae, baker’s yeast, being the source, T. reesei being the target of the inter-species transfer learning. We identify completely novel candidate secretion proteins conserved in filamentous fungi. These proteins could contribute to their unique secretion capabilities. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
213. Detection and Classification of Measurement Errors in Bioimpedance Spectroscopy.
- Author
-
Ayllón, David, Gil-Pita, Roberto, and Seoane, Fernando
- Subjects
- *
IMPEDANCE spectroscopy , *MEASUREMENT errors , *SPECTROMETERS , *ROBUST control , *MACHINE learning , *ELECTRIC capacity - Abstract
Bioimpedance spectroscopy (BIS) measurement errors may be caused by parasitic stray capacitance, impedance mismatch, cross-talking or their very likely combination. An accurate detection and identification is of extreme importance for further analysis because in some cases and for some applications, certain measurement artifacts can be corrected, minimized or even avoided. In this paper we present a robust method to detect the presence of measurement artifacts and identify what kind of measurement error is present in BIS measurements. The method is based on supervised machine learning and uses a novel set of generalist features for measurement characterization in different immittance planes. Experimental validation has been carried out using a database of complex spectra BIS measurements obtained from different BIS applications and containing six different types of errors, as well as error-free measurements. The method obtained a low classification error (0.33%) and has shown good generalization. Since both the features and the classification schema are relatively simple, the implementation of this pre-processing task in the current hardware of bioimpedance spectrometers is possible. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
214. Ensemble Feature Learning of Genomic Data Using Support Vector Machine.
- Author
-
Anaissi, Ali, Goyal, Madhu, Catchpoole, Daniel R., Braytee, Ali, and Kennedy, Paul J.
- Subjects
- *
GENOMES , *LEARNING , *GENES , *GENE expression , *RECURSIVE functions - Abstract
The identification of a subset of genes having the ability to capture the necessary information to distinguish classes of patients is crucial in bioinformatics applications. Ensemble and bagging methods have been shown to work effectively in the process of gene selection and classification. Testament to that is random forest which combines random decision trees with bagging to improve overall feature selection and classification accuracy. Surprisingly, the adoption of these methods in support vector machines has only recently received attention but mostly on classification not gene selection. This paper introduces an ensemble SVM-Recursive Feature Elimination (ESVM-RFE) for gene selection that follows the concepts of ensemble and bagging used in random forest but adopts the backward elimination strategy which is the rationale of RFE algorithm. The rationale behind this is, building ensemble SVM models using randomly drawn bootstrap samples from the training set, will produce different feature rankings which will be subsequently aggregated as one feature ranking. As a result, the decision for elimination of features is based upon the ranking of multiple SVM models instead of choosing one particular model. Moreover, this approach will address the problem of imbalanced datasets by constructing a nearly balanced bootstrap sample. Our experiments show that ESVM-RFE for gene selection substantially increased the classification performance on five microarray datasets compared to state-of-the-art methods. Experiments on the childhood leukaemia dataset show that an average 9% better accuracy is achieved by ESVM-RFE over SVM-RFE, and 5% over random forest based approach. The selected genes by the ESVM-RFE algorithm were further explored with Singular Value Decomposition (SVD) which reveals significant clusters with the selected data. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
215. FSM-F: Finite State Machine Based Framework for Denial of Service and Intrusion Detection in MANET.
- Author
-
N. Ahmed, Malik, Abdullah, Abdul Hanan, and Kaiwartya, Omprakash
- Subjects
- *
AD hoc computer networks , *FINITE state machines , *DENIAL of service attacks , *INTRUSION detection systems (Computer security) , *WIRELESS communications , *EMERGENCY management - Abstract
Due to the continuous advancements in wireless communication in terms of quality of communication and affordability of the technology, the application area of Mobile Adhoc Networks (MANETs) significantly growing particularly in military and disaster management. Considering the sensitivity of the application areas, security in terms of detection of Denial of Service (DoS) and intrusion has become prime concern in research and development in the area. The security systems suggested in the past has state recognition problem where the system is not able to accurately identify the actual state of the network nodes due to the absence of clear definition of states of the nodes. In this context, this paper proposes a framework based on Finite State Machine (FSM) for denial of service and intrusion detection in MANETs. In particular, an Interruption Detection system for Adhoc On-demand Distance Vector (ID-AODV) protocol is presented based on finite state machine. The packet dropping and sequence number attacks are closely investigated and detection systems for both types of attacks are designed. The major functional modules of ID-AODV includes network monitoring system, finite state machine and attack detection model. Simulations are carried out in network simulator NS-2 to evaluate the performance of the proposed framework. A comparative evaluation of the performance is also performed with the state-of-the-art techniques: RIDAN and AODV. The performance evaluations attest the benefits of proposed framework in terms of providing better security for denial of service and intrusion detection attacks. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
216. Computational Visual Stress Level Analysis of Calcareous Algae Exposed to Sedimentation.
- Author
-
Osterloff, Jonas, Nilssen, Ingunn, Eide, Ingvar, de Oliveira Figueiredo, Marcia Abreu, de Souza Tâmega, Frederico Tapajós, and Nattkemper, Tim W.
- Subjects
- *
CORALLINE algae , *EFFECT of light on algae , *PHOTOSYNTHESIS , *SEDIMENTATION & deposition , *MACHINE learning , *ALGAE - Abstract
This paper presents a machine learning based approach for analyses of photos collected from laboratory experiments conducted to assess the potential impact of water-based drill cuttings on deep-water rhodolith-forming calcareous algae. This pilot study uses imaging technology to quantify and monitor the stress levels of the calcareous algae Mesophyllum engelhartii (Foslie) Adey caused by various degrees of light exposure, flow intensity and amount of sediment. A machine learning based algorithm was applied to assess the temporal variation of the calcareous algae size (∼ mass) and color automatically. Measured size and color were correlated to the photosynthetic efficiency (maximum quantum yield of charge separation in photosystem II, ) and degree of sediment coverage using multivariate regression. The multivariate regression showed correlations between time and calcareous algae sizes, as well as correlations between fluorescence and calcareous algae colors. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
217. Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values.
- Author
-
Razzaghi, Talayeh, Roderick, Oleg, Safro, Ilya, and Marko, Nicholas
- Subjects
- *
SUPPORT vector machines , *ELECTRONIC health records , *PREDICTION models , *DATA mining , *ELECTRONIC data processing - Abstract
This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techniques of data-preprocessing and classification. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. It is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
218. A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data.
- Author
-
Goldstein, Markus and Uchida, Seiichi
- Subjects
- *
ANOMALY detection (Computer security) , *TASK performance , *INTRUSION detection systems (Computer security) , *ALGORITHMS , *FRAUD investigation , *COMPARATIVE studies - Abstract
Anomaly detection is the process of identifying unexpected items or events in datasets, which differ from the norm. In contrast to standard classification tasks, anomaly detection is often applied on unlabeled data, taking only the internal structure of the dataset into account. This challenge is known as unsupervised anomaly detection and is addressed in many practical applications, for example in network intrusion detection, fraud detection as well as in the life science and medical domain. Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as well as common publicly available datasets. These shortcomings are addressed in this study, where 19 different unsupervised anomaly detection algorithms are evaluated on 10 different datasets from multiple application domains. By publishing the source code and the datasets, this paper aims to be a new well-funded basis for unsupervised anomaly detection research. Additionally, this evaluation reveals the strengths and weaknesses of the different approaches for the first time. Besides the anomaly detection performance, computational effort, the impact of parameter settings as well as the global/local anomaly detection behavior is outlined. As a conclusion, we give an advise on algorithm selection for typical real-world tasks. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
219. Accurate Prediction of Transposon-Derived piRNAs by Integrating Various Sequential and Physicochemical Features.
- Author
-
Luo, Longqiang, Li, Dingfang, Zhang, Wen, Tu, Shikui, Zhu, Xiaopeng, and Tian, Gang
- Subjects
- *
TRANSPOSONS , *PIWI genes , *NON-coding RNA , *PREDICTION theory , *MACHINE learning - Abstract
Background: Piwi-interacting RNA (piRNA) is the largest class of small non-coding RNA molecules. The transposon-derived piRNA prediction can enrich the research contents of small ncRNAs as well as help to further understand generation mechanism of gamete. Methods: In this paper, we attempt to differentiate transposon-derived piRNAs from non-piRNAs based on their sequential and physicochemical features by using machine learning methods. We explore six sequence-derived features, i.e. spectrum profile, mismatch profile, subsequence profile, position-specific scoring matrix, pseudo dinucleotide composition and local structure-sequence triplet elements, and systematically evaluate their performances for transposon-derived piRNA prediction. Finally, we consider two approaches: direct combination and ensemble learning to integrate useful features and achieve high-accuracy prediction models. Results: We construct three datasets, covering three species: Human, Mouse and Drosophila, and evaluate the performances of prediction models by 10-fold cross validation. In the computational experiments, direct combination models achieve AUC of 0.917, 0.922 and 0.992 on Human, Mouse and Drosophila, respectively; ensemble learning models achieve AUC of 0.922, 0.926 and 0.994 on the three datasets. Conclusions: Compared with other state-of-the-art methods, our methods can lead to better performances. In conclusion, the proposed methods are promising for the transposon-derived piRNA prediction. The source codes and datasets are available in . [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
220. Behavior Based Social Dimensions Extraction for Multi-Label Classification.
- Author
-
Li, Le, Xu, Junyi, Xiao, Weidong, and Ge, Bin
- Subjects
- *
SOCIAL classes , *SOCIAL networks , *ALGORITHMS , *DATA extraction , *SOCIAL sciences - Abstract
Classification based on social dimensions is commonly used to handle the multi-label classification task in heterogeneous networks. However, traditional methods, which mostly rely on the community detection algorithms to extract the latent social dimensions, produce unsatisfactory performance when community detection algorithms fail. In this paper, we propose a novel behavior based social dimensions extraction method to improve the classification performance in multi-label heterogeneous networks. In our method, nodes’ behavior features, instead of community memberships, are used to extract social dimensions. By introducing Latent Dirichlet Allocation (LDA) to model the network generation process, nodes’ connection behaviors with different communities can be extracted accurately, which are applied as latent social dimensions for classification. Experiments on various public datasets reveal that the proposed method can obtain satisfactory classification results in comparison to other state-of-the-art methods on smaller social dimensions. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
221. A Directed Acyclic Graph-Large Margin Distribution Machine Model for Music Symbol Classification.
- Author
-
Wen, Cuihong, Zhang, Jing, Rebelo, Ana, and Cheng, Fanyong
- Subjects
- *
MUSIC psychology , *MACHINE learning , *RECOGNITION (Psychology) , *ATTENTION , *DIRECTED acyclic graphs , *PROBLEM solving - Abstract
Optical Music Recognition (OMR) has received increasing attention in recent years. In this paper, we propose a classifier based on a new method named Directed Acyclic Graph-Large margin Distribution Machine (DAG-LDM). The DAG-LDM is an improvement of the Large margin Distribution Machine (LDM), which is a binary classifier that optimizes the margin distribution by maximizing the margin mean and minimizing the margin variance simultaneously. We modify the LDM to the DAG-LDM to solve the multi-class music symbol classification problem. Tests are conducted on more than 10000 music symbol images, obtained from handwritten and printed images of music scores. The proposed method provides superior classification capability and achieves much higher classification accuracy than the state-of-the-art algorithms such as Support Vector Machines (SVMs) and Neural Networks (NNs). [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
222. Computational Effective Fault Detection by Means of Signature Functions.
- Author
-
Baranski, Przemyslaw and Pietrzak, Piotr
- Subjects
- *
FAULT diagnosis , *COMPUTATIONAL physics , *MARKOV processes , *ARTIFICIAL neural networks , *COMPUTER algorithms - Abstract
The paper presents a computationally effective method for fault detection. A system’s responses are measured under healthy and ill conditions. These signals are used to calculate so-called signature functions that create a signal space. The current system’s response is projected into this space. The signal location in this space easily allows to determine the fault. No classifier such as a neural network, hidden Markov models, etc. is required. The advantage of this proposed method is its efficiency, as computing projections amount to calculating dot products. Therefore, this method is suitable for real-time embedded systems due to its simplicity and undemanding processing capabilities which permit the use of low-cost hardware and allow rapid implementation. The approach performs well for systems that can be considered linear and stationary. The communication presents an application, whereby an industrial process of moulding is supervised. The machine is composed of forms (dies) whose alignment must be precisely set and maintained during the work. Typically, the process is stopped periodically to manually control the alignment. The applied algorithm allows on-line monitoring of the device by analysing the acceleration signal from a sensor mounted on a die. This enables to detect failures at an early stage thus prolonging the machine’s life. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
223. Predicting Market Impact Costs Using Nonparametric Machine Learning Models.
- Author
-
Park, Saerom, Lee, Jaewook, and Son, Youngdoo
- Subjects
- *
NONPARAMETRIC estimation , *MACHINE learning , *TRANSACTION costs , *GAUSSIAN processes , *SUPPORT vector machines , *BAYESIAN analysis - Abstract
Market impact cost is the most significant portion of implicit transaction costs that can reduce the overall transaction cost, although it cannot be measured directly. In this paper, we employed the state-of-the-art nonparametric machine learning models: neural networks, Bayesian neural network, Gaussian process, and support vector regression, to predict market impact cost accurately and to provide the predictive model that is versatile in the number of variables. We collected a large amount of real single transaction data of US stock market from Bloomberg Terminal and generated three independent input variables. As a result, most nonparametric machine learning models outperformed a-state-of-the-art benchmark parametric model such as I-star model in four error measures. Although these models encounter certain difficulties in separating the permanent and temporary cost directly, nonparametric machine learning models can be good alternatives in reducing transaction costs by considerably improving in prediction performance. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
224. A Non-Destructive Method for Distinguishing Reindeer Antler (Rangifer tarandus) from Red Deer Antler (Cervus elaphus) Using X-Ray Micro-Tomography Coupled with SVM Classifiers.
- Author
-
Lefebvre, Alexandre, Rochefort, Gael Y., Santos, Frédéric, Le Denmat, Dominique, Salmon, Benjamin, and Pétillon, Jean-Marc
- Subjects
- *
REINDEER , *RED deer , *X-ray computed microtomography , *SUPPORT vector machines , *THREE-dimensional imaging , *MEDICAL artifacts - Abstract
Over the last decade, biomedical 3D-imaging tools have gained widespread use in the analysis of prehistoric bone artefacts. While initial attempts to characterise the major categories used in osseous industry (i.e. bone, antler, and dentine/ivory) have been successful, the taxonomic determination of prehistoric artefacts remains to be investigated. The distinction between reindeer and red deer antler can be challenging, particularly in cases of anthropic and/or taphonomic modifications. In addition to the range of destructive physicochemical identification methods available (mass spectrometry, isotopic ratio, and DNA analysis), X-ray micro-tomography (micro-CT) provides convincing non-destructive 3D images and analyses. This paper presents the experimental protocol (sample scans, image processing, and statistical analysis) we have developed in order to identify modern and archaeological antler collections (from Isturitz, France). This original method is based on bone microstructure analysis combined with advanced statistical support vector machine (SVM) classifiers. A combination of six microarchitecture biomarkers (bone volume fraction, trabecular number, trabecular separation, trabecular thickness, trabecular bone pattern factor, and structure model index) were screened using micro-CT in order to characterise internal alveolar structure. Overall, reindeer alveoli presented a tighter mesh than red deer alveoli, and statistical analysis allowed us to distinguish archaeological antler by species with an accuracy of 96%, regardless of anatomical location on the antler. In conclusion, micro-CT combined with SVM classifiers proves to be a promising additional non-destructive method for antler identification, suitable for archaeological artefacts whose degree of human modification and cultural heritage or scientific value has previously made it impossible (tools, ornaments, etc.). [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
225. AUI&GIV: Recommendation with Asymmetric User Influence and Global Importance Value.
- Author
-
Zhao, Zhi-Lin, Wang, Chang-Dong, and Lai, Jian-Huang
- Subjects
- *
INFORMATION filtering , *COMPUTER algorithms , *APPLIED mathematics , *INFORMATION science , *SOCIAL networks - Abstract
The user-based collaborative filtering (CF) algorithm is one of the most popular approaches for making recommendation. Despite its success, the traditional user-based CF algorithm suffers one serious problem that it only measures the influence between two users based on their symmetric similarities calculated by their consumption histories. It means that, for a pair of users, the influences on each other are the same, which however may not be true. Intuitively, an expert may have an impact on a novice user but a novice user may not affect an expert at all. Besides, each user may possess a global importance factor that affects his/her influence to the remaining users. To this end, in this paper, we propose an asymmetric user influence model to measure the directed influence between two users and adopt the PageRank algorithm to calculate the global importance value of each user. And then the directed influence values and the global importance values are integrated to deduce the final influence values between two users. Finally, we use the final influence values to improve the performance of the traditional user-based CF algorithm. Extensive experiments have been conducted, the results of which have confirmed that both the asymmetric user influence model and global importance value play key roles in improving recommendation accuracy, and hence the proposed method significantly outperforms the existing recommendation algorithms, in particular the user-based CF algorithm on the datasets of high rating density. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
226. Evaluation of Electroencephalography Source Localization Algorithms with Multiple Cortical Sources.
- Author
-
Bradley, Allison, Yao, Jun, Dewald, Jules, and Richter, Claus-Peter
- Subjects
- *
ELECTROENCEPHALOGRAPHY , *CEREBRAL cortex , *ACOUSTIC localization , *IMAGE reconstruction , *ELECTROPHYSIOLOGY , *BRAIN mapping - Abstract
Background: Source localization algorithms often show multiple active cortical areas as the source of electroencephalography (EEG). Yet, there is little data quantifying the accuracy of these results. In this paper, the performance of current source density source localization algorithms for the detection of multiple cortical sources of EEG data has been characterized. Methods: EEG data were generated by simulating multiple cortical sources (2–4) with the same strength or two sources with relative strength ratios of 1:1 to 4:1, and adding noise. These data were used to reconstruct the cortical sources using current source density (CSD) algorithms: sLORETA, MNLS, and LORETA using a p-norm with p equal to 1, 1.5 and 2. Precision (percentage of the reconstructed activity corresponding to simulated activity) and Recall (percentage of the simulated sources reconstructed) of each of the CSD algorithms were calculated. Results: While sLORETA has the best performance when only one source is present, when two or more sources are present LORETA with p equal to 1.5 performs better. When the relative strength of one of the sources is decreased, all algorithms have more difficulty reconstructing that source. However, LORETA 1.5 continues to outperform other algorithms. If only the strongest source is of interest sLORETA is recommended, while LORETA with p equal to 1.5 is recommended if two or more of the cortical sources are of interest. These results provide guidance for choosing a CSD algorithm to locate multiple cortical sources of EEG and for interpreting the results of these algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
227. Active Semi-Supervised Community Detection Based on Must-Link and Cannot-Link Constraints.
- Author
-
Cheng, Jianjun, Leng, Mingwei, Li, Longjie, Zhou, Hanhai, and Chen, Xiaoyun
- Subjects
- *
PROTEIN-protein interactions , *ALGORITHMS , *ACTIVE learning , *INQUIRY-based learning , *EXPERIENTIAL learning - Abstract
Community structure detection is of great importance because it can help in discovering the relationship between the function and the topology structure of a network. Many community detection algorithms have been proposed, but how to incorporate the prior knowledge in the detection process remains a challenging problem. In this paper, we propose a semi-supervised community detection algorithm, which makes full utilization of the must-link and cannot-link constraints to guide the process of community detection and thereby extracts high-quality community structures from networks. To acquire the high-quality must-link and cannot-link constraints, we also propose a semi-supervised component generation algorithm based on active learning, which actively selects nodes with maximum utility for the proposed semi-supervised community detection algorithm step by step, and then generates the must-link and cannot-link constraints by accessing a noiseless oracle. Extensive experiments were carried out, and the experimental results show that the introduction of active learning into the problem of community detection makes a success. Our proposed method can extract high-quality community structures from networks, and significantly outperforms other comparison methods. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
228. Detecting Gunshots Using Wearable Accelerometers.
- Author
-
Loeffler, Charles E.
- Subjects
- *
GUNSHOT residues , *ACCELEROMETERS , *WEARABLE technology , *COMMUNITY supervision , *VIOLENCE , *FIREARMS , *REMOTE sensing - Abstract
Gun violence continues to be a staggering and seemingly intractable issue in many communities. The prevalence of gun violence among the sub-population of individuals under court-ordered community supervision provides an opportunity for intervention using remote monitoring technology. Existing monitoring systems rely heavily on location-based monitoring methods, which have incomplete geographic coverage and do not provide information on illegal firearm use. This paper presents the first results demonstrating the feasibility of using wearable inertial sensors to recognize wrist movements and other signals corresponding to firearm usage. Data were collected from accelerometers worn on the wrists of subjects shooting a number of different firearms, conducting routine daily activities, and participating in activities and tasks that could be potentially confused with firearm discharges. A training sample was used to construct a combined detector and classifier for individual gunshots, which achieved a classification accuracy of 99.4 percent when tested against a hold-out sample of observations. These results suggest the feasibility of using inexpensive wearable sensors to detect firearm discharges. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
229. Multilevel Hierarchical Kernel Spectral Clustering for Real-Life Large Scale Complex Networks.
- Author
-
Mall, Raghvendra, Langone, Rocco, and Suykens, Johan A. K.
- Subjects
- *
KERNEL functions , *MACHINE learning , *STATISTICAL mechanics , *COGNITIVE science , *ARTIFICIAL intelligence , *EMPIRICAL research - Abstract
Kernel spectral clustering corresponds to a weighted kernel principal component analysis problem in a constrained optimization framework. The primal formulation leads to an eigen-decomposition of a centered Laplacian matrix at the dual level. The dual formulation allows to build a model on a representative subgraph of the large scale network in the training phase and the model parameters are estimated in the validation stage. The KSC model has a powerful out-of-sample extension property which allows cluster affiliation for the unseen nodes of the big data network. In this paper we exploit the structure of the projections in the eigenspace during the validation stage to automatically determine a set of increasing distance thresholds. We use these distance thresholds in the test phase to obtain multiple levels of hierarchy for the large scale network. The hierarchical structure in the network is determined in a bottom-up fashion. We empirically showcase that real-world networks have multilevel hierarchical organization which cannot be detected efficiently by several state-of-the-art large scale hierarchical community detection techniques like the Louvain, OSLOM and Infomap methods. We show that a major advantage of our proposed approach is the ability to locate good quality clusters at both the finer and coarser levels of hierarchy using internal cluster quality metrics on 7 real-life networks. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
230. Reliable Multi-Label Learning via Conformal Predictor and Random Forest for Syndrome Differentiation of Chronic Fatigue in Traditional Chinese Medicine.
- Author
-
Wang, Huazhen, Liu, Xin, Lv, Bing, Yang, Fan, and Hong, Yanzhu
- Subjects
- *
CHRONIC fatigue syndrome , *RANDOM forest algorithms , *CHINESE medicine , *ETIOLOGY of diseases , *PATHOLOGICAL physiology , *MACHINE learning - Abstract
Objective: Chronic Fatigue (CF) still remains unclear about its etiology, pathophysiology, nomenclature and diagnostic criteria in the medical community. Traditional Chinese medicine (TCM) adopts a unique diagnostic method, namely ‘bian zheng lun zhi’ or syndrome differentiation, to diagnose the CF with a set of syndrome factors, which can be regarded as the Multi-Label Learning (MLL) problem in the machine learning literature. To obtain an effective and reliable diagnostic tool, we use Conformal Predictor (CP), Random Forest (RF) and Problem Transformation method (PT) for the syndrome differentiation of CF. Methods and Materials: In this work, using PT method, CP-RF is extended to handle MLL problem. CP-RF applies RF to measure the confidence level (p-value) of each label being the true label, and then selects multiple labels whose p-values are larger than the pre-defined significance level as the region prediction. In this paper, we compare the proposed CP-RF with typical CP-NBC(Naïve Bayes Classifier), CP-KNN(K-Nearest Neighbors) and ML-KNN on CF dataset, which consists of 736 cases. Specifically, 95 symptoms are used to identify CF, and four syndrome factors are employed in the syndrome differentiation, including ‘spleen deficiency’, ‘heart deficiency’, ‘liver stagnation’ and ‘qi deficiency’. The Results: CP-RF demonstrates an outstanding performance beyond CP-NBC, CP-KNN and ML-KNN under the general metrics of subset accuracy, hamming loss, one-error, coverage, ranking loss and average precision. Furthermore, the performance of CP-RF remains steady at the large scale of confidence levels from 80% to 100%, which indicates its robustness to the threshold determination. In addition, the confidence evaluation provided by CP is valid and well-calibrated. Conclusion: CP-RF not only offers outstanding performance but also provides valid confidence evaluation for the CF syndrome differentiation. It would be well applicable to TCM practitioners and facilitate the utilities of objective, effective and reliable computer-based diagnosis tool. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
231. Ensemble Positive Unlabeled Learning for Disease Gene Identification.
- Author
-
Yang, Peng, Li, Xiaoli, Chua, Hon-Nian, Kwoh, Chee-Keong, and Ng, See-Kiong
- Subjects
- *
MACHINE learning , *GENOMICS , *MEDICAL genetics , *COMPUTER algorithms , *COMPUTER simulation - Abstract
An increasing number of genes have been experimentally confirmed in recent years as causative genes to various human diseases. The newly available knowledge can be exploited by machine learning methods to discover additional unknown genes that are likely to be associated with diseases. In particular, positive unlabeled learning (PU learning) methods, which require only a positive training set P (confirmed disease genes) and an unlabeled set U (the unknown candidate genes) instead of a negative training set N, have been shown to be effective in uncovering new disease genes in the current scenario. Using only a single source of data for prediction can be susceptible to bias due to incompleteness and noise in the genomic data and a single machine learning predictor prone to bias caused by inherent limitations of individual methods. In this paper, we propose an effective PU learning framework that integrates multiple biological data sources and an ensemble of powerful machine learning classifiers for disease gene identification. Our proposed method integrates data from multiple biological sources for training PU learning classifiers. A novel ensemble-based PU learning method EPU is then used to integrate multiple PU learning classifiers to achieve accurate and robust disease gene predictions. Our evaluation experiments across six disease groups showed that EPU achieved significantly better results compared with various state-of-the-art prediction methods as well as ensemble learning classifiers. Through integrating multiple biological data sources for training and the outputs of an ensemble of PU learning classifiers for prediction, we are able to minimize the potential bias and errors in individual data sources and machine learning algorithms to achieve more accurate and robust disease gene predictions. In the future, our EPU method provides an effective framework to integrate the additional biological and computational resources for better disease gene predictions. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
232. Question Popularity Analysis and Prediction in Community Question Answering Services.
- Author
-
Liu, Ting, Zhang, Wei-Nan, Cao, Liujuan, and Zhang, Yu
- Subjects
- *
ONLINE social networks , *POPULARITY , *QUESTION answering systems , *BIOMETRY , *TEXT mining , *MACHINE learning - Abstract
With the blooming of online social media applications, Community Question Answering (CQA) services have become one of the most important online resources for information and knowledge seekers. A large number of high quality question and answer pairs have been accumulated, which allow users to not only share their knowledge with others, but also interact with each other. Accordingly, volumes of efforts have been taken to explore the questions and answers retrieval in CQA services so as to help users to finding the similar questions or the right answers. However, to our knowledge, less attention has been paid so far to question popularity in CQA. Question popularity can reflect the attention and interest of users. Hence, predicting question popularity can better capture the users’ interest so as to improve the users’ experience. Meanwhile, it can also promote the development of the community. In this paper, we investigate the problem of predicting question popularity in CQA. We first explore the factors that have impact on question popularity by employing statistical analysis. We then propose a supervised machine learning approach to model these factors for question popularity prediction. The experimental results show that our proposed approach can effectively distinguish the popular questions from unpopular ones in the Yahoo! Answers question and answer repository. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
233. A Systematic Comparison of Supervised Classifiers.
- Author
-
Amancio, Diego Raphael, Comin, Cesar Henrique, Casanova, Dalcimar, Travieso, Gonzalo, Bruno, Odemir Martinez, Rodrigues, Francisco Aparecido, and da Fontoura Costa, Luciano
- Subjects
- *
PATTERN recognition systems , *INDUSTRIAL applications , *ACCURACY , *CLASSIFICATION algorithms , *PARAMETER estimation , *SUPPORT vector machines , *MACHINE learning - Abstract
Pattern recognition has been employed in a myriad of industrial, commercial and academic applications. Many techniques have been devised to tackle such a diversity of applications. Despite the long tradition of pattern recognition research, there is no technique that yields the best classification in all scenarios. Therefore, as many techniques as possible should be considered in high accuracy applications. Typical related works either focus on the performance of a given algorithm or compare various classification methods. In many occasions, however, researchers who are not experts in the field of machine learning have to deal with practical classification tasks without an in-depth knowledge about the underlying parameters. Actually, the adequate choice of classifiers and parameters in such practical circumstances constitutes a long-standing problem and is one of the subjects of the current paper. We carried out a performance study of nine well-known classifiers implemented in the Weka framework and compared the influence of the parameter configurations on the accuracy. The default configuration of parameters in Weka was found to provide near optimal performance for most cases, not including methods such as the support vector machine (SVM). In addition, the k-nearest neighbor method frequently allowed the best accuracy. In certain conditions, it was possible to improve the quality of SVM by more than 20% with respect to their default parameter configuration. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
234. Rare Variants Detection with Kernel Machine Learning Based on Likelihood Ratio Test.
- Author
-
Zeng, Ping, Zhao, Yang, Zhang, Liwei, Huang, Shuiping, and Chen, Feng
- Subjects
- *
MACHINE learning , *LIKELIHOOD ratio tests , *PHENOTYPES , *EIGENVALUES , *NUMERICAL analysis , *SINGLE nucleotide polymorphisms , *SAMPLE size (Statistics) - Abstract
This paper mainly utilizes likelihood-based tests to detect rare variants associated with a continuous phenotype under the framework of kernel machine learning. Both the likelihood ratio test (LRT) and the restricted likelihood ratio test (ReLRT) are investigated. The relationship between the kernel machine learning and the mixed effects model is discussed. By using the eigenvalue representation of LRT and ReLRT, their exact finite sample distributions are obtained in a simulation manner. Numerical studies are performed to evaluate the performance of the proposed approaches under the contexts of standard mixed effects model and kernel machine learning. The results have shown that the LRT and ReLRT can control the type I error correctly at the given α level. The LRT and ReLRT consistently outperform the SKAT, regardless of the sample size and the proportion of the negative causal rare variants, and suffer from fewer power reductions compared to the SKAT when both positive and negative effects of rare variants are present. The LRT and ReLRT performed under the context of kernel machine learning have slightly higher powers than those performed under the context of standard mixed effects model. We use the Genetic Analysis Workshop 17 exome sequencing SNP data as an illustrative example. Some interesting results are observed from the analysis. Finally, we give the discussion. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
235. How Good Is Crude MDL for Solving the Bias-Variance Dilemma? An Empirical Investigation Based on Bayesian Networks.
- Author
-
Cruz-Ramírez, Nicandro, Acosta-Mesa, Héctor Gabriel, Mezura-Montes, Efrén, Guerra-Hernández, Alejandro, Hoyos-Rivera, Guillermo de Jesús, Barrientos-Martínez, Rocío Erandi, Gutiérrez-Fragoso, Karina, Nava-Fernández, Luis Alonso, González-Gaspar, Patricia, Novoa-del-Toro, Elva María, Aguilera-Rueda, Vicente Josué, and Ameca-Alducin, María Yaneli
- Subjects
- *
ANALYSIS of variance , *EMPIRICAL research , *BAYESIAN analysis , *MACHINE learning , *COMPUTATIONAL complexity , *INFORMATION theory - Abstract
The bias-variance dilemma is a well-known and important problem in Machine Learning. It basically relates the generalization capability (goodness of fit) of a learning method to its corresponding complexity. When we have enough data at hand, it is possible to use these data in such a way so as to minimize overfitting (the risk of selecting a complex model that generalizes poorly). Unfortunately, there are many situations where we simply do not have this required amount of data. Thus, we need to find methods capable of efficiently exploiting the available data while avoiding overfitting. Different metrics have been proposed to achieve this goal: the Minimum Description Length principle (MDL), Akaike’s Information Criterion (AIC) and Bayesian Information Criterion (BIC), among others. In this paper, we focus on crude MDL and empirically evaluate its performance in selecting models with a good balance between goodness of fit and complexity: the so-called bias-variance dilemma, decomposition or tradeoff. Although the graphical interaction between these dimensions (bias and variance) is ubiquitous in the Machine Learning literature, few works present experimental evidence to recover such interaction. In our experiments, we argue that the resulting graphs allow us to gain insights that are difficult to unveil otherwise: that crude MDL naturally selects balanced models in terms of bias-variance, which not necessarily need be the gold-standard ones. We carry out these experiments using a specific model: a Bayesian network. In spite of these motivating results, we also should not overlook three other components that may significantly affect the final model selection: the search procedure, the noise rate and the sample size. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.