Database: Academic Search Index / Journal: journal of machine learning research / Topic: algorithms and machine learning - Searchworks@Jio Institute Digital Library Search Results

1. A Bayesian Bradley-Terry model to compare multiple ML algorithms on multiple data sets.

Author: Wainer, Jacques
Subjects: *MACHINE learning, *PYTHON programming language, *ALGORITHMS, *POLYSEMY
Abstract: This paper presents a Bayesian model, called the Bayesian Bradley Terry (BBT) model, for comparing multiple algorithms on multiple data sets based on any metric. The model is an extension of the Bradley Terry model, which tracks the number of wins each algorithm has on different data sets. Unlike frequentist methods such as Demsar tests on mean rank or multiple pairwise Wilcoxon tests, the Bayesian approach provides a more nuanced understanding of the algorithms' performance and allows for the definition of the "region of practical equivalence" (ROPE) for two algorithms. Additionally, the paper introduces the concept of "local ROPE," which assesses the significance of the difference in mean measure between two algorithms using effect sizes, and can be applied in frequentist approaches as well. Both an R package and a Python program implementing the BBT are available for use. [ABSTRACT FROM AUTHOR]
Published: 2023

2. Q-Learning for MDPs with General Spaces: Convergence and Near Optimality via Quantization under Weak Continuity.

Author: Kara, Ali Devran, Saldi, Naci, and Yüksel, Serdar
Subjects: *MACHINE learning, *REINFORCEMENT learning, *MARKOV processes, *GEOMETRIC quantization, *ALGORITHMS
Abstract: Reinforcement learning algorithms often require finiteness of state and action spaces in Markov decision processes (MDPs) (also called controlled Markov chains) and various efforts have been made in the literature towards the applicability of such algorithms for continuous state and action spaces. In this paper, we show that under very mild regularity conditions (in particular, involving only weak continuity of the transition kernel of an MDP), Q-learning for standard Borel MDPs via quantization of states and actions (called Quantized Q-Learning) converges to a limit, and furthermore this limit satisfies an optimality equation which leads to near optimality with either explicit performance bounds or which are guaranteed to be asymptotically optimal. Our approach builds on (i) viewing quantization as a measurement kernel and thus a quantized MDP as a partially observed Markov decision process (POMDP), (ii) utilizing near optimality and convergence results of Q-learning for POMDPs, and (iii) finally, near-optimality of finite state model approximations for MDPs with weakly continuous kernels which we show to correspond to the fixed point of the constructed POMDP. Thus, our paper presents a very general convergence and approximation result for the applicability of Q-learning for continuous MDPs. [ABSTRACT FROM AUTHOR]
Published: 2023

3. Multi-Consensus Decentralized Accelerated Gradient Descent.

Author: Haishan Ye, Luo Luo, Ziang Zhou, and Tong Zhang
Subjects: *CONVEX functions, *SENSOR networks, *MACHINE learning, *ALGORITHMS
Abstract: This paper considers the decentralized convex optimization problem, which has a wide range of applications in large-scale machine learning, sensor networks, and control theory. We propose novel algorithms that achieve optimal computation complexity and near optimal communication complexity. Our theoretical results give affirmative answers to the open problem on whether there exists an algorithm that can achieve a communication complexity (nearly) matching the lower bound depending on the global condition number instead of the local one. Furthermore, the linear convergence of our algorithms only depends on the strong convexity of global objective and it does not require the local functions to be convex. The design of our methods relies on a novel integration of well-known techniques including Nesterov's acceleration, multi-consensus and gradient-tracking. Empirical studies show the outperformance of our methods for machine learning applications. [ABSTRACT FROM AUTHOR]
Published: 2023

4. Attacks against Federated Learning Defense Systems and their Mitigation.

Author: Lewis, Cody, Varadharajan, Vijay, and Noman, Nasimul
Subjects: *FEDERATED learning, *OPTIMIZATION algorithms, *INSTRUCTIONAL systems, *MACHINE learning, *ALGORITHMS, *HAZARD mitigation
Abstract: The susceptibility of federated learning (FL) to attacks from untrustworthy endpoints has led to the design of several defense systems. FL defense systems enhance the federated optimization algorithm using anomaly detection, scaling the updates from endpoints depending on their anomalous behavior. However, the defense systems themselves may be exploited by the endpoints with more sophisticated attacks. First, this paper proposes three categories of attacks and shows that they can effectively deceive some well-known FL defense systems. In the first two categories, referred to as on-off attacks, the adversary toggles between being honest and engaging in attacks. We analyse two such on-off attacks, label flipping and free riding, and show their impact against existing FL defense systems. As a third category, we propose attacks based on "good mouthing" and "bad mouthing", to boost or diminish influence of the victim endpoints on the global model. Secondly, we propose a new federated optimization algorithm, Viceroy, that can successfully mitigate all the proposed attacks. The proposed attacks and the mitigation strategy have been tested on a number of different experiments establishing their effectiveness in comparison with other contemporary methods. The proposed algorithm has also been made available as open source. Finally, in the appendices, we provide an induction proof for the on-off model poisoning attack, and the proof of convergence and adversarial tolerance for the new federated optimization algorithm. [ABSTRACT FROM AUTHOR]
Published: 2023

5. Decentralized Learning: Theoretical Optimality and Practical Improvements.

Author: Yucheng Lu and De Sa, Christopher
Subjects: *MACHINE learning, *INSTRUCTIONAL systems, *ALGORITHMS, *LOGARITHMS
Abstract: Decentralization is a promising method of scaling up parallel machine learning systems. In this paper, we provide a tight lower bound on the iteration complexity for such methods in a stochastic non-convex setting. Our lower bound reveals a theoretical gap in known convergence rates of many existing decentralized training algorithms, such as D-PSGD. We prove by construction this lower bound is tight and achievable. Motivated by our insights, we further propose DeTAG, a practical gossip-style decentralized algorithm that achieves the lower bound with only a logarithm gap. While a simple version of DeTAG with plain SGD and constant step size suffice for achieving theoretical limits, we additionally provide convergence bound for DeTAG under general non-increasing step size and momentum. Empirically, we compare DeTAG with other decentralized algorithms on multiple vision benchmarks, including CIFAR10/100 and ImageNet. We substantiate our theory and show DeTAG converges faster on unshuffled data and in sparse networks. Furthermore, we study a DeTAG variant, DeTAG*, that practically speeds up data-center-scale model training. This manuscript is the extended version for (Lu and De Sa, 2021). [ABSTRACT FROM AUTHOR]
Published: 2023

6. Generalization Bounds for Noisy Iterative Algorithms Using Properties of Additive Noise Channels.

Author: Hao Wang, Rui Gao, and Calmon, Flavio P.
Subjects: *MACHINE learning, *OPTIMIZATION algorithms, *GENERALIZATION, *INFORMATION theory, *ALGORITHMS
Abstract: Machine learning models trained by different optimization algorithms under different data distributions can exhibit distinct generalization behaviors. In this paper, we analyze the generalization of models trained by noisy iterative algorithms. We derive distribution dependent generalization bounds by connecting noisy iterative algorithms to additive noise channels found in communication and information theory. Our generalization bounds shed light on several applications, including differentially private stochastic gradient descent (DP-SGD), federated learning, and stochastic gradient Langevin dynamics (SGLD). We demonstrate our bounds through numerical experiments, showing that they can help understand recent empirical observations of the generalization phenomena of neural networks. [ABSTRACT FROM AUTHOR]
Published: 2023

7. Lower Bounds and Accelerated Algorithms for Bilevel Optimization.

Author: Kaiyi Ji and Yingbin Liang
Subjects: *BILEVEL programming, *LEARNING problems, *MACHINE learning, *QUADRATIC forms, *ALGORITHMS
Abstract: Bilevel optimization has recently attracted growing interests due to its wide applications in modern machine learning problems. Although recent studies have characterized the convergence rate for several such popular algorithms, it is still unclear how much further these convergence rates can be improved. In this paper, we address this fundamental question from two perspectives. First, we provide the first-known lower complexity bounds of ... and ... respectively for strongly-convex-strongly convex and convex-strongly-convex bilevel optimizations. Second, we propose an accelerated bilevel optimizer named AccBiO, for which we provide the first-known complexity bounds without the gradient boundedness assumption (which was made in existing analyses) under the two aforementioned geometries. We also provide significantly tighter upper bounds than the existing complexity when the bounded gradient assumption does hold. We show that AccBiO achieves the optimal results (i.e., the upper and lower bounds match up to logarithmic factors) when the inner-level problem takes a quadratic form with a constant-level condition number. Interestingly, our lower bounds under both geometries are larger than the corresponding optimal complexities of minimax optimization, establishing that bilevel optimization is provably more challenging than minimax optimization. Our theoretical results are validated by numerical experiments. [ABSTRACT FROM AUTHOR]
Published: 2023

8. Comprehensive Algorithm Portfolio Evaluation using Item Response Theory.

Author: Kandanaarachchi, Sevvandi and Smith-Miles, Kate
Subjects: *ITEM response theory, *CLASSIFICATION algorithms, *MACHINE learning, *ALGORITHMS
Abstract: Item Response Theory (IRT) has been proposed within the field of Educational Psychometrics to assess student ability as well as test question difficulty and discrimination power. More recently, IRT has been applied to evaluate machine learning algorithm performance on a single classification dataset, where the student is now an algorithm, and the test question is an observation to be classified by the algorithm. In this paper we present a modified IRT-based framework for evaluating a portfolio of algorithms across a repository of datasets, while simultaneously eliciting a richer suite of characteristics - such as algorithm consistency and anomalousness - that describe important aspects of algorithm performance. These characteristics arise from a novel inversion and reinterpretation of the traditional IRT model without requiring additional dataset feature computations. We test this framework on algorithm portfolios for a wide range of applications, demonstrating the broad applicability of this method as an insightful algorithm evaluation tool. Furthermore, the explainable nature of IRT parameters yield an increased understanding of algorithm portfolios. [ABSTRACT FROM AUTHOR]
Published: 2023

9. Towards An Efficient Approach for the Nonconvex ℓp Ball Projection: Algorithm and Analysis.

Author: Xiangyu Yang, Jiashan Wang, and Hao Wang
Subjects: *STATISTICAL learning, *ALGORITHMS, *SIGNAL processing, *MACHINE learning, *NONSMOOTH optimization
Abstract: This paper primarily focuses on computing the Euclidean projection of a vector onto the ℓp ball in which p∈(0,1). Such a problem emerges as the core building block in statistical machine learning and signal processing tasks because of its ability to promote the sparsity of the desired solution. However, efficient numerical algorithms for finding the projections are still not available, particularly in large-scale optimization. To meet this challenge, we first derive the first-order necessary optimality conditions of this problem. Based on this characterization, we develop a novel numerical approach for computing the stationary point by solving a sequence of projections onto the reweighted ℓ1-balls. This method is practically simple to implement and computationally efficient. Moreover, the proposed algorithm is shown to converge uniquely under mild conditions and has a worst-case O(1/√k) convergence rate. Numerical experiments demonstrate the efficiency of our proposed algorithm. [ABSTRACT FROM AUTHOR]
Published: 2022

10. Asymptotic Study of Stochastic Adaptive Algorithms in Non-convex Landscape.

Author: Gadat, Sébastien and Gavra, Ioana
Subjects: *DEEP learning, *MACHINE learning, *ALGORITHMS, *SET functions
Abstract: This paper studies some asymptotic properties of adaptive algorithms widely used in optimization and machine learning, and among them Adagrad and Rmsprop, which are involved in most of the blackbox deep learning algorithms. Our setup is the non-convex landscape optimization point of view, we consider a one time scale parametrization and the situation where these algorithms may or may not be used with mini-batches. We adopt the point of view of stochastic algorithms and establish the almost sure convergence of these methods when using a decreasing step-size towards the set of critical points of the target function. With a mild extra assumption on the noise, we also obtain the convergence towards the set of minimizers of the function. Along our study, we also obtain a "convergence rate" of the methods, in the vein of the works of Ghadimi and Lan (2013). [ABSTRACT FROM AUTHOR]
Published: 2022

11. First-order Convergence Theory for Weakly-Convex-Weakly-Concave Min-max Problems.

Author: Mingrui Liu, Rafique, Hassan, Qihang Lin, and Tianbao Yang
Subjects: *MACHINE learning, *GENERATIVE adversarial networks, *PROBLEM solving, *ALGORITHMS
Abstract: In this paper, we consider first-order convergence theory and algorithms for solving a class of non-convex non-concave min-max saddle-point problems, whose objective function is weakly convex in the variables of minimization and weakly concave in the variables of maximization. It has many important applications in machine learning including training Generative Adversarial Nets (GANs). We propose an algorithmic framework motivated by the inexact proximal point method, where the weakly monotone variational inequality (VI) corresponding to the original min-max problem is solved through approximately solving a sequence of strongly monotone VIs constructed by adding a strongly monotone mapping to the original gradient mapping. We prove first-order convergence to a nearly stationary solution of the original min-max problem of the generic algorithmic framework and establish different rates by employing different algorithms for solving each strongly monotone VI. Experiments verify the convergence theory and also demonstrate the effectiveness of the proposed methods on training GANs. [ABSTRACT FROM AUTHOR]
Published: 2021

12. Cooperative SGD: A Unified Framework for the Design and Analysis of Local-Update SGD Algorithms.

Author: Jianyu Wang and Joshi, Gauri
Subjects: *STOCHASTIC learning models, *ELASTIC analysis (Engineering), *ALGORITHMS, *BOTTLENECKS (Manufacturing), *PARALLEL algorithms, *MACHINE learning
Abstract: When training machine learning models using stochastic gradient descent (SGD) with a large number of nodes or massive edge devices, the communication cost of synchronizing gradients at every iteration is a key bottleneck that limits the scalability of the system and hinders the benet of parallel computation. Local-update SGD algorithms, where worker nodes perform local iterations of SGD and periodically synchronize their local models, can effectively reduce the communication frequency and save the communication delay. In this paper, we propose a powerful framework, named Cooperative SGD, that subsumes a variety of local-update SGD algorithms (such as local SGD, elastic averaging SGD, and decentralized parallel SGD) and provides a unified convergence analysis. Notably, special cases of the unied convergence analysis provided by the cooperative SGD framework yield 1) the rst convergence analysis of elastic averaging SGD for general non-convex objectives, and 2) improvements upon previous analyses of local SGD and decentralized parallel SGD. Moreover, we design new algorithms such as elastic averaging SGD with overlapped computation and communication, and decentralized periodic averaging which are shown to be 4x or more faster than the baseline in reaching the same training loss. [ABSTRACT FROM AUTHOR]
Published: 2021

13. Learning Sums of Independent Random Variables with Sparse Collective Support.

Author: De, Anindya, Long, Philip M., and Servedio, Rocco A.
Subjects: *RANDOM variables, *INDEPENDENT variables, *ALGORITHMS, *ARBITRARY constants, *CENTRAL limit theorem, *MACHINE learning
Abstract: We study the learnability of sums of independent integer random variables given a bound on the size of the union of their supports. For A⊂Z+, a {sum of independent random variables with collective support A} (called an A-sum in this paper) is a distribution S=X1+⋯+XN where the Xi's are mutually independent (but not necessarily identically distributed) integer random variables with ∪isupp(Xi)⊆A. We give two main algorithmic results for learning such distributions. First, for the case |A|=3, we give an algorithm for learning an unknown A-sum to accuracy ϵ using poly(1/ϵ) samples and running in time poly(1/ϵ), independent of N and of the elements of A. Second, for an arbitrary constant k≥4, if A={a1,...,ak} with 0≤a1<...0. [ABSTRACT FROM AUTHOR]
Published: 2020

14. A Class of Parallel Doubly Stochastic Algorithms for Large-Scale Learning.

Author: Mokhtari, Aryan, Koppel, Alec, Tákač, Martin, and Ribeiro, Alejandro
Subjects: *HESSIAN matrices, *MACHINE learning, *ALGORITHMS, *PARALLEL algorithms, *SET functions, *RANDOM sets, *LEARNING problems
Abstract: We consider learning problems over training sets in which both, the number of training examples and the dimension of the feature vectors, are large. To solve these problems we propose the random parallel stochastic algorithm (RAPSA). We call the algorithm random parallel because it utilizes multiple parallel processors to operate on a randomly chosen subset of blocks of the feature vector. RAPSA is doubly stochastic since each processor utilizes a random set of functions to compute the stochastic gradient associated with a randomly chosen sets of variable coordinates. Algorithms that are parallel in either of these dimensions exist, but RAPSA is the first attempt at a methodology that is parallel in both the selection of blocks and the selection of elements of the training set. In RAPSA, processors utilize the randomly chosen functions to compute the stochastic gradient component associated with a randomly chosen block. The technical contribution of this paper is to show that this minimally coordinated algorithm converges to the optimal classifier when the training objective is strongly convex. Moreover, we present an accelerated version of RAPSA (ARAPSA) that incorporates the objective function curvature information by premultiplying the descent direction by a Hessian approximation matrix. We further extend the results for asynchronous settings and show that if the processors perform their updates without any coordination the algorithms are still convergent to the optimal argument. RAPSA and its extensions are then numerically evaluated on a linear estimation problem and a binary image classification task using the MNIST handwritten digit dataset. [ABSTRACT FROM AUTHOR]
Published: 2020

15. Generalized Hierarchical Kernel Learning.

Author: Jawanpuria, Pratik, Saketha Nath, Jagarlapudi, and Ramakrishnan, Ganesh
Subjects: *HIERARCHICAL clustering (Cluster analysis), *KERNEL (Mathematics), *MACHINE learning, *GRAPH theory, *ALGORITHMS
Abstract: This paper generalizes the framework of Hierarchical Kernel Learning (HKL) and illustrates its utility in the domain of rule learning. HKL involves Multiple Kernel Learning over a set of given base kernels assumed to be embedded on a directed acyclic graph. This paper proposes a two-fold generalization of HKL: the first is employing a generic l1/lp block-norm regularizer (ρ ε (1; 2]) that alleviates a key limitation of the HKL formulation. The second is a generalization to the case of multi-class, multi-label and more generally, multi-task applications. The main technical contribution of this work is the derivation of a highly specialized partial dual of the proposed generalized HKL formulation and an efficient mirror descent based active set algorithm for solving it. Importantly, the generic regularizer enables the proposed formulation to be employed in the Rule Ensemble Learning (REL) where the goal is to construct an ensemble of conjunctive propositional rules. Experiments on benchmark REL data sets illustrate the efficacy of the proposed generalizations. [ABSTRACT FROM AUTHOR]
Published: 2015

16. Following the Leader and Fast Rates in Online Linear Prediction: Curved Constraint Sets and Other Regularities.

Author: Ruitong Huang, Lattimore, Tor, György, András, and Szepesvári, Csaba
Subjects: *ALGORITHMS, *MACHINE learning, *CONVEX domains, *ELLIPSOIDS, *STOCHASTIC processes
Abstract: Follow the leader (FTL) is a simple online learning algorithm that is known to perform well when the loss functions are convex and positively curved. In this paper we ask whether there are other settings when FTL achieves low regret. In particular, we study the fundamental problem of linear prediction over a convex, compact domain with non-empty interior. Amongst other results, we prove that the curvature of the boundary of the domain can act as if the losses were curved: In this case, we prove that as long as the mean of the loss vectors have positive lengths bounded away from zero, FTL enjoys logarithmic regret, while for polytope domains and stochastic data it enjoys finite expected regret. The former result is also extended to strongly convex domains by establishing an equivalence between the strong convexity of sets and the minimum curvature of their boundary, which may be of independent interest. Building on a previously known meta-algorithm, we also get an algorithm that simultaneously enjoys the worst-case guarantees and the smaller regret of FTL when the data is 'easy'. Finally, we show that such guarantees are achievable directly (e.g., by the follow the regularized leader algorithm or by a shrinkage-based variant of FTL) when the constraint set is an ellipsoid. [ABSTRACT FROM AUTHOR]
Published: 2017

17. Active-set Methods for Submodular Minimization Problems.

Author: Kumar, K. S. Sesh and Bach, Francis
Subjects: *MATHEMATICAL functions, *MATHEMATICAL optimization, *ALGORITHMS, *STOCHASTIC convergence, *MACHINE learning
Abstract: We consider the submodular function minimization (SFM) and the quadratic minimization problems regularized by the Lovász extension of the submodular function. These optimization problems are intimately related; for example, min-cut problems and total variation denoising problems, where the cut function is submodular and its Lovász extension is given by the associated total variation. When a quadratic loss is regularized by the total variation of a cut function, it thus becomes a total variation denoising problem and we use the same terminology in this paper for "general" submodular functions. We propose a new active-set algorithm for total variation denoising with the assumption of an oracle that solves the corresponding SFM problem. This can be seen as local descent algorithm over ordered partitions with explicit convergence guarantees. It is more flexible than the existing algorithms with the ability for warm-restarts using the solution of a closely related problem. Further, we also consider the case when a submodular function can be decomposed into the sum of two submodular functions F1 and F2 and assume SFM oracles for these two functions. We propose a new active-set algorithm for total variation denoising (and hence SFM by thresholding the solution at zero). This algorithm also performs local descent over ordered partitions and its ability to warm start considerably improves the performance of the algorithm. In the experiments, we compare the performance of the proposed algorithms with state-of-the-art algorithms, showing that it reduces the calls to SFM oracles. [ABSTRACT FROM AUTHOR]
Published: 2017

18. Accelerating t-SNE using Tree-Based Algorithms.

Author: van der Maaten, Laurens
Subjects: *EMBEDDINGS (Mathematics), *STOCHASTIC analysis, *ALGORITHMS, *MACHINE learning, *SET theory
Abstract: The paper investigates the acceleration of t-SNE--an embedding technique that is commonly used for the visualization of high-dimensional data in scatter plots--using two tree-based algorithms. In particular, the paper develops variants of the Barnes-Hut algorithm and of the dual-tree algorithm that approximate the gradient used for learning t-SNE em-beddings in O(N log N). Our experiments show that the resulting algorithms substantially accelerate t-SNE, and that they make it possible to learn embeddings of data sets with millions of objects. Somewhat counterintuitively, the Barnes-Hut variant of t-SNE appears to outperform the dual-tree variant. [ABSTRACT FROM AUTHOR]
Published: 2014

19. Lens Depth Function and k-Relative Neighborhood Graph: Versatile Tools for Ordinal Data Analysis.

Author: Kleindessner, Matthäus and von Luxburg, Ulrike
Subjects: *GRAPH theory, *MACHINE learning, *PATHS & cycles in graph theory, *ALGORITHMS, *CLUSTER analysis (Statistics)
Abstract: In recent years it has become popular to study machine learning problems in a setting of ordinal distance information rather than numerical distance measurements. By ordinal distance information we refer to binary answers to distance comparisons such as d(A, B) < d(C, D). For many problems in machine learning and statistics it is unclear how to solve them in such a scenario. Up to now, the main approach is to explicitly construct an ordinal embedding of the data points in the Euclidean space, an approach that has a number of drawbacks. In this paper, we propose algorithms for the problems of medoid estimation, outlier identification, classification, and clustering when given only ordinal data. They are based on estimating the lens depth function and the k-relative neighborhood graph on a data set. Our algorithms are simple, are much faster than an ordinal embedding approach and avoid some of its drawbacks, and can easily be parallelized. [ABSTRACT FROM AUTHOR]
Published: 2017

20. A Unified Formulation and Fast Accelerated Proximal Gradient Method for Classification.

Author: Naoki Ito, Akiko Takeda, and Kim-Chuan Toh
Subjects: *SUPPORT vector machines, *MATHEMATICAL optimization, *ALGORITHMS, *MACHINE learning, *ASSISTIVE computer technology
Abstract: Binary classification is the problem of predicting the class a given sample belongs to. To achieve a good prediction performance, it is important to find a suitable model for a given dataset. However, it is often time consuming and impractical for practitioners to try various classification models because each model employs a different formulation and algorithm. The dificulty can be mitigated if we have a unified formulation and an eficient universal algorithmic framework for various classification models to expedite the comparison of performance of different models for a given dataset. In this paper, we present a unified formulation of various classification models (including C-SVM, l2-SVM, v-SVM, MM-FDA, MM-MPM, logistic regression, distance weighted discrimination) and develop a general optimization algorithm based on an accelerated proximal gradient (APG) method for the formulation. We design various techniques such as backtracking line search and adaptive restarting strategy in order to speed up the practical convergence of our method. We also give a theoretical convergence guarantee for the proposed fast APG algorithm. Numerical experiments show that our algorithm is stable and highly competitive to specialized algorithms designed for specific models (e.g., sequential minimal optimization (SMO) for SVM). [ABSTRACT FROM AUTHOR]
Published: 2017

21. Classification Methods with Reject Option Based on Convex Risk Minimization.

Author: Ming Yuan and Wegkamp, Marten
Subjects: *CLASSIFICATION, *SURROGATE-based optimization, *MATHEMATICAL optimization, *MACHINE learning, *ALGORITHMS
Abstract: In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for infinite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. [ABSTRACT FROM AUTHOR]
Published: 2010

22. An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data.

Author: Yufeng Ding and Simonoff, Jeffrey S.
Subjects: *MISSING data (Statistics), *CLASSIFICATION, *MONTE Carlo method, *BINARY number system, *ALGORITHMS, *MACHINE learning
Abstract: There are many different methods used by classification tree algorithms when missing data occur in the predictors, but few studies have been done comparing their appropriateness and performance. This paper provides both analytic and Monte Carlo evidence regarding the effectiveness of six popular missing data methods for classification trees applied to binary response data. We show that in the context of classification trees, the relationship between the missingness and the dependent variable, as well as the existence or non-existence of missing values in the testing data, are the most helpful criteria to distinguish different missing data methods. In particular, separate class is clearly the best method to use when the testing set has missing values and the missingness is related to the response variable. A real data set related to modeling bankruptcy of a firm is then analyzed. The paper concludes with discussion of adaptation of these results to logistic regression, and other potential generalizations. [ABSTRACT FROM AUTHOR]
Published: 2010

23. Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining.

Author: Novak, Petra Kralj, Lavrač, Nada, and Webb, Geoffrey I.
Subjects: *SUPERVISED learning, *SET theory, *SUBGROUP growth, *MACHINE learning, *DATA mining, *ALGORITHMS
Abstract: This paper gives a survey of contrast set mining (CSM), emerging pattern mining (EPM), and subgroup discovery (SD) in a unifying framework named supervised descriptive rule discovery. While all these research areas aim at discovering patterns in the form of rules induced from labeled data, they use different terminology and task definitions, claim to have different goals, claim to use different rule learning heuristics, and use different means for selecting subsets of induced patterns. This paper contributes a novel understanding of these subareas of data mining by presenting a unified terminology, by explaining the apparent differences between the learning tasks as variants of a unique supervised descriptive rule discovery task and by exploring the apparent differences between the approaches. It also shows that various rule learning heuristics used in CSM, EPM and SD algorithms all aim at optimizing a trade off between rule coverage and precision. The commonalities (and differences) between the approaches are showcased on a selection of best known variants of CSM, EPM and SD algorithms. The paper also provides a critical survey of existing supervised descriptive rule discovery visualization methods. [ABSTRACT FROM AUTHOR]
Published: 2009

24. Multi-class Discriminant Kernel Learning via Convex Programming.

Author: Jieping Ye, Shuiwang Ji, and Jianhui Chen
Subjects: *KERNEL functions, *CONVEX programming, *MATHEMATICAL programming, *MACHINE learning, *ALGORITHMS
Abstract: Regularized kernel discriminant analysis (RKDA) performs linear discriminant analysis in the feature space via the kernel trick. Its performance depends on the selection of kernels. In this paper, we consider the problem of multiple kernel learning (MKL) for RKDA, in which the optimal kernel matrix is obtained as a linear combination of pre-specified kernel matrices. We show that the kernel learning problem in RKDA can be formulated as convex programs. First, we show that this problem can be formulated as a semidefinite program (SDP). Based on the equivalence relationship between RKDA and least square problems in the binary-class case, we propose a convex quadratically constrained quadratic programming (QCQP) formulation for kernel learning in RKDA. A semi-infinite linear programming (SILP) formulation is derived to further improve the efficiency. We extend these formulations to the multi-class case based on a key result established in this paper. That is, the multi-class RKDA kernel learning problem can be decomposed into a set of binary-class kernel learning problems which are constrained to share a common kernel. Based on this decomposition property, SDP formulations are proposed for the multi-class case. Furthermore, it leads naturally to QCQP and SILP formulations. As the performance of RKDA depends on the regularization parameter, we show that this parameter can also be optimized in a joint framework with the kernel. Extensive experiments have been conducted and analyzed, and connections to other algorithms are discussed. [ABSTRACT FROM AUTHOR]
Published: 2008

25. Comments on the "Core Vector Machines: Fast SVM Training on Very Large Data Sets.".

Author: Loosli, Gaëlle and Canu, Stéphane
Subjects: *ALGORITHMS, *MACHINE learning, *TECHNICAL specifications, *SAMPLE size (Statistics), *PARAMETERS (Statistics)
Abstract: In a recently published paper in JMLR, Tsang et al. (2005) present an algorithm for SVM called Core Vector Machines (CVM) and illustrate its performances through comparisons with other SVM solvers. After reading the CVM paper we were surprised by some of the reported results. In order to clarify the matter, we decided to reproduce some of the experiments. It turns out that to some extent, our results contradict those reported. Reasons of these different behaviors are given through the analysis of the stopping criterion. [ABSTRACT FROM AUTHOR]
Published: 2007

26. Universal Algorithms for Learning Theory Part I : Piecewise Constant Functions.

Author: Binev, Peter, Cohen, Albert, Dahmen, Wolfgang, DeVore, Ronald, Temlyakov, Vladimir, and Bartlett, Peter
Subjects: *REGRESSION analysis, *MACHINE learning, *ARTIFICIAL intelligence, *ALGORITHMS, *ESTIMATION theory
Abstract: This paper is concerned with the construction and analysis of a universal estimator for the regression problem in supervised learning. Universal means that the estimator does not depend on any a priori assumptions about the regression function to be estimated. The universal estimator studied in this paper consists of a least-square fitting procedure using piecewise constant functions on a partition which depends adaptively on the data. The partition is generated by a splitting procedure which differs from those used in CART algorithms. It is proven that this estimator performs at the optimal convergence rate for a wide class of priors on the regression function. Namely, as will be made precise in the text, if the regression function is in any one of a certain class of approximation spaces (or smoothness spaces of order not exceeding one -- a limitation resulting because the estimator uses piecewise constants) measured relative to the marginal measure, then the estimator converges to the regression function (in the least squares sense) with an optimal rate of convergence in terms of the number of samples. The estimator is also numerically feasible and can be implemented on-line. [ABSTRACT FROM AUTHOR]
Published: 2005

27. Ultraconservative Online Algorithms for Multiclass Problems.

Author: Crammer, Koby, Singer, Yoram, and Warmuth, Manfred K.
Subjects: *COMPUTER algorithms, *ALGORITHMS, *MACHINE learning, *HYPOTHESIS, *SOCIAL problems, *SENSORY perception
Abstract: In this paper we study a paradigm to generalize online classification algorithms for binary classification problems to multiclass problems. The particular hypotheses we investigate maintain one prototype vector per class. Given an input instance, a multiclass hypothesis computes a similarityscore between each prototype and the input instance and sets the predicted label to be the index of the prototype achieving the highest similarity. To design and analyze the learning algorithms in this paper we introduce the notion of ultraconservativeness. Ultraconservative algorithms are algorithms that update only the prototypes attaining similarity-scores which are higher than the score of the correct label's prototype. We start by describing a family of additive ultraconservative algorithms where each algorithm in the family updates its prototypes by finding a feasible solution for a set of linear constraints that depend on the instantaneous similarity-scores. We then discuss a specific online algorithm that seeks a set of prototypes which have a small norm. The resulting algorithm, which we term MIRA (for Margin Infused Relaxed Algorithm) is ultraconservative as well. We derive mistake bounds for all the algorithms and provide further analysis of MIRA using a generalized notion of the margin for multiclass problems. We discuss the form the algorithms take in the binary case and show that all the algorithms from the first family reduce to the Perception algorithm while MIRA provides a new Perceptron-like algorithm with a margin-dependent learning rate. We then return to multiclass problems and describe an analogous multiplicative family of algorithms with corresponding mistake bounds. We end the formal part by deriving and analyzing a multiclass version of Li and Long's ROMMA algorithm. We conclude with a discussion of experimental results that demonstrate the merits of our algorithms. [ABSTRACT FROM AUTHOR]
Published: 2003

28. A General Framework for Fast Stagewise Algorithms.

Author: Tibshirani, Ryan J.
Subjects: *ALGORITHMS, *REGRESSION analysis, *MULTIVARIATE analysis, *COMPUTER algorithms, *MACHINE learning
Abstract: Forward stagewise regression follows a very simple strategy for constructing a sequence of sparse regression estimates: it starts with all coefficients equal to zero, and iteratively updates the coefficient (by a small amount ϵ) of the variable that achieves the maximal absolute inner product with the current residual. This procedure has an interesting connection to the lasso: under some conditions, it is known that the sequence of forward stagewise estimates exactly coincides with the lasso path, as the step size ϵ goes to zero. Furthermore, essentially the same equivalence holds outside of least squares regression, with the minimization of a differentiable convex loss function subject to an l1 norm constraint (the stagewise algorithm now updates the coefficient corresponding to the maximal absolute component of the gradient). Even when they do not match their l1-constrained analogues, stagewise estimates provide a useful approximation, and are computationally appealing. Their success in sparse modeling motivates the question: can a simple, effective strategy like forward stagewise be applied more broadly in other regularization settings, beyond the l1 norm and sparsity? The current paper is an attempt to do just this. We present a general framework for stagewise estimation, which yields fast algorithms for problems such as group-structured learning, matrix completion, image denoising, and more. [ABSTRACT FROM AUTHOR]
Published: 2015

29. Plug-and-Play Dual-Tree Algorithm Runtime Analysis.

Author: Curtin, Ryan R., Dongryeol Lee, March, William B., and Ram, Parikshit
Subjects: *MACHINE learning, *REINFORCEMENT learning, *ALGORITHMS, *MACHINE theory, *PLUG & play (Computer architecture), *COMPUTER architecture
Abstract: Numerous machine learning algorithms contain pairwise statistical problems at their core-that is, tasks that require computations over all pairs of input points if implemented naively. Often, tree structures are used to solve these problems efficiently. Dual-tree algorithms can efficiently solve or approximate many of these problems. Using cover trees, rigorous worst-case runtime guarantees have been proven for some of these algorithms. In this paper, we present a problem-independent runtime guarantee for any dual-tree algorithm using the cover tree, separating out the problem-dependent and the problem-independent elements. This allows us to just plug in bounds for the problem-dependent elements to get runtime guarantees for dual-tree algorithms for any pairwise statistical problem without re-deriving the entire proof. We demonstrate this plug-and-play procedure for nearest-neighbor search and approximate kernel density estimation to get improved runtime guarantees. Under mild assumptions, we also present the first linear runtime guarantee for dual-tree based range search. [ABSTRACT FROM AUTHOR]
Published: 2015

30. Conditional Random Field with High-order Dependencies for Sequence Labeling and Segmentation.

Author: Nguyen Viet Cuong, Nan Ye, Wee Sun Lee, and Hai Leong Chieu
Subjects: *CONDITIONAL random fields, *COMPUTATIONAL complexity, *MATHEMATICAL sequences, *ALGORITHMS, *MACHINE learning, *MARKOV processes
Abstract: Dependencies among neighboring labels in a sequence are important sources of information for sequence labeling and segmentation. However, only first-order dependencies, which are dependencies between adjacent labels or segments, are commonly exploited in practice because of the high computational complexity of typical inference algorithms when longer distance dependencies are taken into account. In this paper, we give efficient inference algorithms to handle high-order dependencies between labels or segments in conditional random fields, under the assumption that the number of distinct label patterns used in the features is small. This leads to efficient learning algorithms for these conditional random fields. We show experimentally that exploiting high-order dependencies can lead to substantial performance improvements for some problems, and we discuss conditions under which high-order features can be effective. [ABSTRACT FROM AUTHOR]
Published: 2014

31. Stationary-Sparse Causality Network Learning.

Author: Yuejia He, Yiyuan She, and Dapeng Wu
Subjects: *STATIONARY processes, *SPARSE approximations, *MACHINE learning, *MAXIMUM likelihood statistics, *COMPUTER networks, *ESTIMATION theory, *ALGORITHMS, *DYNAMICAL systems
Abstract: Recently, researchers have proposed penalized maximum likelihood to identify network topology underlying a dynamical system modeled by multivariate time series. The time series of interest are assumed to be stationary, but this restriction is never taken into consideration by existing estimation methods. Moreover, practical problems of interest may have ultra-high dimensionality and obvious node collinearity. In addition, none of the available algorithms provides a probabilistic measure of the uncertainty for the obtained network topology which is informative in reliable network identification. The main purpose of this paper is to tackle these challenging issues. We propose the S² learning framework, which stands for stationary-sparse network learning. We propose a novel algorithm referred to as the Berhu iterative sparsity pursuit with stationarity (BISPS), where the Berhu regularization can improve the Lasso in detection and estimation. The algorithm is extremely easy to implement, efficient in computation and has a theoretical guarantee to converge to a global optimum. We also incorporate a screening technique into BISPS to tackle ultra-high dimensional problems and enhance computational efficiency. Furthermore, a stationary bootstrap technique is applied to provide connection occurring frequency for reliable topology learning. Experiments show that our method can achieve stationary and sparse causality network learning and is scalable for high-dimensional problems. [ABSTRACT FROM AUTHOR]
Published: 2013

32. Multi-Stage Multi-Task Feature Learning.

Author: Pinghua Gong, Jieping Ye, and Changshui Zhang
Subjects: *MACHINE learning, *COMPUTER multitasking, *GENERALIZATION, *PERFORMANCE evaluation, *MEDICAL informatics, *COMPUTER vision, *ALGORITHMS
Abstract: Multi-task sparse feature learning aims to improve the generalization performance by exploiting the shared features among tasks. It has been successfully applied to many applications including computer vision and biomedical informatics. Most of the existing multi-task sparse feature learning algorithms are formulated as a convex sparse regularization problem, which is usually suboptimal, due to its looseness for approximating an ℓ0-type regularizer. In this paper, we propose a non-convex formulation for multi-task sparse feature learning based on a novel non-convex regularizer. To solve the non-convex optimization problem, we propose a Multi-Stage Multi-Task Feature Learning (MSMTFL) algorithm; we also provide intuitive interpretations, detailed convergence and reproducibility analysis for the proposed algorithm. Moreover, we present a detailed theoretical analysis showing that MSMTFL achieves a better parameter estimation error bound than the convex formulation. Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature learning algorithms. [ABSTRACT FROM AUTHOR]
Published: 2013

33. Algorithms for Discovery of Multiple Markov Boundaries.

Author: Statnikov, Alexander, Lytkin, Nikita I., Lemeire, Jan, and Aliferis, Constantin F.
Subjects: *ALGORITHMS, *MARKOV processes, *MACHINE learning, *DISTRIBUTION (Probability theory), *FEATURE selection
Abstract: Algorithms for Markov boundary discovery from data constitute an important recent development in machine learning, primarily because they offer a principled solution to the variable/feature selection problem and give insight on local causal structure. Over the last decade many sound algorithms have been proposed to identify a single Markov boundary of the response variable. Even though faithful distributions and, more broadly, distributions that satisfy the intersection property always have a single Markov boundary, other distributions/data sets may have multiple Markov boundaries of the response variable. The latter distributions/data sets are common in practical data-analytic applications, and there are several reasons why it is important to induce multiple Markov boundaries from such data. However, there are currently no sound and efficient algorithms that can accomplish this task. This paper describes a family of algorithms TIE* that can discover all Markov boundaries in a distribution. The broad applicability as well as efficiency of the new algorithmic family is demonstrated in an extensive benchmarking study that involved comparison with 26 state-of-the-art algorithms/variants in 15 data sets from a diversity of application domains. [ABSTRACT FROM AUTHOR]
Published: 2013

34. Bayesian Co-Training.

Author: Shipeng Yu, Krishnapuram, Balaji, Rosales, Rómer, and Rao, R. Bharat
Subjects: *BAYESIAN analysis, *MACHINE learning, *ALGORITHMS, *SUPERVISED learning, *FEATURE extraction, *UNDIRECTED graphs
Abstract: Co-training (or more generally, co-regularization) has been a popular algorithm for semi-supervised learning in data with two feature representations (or views), but the fundamental assumptions underlying this type of models are still unclear. In this paper we propose a Bayesian undirected graphical model for co-training, or more generally for semi-supervised multi-view learning. This makes explicit the previously unstated assumptions of a large class of co-training type algorithms, and also clarifies the circumstances under which these assumptions fail. Building upon new insights from this model, we propose an improved method for co-training, which is a novel co-training kernel for Gaussian process classifiers. The resulting approach is convex and avoids local-maxima problems, and it can also automatically estimate how much each view should be trusted to accommodate noisy or unreliable views. The Bayesian co-training approach can also elegantly handle data samples with missing views, that is, some of the views are not available for some data points at learning time. This is further extended to an active sensing framework, in which the missing (sample, view) pairs are actively acquired to improve learning performance. The strength of active sensing model is that one actively sensed (sample, view) pair would improve the joint multi-view classification on all the samples. Experiments on toy data and several real world data sets illustrate the benefits of this approach. [ABSTRACT FROM AUTHOR]
Published: 2012

35. Metric and Kernel Learning Using a Linear Transformation.

Author: Jain, Prateek, Kulis, Brian, Davis, Jason V., Dhillon, Inderjit S., Sonnenburg, Sören, Bach, Francis, and Cheng Soon Ong
Subjects: *MACHINE learning, *SOFTWARE measurement, *ALGORITHMS, *KERNEL functions, *TEXT mining, *DIMENSION reduction (Statistics), *MATHEMATICAL transformations
Abstract: Metric and kernel learning arise in several machine learning applications. However, most existing metric learning algorithms are limited to learning metrics over low-dimensional data, while existing kernel learning algorithms are often limited to the transductive setting and do not generalize to new data points. In this paper, we study the connections between metric learning and kernel learning that arise when studying metric learning as a linear transformation learning problem. In particular, we propose a general optimization framework for learning metrics via linear transformations, and analyze in detail a special case of our framework-that of minimizing the LogDet divergence subject to linear constraints. We then propose a general regularized framework for learning a kernel matrix, and show it to be equivalent to our metric learning framework. Our theoretical connections between metric and kernel learning have two main consequences: 1) the learned kernel matrix parameterizes a linear transformation kernel function and can be applied inductively to new data points, 2) our result yields a constructive method for kernelizing most existing Mahalanobis metric learning formulations. We demonstrate our learning approach by applying it to large-scale real world problems in computer vision, text mining and semi-supervised kernel dimensionality reduction. [ABSTRACT FROM AUTHOR]
Published: 2012

36. Algorithms for Learning Kernels Based on Centered Alignment.

Author: Cortes, Corinna, Mohri, Mehryar, Rostamizadeh, Afshin, and Bach, Francis
Subjects: *ALGORITHMS, *MACHINE learning, *KERNEL functions, *CLASSIFICATION, *REGRESSION analysis software, *MATRICES (Mathematics), *MATHEMATICAL regularization, *FEATURE selection
Abstract: This paper presents new and effective algorithms for learning kernels. In particular, as shown by our empirical results, these algorithms consistently outperform the so-called uniform combination solution that has proven to be difficult to improve upon in the past, as well as other algorithms for learning kernels based on convex combinations of base kernels in both classification and regression. Our algorithms are based on the notion of centered alignment which is used as a similarity measure between kernels or kernel matrices. We present a number of novel algorithmic, theoretical, and empirical results for learning kernels based on our notion of centered alignment. In particular, we describe efficient algorithms for learning a maximum alignment kernel by showing that the problem can be reduced to a simple QP and discuss a one-stage algorithm for learning both a kernel and a hypothesis based on that kernel using an alignment-based regularization. Our theoretical results include a novel concentration bound for centered alignment between kernel matrices, the proof of the existence of effective predictors for kernels with high alignment, both for classification and for regression, and the proof of stability-based generalization bounds for a broad family of algorithms for learning kernels based on centered alignment. We also report the results of experiments with our centered alignment-based algorithms in both classification and regression. [ABSTRACT FROM AUTHOR]
Published: 2012

37. Efficient and Effective Visual Codebook Generation Using Additive Kernels.

Author: Jianxin Wu, Wei-Chian Tan, and Rehg, James M.
Subjects: *KERNEL functions, *ELECTRONIC dictionaries, *GAUSSIAN processes, *CLUSTER analysis (Statistics), *MACHINE learning, *ALGORITHMS, *PATTERN perception
Abstract: Common visual codebook generation methods used in a bag of visual words model, for example, k-means or Gaussian Mixture Model, use the Euclidean distance to cluster features into visual code words. However, most popular visual descriptors are histograms of image measurements. It has been shown that with histogram features, the Histogram Intersection Kernel (HIK) is more effective than the Euclidean distance in supervised learning tasks. In this paper, we demonstrate that HIK can be used in an unsupervised manner to significantly improve the generation of visual codebooks. We propose a histogram kernel k-means algorithm which is easy to implement and runs almost as fast as the standard k-means. The HIK codebooks have consistently higher recognition accuracy over k-means codebooks by 2-4% in several benchmark object and scene recognition data sets. The algorithm is also generalized to arbitrary additive kernels. Its speed is thousands of times faster than a naive implementation of the kernel k-means algorithm. In addition, we propose a one-class SVM formulation to create more effective visual code words. Finally, we show that the standard k- median clustering method can be used for visual codebook generation and can act as a compromise between the HIK / additive kernel and the k-means approaches. [ABSTRACT FROM AUTHOR]
Published: 2011

38. Learning with Structured Sparsity.

Author: Junzhou Huang, Tong Zhang, and Metaxas, Dimitris
Subjects: *MACHINE learning, *SPARSE matrices, *STATISTICS, *GENERALIZATION, *ALGORITHMS, *MATHEMATICAL optimization, *FEATURE extraction
Abstract: This paper investigates a learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization. Moreover, a structured greedy algorithm is proposed to efficiently solve the structured sparsity problem. It is shown that the greedy algorithm approximately solves the coding complexity optimization problem under appropriate conditions. Experiments are included to demonstrate the advantage of structured sparsity over standard sparsity on some real applications. [ABSTRACT FROM AUTHOR]
Published: 2011

39. Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes.

Author: Zavitsanos, Elias, Paliouras, Georgios, and Vouros, George A.
Subjects: *NONPARAMETRIC statistics, *HIERARCHICAL Bayes model, *DIRICHLET problem, *ALGORITHMS, *BRANCHING processes, *MACHINE learning, *PROBABILITY theory
Abstract: This paper presents hHDP, a hierarchical algorithm for representing a document collection as a hierarchy of latent topics, based on Dirichlet process priors. The hierarchical nature of the algorithm refers to the Bayesian hierarchy that it comprises, as well as to the hierarchy of the latent topics. hHDP relies on nonparametric Bayesian priors and it is able to infer a hierarchy of topics, without making any assumption about the depth of the learned hierarchy and the branching factor at each level. We evaluate the proposed method on real-world data sets in document modeling, as well as in ontology learning, and provide qualitative and quantitative evaluation results, showing that the model is robust, it models accurately the training data set and is able to generalize on held-out data. [ABSTRACT FROM AUTHOR]
Published: 2011

40. Bayesian Co-Training.

Author: Yu, Shipeng, Krishnapuram, Balaji, Rosales, Rómer, and Rao, R. Bharat
Subjects: *BAYESIAN analysis, *MATHEMATICAL regularization, *ALGORITHMS, *MACHINE learning, *KERNEL functions, *GAUSSIAN processes, *MATHEMATICAL models
Abstract: Co-training (or more generally, co-regularization) has been a popular algorithm for semi-supervised learning in data with two feature representations (or views), but the fundamental assumptions underlying this type of models are still unclear. In this paper we propose a Bayesian undirected graphical model for co-training, or more generally for semi-supervised multi-view learning. This makes explicit the previously unstated assumptions of a large class of co-training type algorithms, and also clarifies the circumstances under which these assumptions fail. Building upon new insights from this model, we propose an improved method for co-training, which is a novel co-training kernel for Gaussian process classifiers. The resulting approach is convex and avoids local-maxima problems, and it can also automatically estimate how much each view should be trusted to accommodate noisy or unreliable views. The Bayesian co-training approach can also elegantly handle data samples with missing views, that is, some of the views are not available for some data points at learning time. This is further extended to an active sensing framework, in which the missing (sample, view) pairs are actively acquired to improve learning performance. The strength of active sensing model is that one actively sensed (sample, view) pair would improve the joint multi-view classification on all the samples. Experiments on toy data and several real world data sets illustrate the benefits of this approach. [ABSTRACT FROM AUTHOR]
Published: 2011

41. Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning.

Author: Liwei Wang
Subjects: *MACHINE learning, *SMOOTHNESS of functions, *ELECTRONIC noise, *STATISTICAL hypothesis testing, *ALGORITHMS, *DATA distribution, *CLASSIFICATION
Abstract: We study pool-based active learning in the presence of noise, that is, the agnostic setting. It is known that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have an advantage. Previous works have shown that the label complexity of active learning relies on the disagreement coefficient which often characterizes the intrinsic difficulty of the learning problem. In this paper, we study the disagreement coefficient of classification problems for which the classification boundary is smooth and the data distribution has a density that can be bounded by a smooth function. We prove upper and lower bounds for the disagreement coefficients of both finitely and infinitely smooth problems. Combining with existing results, it shows that active learning is superior to passive supervised learning for smooth problems. [ABSTRACT FROM AUTHOR]
Published: 2011

42. Universality, Characteristic Kernels and RKHS Embedding of Measures.

Author: Sriperumbudur, Bharath K., Fukumizu, Kenji, and Lanckriet, Gert R. G.
Subjects: *HILBERT space, *EMBEDDINGS (Mathematics), *KERNEL functions, *MACHINE learning, *PROBABILITY theory, *ALGORITHMS, *BINARY number system
Abstract: Over the last few years, two different notions of positive definite (pd) kernels--universal and characteristic--have been developing in parallel in machine learning: universal kernels are proposed in the context of achieving the Bayes risk by kernel-based classification/regression algorithms while characteristic kernels are introduced in the context of distinguishing probability measures by embedding them into a reproducing kernel Hilbert space (RKHS). However, the relation between these two notions is not well understood. The main contribution of this paper is to clarify the relation between universal and characteristic kernels by presenting a unifying study relating them to RKHS embedding of measures, in addition to clarifying their relation to other common notions of strictly pd, conditionally strictly pd and integrally strictly pd kernels. For radial kernels on Rd, all these notions are shown to be equivalent. [ABSTRACT FROM AUTHOR]
Published: 2011

43. Double Updating Online Learning.

Author: Zhao, Peilin, Hoi, Steven C. H., and Jin, Rong
Subjects: *DISTANCE education, *KERNEL functions, *ALGORITHMS, *SUPPORT vector machines, *MACHINE learning, *INFORMATION retrieval, *SET theory, *INFORMATION technology
Abstract: In most kernel based online learning algorithms, when an incoming instance is misclassified, it will be added into the pool of support vectors and assigned with a weight, which often remains unchanged during the rest of the learning process. This is clearly insufficient since when a new support vector is added, we generally expect the weights of the other existing support vectors to be updated in order to reflect the influence of the added support vector. In this paper, we propose a new online learning method, termed Double Updating Online Learning, or DUOL for short, that explicitly addresses this problem. Instead of only assigning a fixed weight to the misclassified example received at the current trial, the proposed online learning algorithm also tries to update the weight for one of the existing support vectors. We show that the mistake bound can be improved by the proposed online learning method. We conduct an extensive set of empirical evaluations for both binary and multi-class online learning tasks. The experimental results show that the proposed technique is considerably more effective than the state-of-the-art online learning algorithms. The source code is available to public at http://www.cais.ntu.edu.sg/∼chhoi/DUOL/. [ABSTRACT FROM AUTHOR]
Published: 2011

44. Laplacian Support Vector Machines Trained in the Primal.

Author: Melacci, Stefano and Belkin, Mikhail
Subjects: *SUPPORT vector machines, *LAPLACIAN operator, *PREDICTION models, *ALGORITHMS, *COMPUTATIONAL complexity, *MATHEMATICAL optimization, *MACHINE learning
Abstract: In the last few years, due to the growing ubiquity of unlabeled data, much effort has been spent by the machine learning community to develop better understanding and improve the quality of classifiers exploiting unlabeled data. Following the manifold regularization approach, Laplacian Support Vector Machines (LapSVMs) have shown the state of the art performance in semi-supervised classification. In this paper we present two strategies to solve the primal LapSVM problem, in order to overcome some issues of the original dual formulation. In particular, training a LapSVM in the primal can be efficiently performed with preconditioned conjugate gradient. We speed up training by using an early stopping strategy based on the prediction on unlabeled data or, if available, on labeled validation examples. This allows the algorithm to quickly compute approximate solutions with roughly the same classification accuracy as the optimal ones, considerably reducing the training time. The computational complexity of the training algorithm is reduced from O(n³) to O(kn²), where n is the combined number of labeled and unlabeled examples and k is empirically evaluated to be significantly smaller than n. Due to its simplicity, training LapSVM in the primal can be the starting point for additional enhancements of the original LapSVM formulation, such as those for dealing with large data sets. We present an extensive experimental evaluation on real world data showing the benefits of the proposed approach. [ABSTRACT FROM AUTHOR]
Published: 2011

45. Learning Transformation Models for Ranking and Survival Analysis.

Author: Van Belle, Vanya, Pelckmans, Kristiaan, Suykens, Johan A. K., and Van Huffel, Sabine
Subjects: *RANKING, *SURVIVAL analysis (Biometry), *MACHINE learning, *MATHEMATICAL transformations, *SMOOTHNESS of functions, *SUPPORT vector machines, *ALGORITHMS, *PROPORTIONAL hazards models
Abstract: This paper studies the task of learning transformation models for ranking problems, ordinal regression and survival analysis. The present contribution describes a machine learning approach termed MINLIP. The key insight is to relate ranking criteria as the Area Under the Curve to monotone transformation functions. Consequently, the notion of a Lipschitz smoothness constant is found to be useful for complexity control for learning transformation models, much in a similar vein as the 'margin' is for Support Vector Machines for classification. The use of this model structure in the context of high dimensional data, as well as for estimating non-linear, and additive models based on primal-dual kernel machines, and for sparse models is indicated. Given n observations, the present method solves a quadratic program existing of O(n) constraints and O(n) unknowns, where most existing risk minimization approaches to ranking problems typically result in algorithms with O(n²) constraints or unknowns. We specify the MINLIP method for three different cases: the first one concerns the preference learning problem. Secondly it is specified how to adapt the method to ordinal regression with a finite set of ordered outcomes. Finally, it is shown how the method can be used in the context of survival analysis where one models failure times, typically subject to censoring. The current approach is found to be particularly useful in this context as it can handle, in contrast with the standard statistical model for analyzing survival data, all types of censoring in a straight-forward way, and because of the explicit relation with the Proportional Hazard and Accelerated Failure Time models. The advantage of the current method is illustrated on different benchmark data sets, as well as for estimating a model for cancer survival based on different micro-array and clinical data sets. [ABSTRACT FROM AUTHOR]
Published: 2011

46. Efficient Structure Learning of Bayesian Networks using Constraints.

Author: de Campos, Cassio P. and Qiang Ji
Subjects: *BAYESIAN analysis, *CONSTRAINT satisfaction, *MINIMUM description length (Information theory), *ALGORITHMS, *OPTIMAL designs (Statistics), *INFORMATION storage & retrieval systems, *MACHINE learning
Abstract: This paper addresses the problem of learning Bayesian network structures from data based on score functions that are decomposable. It describes properties that strongly reduce the time and memory costs of many known methods without losing global optimality guarantees. These properties are derived for different score criteria such as Minimum Description Length (or Bayesian Information Criterion), Akaike Information Criterion and Bayesian Dirichlet Criterion. Then a branch-and-bound algorithm is presented that integrates structural constraints with data in a way to guarantee global optimality. As an example, structural constraints are used to map the problem of structure learning in Dynamic Bayesian networks into a corresponding augmented Bayesian network. Finally, we show empirically the benefits of using the properties with state-of-the-art methods and with the new algorithm, which is able to handle larger data sets than before. [ABSTRACT FROM AUTHOR]
Published: 2011

47. On the Foundations of Noise-free Selective Classification.

Author: El-Yaniv, Ran and YairWiener
Subjects: *CLASSIFICATION, *ALGORITHMS, *LEARNING classifier systems, *MACHINE learning, *MACHINE theory
Abstract: We consider selective classification, a term we adopt here to refer to 'classification with a reject option.' The essence in selective classification is to trade-off classifier coverage for higher accuracy. We term this trade-off the risk-coverage (RC) trade-off. Our main objective is to characterize this trade-off and to construct algorithms that can optimally or near optimally achieve the best possible trade-offs in a controlled manner. For noise-free models we present in this paper a thorough analysis of selective classification including characterizations of RC trade-offs in various interesting settings. [ABSTRACT FROM AUTHOR]
Published: 2010

48. Generalized Power Method for Sparse Principal Component Analysis.

Author: Journée, Michel, Nesterov, Yurii, Richtárik, Peter, and Sepulchre, Rodolphe
Subjects: *PRINCIPAL components analysis, *CONVEX sets, *ALGORITHMS, *COMPUTER science, *MACHINE learning
Abstract: In this paper we develop a new approach to sparse principal component analysis (sparse PCA). We propose two single-unit and two block optimization formulations of the sparse PCA problem, aimed at extracting a single sparse dominant principal component of a data matrix, or more components at once, respectively. While the initial formulations involve nonconvex functions, and are therefore computationally intractable, we rewrite them into the form of an optimization program involving maximization of a convex function on a compact set. The dimension of the search space is decreased enormously if the data matrix has many more columns (variables) than rows. We then propose and analyze a simple gradient method suited for the task. It appears that our algorithm has best convergence properties in the case when either the objective function or the feasible set are strongly convex, which is the case with our single-unit formulations and can be enforced in the block case. Finally, we demonstrate numerically on a set of random and gene expression test problems that our approach outperforms existing algorithms both in quality of the obtained solution and in computational speed. [ABSTRACT FROM AUTHOR]
Published: 2010

49. Stability Bounds for Stationary φ-mixing and β-mixing Processes.

Author: Mohri, Mehryar and Rostamizadeh, Afshin
Subjects: *MACHINE learning, *DISTRIBUTION (Probability theory), *ALGORITHMS, *KERNEL functions, *GENERALIZATION
Abstract: Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence. This paper studies the scenario where the observations are drawn from a stationary j-mixing or b-mixing sequence, a widely adopted assumption in the study of non-i.i.d. processes that implies a dependence between observations weakening over time. We prove novel and distinct stability-based generalization bounds for stationary j-mixing and b-mixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algorithms, thereby extending the use of stability-bounds to non-i.i.d. scenarios. We also illustrate the application of our j-mixing generalization bounds to general classes of learning algorithms, including Support Vector Regression, Kernel Ridge Regression, and Support Vector Machines, and many other kernel regularization-based and relative entropy-based regularization algorithms. These novel bounds can thus be viewed as the first theoretical basis for the use of these algorithms in non-i.i.d. scenarios. [ABSTRACT FROM AUTHOR]
Published: 2010

50. An Efficient Explanation of Individual Classifications using Game Theory.

Author: Štrumbelj, Erik and Kononenko, Igor
Subjects: *INFORMATION storage & retrieval systems, *CLASSIFICATION, *ALGORITHMS, *VISUALIZATION, *MACHINE learning
Abstract: We present a general method for explaining individual predictions of classification models. The method is based on fundamental concepts from coalitional game theory and predictions are explained with contributions of individual feature values. We overcome the method's initial exponential time complexity with a sampling-based approximation. In the experimental part of the paper we use the developed method on models generated by several well-known machine learning algorithms on both synthetic and real-world data sets. The results demonstrate that the method is efficient and that the explanations are intuitive and useful. [ABSTRACT FROM AUTHOR]
Published: 2010

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

94 results

Search Results

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources