1,729 results on '"Statistical learning theory"'
Search Results
2. Towards Optimal Problem Dependent Generalization Error Bounds in Statistical Learning Theory.
- Author
-
Xu, Yunbei and Zeevi, Assaf
- Subjects
STATISTICAL learning ,MACHINE learning ,MISSING data (Statistics) ,ASYMPTOTIC analysis ,LEARNING problems - Abstract
We study problem-dependent rates, that is, generalization errors that scale near-optimally with the variance, effective loss, or gradient norms evaluated at the "best hypothesis." We introduce a principled framework dubbed "uniform localized convergence" and characterize sharp problem-dependent rates for central statistical learning problems. From a methodological viewpoint, our framework resolves several fundamental limitations of existing uniform convergence and localization analysis approaches. It also provides improvements and some level of unification in the study of localized complexities, one-sided uniform inequalities, and sample-based iterative algorithms. In the so-called "slow rate" regime, we provide the first (moment-penalized) estimator that achieves the optimal variance-dependent rate for general "rich" classes; we also establish an improved loss-dependent rate for standard empirical risk minimization. In the "fast rate" regime, we establish finite-sample, problem-dependent bounds that are comparable to precise asymptotics. In addition, we show that iterative algorithms such as gradient descent and first order expectation maximization can achieve optimal generalization error in several representative problems across the areas of nonconvex learning, stochastic optimization, and learning with missing data. Supplemental Material: The online appendix is available at https://doi.org/10.1287/moor.2021.0076. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
3. ϵ-Confidence Approximately Correct (ϵ-CoAC) Learnability and Hyperparameter Selection in Linear Regression Modeling
- Author
-
Soosan Beheshti and Mahdi Shamsi
- Subjects
Statistical learning theory ,sample complexity ,hypothesis class complexity ,Kullback-Leibler divergence ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
In a data based learning process, training data set is utilized to provide a hypothesis that can be generalized to explain all data points from a domain set. The hypothesis is chosen from classes with potentially different complexities. Linear regression modeling is an important category of learning algorithms. The practical uncertainty of the label samples in the training data set has a major effect in the generalization ability of the learned model. Failing to choose a proper model or hypothesis class can lead to serious issues such as underfitting or overfitting. These issues have been addressed mostly by alternating modeling cost functions or by utilizing cross-validation methods. Drawbacks of these methods include introducing new hyperparameters with their own new challenges and uncertainties, potential increase of the computational complexity or requiring large set of training data sets. On the other hand, the theory of probably approximately correct (PAC) aims at defining learnability based on probabilistic settings. Despite its theoretical value, PAC bounds can’t be utilized in practical regression learning applications with only available training data sets. This work is motivated by practical issues in regression learning generalization and is inspired by the foundations of the theory of statistical learning. The proposed approach, denoted by $\epsilon $ -Confidence Approximately Correct ( $\epsilon $ -CoAC), utilizes the conventional Kullback-Leibler divergence (relative entropy) and defines new related typical sets to develop a unique method of probabilistic statistical learning for practical regression learning and generalization. $\epsilon $ -CoAC learnability is able to validate the learning process as a function of training data sample size, as well as a function of the hypothesis class complexity order. Consequently, it enables the learner to automatically compare hypothesis classes of different complexity orders and to choose among them the optimum class with the minimum $\epsilon $ in the $\epsilon $ -CoAC framework. The $\epsilon $ -CoAC learnability overcomes the issues of overfitting and underfitting. In addition, it shows advantages over the well-known cross-validation method in the sense of accuracy and data length requirements for convergence. Simulation results, for both synthetic and real data, confirm not only strength and capability of $\epsilon $ -CoAC in providing learning measurements as a function of data length and/or hypothesis complexity, but also superiority of the method over the existing approaches in hypothesis complexity and model selection.
- Published
- 2025
- Full Text
- View/download PDF
4. Statistical learning to identify salient factors influencing FEMA public assistance outlays.
- Author
-
Ghaedi, Hamed, Best, Kelsea, Reilly, Allison, and Niemeier, Deb
- Subjects
STATISTICAL learning ,FLOOD damage ,INFRASTRUCTURE (Economics) ,HUMAN behavior ,STATE governments - Abstract
Both the number of disasters in the U.S. and federal outlays following disasters are rising. FEMA's Public Assistance (PA) is a key program for rebuilding damaged public infrastructure and aiding local and state governments in recovery. It is the primary post-disaster source of recovery funds. Between 2000 and 2019, more than $125B (adjusted, 2020 dollars) was awarded through PA. While all who qualify for PA should have equal opportunity to receive aid, not all do, and the factors influencing how the program has been administered are complex and multifaceted. Lacking an understanding of the factors positively associated with historical receipt of aid, there is little way to objectively evaluate the efficacy of the PA program. In this work, we evaluate the salient features that contribute to the number of county-level PA applicants and projects following disasters. We use statistical learning theory applied to repetitive flooding events in the upper Midwest between 2003 and 2018 as a case study. The results suggest that many non-disaster related indicators are key predictors of PA outlays, including the state in which the disaster occurred, the county's prior experience with disasters, the county's median income, and the length of time between the end of the disaster and the date when a disaster is declared. Our work suggests that indicators of PA aid are tied to exposure, bureaucratic attributes, and human behavior. For equitable distribution of aid, policymakers should explore more disaster-relevant indicators for PA distribution. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. Towards Understanding the Mechanism of Contrastive Learning via Similarity Structure: A Theoretical Analysis
- Author
-
Waida, Hiroki, Wada, Yuichiro, Andéol, Léo, Nakagawa, Takumi, Zhang, Yuhui, Kanamori, Takafumi, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Koutra, Danai, editor, Plant, Claudia, editor, Gomez Rodriguez, Manuel, editor, Baralis, Elena, editor, and Bonchi, Francesco, editor
- Published
- 2023
- Full Text
- View/download PDF
6. A remark about a learning risk lower bound.
- Author
-
Leung, Man Fung, Lin, Yiqi, and Wicker, Nicolas
- Subjects
- *
STATISTICAL learning - Abstract
In this paper, we correct the learning risk lower bound in the non-realizable case provided by Anthony and Bartlett in 1999. Our contribution is mainly technical as we follow closely the proof of Anthony and Bartlett by correcting first the lemma they use and adapting then the lower bound proof itself. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
7. Set-Valued Support Vector Machine with Bounded Error Rates.
- Author
-
Wang, Wenbo and Qiao, Xingye
- Subjects
- *
SUPPORT vector machines , *ERROR rates , *STATISTICAL learning , *CONSTRAINED optimization - Abstract
This article concerns cautious classification models that are allowed to predict a set of class labels or reject to make a prediction when the uncertainty in the prediction is high. This set-valued classification approach is equivalent to the task of acceptance region learning, which aims to identify subsets of the input space, each of which guarantees to cover observations in a class with at least a predetermined probability. We propose to directly learn the acceptance regions through risk minimization, by making use of a truncated hinge loss and a constrained optimization framework. Collectively our theoretical analyses show that these acceptance regions, with high probability, satisfy simultaneously two properties: (a) they guarantee to cover each class with a noncoverage rate bounded from above; (b) they give the least ambiguous predictions among all the acceptance regions satisfying (a). An efficient algorithm is developed and numerical studies are conducted using both simulated and real data. for this article are available online. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
8. Benign overfitting in linear regression
- Author
-
Bartlett, Peter L, Long, Philip M, Lugosi, Gábor, and Tsigler, Alexander
- Subjects
Information and Computing Sciences ,Machine Learning ,statistical learning theory ,overfitting ,linear regression ,interpolation - Abstract
The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of the effective rank of the data covariance. It shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size. By studying examples of data covariance properties that this characterization shows are required for benign overfitting, we find an important role for finite-dimensional data: the accuracy of the minimum norm interpolating prediction rule approaches the best possible accuracy for a much narrower range of properties of the data distribution when the data lie in an infinite-dimensional space vs. when the data lie in a finite-dimensional space with dimension that grows faster than the sample size.
- Published
- 2020
9. Information Losses in Neural Classifiers From Sampling
- Author
-
Foggo, Brandon, Yu, Nanpeng, Shi, Jie, and Gao, Yuanqi
- Subjects
Information and Computing Sciences ,Machine Learning ,Neural networks ,Machine learning ,Training ,Random variables ,Training data ,Probability distribution ,Learning systems ,Deep learning ,information theory ,large deviations theory ,mutual information ,statistical learning theory ,cs.LG ,stat.ML ,Artificial Intelligence & Image Processing ,Artificial intelligence - Abstract
This article considers the subject of information losses arising from the finite data sets used in the training of neural classifiers. It proves a relationship between such losses as the product of the expected total variation of the estimated neural model with the information about the feature space contained in the hidden representation of that model. It then bounds this expected total variation as a function of the size of randomly sampled data sets in a fairly general setting, and without bringing in any additional dependence on model complexity. It ultimately obtains bounds on information losses that are less sensitive to input compression and in general much smaller than existing bounds. This article then uses these bounds to explain some recent experimental findings of information compression in neural networks that cannot be explained by previous work. Finally, this article shows that not only are these bounds much smaller than existing ones, but they also correspond well with experiments.
- Published
- 2020
10. Bavarian: Betweenness Centrality Approximation with Variance-aware Rademacher Averages.
- Author
-
COUSINS, CYRUS, WOHLGEMUTH, CHLOE, and RIONDATO, MATTEO
- Subjects
STATISTICAL learning ,CENTRALITY ,DEVIATION (Statistics) ,PERCOLATION theory - Abstract
We present Bavarian, a collection of sampling-based algorithms for approximating the Betweenness Centrality (BC) of all vertices in a graph. Our algorithms use Monte-Carlo Empirical Rademacher Averages (MCERAs), a concept from statistical learning theory, to efficiently compute tight bounds on the maximum deviation of the estimates from the exact values. The MCERAs provide a sample-dependent approximation guarantee much stronger than the state-of-the-art, thanks to its use of variance-aware probabilistic tail bounds. The flexibility of the MCERAs allows us to introduce a unifying framework that can be instantiated with existing sampling-based estimators of BC, thus allowing a fair comparison between them, decoupled from the sample-complexity results with which they were originally introduced. Additionally, we prove novel sample-complexity results showing that, for all estimators, the sample size sufficient to achieve a desired approximation guarantee depends on the vertex-diameter of the graph, an easy-to-bound characteristic quantity. We also show progressive-sampling algorithms and extensions to other centrality measures, such as percolation centrality. Our extensive experimental evaluation of Bavarian shows the improvement over the state-of-the-art made possible by the MCERAs (2–4× reduction in the error bound), and it allows us to assess the different trade-offs between sample size and accuracy guarantees offered by the different estimators. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
11. A Statistical Learning Theory Approach for the Analysis of the Trade-off Between Sample Size and Precision in Truncated Ordinary Least Squares
- Author
-
Gnecco, Giorgio, Raciti, Fabio, Selvi, Daniela, Pardalos, Panos M., Series Editor, Thai, My T., Series Editor, Du, Ding-Zhu, Honorary Editor, Belavkin, Roman V., Advisory Editor, Birge, John R., Advisory Editor, Butenko, Sergiy, Advisory Editor, Kumar, Vipin, Advisory Editor, Nagurney, Anna, Advisory Editor, Pei, Jun, Advisory Editor, Prokopyev, Oleg, Advisory Editor, Rebennack, Steffen, Advisory Editor, Resende, Mauricio, Advisory Editor, Terlaky, Tamás, Advisory Editor, Vu, Van, Advisory Editor, Vrahatis, Michael N., Advisory Editor, Xue, Guoliang, Advisory Editor, Ye, Yinyu, Advisory Editor, Nikeghbali, Ashkan, editor, Raigorodskii, Andrei M., editor, and Rassias, Michael Th., editor
- Published
- 2022
- Full Text
- View/download PDF
12. Hold-out estimates of prediction models for Markov processes.
- Author
-
Garnier, Remy, Langhendries, Raphaël, and Rynkiewicz, Joseph
- Subjects
- *
MARKOV processes , *PREDICTION models , *TIME series analysis , *STATISTICAL learning , *INDEPENDENT sets , *BOUND states , *GENERALIZATION - Abstract
We consider the selection of prediction models for Markovian time series. For this purpose, we study the theoretical properties of the hold-out method. In the econometrics literature, the hold-out method is called 'out-of-sample' and is the main method to select a suitable time series model. This method consists of estimating models on a learning set and picking up the model with minimal empirical error on a validation set of future observations. Hold-out estimates are well studied in the independent case, but, as far as we know, this is not the case when the validation set is not independent of the learning set. In this paper, assuming uniform ergodicity of the Markov chain, we state generalization bounds and oracle inequalities for such method; in particular, we show that the 'out-of-sample' selection method is adaptive to noise condition. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
13. Distributions-free Martingales Test Distributions-shift.
- Author
-
Xi, Zepu, Chen, Hongbo, Chen, Xiaoqian, and Yao, Wen
- Subjects
MACHINE learning ,MARTINGALES (Mathematics) ,STATISTICAL learning ,MACHINE theory ,DISTRIBUTION (Probability theory) ,PROBABILITY theory ,SUPPORT vector machines - Abstract
A standard assumption of the theory of machine learning is the data are generated from a fixed but unknown probability distribution. Although this assumption is based on the foundations of the theory of probability, however, for most learning problems we usually technically random shuffle the original datasets, such as random split into training and test datasets before the training model, to satisfy the assumption, and then we use the shuffled training dataset to train a machine learning model. However, honestly, for real-life learning applications, the data pairs are observed batch by batch under their own original order and it is not necessary to randomly shuffle the original order in advance. From a mathematical point of view, we test if the random shuffling will play a non-negligible influence on the generalization of learning machines. We reduce the problem of random shuffling into the problem of distribution-shift detection. This paper is devoted to testing the null hypothesis that random shuffling does not affect the generalization of learning machines and introduces a distributions-free martingales method against the null hypothesis. We report the five real-life benchmarks of experimental performance with the help of Support Vector Machines and a multi-layer perceptron model. The results show a bonafide fact that the distribution shift in itself of the data is an inescapable reality when we build machine learning algorithms as the original order. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
14. Revisiting generalization for deep learning : PAC-Bayes, flat minima, and generative models
- Author
-
Dziugaite, Gintare Karolina and Ghahramani, Zoubin
- Subjects
Deep learning ,statistical learning theory ,Generalization in neural networks ,adversarial learning ,PAC-Bayesian bounds ,generative models - Abstract
In this work, we construct generalization bounds to understand existing learning algorithms and propose new ones. Generalization bounds relate empirical performance to future expected performance. The tightness of these bounds vary widely, and depends on the complexity of the learning task and the amount of data available, but also on how much information the bounds take into consideration. We are particularly concerned with data and algorithm- dependent bounds that are quantitatively nonvacuous. We begin with an analysis of stochastic gradient descent (SGD) in supervised learning. By formalizing the notion of flat minima using PAC-Bayes generalization bounds, we obtain nonvacuous generalization bounds for stochastic classifiers based on SGD solutions. Despite strong empirical performance in many settings, SGD rapidly overfits in others. By combining nonvacuous generalization bounds and structural risk minimization, we arrive at an algorithm that trades-off accuracy and generalization guarantees. We also study generalization in the context of unsupervised learning. We propose to use a two sample test statistic for training neural network generator models and bound the gap between the population and the empirical estimate of the statistic.
- Published
- 2019
- Full Text
- View/download PDF
15. Investigating the ability of PINNs to solve Burgers’ PDE near finite-time blowup
- Author
-
Dibyakanti Kumar and Anirbit Mukherjee
- Subjects
physics informed neural nets ,deep learning ,neural networks ,PDE ,statistical learning theory ,Burgers’ PDE ,Computer engineering. Computer hardware ,TK7885-7895 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Physics Informed Neural Networks (PINNs) have been achieving ever newer feats of solving complicated Partial Differential Equations (PDEs) numerically while offering an attractive trade-off between accuracy and speed of inference. A particularly challenging aspect of PDEs is that there exist simple PDEs which can evolve into singular solutions in finite time starting from smooth initial conditions. In recent times some striking experiments have suggested that PINNs might be good at even detecting such finite-time blow-ups. In this work, we embark on a program to investigate this stability of PINNs from a rigorous theoretical viewpoint. Firstly, we derive error bounds for PINNs for Burgers’ PDE, in arbitrary dimensions, under conditions that allow for a finite-time blow-up. Our bounds give a theoretical justification for the functional regularization terms that have been reported to be useful for training PINNs near finite-time blow-up. Then we demonstrate via experiments that our bounds are significantly correlated to the $\ell_2$ -distance of the neurally found surrogate from the true blow-up solution, when computed on sequences of PDEs that are getting increasingly close to a blow-up.
- Published
- 2024
- Full Text
- View/download PDF
16. Machine learning advances for time series forecasting.
- Author
-
Masini, Ricardo P., Medeiros, Marcelo C., and Mendes, Eduardo F.
- Subjects
ARTIFICIAL neural networks ,MACHINE learning ,SUPERVISED learning ,TIME series analysis ,RANDOM forest algorithms ,ECONOMICS education ,CREDIT scoring systems - Abstract
In this paper, we survey the most recent advances in supervised machine learning (ML) and high‐dimensional models for time‐series forecasting. We consider both linear and nonlinear alternatives. Among the linear methods, we pay special attention to penalized regressions and ensemble of models. The nonlinear methods considered in the paper include shallow and deep neural networks, in their feedforward and recurrent versions, and tree‐based methods, such as random forests and boosted trees. We also consider ensemble and hybrid models by combining ingredients from different alternatives. Tests for superior predictive ability are briefly reviewed. Finally, we discuss application of ML in economics and finance and provide an illustration with high‐frequency financial data. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
17. Set-valued Classification with Out-of-distribution Detection for Many Classes.
- Author
-
Zhou Wang and Xingye Qiao
- Subjects
- *
STATISTICAL learning , *FEATURE selection , *NUMERICAL analysis - Abstract
Set-valued classification, a new classification paradigm that aims to identify all the plausible classes that an observation belongs to, improves over the traditional classification paradigms in multiple aspects. Existing set-valued classification methods do not consider the possibility that the test set may contain out-of-distribution data, that is, the emergence of a new class that never appeared in the training data. Moreover, they are computationally expensive when the number of classes is large. We propose a Generalized Prediction Set (GPS) approach to set-valued classification while considering the possibility of a new class in the test data. The proposed classifier uses kernel learning and empirical risk minimization to encourage a small expected size of the prediction set while guaranteeing that the class-specific accuracy is at least some value specified by the user. For high-dimensional data, further improvement is obtained through kernel feature selection. Unlike previous methods, the proposed method achieves a good balance between accuracy, efficiency, and out-of-distribution detection rate. Moreover, our method can be applied in parallel to all the classes to alleviate the computational burden. Both theoretical analysis and numerical experiments are conducted to illustrate the effectiveness of the proposed method. [ABSTRACT FROM AUTHOR]
- Published
- 2023
18. A Unified Recipe for Deriving (Time-Uniform) PAC-Bayes Bounds.
- Author
-
Ben Chugg, Hongjian Wang, and Aaditya Ramdas
- Subjects
- *
MARTINGALES (Mathematics) , *STOCHASTIC processes , *STATISTICAL learning , *SAMPLE size (Statistics) - Abstract
We present a unified framework for deriving PAC-Bayesian generalization bounds. Unlike most previous literature on this topic, our bounds are anytime-valid (i.e., time-uniform), meaning that they hold at all stopping times, not only for a fixed sample size. Our approach combines four tools in the following order: (a) nonnegative supermartingales or reverse submartingales, (b) the method of mixtures, (c) the Donsker-Varadhan formula (or other convex duality principles), and (d) Ville's inequality. Our main result is a PAC-Bayes theorem which holds for a wide class of discrete stochastic processes. We show how this result implies time-uniform versions of well-known classical PAC-Bayes bounds, such as those of Seeger, McAllester, Maurer, and Catoni, in addition to many recent bounds. We also present several novel bounds. Our framework also enables us to relax traditional assumptions; in particular, we consider nonstationary loss functions and non-i.i.d. data. In sum, we unify the derivation of past bounds and ease the search for future bounds: one may simply check if our supermartingale or submartingale conditions are met and, if so, be guaranteed a (time-uniform) PAC-Bayes bound. [ABSTRACT FROM AUTHOR]
- Published
- 2023
19. Risk Bounds for Positive-Unlabeled Learning Under the Selected At Random Assumption.
- Author
-
Coudray, Olivier, Keribin, Christine, Massart, Pascal, and Pamphile, Patrick
- Subjects
- *
STATISTICAL learning - Abstract
Positive-Unlabeled learning (PU learning) is a special case of semi-supervised binary classification where only a fraction of positive examples is labeled. The challenge is then to find the correct classifier despite this lack of information. Recently, new methodologies have been introduced to address the case where the probability of being labeled may depend on the covariates. In this paper, we are interested in establishing risk bounds for PU learning under this general assumption. In addition, we quantify the impact of label noise on PU learning compared to the standard classification setting. Finally, we provide a lower bound on the minimax risk proving that the upper bound is almost optimal. [ABSTRACT FROM AUTHOR]
- Published
- 2023
20. Compression, Generalization and Learning.
- Author
-
Campi, Marco C. and Garatti, Simone
- Subjects
- *
DISTRIBUTION (Probability theory) , *STATISTICAL learning , *GENERALIZATION - Abstract
A compression function is a map that slims down an observational set into a subset of reduced size, while preserving its informational content. In multiple applications, the condition that one new observation makes the compressed set change is interpreted that this observation brings in extra information and, in learning theory, this corresponds to misclassi fication, or misprediction. In this paper, we lay the foundations of a new theory that allows one to keep control on the probability of change of compression (which maps into the statistical "risk" in learning applications). Under suitable conditions, the cardinality of the compressed set is shown to be a consistent estimator of the probability of change of compression (without any upper limit on the size of the compressed set); moreover, unprecedentedly tight finite-sample bounds to evaluate the probability of change of compression are obtained under a generally applicable condition of preference. All results are usable in a fully agnostic setup, i.e., without requiring any a priori knowledge on the probability distribution of the observations. Not only these results offer a valid support to develop trust in observation-driven methodologies, they also play a fundamental role in learning techniques as a tool for hyper-parameter tuning. [ABSTRACT FROM AUTHOR]
- Published
- 2023
21. Learning an Explicit Hyper-parameter Prediction Function Conditioned on Tasks.
- Author
-
Jun Shu, Deyu Meng, and Zongben Xu
- Subjects
- *
STATISTICAL learning , *MACHINE learning , *SET functions - Abstract
Meta learning has attracted much attention recently in machine learning community. Contrary to conventional machine learning aiming to learn inherent prediction rules to predict labels for new query data, meta learning aims to learn the learning methodology for machine learning from observed tasks, so as to generalize to new query tasks by leveraging the meta-learned learning methodology. In this study, we achieve such learning methodology by learning an explicit hyper-parameter prediction function shared by all training tasks, and we call this learning process as Simulating Learning Methodology (SLeM). Specifically, this function is represented as a parameterized function called meta-learner, mapping from a training/test task to its suitable hyper-parameter setting, extracted from a pre specified function set called meta learning machine. Such setting guarantees that the meta-learned learning methodology is able to flexibly fit diverse query tasks, instead of only obtaining fixed hyper-parameters by many current meta learning methods, with less adaptability to query task's variations. Such understanding of meta learning also makes it easily succeed from traditional learning theory for analyzing its generalization bounds with general losses/tasks/models. The theory naturally leads to some feasible controlling strategies for ameliorating the quality of the extracted meta-learner, verified to be able to finely ameliorate its generalization capability in some typical meta learning applications, including few-shot regression, few-shot classification and domain generalization. The source code of our method is released at https://github.com/xjtushujun/SLeM-Theory. [ABSTRACT FROM AUTHOR]
- Published
- 2023
22. Design and Testing Novel One-Class Classifier Based on Polynomial Interpolation With Application to Networking Security
- Author
-
Pierpaolo Dini, Andrea Begni, Stefano Ciavarella, Emiliano De Paoli, Giuseppe Fiorelli, Carmelo Silvestro, and Sergio Saponara
- Subjects
Anomaly and intrusion detection ,machine learning ,statistical learning theory ,classification ,data management ,networking ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
This work exploits the concept of one-class classifier applied to the problem of anomaly detection in communication networks. The article presents the design of an innovative anomaly detection algorithm based on polynomial interpolation technique and statistical analysis. The innovative method is applied to datasets largely used in the scientific community for bench-marking such as KDD99, UNSW-NB15 and CSE-CIC-IDS-2018, and further evaluated with application to a novel available dataset EDGE-IIOTSET 2022. The paper also reports experimental results showing that the proposed methodology outperforms classic one-class classifiers, such as Extreme Learning Machine and Support Vector Machine models, and rule-based intrusion detection system like SNORT. With respect to binary classifiers, this work has the advantage of not requiring any a-priori knowledge about attacks and is based on the collection of only normal data traffic.
- Published
- 2022
- Full Text
- View/download PDF
23. Improving the Interpretation of Data-Driven Water Consumption Models via the Use of Social Norms.
- Author
-
Obringer, Renee, Nateghi, Roshanak, Ma, Zhao, and Kumar, Rohini
- Subjects
- *
WATER consumption , *RESIDENTIAL water consumption , *CONSUMPTION (Economics) , *SOCIAL norms , *WATER utilities , *STATISTICAL learning - Abstract
Water is essential to improving social equity, promoting just economic development and protecting the function of the Earth system. It is therefore important to have access to credible models of water consumption, so as to ensure that water utilities can adequately supply water to meet the growing demand. Within the literature, there are a variety of models, but often these models evaluate the water consumption at aggregate scales (e.g., city or regional), thus overlooking intra-city differences. Conversely, the models that evaluate intra-city differences tend to rely heavily on one or two sources of quantitative data (e.g., climate variables or demographics), potentially missing key cultural aspects that may act as confounding factors in quantitative models. Here, we present a novel mixed-methods approach to predict intra-city residential water consumption patterns by integrating climate and demographic data, and by incorporating social norm data to aid the interpretation of model results. Using Indianapolis, Indiana as a test case, we show the value in adopting a more integrative approach to modeling residential water consumption. In particular, we leverage qualitative interview data to interpret the results from a predictive model based on a state-of-the-art machine learning algorithm. This integrative approach provides community-specific interpretations of model results that would otherwise not be observed by considering demographics alone. Ultimately, the results demonstrate the value and importance of such approaches when working on complex problems. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
24. Statistical modeling and inference in the era of Data Science and Graphical Causal modeling.
- Subjects
DATA science ,CAUSAL models ,INFERENTIAL statistics ,STATISTICAL models ,STATISTICAL learning ,ARTIFICIAL intelligence - Abstract
The paper discusses four paradigm shifts in statistics since the 1920s with a view to compare their similarities and differences, and evaluate their effectiveness in giving rise to 'learning from data' about phenomena of interest. The first is Fisher's 1922 recasting of Karl Pearson's descriptive statistics into a model‐based [Mθ(x)$\mathcal {M}_{{\bm {\theta }}}(\mathbf {x})$] statistical induction that dominates current statistics (frequentist and Bayesian). A crucial departure was Fisher's replacing the curve‐fitting perspective guided by goodness‐of‐fit measures with a model‐based perspective guided by the statistical adequacy: the validity of the probabilistic assumptions comprising Mθ(x)$\mathcal {M}_{{\bm {\theta }}}(\mathbf { x})$. Statistical adequacy is pivotal in securing trustworthy evidence since it underwrites the reliability of inference. The second is the nonparametric turn in the 1970s aiming to broaden Mθ(x)$\mathcal {M}_{{\bm {\theta }}}(\mathbf {x })$ by replacing its distribution assumption with weaker mathematical conditions relating to the unknown density function underlying Mθ(x)$\mathcal {M}_{ {\bm {\theta }}}(\mathbf {x})$. The third is a two‐pronged development initiated in Artificial Intelligence (AI) in the 1990s that gave rise to Data Science (DS) and Graphical Causal (GC) modeling. The primary objective of the paper is to compare and evaluate the other competing approaches with a refined/enhanced version of Fisher's model‐based approach in terms of their effectiveness in giving rise to genuine "learning from data;" excellent goodness‐of‐fit/prediction is neither necessary nor sufficient for statistical adequacy, or so it is argued. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
25. MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining.
- Author
-
PELLEGRINA, LEONARDO, COUSINS, CYRUS, VANDIN, FABIO, and RIONDATO, MATTEO
- Subjects
PARTIALLY ordered sets ,STATISTICAL learning ,STATISTICAL power analysis - Abstract
We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA allows us to compute upper bounds to the maximum deviation of sample means from their expectations, thus it can be used to find both (1) statistically-significant functions (i.e., patterns) when the available data is seen as a sample from an unknown distribution, and (2) approximations of collections of high-expectation functions (e.g., frequent patterns) when the available data is a small sample froma large dataset. This flexibility offered byMCRapper is a big advantage over previously proposed solutions, which could only achieve one of the two. MCRapper uses upper bounds to the discrepancy of the functions to efficiently explore and prune the search space, a technique borrowed from pattern mining itself. To show the practical use of MCRapper, we employ it to develop an algorithm TFP-R for the task of True Frequent Pattern (TFP) mining, by appropriately computing approximations of the negative and positive borders of the collection of patterns of interest, which allow an effective pruning of the pattern space and the computation of strong bounds to the supremum deviation. TFP-R gives guarantees on the probability of including any false positives (precision) and exhibits higher statistical power (recall) than existing methods offering the same guarantees. We evaluate MCRapper and TFP-R and show that they outperform the state-of-the-art for their respective tasks. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
26. Should Simplicity Be Always Preferred to Complexity in Supervised Machine Learning?
- Author
-
Bargagli-Stoffi, Falco, Cevolani, Gustavo, Gnecco, Giorgio, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Nicosia, Giuseppe, editor, Ojha, Varun, editor, La Malfa, Emanuele, editor, Jansen, Giorgio, editor, Sciacca, Vincenzo, editor, Pardalos, Panos, editor, Giuffrida, Giovanni, editor, and Umeton, Renato, editor
- Published
- 2020
- Full Text
- View/download PDF
27. Data Streams Are Time Series: Challenging Assumptions
- Author
-
Read, Jesse, Rios, Ricardo A., Nogueira, Tatiane, de Mello, Rodrigo F., Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Cerri, Ricardo, editor, and Prati, Ronaldo C., editor
- Published
- 2020
- Full Text
- View/download PDF
28. The Support Vector Machine
- Author
-
Land, Walker H., Jr., Schaffer, J. David, Land Jr., Walker H., and Schaffer, J. David
- Published
- 2020
- Full Text
- View/download PDF
29. The benefits of adversarial defense in generalization.
- Author
-
Oneto, Luca, Ridella, Sandro, and Anguita, Davide
- Subjects
- *
STATISTICAL learning , *DEEP learning , *GENERALIZATION , *MACHINE learning , *MACHINE theory - Abstract
Recent research has shown that models induced by machine learning, and in particular by deep learning, can be easily fooled by an adversary who carefully crafts imperceptible, at least from the human perspective, or physically plausible modifications of the input data. This discovery gave birth to a new field of research, the adversarial machine learning, where new methods of attacks and defense are developed continuously, mimicking what is happening from a long time in cybersecurity. In this paper we will show that the drawbacks of inducing models from data less prone to be misled can actually provide some benefits when it comes to assessing their generalization abilities. We will show these benefits both from a theoretical perspective, using state-of-the-art statistical learning theory, and both with practical examples. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
30. Detection of outliers in high-dimensional data using nu-support vector regression.
- Author
-
Mohammed Rashid, Abdullah, Midi, Habshah, Dhhan, Waleed, and Arasan, Jayanthi
- Subjects
- *
OUTLIER detection , *MONTE Carlo method , *STATISTICAL learning , *APPLIED sciences , *INDEPENDENT variables - Abstract
Support Vector Regression (SVR) is gaining in popularity in the detection of outliers and classification problems in high-dimensional data (HDD) as this technique does not require the data to be of full rank. In real application, most of the data are of high dimensional. Classification of high-dimensional data is needed in applied sciences, in particular, as it is important to discriminate cancerous cells from non-cancerous cells. It is also imperative that outliers are identified before constructing a model on the relationship between the dependent and independent variables to avoid misleading interpretations about the fitting of a model. The standard SVR and the μ-ε-SVR are able to detect outliers; however, they are computationally expensive. The fixed parameters support vector regression (FP-ε-SVR) was put forward to remedy this issue. However, the FP-ε-SVR using ε-SVR is not very successful in identifying outliers. In this article, we propose an alternative method to detect outliers i.e. by employing nu-SVR. The merit of our proposed method is confirmed by three real examples and the Monte Carlo simulation. The results show that our proposed nu-SVR method is very successful in identifying outliers under a variety of situations, and with less computational running time. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
31. Nonlinear optimization and support vector machines.
- Author
-
Piccialli, Veronica and Sciandrone, Marco
- Subjects
- *
SUPPORT vector machines , *CONVEX programming , *STATISTICAL learning , *MACHINE learning , *KERNEL functions - Abstract
Support vector machine (SVM) is one of the most important class of machine learning models and algorithms, and has been successfully applied in various fields. Nonlinear optimization plays a crucial role in SVM methodology, both in defining the machine learning models and in designing convergent and efficient algorithms for large-scale training problems. In this paper we present the convex programming problems underlying SVM focusing on supervised binary classification. We analyze the most important and used optimization methods for SVM training problems, and we discuss how the properties of these problems can be incorporated in designing useful algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
32. Learning from fuzzy labels: Theoretical issues and algorithmic solutions
- Author
-
Campagner, A, Campagner A., Campagner, A, and Campagner A.
- Abstract
In this article we study the problem of learning from fuzzy labels (LFL), a form of weakly supervised learning in which the supervision target is not precisely specified but is instead given in the form of possibility distributions, that express the imprecise knowledge of the annotating agent. While several approaches for LFL have been proposed in the literature, including generalized risk minimization (GRM), instance-based methods and pseudo label-based learning, both their theoretical properties and their empirical performance have scarcely been studied. We address this gap by: first, presenting a review of the previous results relative to the sample complexity and generalization bounds for GRM and instance-based methods; second, studying both their computational complexity, by proving in particular the impossibility of efficiently solving LFL using GRM, as well as impossibility theorems. We then propose a novel pseudo label-based learning method, called Random Resampling-based Learning (RRL), which directly draws from ensemble learning and possibility theory and study its learning- and complexity-theoretic properties, showing that it achieves guarantees similar to those for GRM while being computationally efficient. Finally, we study the empirical performance of several state-of-the-art LFL algorithms on wide set of synthetic and real-world benchmark datasets, by which we confirm the effectiveness of the proposed RRL method. Additionally, we describe directions for future research, and highlight opportunities for further interaction between machine learning and uncertainty representation theories.
- Published
- 2024
33. Disagreement-Based Active Learning in Online Settings.
- Author
-
Huang, Boshuang, Salgia, Sudeep, and Zhao, Qing
- Subjects
- *
ACTIVE learning , *ONLINE education , *STATISTICAL learning , *ONLINE algorithms , *PICTURE archiving & communication systems - Abstract
We study online active learning for classifying streaming instances within the framework of statistical learning theory. At each time, the learner either queries the label of the current instance or predicts the label based on past seen examples. The objective is to minimize the number of queries while constraining the number of prediction errors over a horizon of length $T$. We develop a disagreement-based online learning algorithm for a general hypothesis space and under the Tsybakov noise and establish its label complexity under a constraint of bounded regret in terms of classification errors. We further establish a matching (up to a poly-logarithmic factor) lower bound, demonstrating the order optimality of the proposed algorithm. We address the tradeoff between label complexity and regret and show that the algorithm can be modified to operate at a different point on the tradeoff curve. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
34. Stability selection enables robust learning of differential equations from limited noisy data.
- Author
-
Maddu, Suryanarayana, Cheeseman, Bevan L., Sbalzarini, Ivo F., and Müller, Christian L.
- Subjects
- *
DIFFERENTIAL equations , *THRESHOLDING algorithms , *STATISTICAL learning , *MOLECULAR interactions , *PROTEIN-protein interactions , *COMPRESSED sensing , *MATHEMATICAL regularization - Abstract
We present a statistical learning framework for robust identification of differential equations from noisy spatio-temporal data. We address two issues that have so far limited the application of such methods, namely their robustness against noise and the need for manual parameter tuning, by proposing stability-based model selection to determine the level of regularization required for reproducible inference. This avoids manual parameter tuning and improves robustness against noise in the data. Our stability selection approach, termed PDE-STRIDE, can be combined with any sparsity-promoting regression method and provides an interpretable criterion for model component importance. We show that the particular combination of stability selection with the iterative hard-thresholding algorithm from compressed sensing provides a fast and robust framework for equation inference that outperforms previous approaches with respect to accuracy, amount of data required, and robustness. We illustrate the performance of PDE-STRIDE on a range of simulated benchmark problems, and we demonstrate the applicability of PDE-STRIDE on real-world data by considering purely data-driven inference of the protein interaction network for embryonic polarization in Caenorhabditis elegans. Using fluorescence microscopy images of C. elegans zygotes as input data, PDE-STRIDE is able to learn the molecular interactions of the proteins. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
35. Developmental and evolutionary constraints on olfactory circuit selection.
- Author
-
Hiratani, Naoki and Latham, Peter E.
- Subjects
- *
NEURAL circuitry , *NUMBERS of species , *STATISTICAL learning - Abstract
Across species, neural circuits show remarkable regularity, suggesting that their structure has been driven by underlying optimality principles. Here we ask whether we can predict the neural circuitry of diverse species by optimizing the neural architecture to make learning as efficient as possible. We focus on the olfactory system, primarily because it has a relatively simple evolutionarily conserved structure and because its input- and intermediate-layer sizes exhibit a tight allometric scaling. In mammals, it has been shown that the number of neurons in layer 2 of piriform cortex scales as the number of glomeruli (the input units) to the 3/2 power; in invertebrates, we show that the number of mushroom body Kenyon cells scales as the number of glomeruli to the 7/2 power. To understand these scaling laws, we model the olfactory system as a three-layer nonlinear neural network and analytically optimize the intermediate-layer size for efficient learning from limited samples. We find, as observed, a power-law scaling, with the exponent depending strongly on the number of samples and thus on longevity. The 3/2 scaling seen in mammals is consistent with observed longevity, but the 7/2 scaling in invertebrates is not. However, when a fraction of the olfactory circuit is genetically specified, not learned, scaling becomes steeper for species with a small number of glomeruli and recovers consistency with the invertebrate scaling. This study provides analytic insight into the principles underlying both allometric scaling across species and optimal architectures in artificial networks. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
36. Simple Models in Complex Worlds: Occam's Razor and Statistical Learning Theory.
- Author
-
Bargagli Stoffi, Falco J., Cevolani, Gustavo, and Gnecco, Giorgio
- Subjects
- *
STATISTICAL learning , *SUPERVISED learning , *PHILOSOPHY of science , *ARTIFICIAL intelligence , *MACHINE learning - Abstract
The idea that "simplicity is a sign of truth", and the related "Occam's razor" principle, stating that, all other things being equal, simpler models should be preferred to more complex ones, have been long discussed in philosophy and science. We explore these ideas in the context of supervised machine learning, namely the branch of artificial intelligence that studies algorithms which balance simplicity and accuracy in order to effectively learn about the features of the underlying domain. Focusing on statistical learning theory, we show that situations exist for which a preference for simpler models (as modeled through the addition of a regularization term in the learning problem) provably slows down, instead of favoring, the supervised learning process. Our results shed new light on the relations between simplicity and truth approximation, which are briefly discussed in the context of both machine learning and the philosophy of science. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
37. Vector-Valued Least-Squares Regression under Output Regularity Assumptions.
- Author
-
Brogat-Motte, Luc, Rudi, Alessandro, Brouard, Céline, Rousu, Juho, and d'Alché-Buc, Florence
- Subjects
- *
STATISTICAL learning , *IMAGE reconstruction - Abstract
We propose and analyse a reduced-rank method for solving least-squares regression problems with infinite dimensional output. We derive learning bounds for our method, and study under which setting statistical performance is improved in comparison to full-rank method. Our analysis extends the interest of reduced-rank regression beyond the standard low-rank setting to more general output regularity assumptions. We illustrate our theoretical insights on synthetic least-squares problems. Then, we propose a surrogate structured prediction method derived from this reduced-rank method. We assess its benefits on three different problems: image reconstruction, multi-label classification, and metabolite identification. [ABSTRACT FROM AUTHOR]
- Published
- 2022
38. An improper estimator with optimal excess risk in misspecified density estimation and logistic regression.
- Author
-
Mourtada, Jaouad and Gaïffas, Stéphane
- Subjects
- *
LOGISTIC regression analysis , *MAXIMUM likelihood statistics , *STATISTICAL learning , *APPROXIMATION error , *DENSITY - Abstract
We introduce a procedure for conditional density estimation under logarithmic loss, which we call SMP (Sample Minmax Predictor). This estimator minimizes a new general excess risk bound for statistical learning. On standard examples, this bound scales as d=n with d the model dimension and n the sample size, and critically remains valid under model misspecification. Being an improper (out-of-model) procedure, SMP improves over withinmodel estimators such as the maximum likelihood estimator, whose excess risk degrades under misspecification. Compared to approaches reducing to the sequential problem, our bounds remove suboptimal log n factors and can handle unbounded classes. For the Gaussian linear model, the predictions and risk bound of SMP are governed by leverage scores of covariates, nearly matching the optimal risk in the well-specified case without conditions on the noise variance or approximation error of the linear model. For logistic regression, SMP provides a non-Bayesian approach to calibration of probabilistic predictions relying on virtual samples, and can be computed by solving two logistic regressions. It achieves a non-asymptotic excess risk of O((d+B2R2)=n), where R bounds the norm of features and B that of the comparison parameter; by contrast, no within-model estimator can achieve better rate than min(BR= p n; eBR=n) in general (Hazan et al., 2014). This provides a more practical alternative to Bayesian approaches, which require approximate posterior sampling, thereby partly addressing a question raised by Foster et al. (2018). [ABSTRACT FROM AUTHOR]
- Published
- 2022
39. Empirical Risk Minimization under Random Censorship.
- Author
-
Ausset, Guillaume, Clémençon, Stephan, and Portier, François
- Subjects
- *
CREDIT analysis , *KAPLAN-Meier estimator , *STATISTICAL learning , *CREDIT risk , *ACCELERATED life testing , *CENSORSHIP , *SUPERVISED learning , *CENSORING (Statistics) - Abstract
We consider the classic supervised learning problem where a continuous non-negative random label Y (e.g. a random duration) is to be predicted based upon observing a random vector X valued in Rd with d ≥ 1 by means of a regression rule with minimum least square error. In various applications, ranging from industrial quality control to public health through credit risk analysis for instance, training observations can be right censored, meaning that, rather than on independent copies of (X; Y), statistical learning relies on a collection of n 1 independent realizations of the triplet (X; minfY; Cg; 1), where C is a nonnegative random variable with unknown distribution, modelling censoring and ffi = IfY ≤ Cg indicates whether the duration is right censored or not. As ignoring censoring in the risk computation may clearly lead to a severe underestimation of the target duration and jeopardize prediction, we consider a plug-in estimate of the true risk based on a Kaplan-Meier estimator of the conditional survival function of the censoring C given X, referred to as Beran risk, in order to perform empirical risk minimization. It is established, under mild conditions, that the learning rate of minimizers of this biased/weighted empirical risk functional is of order OP(p log(n)=n) when ignoring model bias issues inherent to plug-in estimation, as can be attained in absence of censoring. Beyond theoretical results, numerical experiments are presented in order to illustrate the relevance of the approach developed. [ABSTRACT FROM AUTHOR]
- Published
- 2022
40. Bohnenblust–Hille inequality for cyclic groups.
- Author
-
Slote, Joseph, Volberg, Alexander, and Zhang, Haonan
- Subjects
- *
CYCLIC groups , *STATISTICAL learning , *MATRICES (Mathematics) , *TENSOR products , *HARMONIC analysis (Mathematics) , *COMPLEX matrices - Abstract
For any K > 2 and the multiplicative cyclic group Ω K of order K , consider any function f : Ω K n → C and its Fourier expansion f (z) = ∑ α ∈ { 0 , 1 , ... , K − 1 } n a α z α , with d : = deg (f) denoting its degree as a multivariate polynomial. We prove a Bohnenblust–Hille (BH) inequality in this setting: the ℓ 2 d / (d + 1) norm of the Fourier coefficients of f is bounded by C (d , K) ‖ f ‖ ∞ with C (d , K) independent of n. This is the interpolating case between the now well-understood BH inequalities for functions on the poly-torus (K = ∞) and the hypercube (K = 2) but those extreme cases of K have special properties whose absence for intermediate K prevent a proof by the standard BH framework. New techniques are developed exploiting the group structure of Ω K n. By known reductions, the cyclic group BH inequality also entails a noncommutative BH inequality for tensor products of the K × K complex matrix algebra (or in the language of quantum mechanics, systems of K -level qudits). These new BH inequalities generalize several applications in harmonic analysis and statistical learning theory to broader classes of functions and operators. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
41. Exponential inequalities for nonstationary Markov chains
- Author
-
Alquier Pierre, Doukhan Paul, and Fan Xiequan
- Subjects
nonstationary markov chains ,martingales ,exponential inequalities ,time series forecasting ,statistical learning theory ,oracle inequalities ,model selection ,60j05 ,60e15 ,62m20 ,62m05 ,62m10 ,68q32 ,Science (General) ,Q1-390 ,Mathematics ,QA1-939 - Abstract
Exponential inequalities are main tools in machine learning theory. To prove exponential inequalities for non i.i.d random variables allows to extend many learning techniques to these variables. Indeed, much work has been done both on inequalities and learning theory for time series, in the past 15 years. However, for the non independent case, almost all the results concern stationary time series. This excludes many important applications: for example any series with a periodic behaviour is nonstationary. In this paper, we extend the basic tools of [19] to nonstationary Markov chains. As an application, we provide a Bernsteintype inequality, and we deduce risk bounds for the prediction of periodic autoregressive processes with an unknown period.
- Published
- 2019
- Full Text
- View/download PDF
42. Deep neural networks for choice analysis: A statistical learning theory perspective.
- Author
-
Wang, Shenhao, Wang, Qingyi, Bailey, Nate, and Zhao, Jinhua
- Subjects
- *
STATISTICAL learning , *STATISTICS , *DISCRETE choice models , *LOGITS , *LOGISTIC regression analysis , *STATISTICAL models , *DEMAND forecasting , *MIXED dentition - Abstract
• Used statistical learning theory to evaluate DNNs in choice analysis. • Operationalized DNN interpretability by using the choice probability functions. • Provided a tight upper bound on the estimation error of DNNs. • Conducted experiments to identify when DNNs outperform classical models. • DNNs can be more predictive and interpretable than BNL and MNL models. Although researchers increasingly use deep neural networks (DNN) to analyze individual choices, overfitting and interpretability issues remain obstacles in theory and practice. This study presents a statistical learning theoretical framework to examine the tradeoff between estimation and approximation errors, and between the quality of prediction and of interpretation. It provides an upper bound on the estimation error of the prediction quality in DNN, measured by zero-one and log losses, shedding light on why DNN models do not overfit. It proposes a metric for interpretation quality by formulating a function approximation loss that measures the difference between true and estimated choice probability functions. It argues that the binary logit (BNL) and multinomial logit (MNL) models are the specific cases of DNNs, since the latter always has smaller approximation errors. We explore the relative performance of DNN and classical choice models through three simulation scenarios comparing DNN, BNL, and binary mixed logit models (BXL), as well as one experiment comparing DNN to BNL, BXL, MNL, and mixed logit (MXL) in analyzing the choice of trip purposes based on the National Household Travel Survey 2017. The results indicate that DNN can be used for choice analysis beyond the current practice of demand forecasting because it has the inherent utility interpretation and the power of automatically learning utility specification. Our results suggest DNN outperforms BNL, BXL, MNL, and MXL models in both prediction and interpretation when the sample size is large (≥ O (10 4)), the input dimension is high, or the true data generating process is complex, while performing worse when the opposite is true. DNN outperforms BNL and BXL in zero-one, log, and approximation losses for most of the experiments, and the larger sample size leads to greater incremental value of using DNN over classical discrete choice models. Overall, this study introduces the statistical learning theory as a new foundation for high-dimensional data, complex statistical models, and non-asymptotic data regimes in choice analysis, and the experiments show the effective prediction and interpretation of DNN for its applications to policy and behavioral analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
43. Estimating optimal treatment rules with an instrumental variable: A partial identification learning approach.
- Author
-
Pu, Hongming and Zhang, Bo
- Subjects
INSTRUMENTAL variables (Statistics) ,STATISTICAL learning ,INTENSIVE care units ,NEONATAL intensive care ,TREATMENT effectiveness ,FAILURE time data analysis - Abstract
Individualized treatment rules (ITRs) are considered a promising recipe to deliver better policy interventions. One key ingredient in optimal ITR estimation problems is to estimate the average treatment effect conditional on a subject's covariate information, which is often challenging in observational studies due to the universal concern of unmeasured confounding. Instrumental variables (IVs) are widely used tools to infer the treatment effect when there is unmeasured confounding between the treatment and outcome. In this work, we propose a general framework of approaching the optimal ITR estimation problem when a valid IV is allowed to only partially identify the treatment effect. We introduce a novel notion of optimality called 'IV‐optimality'. A treatment rule is said to be IV‐optimal if it minimizes the maximum risk with respect to the putative IV and the set of IV identification assumptions. We derive a bound on the risk of an IV‐optimal rule that illuminates when an IV‐optimal rule has favourable generalization performance. We propose a classification‐based statistical learning method that estimates such an IV‐optimal rule, design computationally efficient algorithms, and prove theoretical guarantees. We contrast our proposed method to the popular outcome weighted learning (OWL) approach via extensive simulations, and apply our method to study which mothers would benefit from travelling to deliver their premature babies at hospitals with high‐level neonatal intensive care units. R package ivitr implements the proposed method. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
44. Theoretical learning guarantees applied to acoustic modeling
- Author
-
Christopher D. Shulby, Martha D. Ferreira, Rodrigo F. de Mello, and Sandra M. Aluisio
- Subjects
Acoustic modeling ,Convolutional neural networks ,Shallow learning ,Speech recognition ,Statistical learning theory ,Support vector machines ,Computer engineering. Computer hardware ,TK7885-7895 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Abstract In low-resource scenarios, for example, small datasets or a lack in computational resources available, state-of-the-art deep learning methods for speech recognition have been known to fail. It is possible to achieve more robust models if care is taken to ensure the learning guarantees provided by the statistical learning theory. This work presents a shallow and hybrid approach using a convolutional neural network feature extractor fed into a hierarchical tree of support vector machines for classification. Here, we show that gross errors present even in state-of-the-art systems can be avoided and that an accurate acoustic model can be built in a hierarchical fashion. Furthermore, we present proof that our algorithm does adhere to the learning guarantees provided by the statistical learning theory. The acoustic model produced in this work outperforms traditional hidden Markov models, and the hierarchical support vector machine tree outperforms a multi-class multilayer perceptron classifier using the same features. More importantly, we isolate the performance of the acoustic model and provide results on both the frame and phoneme level, considering the true robustness of the model. We show that even with a small amount of data, accurate and robust recognition rates can be obtained.
- Published
- 2019
- Full Text
- View/download PDF
45. A cost-effective approach to portfolio construction with range-based risk measures.
- Author
-
Pun, Chi Seng and Wang, Lei
- Subjects
- *
PORTFOLIO management (Investments) , *STATISTICAL learning , *PORTFOLIO performance , *MACHINE learning - Abstract
In this paper, we introduce a new class of risk measures and the corresponding risk minimizing portfolio optimization problem. Instead of measuring the expected deviation of a daily return from a single target value, we propose to measure its deviation from a range of values centered on the single target value. By relaxing the definition of deviation, the proposed risk measure is robust to the variation of data input and thus the resulting risk-minimizing portfolio has a lower turnover rate and is resilient to outliers. To construct a practical portfolio, we propose to impose an ℓ 2 -norm constraint on the portfolio weights to stabilize the portfolio's out-of-sample performance. We show that for some cases of our proposed range-based risk measures, the corresponding portfolio optimization can be recast as a support vector regression problem. This allows us to tap into the machine learning literature on support vector regression and effectively solve the portfolio optimization problem even in high dimensions. Moreover, we present theoretical results on the robustness of our range-based risk minimizing portfolios. Simulation and empirical studies are conducted to examine the out-of-sample performance of the proposed portfolios. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
46. The Ridgelet Prior: A Covariance Function Approach to Prior Specification for Bayesian Neural Networks.
- Author
-
Takuo Matsubara, Oates, Chris J., and Briol, François-Xavier
- Subjects
- *
BAYESIAN analysis , *GAUSSIAN processes , *STATISTICAL learning - Abstract
Bayesian neural networks attempt to combine the strong predictive performance of neural networks with formal quantification of uncertainty associated with the predictive output in the Bayesian framework. However, it remains unclear how to endow the parameters of the network with a prior distribution that is meaningful when lifted into the output space of the network. A possible solution is proposed that enables the user to posit an appropriate Gaussian process covariance function for the task at hand. Our approach constructs a prior distribution for the parameters of the network, called a ridgelet prior, that approximates the posited Gaussian process in the output space of the network. In contrast to existing work on the connection between neural networks and Gaussian processes, our analysis is non-asymptotic, with finite sample-size error bounds provided. This establishes the universality property that a Bayesian neural network can approximate any Gaussian process whose covariance function is sufficiently regular. Our experimental assessment is limited to a proof-of-concept, where we demonstrate that the ridgelet prior can out-perform an unstructured prior on regression problems for which a suitable Gaussian process prior can be provided. [ABSTRACT FROM AUTHOR]
- Published
- 2021
47. Failures of Model-dependent Generalization Bounds for Least-norm Interpolation.
- Author
-
Bartlett, Peter L. and Long, Philip M.
- Subjects
- *
STATISTICAL learning , *GENERALIZATION , *INTERPOLATION , *MACHINE learning - Abstract
We consider bounds on the generalization performance of the least-norm linear regressor, in the over-parameterized regime where it can interpolate the data. We describe a sense in which any generalization bound of a type that is commonly proved in statistical learning theory must sometimes be very loose when applied to analyze the least-norm interpolant. In particular, for a variety of natural joint distributions on training examples, any valid generalization bound that depends only on the output of the learning algorithm, the number of training examples, and the confidence parameter, and that satisfies a mild condition (substantially weaker than monotonicity in sample size), must sometimes be very loose|it can be bounded below by a constant when the true excess risk goes to zero. [ABSTRACT FROM AUTHOR]
- Published
- 2021
48. Finite Time LTI System Identification.
- Author
-
Sarkar, Tuhin, Rakhlin, Alexander, and Dahleh, Munther A.
- Subjects
- *
SYSTEM identification , *STATISTICAL learning , *LINEAR dynamical systems , *NONPARAMETRIC statistics , *SYSTEMS theory - Abstract
We address the problem of learning the parameters of a stable linear time invariant (LTI) system with unknown latent space dimension, or order, from a single time-series of noisy input-output data. We focus on learning the best lower order approximation allowed by finite data. Motivated by subspace algorithms in systems theory, where the doubly infinite system Hankel matrix captures both order and good lower order approximations, we construct a Hankel-like matrix from noisy finite data using ordinary least squares. This circumvents the non-convexities that arise in system identification, and allows accurate estimation of the underlying LTI system. Our results rely on careful analysis of self-normalized martingale difference terms that helps bound identification error up to logarithmic factors of the lower bound. We provide a data-dependent scheme for order selection and find an accurate realization of system parameters, corresponding to that order, by an approach that is closely related to the Ho-Kalman subspace algorithm. We demonstrate that the proposed model order selection procedure is not overly conservative, i.e., for the given data length it is not possible to estimate higher order models or find higher order approximations with reasonable accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2021
49. Learning Whenever Learning is Possible: Universal Learning under General Stochastic Processes.
- Author
-
Hanneke, Steve
- Subjects
- *
STATISTICAL learning , *IMPLICIT learning , *PROBABILITY theory , *ONLINE education , *MACHINE learning , *DEMAND function - Abstract
This work initiates a general study of learning and generalization without the i.i.d. assumption, starting fromrst principles. While the traditional approach to statistical learning theory typically relies on standard assumptions from probability theory (e.g., i.i.d. or stationary ergodic), in this work we are interested in developing a theory of learning based only on the most fundamental and necessary assumptions implicit in the requirements of the learning problem itself. We specifically study universally consistent function learning, where the objective is to obtain low long-run average loss for any target function, when the data follow a given stochastic process. We are then interested in the question of whether there exist learning rules guaranteed to be universally consistent given only the assumption that universally consistent learning is possible for the given data process. The reasoning that motivates this criterion emanates from a kind of optimist's decision theory, and so we refer to such learning rules as being optimistically universal. We study this question in three natural learning settings: inductive, self-adaptive, and online. Remarkably, as our strongest positive result, wend that optimistically universal learning rules do indeed exist in the self-adaptive learning setting. Establishing this fact requires us to develop new approaches to the design of learning algorithms. Along the way, we also identify concise characterizations of the family of processes under which universally consistent learning is possible in the inductive and self-adaptive settings. We additionally pose a number of enticing open problems, particularly for the online learning setting. [ABSTRACT FROM AUTHOR]
- Published
- 2021
50. Learning from fuzzy labels: Theoretical issues and algorithmic solutions.
- Author
-
Campagner, Andrea
- Subjects
- *
STATISTICAL learning , *SUPERVISED learning , *MACHINE learning , *COMPUTATIONAL complexity - Abstract
In this article we study the problem of learning from fuzzy labels (LFL), a form of weakly supervised learning in which the supervision target is not precisely specified but is instead given in the form of possibility distributions, that express the imprecise knowledge of the annotating agent. While several approaches for LFL have been proposed in the literature, including generalized risk minimization (GRM), instance-based methods and pseudo label-based learning, both their theoretical properties and their empirical performance have scarcely been studied. We address this gap by: first, presenting a review of the previous results relative to the sample complexity and generalization bounds for GRM and instance-based methods; second, studying both their computational complexity, by proving in particular the impossibility of efficiently solving LFL using GRM, as well as impossibility theorems. We then propose a novel pseudo label-based learning method, called Random Resampling-based Learning (RRL), which directly draws from ensemble learning and possibility theory and study its learning- and complexity-theoretic properties, showing that it achieves guarantees similar to those for GRM while being computationally efficient. Finally, we study the empirical performance of several state-of-the-art LFL algorithms on wide set of synthetic and real-world benchmark datasets, by which we confirm the effectiveness of the proposed RRL method. Additionally, we describe directions for future research, and highlight opportunities for further interaction between machine learning and uncertainty representation theories. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.