1. A Hybrid Stochastic-Deterministic Minibatch Proximal Gradient Method for Efficient Optimization and Generalization
- Author
-
Pan Zhou, Zhouchen Lin, Steven C. H. Hoi, and Xiao-Tong Yuan
- Subjects
Computational complexity theory ,Generalization ,business.industry ,Applied Mathematics ,Combinatorics ,Quadratic equation ,Computational Theory and Mathematics ,Artificial Intelligence ,Softmax function ,Convex optimization ,Proximal Gradient Methods ,Computer Vision and Pattern Recognition ,Artificial intelligence ,Convex function ,business ,Condition number ,Software ,Mathematics - Abstract
Despite the success of stochastic variance-reduced gradient (SVRG) algorithms in solving large-scale problems, their stochastic gradient complexity often scales linearly with data size and is expensive for huge data. Accordingly, we propose a hybrid stochastic-deterministic minibatch proximal gradient~(HSDMPG) algorithm for strongly convex problems with linear prediction structure, e.g.~least squares and logistic/softmax regression. HSDMPG~enjoys improved computational complexity that is data-size-independent for large-scale problems. It iteratively samples an evolving~minibatch of individual losses to estimate the original problem, and efficiently minimizes the sampled smaller-sized subproblems. For strongly convex loss of n components, HSDMPG~attains an $\epsilon$ -optimization-error within $\mathcal{O} \left(\kappa\log^{\zeta+1}\left(\frac{1}{\epsilon}\right)\frac{1}{\epsilon}\bigwedge n\log^{\zeta}\left(\frac{1}{\epsilon}\right)\right)$ stochastic gradient evaluations, where $\kappa$ is condition number, $\zeta=1$ for quadratic loss and $\zeta=2$ for generic loss. For large-scale problems, our complexity outperforms those of SVRG-type algorithms with/without dependence on data size. Particularly, when $\epsilon=\mathcal{O}(1/\sqrt{n})$ which matches the intrinsic excess error of a learning model and is sufficient for generalization, our complexity for quadratic and generic losses is respectively $\mathcal{O} (n^{0.5}\log^{2}(n))$ and $\mathcal{O} (n^{0.5}\log^{3}(n))$ , which for the first time achieves optimal generalization in less than a single pass over data. Besides, we extend HSDMPG~to online strongly convex problems and prove its higher efficiency over the prior algorithms. Numerical results demonstrate the computational advantages of~HSDMPG.
- Published
- 2022
- Full Text
- View/download PDF