1. Deep Reactive Policies for Planning in Stochastic Nonlinear Domains
- Author
-
Leliane Nunes de Barros, Scott Sanner, Thiago Pereira Bueno, and Denis Deratani Mauá
- Subjects
Mathematical optimization ,Speedup ,Artificial neural network ,Computer science ,business.industry ,Deep learning ,Supervised learning ,Probabilistic logic ,Inference ,Estimator ,020206 networking & telecommunications ,02 engineering and technology ,General Medicine ,010501 environmental sciences ,01 natural sciences ,Nonlinear system ,0202 electrical engineering, electronic engineering, information engineering ,Benchmark (computing) ,Differentiable function ,Artificial intelligence ,Gradient descent ,business ,0105 earth and related environmental sciences ,Parametric statistics - Abstract
Recent advances in applying deep learning to planning have shown that Deep Reactive Policies (DRPs) can be powerful for fast decision-making in complex environments. However, an important limitation of current DRP-based approaches is either the need of optimal planners to be used as ground truth in a supervised learning setting or the sample complexity of high-variance policy gradient estimators, which are particularly troublesome in continuous state-action domains. In order to overcome those limitations, we introduce a framework for training DRPs in continuous stochastic spaces via gradient-based policy search. The general approach is to explicitly encode a parametric policy as a deep neural network, and to formulate the probabilistic planning problem as an optimization task in a stochastic computation graph by exploiting the re-parameterization of the transition probability densities; the optimization is then solved by leveraging gradient descent algorithms that are able to handle non-convex objective functions. We benchmark our approach against stochastic planning domains exhibiting arbitrary differentiable nonlinear transition and cost functions (e.g., Reservoir Control, HVAC and Navigation). Results show that DRPs with more than 125,000 continuous action parameters can be optimized by our approach for problems with 30 state fluents and 30 action fluents on inexpensive hardware under 6 minutes. Also, we observed a speedup of 5 orders of magnitude in the average inference time per decision step of DRPs when compared to other state-of-the-art online gradient-based planners when the same level of solution quality is required.
- Published
- 2019
- Full Text
- View/download PDF