1. Improving sample efficiency in deep reinforcement learning
- Author
-
Dai, Tianhong and Bharath, Anil
- Abstract
Deep reinforcement learning (DRL) has made great progress in dealing with complex control problems in various test scenarios, such as playing video games, playing board games, and dexterous robotic manipulation, with the promise of critical real-world applications, such as controlling plasmas for nuclear fusion. However, DRL requires large amounts of interactions with an environment to find an optimal policy to solve the task, limiting its application in real-world problems. In this thesis, we focus on two aspects to improve sample efficiency in DRL: 1) solving sparse reward tasks and 2) improving general exploration strategies. First, we analyse the trained agents with and without domain randomisation (DR), a technique that can reduce the reality gap between a simulator and real-world scenarios. Through evaluating their robustness to previous unseen environments and applying both qualitative and quantitative interpretability methods, we provide the insight into the behaviour of trained agents. Finally, some suggestions are also given to researchers who intend to adopt interpretability methods to analyse DRL agents. Second, we propose two methods to overcome exploration difficulties and improve learning efficiency in goal-oriented RL with the sparse reward setting, where an agent can rarely achieve positive feedback. In the first method, to provide sufficient positive samples for training an agent, hindsight goal relabelling is used to replace goals in original samples with intermediate goals, and these augmented positive samples are leveraged to accelerate the training via a self-imitation learning paradigm. An additional selection module is also designed to remove undesirable modified samples and stabilise training. In the second method, to alleviate the inefficiency of hindsight experience replay (HER) caused by its uniform sampling strategy, a diversity-based sampling method is employed to select valuable and diverse experiences for efficient training. Furthermore, diversity-augmented intrinsic motivation is introduced to encourage the agent to explore novel states in an environment with sparse or delayed rewards. During training, the diversity of adjacent state sequences is measured under the framework of determinantal point processes (DPPs) and this measurement is used as an auxiliary reward to facilitate the exploration of the agent, thus improving the final performance.
- Published
- 2022
- Full Text
- View/download PDF