421 results on '"policy gradient"'
Search Results
102. Conditional GANs for Image Captioning with Sentiments
- Author
-
Karayil, Tushar, Irfan, Asif, Raue, Federico, Hees, Jörn, Dengel, Andreas, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Tetko, Igor V., editor, Kůrková, Věra, editor, Karpov, Pavel, editor, and Theis, Fabian, editor
- Published
- 2019
- Full Text
- View/download PDF
103. A Reinforcement Learning Approach for Sequential Spatial Transformer Networks
- Author
-
Azimi, Fatemeh, Raue, Federico, Hees, Jörn, Dengel, Andreas, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Tetko, Igor V., editor, Kůrková, Věra, editor, Karpov, Pavel, editor, and Theis, Fabian, editor
- Published
- 2019
- Full Text
- View/download PDF
104. A Deep Reinforcement Learning Approach for Autonomous Car Racing
- Author
-
Guo, Fenggen, Wu, Zizhao, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, El Rhalibi, Abdennour, editor, Pan, Zhigeng, editor, Jin, Haiyan, editor, Ding, Dandan, editor, Navarro-Newball, Andres A., editor, and Wang, Yinghui, editor
- Published
- 2019
- Full Text
- View/download PDF
105. Learning Agents with Prioritization and Parameter Noise in Continuous State and Action Space
- Author
-
Mangannavar, Rajesh, Srinivasaraghavan, Gopalakrishnan, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Pandu Rangan, C., Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Lu, Huchuan, editor, Tang, Huajin, editor, and Wang, Zhanshan, editor
- Published
- 2019
- Full Text
- View/download PDF
106. Safe Policy Learning with Constrained Return Variance
- Author
-
Jain, Arushi, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Pandu Rangan, C., Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Meurs, Marie-Jean, editor, and Rudzicz, Frank, editor
- Published
- 2019
- Full Text
- View/download PDF
107. Trajectory Based Prioritized Double Experience Buffer for Sample-Efficient Policy Optimization
- Author
-
Shengxiang Li, Ou Li, Guangyi Liu, Siyuan Ding, and Yijie Bai
- Subjects
Reinforcement learning ,policy gradient ,replay buffer ,distributed RL ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Reinforcement learning has recently made great progress in various challenging domains such as board game of Go and MOBA game of StarCraft II. Policy gradient based reinforcement learning method has become the mainstream due to its effectiveness and simplicity both in discrete and continuous scenarios. However, policy gradient methods commonly involve function approximation and work in an on-policy fashion, which leads to high variance and low sample efficiency. This paper introduces a novel policy gradient method to improve the sample efficiency via a pair of trajectory based prioritized replay buffers and reduce the variance in training with a target network whose weights are updated in a “soft” manner. We evaluate our method on the reinforcement learning suit of Open AI Gym tasks, and the results show that the proposed method can learn more steadily and achieve higher performance than existing methods.
- Published
- 2021
- Full Text
- View/download PDF
108. An overview on algorithms and applications of deep reinforcement learning
- Author
-
Zhaoyang LIU, Chaoxu MU, and Changyin SUN
- Subjects
artificial intelligence ,deep reinforcement learning ,value function ,policy gradient ,navigation ,cooperation ,complex environment ,generalization ,robustness ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Deep reinforcement learning (DRL) is mainly applied to solve the perception-decision problem, and has become an important research branch in the field of artificial intelligence.Two kinds of DRL algorithms based on value function and policy gradient were summarized, including deep Q network, policy gradient as well as related developed algorithms.In addition, the applications of DRL in video games, navigation, multi-agent cooperation and recommendation field were intensively reviewed.Finally, a prospect for the future research of DRL was made, and some research suggestions were given.
- Published
- 2020
- Full Text
- View/download PDF
109. Employing reinforcement learning to enhance particle swarm optimization methods.
- Author
-
Wu, Di and Wang, G. Gary
- Subjects
- *
PARTICLE swarm optimization , *REINFORCEMENT learning , *NUMERICAL functions , *PETRI nets , *GAUSSIAN distribution , *RANDOM numbers , *LEARNING strategies , *STOCHASTIC convergence - Abstract
Particle swarm optimization (PSO) is a well-known optimization algorithm that shows good performance in solving different optimization problems. However, PSO usually suffers from slow convergence. In this article, a reinforcement learning strategy is developed to enhance PSO in convergence by replacing the uniformly distributed random number in the updating function with a random number generated from a selected normal distribution. In the proposed method, the mean and standard deviation of the normal distribution are estimated from the current state of each individual through a policy net. The historic behaviour of the swarm group is used to update the policy net and guide the selection of parameters of the normal distribution. The proposed method is integrated into the original PSO and a state-of-the-art PSO, called the self-adaptive dynamic multi-swarm PSO (sDMS-PSO), and tested with numerical functions and engineering problems. The test results show that the convergence rate of PSO methods can be improved with the proposed reinforcement learning strategy. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
110. On Diversity in Image Captioning: Metrics and Methods.
- Author
-
Wang, Qingzhong, Wan, Jia, and Chan, Antoni B.
- Subjects
- *
LATENT semantic analysis , *REWARD (Psychology) , *REINFORCEMENT learning , *POINT processes - Abstract
Diversity is one of the most important properties in image captioning, as it reflects various expressions of important concepts presented in an image. However, the most popular metrics cannot well evaluate the diversity of multiple captions. In this paper, we first propose a metric to measure the diversity of a set of captions, which is derived from latent semantic analysis (LSA), and then kernelize LSA using CIDEr (R. Vedantam et al., 2015) similarity. Compared with mBLEU (R. Shetty et al., 2017), our proposed diversity metrics show a relatively strong correlation to human evaluation. We conduct extensive experiments, finding there is a large gap between the performance of the current state-of-the-art models and human annotations considering both diversity and accuracy; the models that aim to generate captions with higher CIDEr scores normally obtain lower diversity scores, which generally learn to describe images using common words. To bridge this “diversity” gap, we consider several methods for training caption models to generate diverse captions. First, we show that balancing the cross-entropy loss and CIDEr reward in reinforcement learning during training can effectively control the tradeoff between diversity and accuracy of the generated captions. Second, we develop approaches that directly optimize our diversity metric and CIDEr score using reinforcement learning. These proposed approaches using reinforcement learning (RL) can be unified into a self-critical (S. J. Rennie et al., 2017) framework with new RL baselines. Third, we combine accuracy and diversity into a single measure using an ensemble matrix, and then maximize the determinant of the ensemble matrix via reinforcement learning to boost diversity and accuracy, which outperforms its counterparts on the oracle test. Finally, inspired by determinantal point processes (DPP), we develop a DPP selection algorithm to select a subset of captions from a large number of candidate captions. The experimental results show that maximizing the determinant of the ensemble matrix outperforms other methods considerably improving diversity and accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
111. Energy-Efficient and QoS Guaranteed BBU Aggregation in CRAN Based on Heuristic- Assisted Deep Reinforcement Learning.
- Author
-
Zhu, Min, Gu, Jiahua, Shen, Tianyu, Shi, Chenglin, and Ren, Xueqi
- Abstract
The surging mobile traffic poses serious challenges for mobile operators, one of which is the unsustainable growth caused by the high energy consumption of the massively deployed base stations (BSs). Cloud radio access network (CRAN), as a new architecture, is proposed to confront this challenge. By isolating baseband unit (BBU) from its remote radio head (RRH) in BS, the BBUs are consolidated into a common place (i.e., BBU pool). Since the “any-to-any” connection between BBUs and RRHs is realized in CRAN, low utilized BBUs can be switched to sleep mode to save energy during traffic valley, which can effectively reduce the energy consumption of CRAN. However, when a BBU enters into the sleep mode, RRHs connected with this BBU must be switched to another BBU, which would degrade the quality of service (QoS) for the RRHs. In this paper, to simultaneously guarantee low BBU energy consumption and low RRH traffic migration, both of which are interrelated and mutual restraint, we propose a deep reinforcement learning (DRL) based BBU aggregation scheme. Furthermore, to train the DRL fast and well, we introduce several heuristic algorithms to assist the DRL training. Extensive numerical evaluations show that our proposed heuristic-assisted DRL (HA-DRL) can guarantee both the low power consumption and the less traffic migration. When compared with the baselines, HA-DRL can achieve up to 18.3% cost reduction and 32.8% migrated traffic reduction with at most 8.4% higher energy consumption and attains the lowest cost for all cases considered. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
112. On the Convergence Rates of Policy Gradient Methods.
- Author
-
Lin Xiao
- Subjects
- *
MARKOV processes - Abstract
We consider infinite-horizon discounted Markov decision problems with finite state and action spaces and study the convergence rates of the projected policy gradient method and a general class of policy mirror descent methods, all with direct parametrization in the policy space. First, we develop a theory of weak gradient-mapping dominance and use it to prove sharp sublinear convergence rate of the projected policy gradient method. Then we show that with geometrically increasing step sizes, a general class of policy mirror descent methods, including the natural policy gradient method and a projected Q-descent method, all enjoy a linear rate of convergence without relying on entropy or other strongly convex regularization. Finally, we also analyze the convergence rate of an inexact policy mirror descent method and estimate its sample complexity under a simple generative model. [ABSTRACT FROM AUTHOR]
- Published
- 2022
113. Policy Gradient and Actor--Critic Learning in Continuous Time and Space: Theory and Algorithms.
- Author
-
Yanwei Jia and Xun Yu Zhou
- Subjects
- *
REINFORCEMENT learning , *ALGORITHMS , *STOCHASTIC approximation , *ONLINE education , *MARTINGALES (Mathematics) , *CRITICS - Abstract
We study policy gradient (PG) for reinforcement learning in continuous time and space under the regularized exploratory formulation developed by Wang et al. (2020). We represent the gradient of the value function with respect to a given parameterized stochastic policy as the expected integration of an auxiliary running reward function that can be evaluated using samples and the current value function. This representation effectively turns PG into a policy evaluation (PE) problem, enabling us to apply the martingale approach recently developed by Jia and Zhou (2022a) for PE to solve our PG problem. Based on this analysis, we propose two types of actor-critic algorithms for RL, where we learn and update value functions and policies simultaneously and alternatingly. The first type is based directly on the aforementioned representation, which involves future trajectories and is offline. The second type, designed for online learning, employs the first-order condition of the policy gradient and turns it into martingale orthogonality conditions. These conditions are then incorporated using stochastic approximation when updating policies. Finally, we demonstrate the algorithms by simulations in two concrete examples. [ABSTRACT FROM AUTHOR]
- Published
- 2022
114. Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor Critic under State Distribution Mismatch.
- Author
-
Shangtong Zhang, des Combes, Remi Tachet, and Laroche, Romain
- Subjects
- *
MARKOV operators , *STOCHASTIC analysis , *MARKOV processes , *CRITICS , *APPROXIMATION algorithms , *MARKOV chain Monte Carlo , *STOCHASTIC approximation - Abstract
In this paper, we establish the global optimality and convergence rate of an off-policy actor critic algorithm in the tabular setting without using density ratio to correct the discrepancy between the state distribution of the behavior policy and that of the target policy. Our work goes beyond existing works on the optimality of policy gradient methods in that existing works use the exact policy gradient for updating the policy parameters while we use an approximate and stochastic update step. Our update step is not a gradient update because we do not use a density ratio to correct the state distribution, which aligns well with what practitioners do. Our update is approximate because we use a learned critic instead of the true value function. Our update is stochastic because at each step the update is done for only the current state action pair. Moreover, we remove several restrictive assumptions from existing works in our analysis. Central to our work is the finite sample analysis of a generic stochastic approximation algorithm with time-inhomogeneous update operators on time-inhomogeneous Markov chains, based on its uniform contraction properties. [ABSTRACT FROM AUTHOR]
- Published
- 2022
115. Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences.
- Author
-
Chan, Alan, Hugo Silva, Sungsu Lim, Tadashi Kozuno, Mahmood, A. Rupam, and White, Martha
- Subjects
- *
OPTIMIZATION algorithms , *BENCHMARK problems (Computer science) , *REINFORCEMENT learning , *MATHEMATICAL regularization - Abstract
Approximate Policy Iteration (API) algorithms alternate between (approximate) policy evaluation and (approximate) greedification. Many different approaches have been explored for approximate policy evaluation, but less is understood about approximate greedification and what choices guarantee policy improvement. In this work, we investigate approximate greedification when reducing the KL divergence between the parameterized policy and the Boltzmann distribution over action values. In particular, we investigate the difference between the forward and reverse KL divergences, with varying degrees of entropy regularization; these are chosen because they underlie many existing policy optimization approaches, as we highlight in this work. We show that the reverse KL has stronger policy improvement guarantees, and that reducing the forward KL can result in a worse policy. We also demonstrate, however, that a large enough reduction of the forward KL can induce improvement under additional assumptions. Empirically, we show on simple continuous-action environments that the forward KL can induce more exploration, but at the cost of a more suboptimal policy. No significant differences were observed in the discrete-action setting or on a suite of benchmark problems. This work provides novel theoretical and empirical insights about the forward KL and reverse KL for greedification, and clear next steps for understanding and improving our policy optimization algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2022
116. Segment boundary detection directed attention for online end-to-end speech recognition
- Author
-
Junfeng Hou, Wu Guo, Yan Song, and Li-Rong Dai
- Subjects
Encoder-decoder ,Online recognition ,Boundary detection ,Attention mechanism ,Reinforcement learning ,Policy gradient ,Acoustics. Sound ,QC221-246 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Abstract Attention-based encoder-decoder models have recently shown competitive performance for automatic speech recognition (ASR) compared to conventional ASR systems. However, how to employ attention models for online speech recognition still needs to be explored. Different from conventional attention models wherein the soft alignment is obtained by a pass over the entire input sequence, attention models for online recognition must learn online alignment to attend part of input sequence monotonically when generating output symbols. Based on the fact that every output symbol is corresponding to a segment of input sequence, we propose a new attention mechanism for learning online alignment by decomposing the conventional alignment into two parts: segmentation—segment boundary detection with hard decision—and segment-directed attention—information aggregation within the segment with soft attention. The boundary detection is conducted along the time axis from left to right, and a decision is made for each input frame about whether it is a segment boundary or not. When a boundary is detected, the decoder generates an output symbol by attending the inputs within the corresponding segment. With the proposed attention mechanism, online speech recognition can be realized. The experimental results on TIMIT and WSJ dataset show that our proposed attention mechanism achieves comparable online performance with state-of-the-art models.
- Published
- 2020
- Full Text
- View/download PDF
117. Adaptive Laser Welding Control: A Reinforcement Learning Approach
- Author
-
Giulio Masinelli, Tri Le-Quang, Silvio Zanoli, Kilian Wasmer, and Sergey A. Shevchik
- Subjects
Laser welding ,laser material processing ,reinforcement learning ,policy gradient ,Q-learning ,closed-loop control ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Despite extensive research efforts in the field of laser welding, the imperfect repeatability of the weld quality still represents an open topic. Indeed, the inherent complexity of the underlying physical phenomena prevents the implementation of an effective controller using conventional regulators. To close this gap, we propose the application of Reinforcement Learning for closed-loop adaptive control of welding processes. The presented system is able to autonomously learn a control law that achieves a predefined weld quality independently from the starting conditions and without prior knowledge of the process dynamics. Specifically, our control unit influences the welding process by modulating the laser power and uses optical and acoustic emission signals as sensory input. The algorithm consists of three elements: a smart agent interacting with the process, a feedback network for quality monitoring, and an encoder that retains only the quality critic events from the sensory input. Based on the data representation provided by the encoder, the smart agent decides the output laser power accordingly. The corresponding input signals are then analyzed by the feedback network to determine the resulting process quality. Depending on the distance to the targeted quality, a reward is given to the agent. The latter is designed to learn from its experience by taking the actions that maximize not just its immediate reward, but the sum of all the rewards that it will receive from that moment on. Two learning schemes were tested for the agent, namely ${Q}$ -Learning and Policy Gradient. The required training time to reach the targeted quality was 20 min for the former technique and 33 min for the latter.
- Published
- 2020
- Full Text
- View/download PDF
118. Performance Improvement of Linux CPU Scheduler Using Policy Gradient Reinforcement Learning for Android Smartphones
- Author
-
Junyeong Han and Sungyoung Lee
- Subjects
ARM big.LITTLE processing architecture ,energy aware scheduler ,process scheduler ,CPU frequency governor ,reinforcement learning ,policy gradient ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
The Energy Aware Scheduler (EAS) was developed and applied to the Linux kernel of recent Android smartphones in order to exploit the ARM big.LITTLE processing architecture efficiently. EAS organizes CPU hardware information into Energy Model which are used to improve CPU scheduling performance. In particular, it reduces power consumption and improves process scheduling performance. However, EAS has limitations in improving CPU scheduling performance, because the Energy Model configures the CPU hardware information to fixed values, which does not reflect the characteristics of running tasks, such as the workload changes and the transition between running state and sleep state. To solve this problem, this paper introduces the Learning Energy Aware Scheduler (Learning EAS). The Learning EAS adjusts the TARGET_LOAD used to set the CPU frequency and the sched_migration_cost used as the task migration criteria according to the characteristics of the running task through the policy gradient reinforcement learning. In LG G8 ThinQ, Learning EAS improved power consumption by 2.3% - 5.7%, hackbench results for process scheduling performance by 2.8% - 25.5%, applications entry time by 4.4% - 6.1%, and applications entry time under high CPU workload by 9.6% - 12.5%, respectively compared with EAS. This paper also showed that the Learning EAS is scalable by applying the Learning EAS to high-end and low-end chipset platforms of Qualcomm.Inc and MediaTek.Inc and improving power consumption by 2.8% - 7.8%, application entry time by 2.2% - 7.2%, respectively compared with EAS. Finally, this paper showed that the performance of CPU scheduling is improved gradually by the repetition of reinforcement learning.
- Published
- 2020
- Full Text
- View/download PDF
119. A Knowledge Driven Dialogue Model With Reinforcement Learning
- Author
-
Yongnan Jia, Gaochen Min, Cong Xu, Xisheng Li, and Dezheng Zhang
- Subjects
Dialogue model ,policy gradient ,knowledge graph ,transformer network ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
In recent decades, many researchers pay a lot of attention on generating informative responses in end-to-end neural dialogue systems. In order to output the responses with knowledge and fact, many works leverage external knowledge to guide the process of response generation. However, human dialogue is not a simple sequence to sequence task but a process heavily relying on their background knowledge about the topic. Thus, the key of generating informative responses is leveraging the appropriate knowledge associated with current topic. This paper focus on addressing incorporating the appropriate knowledge in response generation. We adopt the reinforcement learning to select the most proper knowledge as the input information of the response generation part. Then we design an end-to-end dialogue model consisting of the knowledge decision part and the response generation part. The proposed model is able to effectively complete the knowledge driven dialogue task with specific topic. Our experiments clearly demonstrate the superior performance of our model over other baselines.
- Published
- 2020
- Full Text
- View/download PDF
120. On the Design of Tailored Neural Networks for Energy Harvesting Broadcast Channels: A Reinforcement Learning Approach
- Author
-
Heasung Kim, Jungtai Kim, Wonjae Shin, Heecheol Yang, Nayoung Lee, Seong Jin Kim, and Jungwoo Lee
- Subjects
Energy harvesting communications ,power allocation ,broadcast channel ,reinforcement learning ,machine learning ,policy gradient ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
In this paper, we consider a power allocation optimization technique for a time-varying fading broadcast channel in energy harvesting communication systems, in which a transmitter with a rechargeable battery transmits messages to receivers using the harvested energy. We first prove that the optimal online power allocation policy for the sum-rate maximization of the transmitter is a monotonically increasing function of harvested energy, remaining battery, and the channel gain of each user. We then construct a lightweight neural network architecture to take advantage of the monotonicity of the optimal policy. This two-step approach, which relies on effective function approximation to provide a mathematical guideline for neural network design, can prevent us from wasting the representational capacity of neural networks. The tailored neural network architectures enable stable learning and eliminate the heuristic network design. For performance assessment, the proposed approach is compared with the closed-form optimal policy for a partially observable Markov problem. Through additional experiments, it is observed that our online solution achieves a performance close to the theoretical upper bound of the performance in a time-varying fading broadcast channel.
- Published
- 2020
- Full Text
- View/download PDF
121. Reinforced knowledge distillation: Multi-class imbalanced classifier based on policy gradient reinforcement learning.
- Author
-
Fan, Saite, Zhang, Xinmin, and Song, Zhihuan
- Subjects
- *
MACHINE learning , *REWARD (Psychology) , *MANUFACTURING processes , *REINFORCEMENT learning , *CLASSIFICATION algorithms , *SIGNAL sampling - Abstract
The real-world datasets often exhibit imbalanced class distribution, which is a common challenge for multi-class classification algorithms. To settle the multi-class imbalanced classification problem of class imbalance learning, a novel reinforced knowledge distillation method is proposed in this paper. In the reinforced knowledge distillation, an improved fine-grained classification architecture based on knowledge distillation strategy and policy gradient reinforcement learning is proposed. In addition, reinforced knowledge distillation uses a newly designed reward signal and a novel sample weights update strategy to train the policies to find the optimal student-network, which makes reinforced knowledge distillation more powerful in handling the multi-class imbalanced classification problem. The effectiveness and practicability of the proposed reinforced knowledge distillation method are verified through its application to a simulated industrial process benchmark and extensive real-world datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
122. Parameter tuning of manipulator motion tracking controller based on Policy Gradient.
- Author
-
HAN Linxiao, HU Jianbo, SONG Shiyuan, WANG Yingyang, HE Zihou, and ZHANG Peng
- Subjects
MOTION control devices ,SELF-tuning controllers ,MANIPULATORS (Machinery) ,REINFORCEMENT learning ,LYAPUNOV stability ,DYNAMIC models - Abstract
Aiming at the parameter self-tuning problem of the manipulator motion tracking controller, a parameter tuning device based on reinforcement learning Policy Gradient method is designed. Firstly, a hybrid dynamic model of manipulator is introduced. According to the system model, the design of proportional- derivative (PD) controller and the stability of Lyapunov are carried out, and the range of parameter matrix is given. Secondly, the parameter tuning device based on Policy Gradient is designed. The integrator is introduced to improve the parameter tuning of gradient, so as to further improve the control effect of PD controller. Finally, a second-order manipulator system is taken as an example to verify the effectiveness and feasibility of the parameter tuning device, which effectively improves the dynamic performance of the proposed system. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
123. GLEU-Guided Multi-resolution Network for Short Text Conversation
- Author
-
Liu, Xuan, Yu, Kai, Barbosa, Simone Diniz Junqueira, Series editor, Chen, Phoebe, Series editor, Filipe, Joaquim, Series editor, Kotenko, Igor, Series editor, Sivalingam, Krishna M., Series editor, Washio, Takashi, Series editor, Yuan, Junsong, Series editor, Zhou, Lizhu, Series editor, Tao, Jianhua, editor, Zheng, Thomas Fang, editor, Bao, Changchun, editor, Wang, Dong, editor, and Li, Ya, editor
- Published
- 2018
- Full Text
- View/download PDF
124. From Plots to Endings: A Reinforced Pointer Generator for Story Ending Generation
- Author
-
Zhao, Yan, Liu, Lu, Liu, Chunhua, Yang, Ruoyao, Yu, Dong, Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Weikum, Gerhard, Series Editor, Zhang, Min, editor, Ng, Vincent, editor, Zhao, Dongyan, editor, Li, Sujian, editor, and Zan, Hongying, editor
- Published
- 2018
- Full Text
- View/download PDF
125. Learning Heuristics for the TSP by Policy Gradient
- Author
-
Deudon, Michel, Cournut, Pierre, Lacoste, Alexandre, Adulyasak, Yossiri, Rousseau, Louis-Martin, Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Weikum, Gerhard, Series Editor, and van Hoeve, Willem-Jan, editor
- Published
- 2018
- Full Text
- View/download PDF
126. Interactive Area Topics Extraction with Policy Gradient
- Author
-
Han, Jingfei, Rong, Wenge, Zhang, Fang, Zhang, Yutao, Tang, Jie, Xiong, Zhang, Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Weikum, Gerhard, Series Editor, Kůrková, Věra, editor, Manolopoulos, Yannis, editor, Hammer, Barbara, editor, Iliadis, Lazaros, editor, and Maglogiannis, Ilias, editor
- Published
- 2018
- Full Text
- View/download PDF
127. Relaxation-Free Deep Hashing via Policy Gradient
- Author
-
Yuan, Xin, Ren, Liangliang, Lu, Jiwen, Zhou, Jie, Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Weikum, Gerhard, Series Editor, Ferrari, Vittorio, editor, Hebert, Martial, editor, Sminchisescu, Cristian, editor, and Weiss, Yair, editor
- Published
- 2018
- Full Text
- View/download PDF
128. Automatically Designing CNN Architectures for Medical Image Segmentation
- Author
-
Mortazi, Aliasghar, Bagci, Ulas, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Pandu Rangan, C., Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Weikum, Gerhard, Series Editor, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Shi, Yinghuan, editor, Suk, Heung-Il, editor, and Liu, Mingxia, editor
- Published
- 2018
- Full Text
- View/download PDF
129. Deep Reinforcement Learning
- Author
-
Aggarwal, Charu C. and Aggarwal, Charu C.
- Published
- 2018
- Full Text
- View/download PDF
130. Joint Communication and Action Learning in Multi-Target Tracking of UAV Swarms with Deep Reinforcement Learning
- Author
-
Wenhong Zhou, Jie Li, and Qingjie Zhang
- Subjects
UAV swarms ,reinforcement learning ,cooperation ,communication ,policy gradient ,Motor vehicles. Aeronautics. Astronautics ,TL1-4050 - Abstract
Communication is the cornerstone of UAV swarms to transmit information and achieve cooperation. However, artificially designed communication protocols usually rely on prior expert knowledge and lack flexibility and adaptability, which may limit the communication ability between UAVs and is not conducive to swarm cooperation. This paper adopts a new data-driven approach to study how reinforcement learning can be utilized to jointly learn the cooperative communication and action policies for UAV swarms. Firstly, the communication policy of a UAV is defined, so that the UAV can autonomously decide the content of the message sent out according to its real-time status. Secondly, neural networks are designed to approximate the communication and action policies of the UAV, and their policy gradient optimization procedures are deduced, respectively. Then, a reinforcement learning algorithm is proposed to jointly learn the communication and action policies of UAV swarms. Numerical simulation results verify that the policies learned by the proposed algorithm are superior to the existing benchmark algorithms in terms of multi-target tracking performance, scalability in different scenarios, and robustness under communication failures.
- Published
- 2022
- Full Text
- View/download PDF
131. Vehicle Safety Planning Control Method Based on Variable Gauss Safety Field
- Author
-
Zixuan Zhu, Chenglong Teng, Yingfeng Cai, Long Chen, Yubo Lian, and Hai Wang
- Subjects
autonomous driving ,planning algorithm ,variable Gaussian safety field ,reinforcement learning ,policy gradient ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 ,Transportation engineering ,TA1001-1280 - Abstract
The existing intelligent vehicle trajectory-planning methods have limitations in terms of efficiency and safety. To overcome these limitations, this paper proposes an automatic driving trajectory-planning method based on a variable Gaussian safety field. Firstly, the time series bird’s-eye view is used as the input state quantity of the network, which improves the effectiveness of the trajectory planning policy network in extracting the features of the surrounding traffic environment. Then, the policy gradient algorithm is used to generate the planned trajectory of the autonomous vehicle, which improves the planning efficiency. The variable Gaussian safety field is used as the reward function of the trajectory planning part and the evaluation index of the control part, which improves the safety of the reinforcement learning vehicle tracking algorithm. The proposed algorithm is verified using the simulator. The obtained results show that the proposed algorithm has excellent trajectory planning ability in the highway scene and can achieve high safety and high precision tracking control.
- Published
- 2022
- Full Text
- View/download PDF
132. Dynamic Navigation and Area Assignment of Multiple USVs Based on Multi-Agent Deep Reinforcement Learning
- Author
-
Jiayi Wen, Shaoman Liu, and Yejin Lin
- Subjects
USV ,trajectory design ,policy gradient ,multi-agent deep reinforcement learning ,multi-object optimization ,Chemical technology ,TP1-1185 - Abstract
The unmanned surface vehicle (USV) has attracted more and more attention because of its basic ability to perform complex maritime tasks autonomously in constrained environments. However, the level of autonomy of one single USV is still limited, especially when deployed in a dynamic environment to perform multiple tasks simultaneously. Thus, a multi-USV cooperative approach can be adopted to obtain the desired success rate in the presence of multi-mission objectives. In this paper, we propose a cooperative navigating approach by enabling multiple USVs to automatically avoid dynamic obstacles and allocate target areas. To be specific, we propose a multi-agent deep reinforcement learning (MADRL) approach, i.e., a multi-agent deep deterministic policy gradient (MADDPG), to maximize the autonomy level by jointly optimizing the trajectory of USVs, as well as obstacle avoidance and coordination, which is a complex optimization problem usually solved separately. In contrast to other works, we combined dynamic navigation and area assignment to design a task management system based on the MADDPG learning framework. Finally, the experiments were carried out on the Gym platform to verify the effectiveness of the proposed method.
- Published
- 2022
- Full Text
- View/download PDF
133. Reinforcement Learning: Theory and Applications in HEMS
- Author
-
Omar Al-Ani and Sanjoy Das
- Subjects
home energy management systems (HEMS) ,reinforcement learning (RL) ,deep neural network (DNN) ,Q-value ,policy gradient ,natural gradient ,Technology - Abstract
The steep rise in reinforcement learning (RL) in various applications in energy as well as the penetration of home automation in recent years are the motivation for this article. It surveys the use of RL in various home energy management system (HEMS) applications. There is a focus on deep neural network (DNN) models in RL. The article provides an overview of reinforcement learning. This is followed with discussions on state-of-the-art methods for value, policy, and actor–critic methods in deep reinforcement learning (DRL). In order to make the published literature in reinforcement learning more accessible to the HEMS community, verbal descriptions are accompanied with explanatory figures as well as mathematical expressions using standard machine learning terminology. Next, a detailed survey of how reinforcement learning is used in different HEMS domains is described. The survey also considers what kind of reinforcement learning algorithms are used in each HEMS application. It suggests that research in this direction is still in its infancy. Lastly, the article proposes four performance metrics to evaluate RL methods.
- Published
- 2022
- Full Text
- View/download PDF
134. AEVRNet: Adaptive exploration network with variance reduced optimization for visual tracking.
- Author
-
Yang, Yuxiang, Xing, Weiwei, Wang, Dongdong, Zhang, Shunli, Yu, Qi, and Wang, Liqiang
- Subjects
- *
REINFORCEMENT learning , *SPACE exploration , *ARTIFICIAL satellite tracking , *CONVOLUTIONAL neural networks - Abstract
For visual tracking methods based on reinforcement learning, action space determines the ability of exploration, which is crucial to model robustness. However, most trackers adopted simple strategies with action space, which will suffer local optima problem. To address this issue, a novel reinforcement learning based tracker called AEVRNet is proposed with non-convex optimization and effective action space exploration. Firstly, inspired by combinatorial upper confidence bound, we design an adaptive exploration strategy leveraging temporal and spatial knowledge to enhance effective action exploration and jump out of local optima. Secondly, we define the tracking problem as a non-convex problem and incorporate non-convex optimization in stochastic variance reduced gradient as backward propagation of our model, which can converge faster with lower loss. Thirdly, different from existing reinforcement learning based trackers using classification method to train model, we define a regression based action-reward loss function, which is more sensitive to aspects of the target states, e.g., the width and height of the target to further improve robustness. Extensive experiments on six benchmark datasets demonstrate that our proposed AEVRNet achieves favorable performance against the state-of-the-art reinforcement learning based methods. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
135. Reprint of: Automated stem cell production by bio-inspired control.
- Author
-
Monostori, László, Csáji, Balázs Cs., Egri, Péter, Kis, Krisztián B., Váncza, József, Ochs, Jelena, Jung, Sven, König, Niels, Pieske, Simon, Wein, Stephan, Schmitt, Robert, and Brecher, Christian
- Subjects
PRODUCTION control ,STEM cells ,REINFORCEMENT learning ,STEM cell treatment ,HUMAN error ,INDUSTRIAL costs - Abstract
The potential in treating chronic and life-threatening diseases by stem cell therapies can greatly be exploited via the efficient automation of stem cell production. Working with living material though poses severe challenges to automation. Recently, production platforms has been developed and tested worldwide with the aim to increase the reproducibility, quality and throughput of the process, to minimize human errors, and to reduce costs of production. A distinctive feature of this domain is the symbiotic co-existence and co-evolution of the technical, information and communication, as well as biological ingredients in production structures. A challenging way to overcome the issues of automated production is the use of biologically inspired control algorithms. In the paper an approach is described which combines digital, agent-based simulation and reinforcement learning for this purpose. The modelling of the cell growth behaviour, which is an important prerequisite of the simulation, is also introduced, together with an appropriate model fitting procedure. The applicability of the proposed approach is demonstrated by the results of a comprehensive investigation. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
136. PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning.
- Author
-
Li, Shilei, Li, Meng, Su, Jiongming, Chen, Shaofei, Yuan, Zhimin, and Ye, Qing
- Subjects
- *
DEEP reinforcement learning , *REINFORCEMENT learning , *MACHINE learning , *SPACE exploration , *EVOLUTIONARY algorithms - Abstract
Efficient and stable exploration remains a key challenge for deep reinforcement learning (DRL) operating in high-dimensional action and state spaces. Recently, a more promising approach by combining the exploration in the action space with the exploration in the parameters space has been proposed to get the best of both methods. In this article, we propose a new iterative and close-loop framework by combining the evolutionary algorithm (EA), which does explorations in a gradient-free manner directly in the parameters space with an actor-critic, and the deep deterministic policy gradient (DDPG) reinforcement learning algorithm, which does explorations in a gradient-based manner in the action space to make these two methods cooperate in a more balanced and efficient way. In our framework, the policies represented by the EA population (the parametric perturbation part) can evolve in a guided manner by utilizing the gradient information provided by the DDPG and the policy gradient part (DDPG) is used only as a fine-tuning tool for the best individual in the EA population to improve the sample efficiency. In particular, we propose a criterion to determine the training steps required for the DDPG to ensure that useful gradient information can be generated from the EA generated samples and the DDPG and EA part can work together in a more balanced way during each generation. Furthermore, within the DDPG part, our algorithm can flexibly switch between fine-tuning the same previous RL-Actor and fine-tuning a new one generated by the EA according to different situations to further improve the efficiency. Experiments on a range of challenging continuous control benchmarks demonstrate that our algorithm outperforms related works and offers a satisfactory trade-off between stability and sample efficiency. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
137. Deep reinforcement learning algorithm based on multi-agent parallelism and its application in game environment.
- Author
-
Liu, Chao and Liu, Di
- Abstract
• Recommends utilizing deep reinforcement learning algorithms grounded in multi-agent parallelism. • Examines the potential for implementing multi-agent parallel-based deep reinforcement learning algorithms within the gaming industry. • Provides valuable insights into the utilization of deep reinforcement learning in game AI and multi-agent systems. Deep reinforcement learning has become a prominent area of research in artificial intelligence in recent years. Its application in solving complex tasks and game environments has garnered significant attention. This study aims to develop a deep reinforcement learning algorithm based on multi-agent parallelism to enhance intelligent decision-making in game environments. The algorithm combines a deep Q-network with a multi-agent cooperation strategy. Through parallel training of multiple agents, the learning process is accelerated, and decision accuracy is improved. The experimental results indicated that the Actor-Critic algorithm, when combined with precision rate, recall rate, and average fitness of multi-agent parallel, achieves a relatively high accuracy rate index, which stabilizes above 0.95. The recall rate index was also above 0.8, and the average fitness was in a relatively high range. The research shows that the deep reinforcement learning algorithm based on multi-agent parallelism performs better and is more effective in game environments. It can learn the optimal strategy faster and obtain higher rewards. This not only provides a new technical means for game development but also offers a useful reference for the application of multi-agent systems in complex environments. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
138. ECG Generation With Sequence Generative Adversarial Nets Optimized by Policy Gradient
- Author
-
Fei Ye, Fei Zhu, Yuchen Fu, and Bairong Shen
- Subjects
Deep learning ,generative adversarial networks ,policy gradient ,electrocardiogram ,time series ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Electrocardiogram (ECG) is a method used by physicians to detect cardiac disease. Requirements for batch processing and accurate recognition of clinical data have led to the applications of deep-learning methods for feature extraction, classification, and denoising of ECGs; however, deep learning requires large amounts of data and multi-feature integration of datasets, with most available methods used for ECGs incapable of extracting global features or resulting in unstable, low quality training. To address these deficiencies, we proposed a novel generative adversarial architecture called RPSeqGAN using a training process reliant upon a sequence generative adversarial network (SeqGAN) algorithm that adopts the policy gradient (PG) in reinforcement learning. Based on clinical records collected from the MIT-BIH arrhythmia database, we compared our proposed model with three deep generative models to evaluate its stability by observing the variance of their loss curves. Additionally, we generated ECGs with five periods and evaluated them according to six metrics suitable for time series. The results indicate that the proposed model showed the highest stability and data quality.
- Published
- 2019
- Full Text
- View/download PDF
139. ULMR: An Unsupervised Learning Framework for Mismatch Removal
- Author
-
Cailong Deng, Shiyu Chen, Yong Zhang, Qixin Zhang, and Feiyan Chen
- Subjects
unsupervised learning ,mismatch removal ,reinforcement learning ,policy gradient ,expected reward ,Chemical technology ,TP1-1185 - Abstract
Due to radiometric and geometric distortions between images, mismatches are inevitable. Thus, a mismatch removal process is required for improving matching accuracy. Although deep learning methods have been proved to outperform handcraft methods in specific scenarios, including image identification and point cloud classification, most learning methods are supervised and are susceptible to incorrect labeling, and labeling data is a time-consuming task. This paper takes advantage of deep reinforcement leaning (DRL) and proposes a framework named unsupervised learning for mismatch removal (ULMR). Resorting to DRL, ULMR firstly scores each state–action pair guided by the output of classification network; then, it calculates the policy gradient of the expected reward; finally, through maximizing the expected reward of state–action pairings, the optimal network can be obtained. Compared to supervised learning methods (e.g., NM-Net and LFGC), unsupervised learning methods (e.g., ULCM), and handcraft methods (e.g., RANSAC, GMS), ULMR can obtain higher precision, more remaining correct matches, and fewer remaining false matches in testing experiments. Moreover, ULMR shows greater stability, better accuracy, and higher quality in application experiments, demonstrating reduced sampling times and higher compatibility with other classification networks in ablation experiments, indicating its great potential for further use.
- Published
- 2022
- Full Text
- View/download PDF
140. Learning-Based Online QoE Optimization in Multi-Agent Video Streaming
- Author
-
Yimeng Wang, Mridul Agarwal, Tian Lan, and Vaneet Aggarwal
- Subjects
video streaming ,resource allocation ,reinforcement learning ,policy gradient ,Industrial engineering. Management engineering ,T55.4-60.8 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Video streaming has become a major usage scenario for the Internet. The growing popularity of new applications, such as 4K and 360-degree videos, mandates that network resources must be carefully apportioned among different users in order to achieve the optimal Quality of Experience (QoE) and fairness objectives. This results in a challenging online optimization problem, as networks grow increasingly complex and the relevant QoE objectives are often nonlinear functions. Recently, data-driven approaches, deep Reinforcement Learning (RL) in particular, have been successfully applied to network optimization problems by modeling them as Markov decision processes. However, existing RL algorithms involving multiple agents fail to address nonlinear objective functions on different agents’ rewards. To this end, we leverage MAPG-finite, a policy gradient algorithm designed for multi-agent learning problems with nonlinear objectives. It allows us to optimize bandwidth distributions among multiple agents and to maximize QoE and fairness objectives on video streaming rewards. Implementing the proposed algorithm, we compare the MAPG-finite strategy with a number of baselines, including static, adaptive, and single-agent learning policies. The numerical results show that MAPG-finite significantly outperforms the baseline strategies with respect to different objective functions and in various settings, including both constant and adaptive bitrate videos. Specifically, our MAPG-finite algorithm maximizes QoE by 15.27% and maximizes fairness by 22.47% compared to the standard SARSA algorithm for a 2000 KB/s link.
- Published
- 2022
- Full Text
- View/download PDF
141. Policy Gradient Reinforcement Learning for I/O Reordering on Storage Servers
- Author
-
Dheenadayalan, Kumar, Srinivasaraghavan, Gopalakrishnan, Muralidhara, V. N., Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Liu, Derong, editor, Xie, Shengli, editor, Li, Yuanqing, editor, Zhao, Dongbin, editor, and El-Alfy, El-Sayed M., editor
- Published
- 2017
- Full Text
- View/download PDF
142. Automated stem cell production by bio-inspired control.
- Author
-
Monostori, László, Csáji, Balázs Cs., Egri, Péter, Kis, Krisztián B., Váncza, József, Ochs, Jelena, Jung, Sven, König, Niels, Pieske, Simon, Wein, Stephan, Schmitt, Robert, and Brecher, Christian
- Subjects
PRODUCTION control ,STEM cells ,REINFORCEMENT learning ,STEM cell treatment ,HUMAN error ,INDUSTRIAL costs - Abstract
The potential in treating chronic and life-threatening diseases by stem cell therapies can greatly be exploited via the efficient automation of stem cell production. Working with living material though poses severe challenges to automation. Recently, production platforms has been developed and tested worldwide with the aim to increase the reproducibility, quality and throughput of the process, to minimize human errors, and to reduce costs of production. A distinctive feature of this domain is the symbiotic co-existence and co-evolution of the technical, information and communication, as well as biological ingredients in production structures. A challenging way to overcome the issues of automated production is the use of biologically inspired control algorithms. In the paper an approach is described which combines digital, agent-based simulation and reinforcement learning for this purpose. The modelling of the cell growth behaviour, which is an important prerequisite of the simulation, is also introduced, together with an appropriate model fitting procedure. The applicability of the proposed approach is demonstrated by the results of a comprehensive investigation. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
143. Positioning of the Robotic Arm Using Different Reinforcement Learning Algorithms.
- Author
-
Lindner, Tymoteusz, Milecki, Andrzej, and Wyrwał, Daniel
- Abstract
Robots are programmed using either the on-line mode, in which the robot programmer manually controls the movement of the robot indicating individual trajectory points or the off mode, in which the programmer enters the program code with predefined trajectory points. Both methods are not easy to be successfully implemented in practice, which is why the research on the development of self-learning methods can be useful. In this paper, for the robot's positioning task, the four Reinforcement Learning (RL) algorithms in six combinations are investigated. At first, the basics of these algorithms are described. Then they are used in positioning control of the robot's arm model and the evaluation of positioning accuracy, motion trajectory, and the number of steps required to achieve the goal is taken into account. The simulation results are recorded. The same tests were repeated in laboratory conditions, in which the Mitsubishi robot was controlled. The simulation results are compared with results obtained in reality. Positive results that have been obtained indicate, that the RL algorithms can be successfully applied for the learning of positioning control of a robot arm. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
144. Novel First Order Bayesian Optimization with an Application to Reinforcement Learning.
- Author
-
J., Prabuchandran K., Penubothula, Santosh, Kamanchi, Chandramouli, and Bhatnagar, Shalabh
- Subjects
REINFORCEMENT learning ,STATISTICAL models - Abstract
Zeroth Order Bayesian Optimization (ZOBO) methods optimize an unknown function based on its black-box evaluations at the query locations. Unlike most optimization procedures, ZOBO methods fail to utilize gradient information even when it is available. On the other hand, First Order Bayesian Optimization (FOBO) methods exploit the available gradient information to arrive at better solutions faster. However, the existing FOBO methods do not utilize a crucial information that the gradient is zero at the optima. Further, the inherent sequential nature of the FOBO methods incur high computational cost limiting their wide applicability. To alleviate the aforementioned difficulties of FOBO methods, we propose a relaxed statistical model to leverage the gradient information that directly searches for points where gradient vanishes. To accomplish this, we develop novel acquisition algorithms that search for global optima effectively. Unlike the existing FOBO methods, the proposed methods are parallelizable. Through extensive experimentation on standard test functions, we compare the performance of our methods over the existing methods. Furthermore, we explore an application of the proposed FOBO methods in the context of policy gradient reinforcement learning. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
145. AGAN: ATTRIBUTE GENERATIVE ADVERSARIAL NETWORK.
- Author
-
Zhang, M. and Xu, P.
- Subjects
GENERATIVE adversarial networks ,REPRESENTATIONS of graphs ,LONG short-term memory ,SHORT-term memory ,SOCIAL networks - Abstract
Graph generative adversarial network has achieved remarkable effectiveness, such as link prediction, node classification, user recommendation and node visualization in recent years. Most existing methods mainly focus on how to represent the proximity between nodes according to the structure of the graph. However, the graph nodes also have rich attribute information in social networks, the traditional methods mainly consider the node attributes as auxiliary information incorporate into the embedding representation of the graph to improve the accuracy of node classification and link prediction. In fact, in social networks, these node attributes are often sparse. Due to privacy and other reasons, the attributes of many nodes are difficult to obtain. Inspired by the application of generative adversarial network in image field, we propose an innovative framework to discover node latent attribute. Through experiments, we demonstrate the effectiveness of our proposed methods. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
146. Multi-Agent Safe Policy Learning for Power Management of Networked Microgrids.
- Author
-
Zhang, Qianzhi, Dehghanpour, Kaveh, Wang, Zhaoyu, Qiu, Feng, and Zhao, Dongbo
- Abstract
This article presents a supervised multi-agent safe policy learning (SMAS-PL) method for optimal power management of networked microgrids (MGs) in distribution systems. While unconstrained reinforcement learning (RL) algorithms are black-box decision models that could fail to satisfy grid operational constraints, our proposed method considers AC power flow equations and other operational limits. Accordingly, the training process employs the gradient information of operational constraints to ensure that the optimal control policy functions generate safe and feasible decisions. Furthermore, we have developed a distributed consensus-based optimization approach to train the agents’ policy functions while maintaining MGs’ privacy and data ownership boundaries. After training, the learned optimal policy functions can be safely used by the MGs to dispatch their local resources, without the need to solve a complex optimization problem from scratch. Numerical experiments have been devised to verify the performance of the proposed method. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
147. Relative Entropy of Correct Proximal Policy Optimization Algorithms with Modified Penalty Factor in Complex Environment
- Author
-
Weimin Chen, Kelvin Kian Loong Wong, Sifan Long, and Zhili Sun
- Subjects
correct proximal policy optimization ,approximation theory ,reinforcement learning ,optimization ,policy gradient ,entropy ,Science ,Astrophysics ,QB460-466 ,Physics ,QC1-999 - Abstract
In the field of reinforcement learning, we propose a Correct Proximal Policy Optimization (CPPO) algorithm based on the modified penalty factor β and relative entropy in order to solve the robustness and stationarity of traditional algorithms. Firstly, In the process of reinforcement learning, this paper establishes a strategy evaluation mechanism through the policy distribution function. Secondly, the state space function is quantified by introducing entropy, whereby the approximation policy is used to approximate the real policy distribution, and the kernel function estimation and calculation of relative entropy is used to fit the reward function based on complex problem. Finally, through the comparative analysis on the classic test cases, we demonstrated that our proposed algorithm is effective, has a faster convergence speed and better performance than the traditional PPO algorithm, and the measure of the relative entropy can show the differences. In addition, it can more efficiently use the information of complex environment to learn policies. At the same time, not only can our paper explain the rationality of the policy distribution theory, the proposed framework can also balance between iteration steps, computational complexity and convergence speed, and we also introduced an effective measure of performance using the relative entropy concept.
- Published
- 2022
- Full Text
- View/download PDF
148. Efficient Robot Skills Learning with Weighted Near-Optimal Experiences Policy Optimization.
- Author
-
Hou, Liwei, Wang, Hengsheng, Zou, Haoran, and Wang, Qun
- Subjects
SUPERVISED learning ,LEARNING ,MACHINE learning ,ABILITY ,ROBOTS ,WALKABILITY ,REINFORCEMENT learning ,HUMAN-robot interaction - Abstract
Autonomous learning of robotic skills seems to be more natural and more practical than engineered skills, analogous to the learning process of human individuals. Policy gradient methods are a type of reinforcement learning technique which have great potential in solving robot skills learning problems. However, policy gradient methods require too many instances of robot online interaction with the environment in order to learn a good policy, which means lower efficiency of the learning process and a higher likelihood of damage to both the robot and the environment. In this paper, we propose a two-phase (imitation phase and practice phase) framework for efficient learning of robot walking skills, in which we pay more attention to the quality of skill learning and sample efficiency at the same time. The training starts with what we call the first stage or the imitation phase of learning, updating the parameters of the policy network in a supervised learning manner. The training set used in the policy network learning is composed of the experienced trajectories output by the iterative linear Gaussian controller. This paper also refers to these trajectories as near-optimal experiences. In the second stage, or the practice phase, the experiences for policy network learning are collected directly from online interactions, and the policy network parameters are updated with model-free reinforcement learning. The experiences from both stages are stored in the weighted replay buffer, and they are arranged in order according to the experience scoring algorithm proposed in this paper. The proposed framework is tested on a biped robot walking task in a MATLAB simulation environment. The results show that the sample efficiency of the proposed framework is much higher than ordinary policy gradient algorithms. The algorithm proposed in this paper achieved the highest cumulative reward, and the robot learned better walking skills autonomously. In addition, the weighted replay buffer method can be made as a general module for other model-free reinforcement learning algorithms. Our framework provides a new way to combine model-based reinforcement learning with model-free reinforcement learning to efficiently update the policy network parameters in the process of robot skills learning. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
149. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift.
- Author
-
Agarwal, Alekh, Kakade, Sham M., Lee, Jason D., and Mahajan, Gaurav
- Subjects
- *
SUPERVISED learning , *REINFORCEMENT learning , *APPROXIMATION error , *MARKOV processes , *LEARNING problems - Abstract
Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution or how they cope with approximation error due to using a restricted class of parametric policies. This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy; and parametric policy classes (considering both log-linear and neural policy classes), which may not contain the optimal policy and where we provide agnostic learning results. One central contribution of this work is in providing approximation guarantees that are average case--which avoid explicit worst-case dependencies on the size of state space--by making a formal connection to supervised learning under distribution shift. This characterization shows an important interplay between estimation error, approximation error, and exploration (as characterized through a precisely defined condition number). [ABSTRACT FROM AUTHOR]
- Published
- 2021
150. Modeling on virtual network embedding using reinforcement learning.
- Author
-
Wang, Cong, Zheng, Fanghui, Zheng, Guangcong, Peng, Sancheng, Tian, Zejie, Guo, Yujia, Li, Guorui, and Yuan, Ying
- Subjects
REINFORCEMENT learning ,REWARD (Psychology) ,VIRTUAL networks ,NP-hard problems ,INTEGER programming ,PHYSICAL mobility - Abstract
Summary: It is well known that virtual network (VN) embedding (VNE) aims to solve how to efficiently allocate physical resources to a VN. However, this issue has been proved to be an NP‐hard problem. Besides, as most of the existing approaches are based on heuristic algorithms, which is easy to fall into local optimal. To address the challenge, we formalize the problem as a mixed integer programming problem and propose a novel VNE method based on reinforcement learning in this article. And to solve the problem, we introduce a pointer network to generate virtual node mapping strategies through an attention mechanism, and design a reward function related to link resource consumption to build the connection between node mapping and link mapping stages of VNE. In addition, we present a policy gradient optimization mechanism to leverage the reward information obtained from the sampled solutions, and design an active search based process to automatically update the parameters of the neural network and to obtain near‐optimal embedding solution. The experimental results show that the proposed method can improve the performance in average physical node utilization and long‐term revenue to cost ratio comparing than that of the existing models. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.