Episode-Experience Replay Based Tree-Backup Method for Off-Policy Actor-Critic Algorithm

摘要：

　　Off-policy algorithms have played important roles in deep reinforcement learning.Since the off-policy based policy gradient is a biased estimation,the previous works employed importance sampling to achieve the unbiased estimation,where the behavior policy is known in advance.However,it is difficult to choose the reasonable behavior policy for complex agents.Moreover,importance sampling usually produces the large variance.To address these problems,this paper presents a novel actor-critic policy gradient algorithm.Specifically,we employ the tree-backup method in off-policy setting to achieve the unbiased estimation of target policy gradient without using importance sampling.Meanwhile,we combine the naive episode-experience replay and the experience replay to obtain the trajectory samples and reduce the strong correlations between these samples.The experimental results demonstrate the advantages of the proposed method over the competed methods.

关键词： Off-policy actor-critic policy gradient Tree-backup algorithm All-action method Episode-experience replay

作者: Haobo Jiang Jianjun Qian Jin Xie Jian Yang

作者单位: Key Laboratory Intelligent Perception and Systems for High Dimensional Information of Ministry of Education,School of Computer Science and Engineering,Nanjing University of Science and Technology,Nanjing 210094,China

会议类型: 国际会议

会议名称: 中国模式识别与计算机视觉大会(PRCV2018)

会议地点: 广州

会议语种:英文

页码: 562-573

在线出版日期: 2018-11-23（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Episode-Experience Replay Based Tree-Backup Method for Off-Policy Actor-Critic Algorithm