会议专题

Episode-Experience Replay Based Tree-Backup Method for Off-Policy Actor-Critic Algorithm

  Off-policy algorithms have played important roles in deep reinforcement learning.Since the off-policy based policy gradient is a biased estimation,the previous works employed importance sampling to achieve the unbiased estimation,where the behavior policy is known in advance.However,it is difficult to choose the reasonable behavior policy for complex agents.Moreover,importance sampling usually produces the large variance.To address these problems,this paper presents a novel actor-critic policy gradient algorithm.Specifically,we employ the tree-backup method in off-policy setting to achieve the unbiased estimation of target policy gradient without using importance sampling.Meanwhile,we combine the naive episode-experience replay and the experience replay to obtain the trajectory samples and reduce the strong correlations between these samples.The experimental results demonstrate the advantages of the proposed method over the competed methods.

Off-policy actor-critic policy gradient Tree-backup algorithm All-action method Episode-experience replay

Haobo Jiang Jianjun Qian Jin Xie Jian Yang

Key Laboratory Intelligent Perception and Systems for High Dimensional Information of Ministry of Education,School of Computer Science and Engineering,Nanjing University of Science and Technology,Nanjing 210094,China

国际会议

中国模式识别与计算机视觉大会(PRCV2018)

广州

英文

562-573

2018-11-23(万方平台首次上网日期,不代表论文的发表时间)