Second-Order Multi-Armed Bandit Learning for Online Optimization in Communication and Networks

摘要：

　　Multi-armed bandit(MAB)based reinforcement learning,which is able to learn in dynamic and uncertain environments with analytic performance bound,provides a robust opti-mization framework for resource optimization/scheduling prob-lems in communication and networks.The goal of MAB problem is to learn the best arms,i.e.,the arms provide the largest reward mean when played.In actual communication systems,not only the mean(i.e.,the first-order statistic),but also the second-order dynamics of reward is important,since a larger dynamic range may result in more frequent reconfiguration or adaptation of systems,and user quality of experience(QoE)degradation.However,traditional MAB models did not consider the second-order dynamic of reward,failing to provide tailored characterization when applied in communications.Motivated by this issue,this paper first pro-poses a second-order MAB problem.Specifically,a new best arm metric and associated regret that take the second-order dynamics of reward into account explicitly are redefined.Then,a second-order learning algorithm is designed.We fur-ther prove that the proposed algorithm is order-optimal.Fi-nally,some simulation results are presented to validate the proposed algorithm.The second-order MAB model and al-gorithm enable a more fine-grained characterization of re-source optimization/scheduling problems in communication and networks.

关键词： multi-armed bandit reinforcement learning QoE second-order learning

作者: Zhiyong Du Bin Jiang Kun Xu Shengyun Wei Shengqing Wang Huatao Zhu

作者单位: National University of Defense Technology,China

会议类型: 国际会议

会议名称: 2019国图灵大会(ACM Turing Celebration conference-China 2019 )

会议地点: 成都

会议语种:英文

页码: 333-338

在线出版日期: 2019-05-17（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Second-Order Multi-Armed Bandit Learning for Online Optimization in Communication and Networks