会议专题

Second-Order Multi-Armed Bandit Learning for Online Optimization in Communication and Networks

  Multi-armed bandit(MAB)based reinforcement learning,which is able to learn in dynamic and uncertain environments with analytic performance bound,provides a robust opti-mization framework for resource optimization/scheduling prob-lems in communication and networks.The goal of MAB problem is to learn the best arms,i.e.,the arms provide the largest reward mean when played.In actual communication systems,not only the mean(i.e.,the first-order statistic),but also the second-order dynamics of reward is important,since a larger dynamic range may result in more frequent reconfiguration or adaptation of systems,and user quality of experience(QoE)degradation.However,traditional MAB models did not consider the second-order dynamic of reward,failing to provide tailored characterization when applied in communications.Motivated by this issue,this paper first pro-poses a second-order MAB problem.Specifically,a new best arm metric and associated regret that take the second-order dynamics of reward into account explicitly are redefined.Then,a second-order learning algorithm is designed.We fur-ther prove that the proposed algorithm is order-optimal.Fi-nally,some simulation results are presented to validate the proposed algorithm.The second-order MAB model and al-gorithm enable a more fine-grained characterization of re-source optimization/scheduling problems in communication and networks.

multi-armed bandit reinforcement learning QoE second-order learning

Zhiyong Du Bin Jiang Kun Xu Shengyun Wei Shengqing Wang Huatao Zhu

National University of Defense Technology,China

国际会议

2019国图灵大会(ACM Turing Celebration conference-China 2019 )

成都

英文

333-338

2019-05-17(万方平台首次上网日期,不代表论文的发表时间)