Advanced search
1 file | 235.31 KB Add to list

Search-based Reinforcement Learning through Bandit Linear Optimization

Milan Peelman (UGent) , Antoon Bronselaer (UGent) and Guy De Tré (UGent)
Author
Organization
Project
Abstract
The development of AlphaZero was a breakthrough in search-based reinforcement learning, by employing a given world model in a Monte-Carlo tree search (MCTS) algorithm to incrementally learn both an action policy and a value estimation. When extending this paradigm to the setting of simultaneous move games we find that the selection strategy of AlphaZero has theoretical shortcomings, including that convergence to a Nash equilibrium is not guaranteed. By analyzing these shortcomings, we find that the selection strategy corresponds to an approximated version of bandit linear optimization using Tsallis entropy regularization with α parameter set to zero, which is equivalent to log-barrier regularization. This observation allows us to refine the search method used by AlphaZero to obtain an algorithm that has theoretically optimal regret as well as superior empirical performance on our evaluation benchmark.
Keywords
machine learning, reinforcement learning, algorithmic game theory

Downloads

  • (...).pdf
    • full text (Published version)
    • |
    • UGent only
    • |
    • PDF
    • |
    • 235.31 KB

Citation

Please use this url to cite or link to this publication:

MLA
Peelman, Milan, et al. “Search-Based Reinforcement Learning through Bandit Linear Optimization.” Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, edited by Luc De Raedt, International Joint Conferences on Artifical Intelligence (IJCAI), 2022, pp. 3380–86, doi:10.24963/ijcai.2022/469.
APA
Peelman, M., Bronselaer, A., & De Tré, G. (2022). Search-based Reinforcement Learning through Bandit Linear Optimization. In L. De Raedt (Ed.), Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (pp. 3380–3386). https://doi.org/10.24963/ijcai.2022/469
Chicago author-date
Peelman, Milan, Antoon Bronselaer, and Guy De Tré. 2022. “Search-Based Reinforcement Learning through Bandit Linear Optimization.” In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, edited by Luc De Raedt, 3380–86. International Joint Conferences on Artifical Intelligence (IJCAI). https://doi.org/10.24963/ijcai.2022/469.
Chicago author-date (all authors)
Peelman, Milan, Antoon Bronselaer, and Guy De Tré. 2022. “Search-Based Reinforcement Learning through Bandit Linear Optimization.” In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, ed by. Luc De Raedt, 3380–3386. International Joint Conferences on Artifical Intelligence (IJCAI). doi:10.24963/ijcai.2022/469.
Vancouver
1.
Peelman M, Bronselaer A, De Tré G. Search-based Reinforcement Learning through Bandit Linear Optimization. In: De Raedt L, editor. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence. International Joint Conferences on Artifical Intelligence (IJCAI); 2022. p. 3380–6.
IEEE
[1]
M. Peelman, A. Bronselaer, and G. De Tré, “Search-based Reinforcement Learning through Bandit Linear Optimization,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, Vienna, Austria, 2022, pp. 3380–3386.
@inproceedings{8766249,
  abstract     = {{The development of AlphaZero was a breakthrough in search-based reinforcement learning, by employing a given world model in a Monte-Carlo tree search (MCTS) algorithm to incrementally learn both an action policy and a value estimation. When extending this paradigm to the setting of simultaneous move games we find that the selection strategy of AlphaZero has theoretical shortcomings, including that convergence to a Nash equilibrium is not guaranteed. By analyzing these shortcomings, we find that the selection strategy corresponds to an approximated version of bandit linear optimization using Tsallis entropy regularization with α parameter set to zero, which is equivalent to log-barrier regularization. This observation allows us to refine the search method used by AlphaZero to obtain an algorithm that has theoretically optimal regret as well as superior empirical performance on our evaluation benchmark.}},
  author       = {{Peelman, Milan and Bronselaer, Antoon and De Tré, Guy}},
  booktitle    = {{Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence}},
  editor       = {{De Raedt, Luc}},
  isbn         = {{9781956792003}},
  keywords     = {{machine learning,reinforcement learning,algorithmic game theory}},
  language     = {{eng}},
  location     = {{Vienna, Austria}},
  pages        = {{3380--3386}},
  publisher    = {{International Joint Conferences on Artifical Intelligence (IJCAI)}},
  title        = {{Search-based Reinforcement Learning through Bandit Linear Optimization}},
  url          = {{http://dx.doi.org/10.24963/ijcai.2022/469}},
  year         = {{2022}},
}

Altmetric
View in Altmetric