Advanced search
1 file | 1.44 MB Add to list

No more hand-tuning rewards : masked constrained policy optimization for safe reinforcement learning

Stef Van Havermaet (UGent) , Yara Khaluf (UGent) and Pieter Simoens (UGent)
Author
Organization
Abstract
In safe Reinforcement Learning (RL), the agent attempts to find policies which maximize the expectation of accumulated rewards and guarantee its safety to remain above a given threshold. Hence, it is straightforward to formalize safe RL problems by both a reward function and a safety constraint. We define safety as the probability of survival in environments where taking risky actions could lead to early termination of the task. Although the optimization problem is already constrained by a safety threshold, reward signals related to unsafe terminal states influence the original maximization objective of the task. Selecting the appropriate value of these signals is often a time consuming and challenging reward engineering task, which requires expert knowledge of the domain. This paper presents a safe RL algorithm, called Masked Constrained Policy Optimization (MCPO), in which the learning process is constrained by safety and excludes the risk reward signals. We develop MCPO as an extension of gradient-based policy search methods, in which the updates of the policy and the expected reward models are masked. Our method benefits from having a high probability of satisfying the given constraints for every policy in the learning process. We validate the proposed algorithm in two continuous tasks. Our findings prove the proposed algorithm is able to neglect risk reward signals, and thereby resolving the desired safety-performance trade-off without having the need for hand-tuning rewards.

Downloads

  • (...).pdf
    • full text (Published version)
    • |
    • UGent only
    • |
    • PDF
    • |
    • 1.44 MB

Citation

Please use this url to cite or link to this publication:

MLA
Van Havermaet, Stef, et al. “No More Hand-Tuning Rewards : Masked Constrained Policy Optimization for Safe Reinforcement Learning.” AAMAS ’21, Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, IFAAMAS, 2021, pp. 1344–52.
APA
Van Havermaet, S., Khaluf, Y., & Simoens, P. (2021). No more hand-tuning rewards : masked constrained policy optimization for safe reinforcement learning. AAMAS ’21, Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, 1344–1352. IFAAMAS.
Chicago author-date
Van Havermaet, Stef, Yara Khaluf, and Pieter Simoens. 2021. “No More Hand-Tuning Rewards : Masked Constrained Policy Optimization for Safe Reinforcement Learning.” In AAMAS ’21, Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, 1344–52. IFAAMAS.
Chicago author-date (all authors)
Van Havermaet, Stef, Yara Khaluf, and Pieter Simoens. 2021. “No More Hand-Tuning Rewards : Masked Constrained Policy Optimization for Safe Reinforcement Learning.” In AAMAS ’21, Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, 1344–1352. IFAAMAS.
Vancouver
1.
Van Havermaet S, Khaluf Y, Simoens P. No more hand-tuning rewards : masked constrained policy optimization for safe reinforcement learning. In: AAMAS ’21, Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems. IFAAMAS; 2021. p. 1344–52.
IEEE
[1]
S. Van Havermaet, Y. Khaluf, and P. Simoens, “No more hand-tuning rewards : masked constrained policy optimization for safe reinforcement learning,” in AAMAS ’21, Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, Online, 2021, pp. 1344–1352.
@inproceedings{8705397,
  abstract     = {{In safe Reinforcement Learning (RL), the agent attempts to find policies which maximize the expectation of accumulated rewards and guarantee its safety to remain above a given threshold. Hence, it is straightforward to formalize safe RL problems by both a reward function and a safety constraint. We define safety as the probability of survival in environments where taking risky actions could lead to early termination of the task. Although the optimization problem is already constrained by a safety threshold, reward signals related to unsafe terminal states influence the original maximization objective of the task. Selecting the appropriate value of these signals is often a time consuming and challenging reward engineering task, which requires expert knowledge of the domain. This paper presents a safe RL algorithm, called Masked Constrained Policy Optimization (MCPO), in which the learning process is constrained by safety and excludes the risk reward signals. We develop MCPO as an extension of gradient-based policy search methods, in which the updates of the policy and the expected reward models are masked. Our method benefits from having a high probability of satisfying the given constraints for every policy in the learning process. We validate the proposed algorithm in two continuous tasks. Our findings prove the proposed algorithm is able to neglect risk reward signals, and thereby resolving the desired safety-performance trade-off without having the need for hand-tuning rewards.}},
  author       = {{Van Havermaet, Stef and Khaluf, Yara and Simoens, Pieter}},
  booktitle    = {{AAMAS '21, Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems}},
  isbn         = {{9781450383073}},
  issn         = {{2523-5699}},
  language     = {{eng}},
  location     = {{Online}},
  pages        = {{1344--1352}},
  publisher    = {{IFAAMAS}},
  title        = {{No more hand-tuning rewards : masked constrained policy optimization for safe reinforcement learning}},
  url          = {{https://dl.acm.org/doi/10.5555/3463952.3464107}},
  year         = {{2021}},
}