Advanced search
2 files | 2.71 MB Add to list
Author
Organization
Abstract
Path Integral Policy Improvement with Covariance Matrix Adaptation (PI2-CMA) is a step-based model free reinforcement learning approach that combines statistical estimation techniques with fundamental results from Stochastic Optimal Control. Basically, a policy distribution is improved iteratively using reward weighted averaging of the corresponding rollouts. It was assumed that PI2-CMA somehow exploited gradient information that was contained by the reward weighted statistics. To our knowledge we are the first to expose the principle of this gradient extraction rigorously. Our findings reveal that PI2-CMA essentially obtains gradient information similar to the forward and backward passes in the Differential Dynamic Programming (DDP) method. It is then straightforward to extend the analogy with DDP by introducing a feedback term in the policy update. This suggests a novel algorithm which we coin Path Integral Policy Improvement with Differential Dynamic Programming (PI2-DDP). The resulting algorithm is similar to the previously proposed Sampled Differential Dynamic Programming (SaDDP) but we derive the method independently as a generalization of the framework of PI2-CMA. Our derivations suggest to implement some small variations to SaDDP so to increase performance. We validated our claims on a robot trajectory learning task.

Downloads

  • AIM2 final 1905.pdf
    • full text (Accepted manuscript)
    • |
    • open access
    • |
    • PDF
    • |
    • 1.51 MB
  • (...).pdf
    • full text (Published version)
    • |
    • UGent only
    • |
    • PDF
    • |
    • 1.21 MB

Citation

Please use this url to cite or link to this publication:

MLA
Lefebvre, Tom, and Guillaume Crevecoeur. “Path Integral Policy Improvement with Differential Dynamic Programming.” 2019 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), IEEE, 2019, pp. 739–45, doi:10.1109/AIM.2019.8868359.
APA
Lefebvre, T., & Crevecoeur, G. (2019). Path integral policy improvement with differential dynamic programming. 2019 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), 739–745. https://doi.org/10.1109/AIM.2019.8868359
Chicago author-date
Lefebvre, Tom, and Guillaume Crevecoeur. 2019. “Path Integral Policy Improvement with Differential Dynamic Programming.” In 2019 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), 739–45. New York: IEEE. https://doi.org/10.1109/AIM.2019.8868359.
Chicago author-date (all authors)
Lefebvre, Tom, and Guillaume Crevecoeur. 2019. “Path Integral Policy Improvement with Differential Dynamic Programming.” In 2019 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), 739–745. New York: IEEE. doi:10.1109/AIM.2019.8868359.
Vancouver
1.
Lefebvre T, Crevecoeur G. Path integral policy improvement with differential dynamic programming. In: 2019 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM). New York: IEEE; 2019. p. 739–45.
IEEE
[1]
T. Lefebvre and G. Crevecoeur, “Path integral policy improvement with differential dynamic programming,” in 2019 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), Hong Kong, HONG KONG, 2019, pp. 739–745.
@inproceedings{8623968,
  abstract     = {{Path Integral Policy Improvement with Covariance Matrix Adaptation (PI2-CMA) is a step-based model free reinforcement learning approach that combines statistical estimation techniques with fundamental results from Stochastic Optimal Control. Basically, a policy distribution is improved iteratively using reward weighted averaging of the corresponding rollouts. It was assumed that PI2-CMA somehow exploited gradient information that was contained by the reward weighted statistics. To our knowledge we are the first to expose the principle of this gradient extraction rigorously. Our findings reveal that PI2-CMA essentially obtains gradient information similar to the forward and backward passes in the Differential Dynamic Programming (DDP) method. It is then straightforward to extend the analogy with DDP by introducing a feedback term in the policy update. This suggests a novel algorithm which we coin Path Integral Policy Improvement with Differential Dynamic Programming (PI2-DDP). The resulting algorithm is similar to the previously proposed Sampled Differential Dynamic Programming (SaDDP) but we derive the method independently as a generalization of the framework of PI2-CMA. Our derivations suggest to implement some small variations to SaDDP so to increase performance. We validated our claims on a robot trajectory learning task.}},
  author       = {{Lefebvre, Tom and Crevecoeur, Guillaume}},
  booktitle    = {{2019 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM)}},
  isbn         = {{9781728124933}},
  issn         = {{2159-6255}},
  language     = {{eng}},
  location     = {{Hong Kong, HONG KONG}},
  pages        = {{739--745}},
  publisher    = {{IEEE}},
  title        = {{Path integral policy improvement with differential dynamic programming}},
  url          = {{http://doi.org/10.1109/AIM.2019.8868359}},
  year         = {{2019}},
}

Altmetric
View in Altmetric
Web of Science
Times cited: