Open Access
Issue |
ESAIM: COCV
Volume 29, 2023
|
|
---|---|---|
Article Number | 20 | |
Number of page(s) | 26 | |
DOI | https://doi.org/10.1051/cocv/2023007 | |
Published online | 07 March 2023 |
- A. Agarwal, S.M. Kakade, J.D. Lee and G. Mahajan, Optimality and approximation with policy gradient methods in Markov decision processes, in Conference on Learning Theory, PMLR (2020) 64–66. [Google Scholar]
- J. An, L. Ying and Y. Zhu Why resampling outperforms reweighting for correcting sampling bias. International Conference on Learning Representations (ICLR) (2021). [Google Scholar]
- J. Baxter and P.L. Bartlett Infinite-horizon policy-gradient estimation. J. Artific. Intell. Res. 15 (2001) 319–350. [CrossRef] [Google Scholar]
- S. Cen, C. Cheng, Y. Chen, Y. Wei and Y. Chi, Fast global convergence of natural policy gradient methods with entropy regularization. Preprint arXiv:2007.06558 (2020). [Google Scholar]
- T. Degris, P.M. Pilarski and R.S. Sutton, Model-free reinforcement learning with continuous action in practice, in 2012 American Control Conference (ACC). IEEE (2012) 2177–2182. [CrossRef] [Google Scholar]
- S. Fujimoto, H. Hoof and D. Meger, Addressing function approximation error in actor-critic methods, in International Conference on Machine Learning, PMLR (2018) 1587–1596. [Google Scholar]
- T. Haarnoja, A. Zhou, P. Abbeel and S. Levine, Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, in International conference on machine learning, PMLR (2018) 1861–1870. [Google Scholar]
- R. Islam, P. Henderson, M. Gomrokchi and D. Precup, Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. Preprint arXiv:1708.04133 (2017). [Google Scholar]
- S.M. Kakade, A natural policy gradient, Adv. Neural Inf. Process. Syst. 14 (2001). [Google Scholar]
- V.R. Konda and J.N. Tsitsiklis, Actor-critic algorithms, in Advances in neural information processing systems, Citeseer (2000) 1008–1014. [Google Scholar]
- T.P. Lillicrap, J.J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver and D. Wierstra, Continuous control with deep reinforcement learning. Preprint arXiv:1509.02971 (2015). [Google Scholar]
- B. Liu, J. Liu, M. Ghavamzadeh, S. Mahadevan and M. Petrik, Finite-sample analysis of proximal gradient td algorithms. Preprint arXiv:2006.14364 (2020). [Google Scholar]
- J. Mei, C. Xiao, C. Szepesvari and D. Schuurmans, On the global convergence rates of softmax policy gradient methods, in International Conference on Machine Learning, PMLR (2020) 6820–6829. [Google Scholar]
- V. Mnih, A.P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver and K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, in International conference on machine learning, PMLR (2016) 1928–1937. [Google Scholar]
- R. Munos and C. Szepesvári Finite-time bounds for fitted value iteration. J. Mach. Learn. Res. 9 (2008). [Google Scholar]
- O. Nachum, B. Dai, I. Kostrikov, Y. Chow, L. Li and D. Schuurmans, Algaedice: Policy gradient from arbitrary experience. Preprint arXiv:1912.02074 (2019). [Google Scholar]
- J. Peters, K. Mulling and Y. Altun, Relative entropy policy search, in Twenty-Fourth AAAI Conference on Artificial Intelligence (2010). [Google Scholar]
- J. Peters and S. Schaal, Policy gradient methods for robotics, in 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE (2006) 2219–2225. [CrossRef] [Google Scholar]
- J. Peters and S. Schaal Natural actor-critic. Neurocomputing 71 (2008) 1180–1190. [CrossRef] [Google Scholar]
- M. Schlegel, W. Chung, D. Graves, J. Qian and M. White, Importance resampling for off-policy prediction. Preprint arXiv:1906.04328 (2019). [Google Scholar]
- J. Schulman, S. Levine, P. Abbeel, M. Jordan and P. Moritz, Trust region policy optimization, in International conference on machine learning. PMLR (2015) 1889–1897. [Google Scholar]
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford and O. Klimov, Proximal policy optimization algorithms. Preprint arXiv:1707.06347 (2017). [Google Scholar]
- D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra and M. Riedmiller, Deterministic policy gradient algorithms, in International conference on machine learning. PMLR (2014) 387–395. [Google Scholar]
- R.S. Sutton and A.G. Barto, Reinforcement learning: An introduction. MIT Press (2018). [Google Scholar]
- R.S. Sutton, D.A. McAllester, S.P. Singh, Y. Mansour et al., Policy gradient methods for reinforcement learning with function approximation., in vol. 99 of Advances in Neural Information Processing Systems. Citeseer (1999) 1057–1063. [Google Scholar]
- Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu and N. de Freitas, Sample efficient actor-critic with experience replay. Preprint arXiv:1611.01224 (2016). [Google Scholar]
- R.J. Williams Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8 (1992) 229–256. [Google Scholar]
- R.J. Williams and J. Peng Function optimization using connectionist reinforcement learning algorithms. Connect. Sci. 3 (1991) 241–268. [CrossRef] [Google Scholar]
- Y. Wu, E. Mansimov, R.B. Grosse, S. Liao and J. Ba Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. Adv. Neural Inf. Process. Syst. 30 (2017) 5279–5288. [Google Scholar]
- Z. Yang, Y. Chen, M. Hong and Z. Wang Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. Adv. Neural Inf. Process. Syst. (2019). [Google Scholar]
- Y. Zhu, Z. Izzo and L. Ying Borrowing from the future: addressing double sampling in model-free control. Math. Sci. Mach. Learn. (2022) 1099–1136. [Google Scholar]
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.