On the theory of policy gradient methods: Optimality, approximation, and distribution shift

Agarwal, A; Kakade, SM; Lee, JD; Mahajan, G

On the theory of policy gradient methods: Optimality, approximation, and distribution shift

Author(s): Agarwal, A; Kakade, SM; Lee, JD; Mahajan, G

Download

To refer to this page use: http://arks.princeton.edu/ark:/88435/pr1sb3wz6z

Full metadata record

DC Field	Value	Language
dc.contributor.author	Agarwal, A	-
dc.contributor.author	Kakade, SM	-
dc.contributor.author	Lee, JD	-
dc.contributor.author	Mahajan, G	-
dc.date.accessioned	2024-01-21T19:38:05Z	-
dc.date.available	2024-01-21T19:38:05Z	-
dc.date.issued	2021-02	en_US
dc.identifier.citation	Agarwal, A, Kakade, SM, Lee, JD, Mahajan, G. (2021). On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22	en_US
dc.identifier.issn	1532-4435	-
dc.identifier.uri	http://arks.princeton.edu/ark:/88435/pr1sb3wz6z	-
dc.description.abstract	Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: If and how fast they converge to a globally optimal solution or how they cope with approximation error due to using a restricted class of parametric policies. This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy; and parametric policy classes (considering both log-linear and neural policy classes), which may not contain the optimal policy and where we provide agnostic learning results. One central contribution of this work is in providing approximation guarantees that are average case-which avoid explicit worst-case dependencies on the size of state space-by making a formal connection to supervised learning under distribution shift. This characterization shows an important interplay between estimation error, approximation error, and exploration (as characterized through a precisely defined condition number).	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartof	Journal of Machine Learning Research	en_US
dc.rights	Final published version. This is an open access article.	en_US
dc.title	On the theory of policy gradient methods: Optimality, approximation, and distribution shift	en_US
dc.type	Journal Article	en_US
dc.identifier.eissn	1533-7928	-
pu.type.symplectic	http://www.symplectic.co.uk/publications/atom-terms/1.0/journal-article	en_US

Files in This Item:

File	Description	Size	Format
19-736.pdf		638.89 kB	Adobe PDF	View/Download

Show Simple Item Record