Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate

Li, Zhiyuan; Lyu, Kaifeng; Arora, Sanjeev

Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate

Author(s): Li, Zhiyuan; Lyu, Kaifeng; Arora, Sanjeev

Download

To refer to this page use: http://arks.princeton.edu/ark:/88435/pr19p0d

Full metadata record

DC Field	Value	Language
dc.contributor.author	Li, Zhiyuan	-
dc.contributor.author	Lyu, Kaifeng	-
dc.contributor.author	Arora, Sanjeev	-
dc.date.accessioned	2021-10-08T19:50:45Z	-
dc.date.available	2021-10-08T19:50:45Z	-
dc.date.issued	2020	en_US
dc.identifier.citation	Li, Zhiyuan, Kaifeng Lyu, and Sanjeev Arora. "Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate." Advances in Neural Information Processing Systems 33 (2020): pp. 14544-14555.	en_US
dc.identifier.issn	1049-5258	-
dc.identifier.uri	https://proceedings.neurips.cc/paper/2020/file/a7453a5f026fb6831d68bdc9cb0edcae-Paper.pdf	-
dc.identifier.uri	http://arks.princeton.edu/ark:/88435/pr19p0d	-
dc.description.abstract	Recent works (e.g., (Li \& Arora, 2020)) suggest that the use of popular normalization schemes (including Batch Normalization) in today's deep learning can move it far from a traditional optimization viewpoint, e.g., use of exponentially increasing learning rates. The current paper highlights other ways in which behavior of normalized nets departs from traditional viewpoints, and then initiates a formal framework for studying their mathematics via suitable adaptation of the conventional framework namely, modeling SGD-induced training trajectory via a suitable stochastic differential equation (SDE) with a noise term that captures gradient noise. This yields: (a) A new \textquotedblleft intrinsic learning rate\textquotedblright\ parameter that is the product of the normal learning rate η and weight decay factor λ . Analysis of the SDE shows how the effective speed of learning varies and equilibrates over time under the control of intrinsic LR. (b) A challenge---via theory and experiments---to popular belief that good generalization requires large learning rates at the start of training. (c) New experiments, backed by mathematical intuition, suggesting the number of steps to equilibrium (in function space) scales as the inverse of the intrinsic learning rate, as opposed to the exponential time convergence bound implied by SDE analysis. We name it the \emph{Fast Equilibrium Conjecture} and suggest it holds the key to why Batch Normalization is effective.	en_US
dc.format.extent	14544 - 14555	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartof	Advances in Neural Information Processing Systems	en_US
dc.rights	Final published version. Article is made available in OAR by the publisher's permission or policy.	en_US
dc.title	Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate	en_US
dc.type	Conference Article	en_US
pu.type.symplectic	http://www.symplectic.co.uk/publications/atom-terms/1.0/conference-proceeding	en_US

Files in This Item:

File	Description	Size	Format
ReconcilingModernDeepLearning.pdf		2.63 MB	Adobe PDF	View/Download

Show Simple Item Record