Three Factors Influencing Minima in SGD
Stanisław Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, Amos Storkey
Published November 2017
Read on Aug 12, 2020
TL;DR
Batch size, learning rate and gradient covariance influence minima. LR/BS ratio is key in the width of the minima, impacting generalization. SGD as SDE discretization. Experimental validation of theory.
Review
Summary
- SGD performs similarly for different batch sizes, but a constant LR/BS ratio.
- The authors note that SGD with the same LR/BS ratio are different discretizations of the same Stochastic Differential Equation.
- LR schedules and BS schedules are interchangeable, what matters, again, is what the LR/BS looks like.
- Width of minima is defined in terms of the trace of the Hessian $Tr(H)$ at the minima: lower trace = wider minima.
- Assumption 1: At a local minima, loss surface approximated via a quadratic bowl. This lets training process be approximated by Orenstein-Unhlenbcek process.
- Assumption 2: $H$ is approximated via the covariance matrix of the stochastic gradients ($H=C$ relies on $C$ being anisotropic).
- Larger LR/BS correlates with wider minima, giving better generalization.
- However, larger $\beta$, with constant $\frac{LR}{BS}=\frac{\beta \eta}{\beta S}$ causes the approximation to the SDE to break down, leading to lower performance.
- Discretization errors become aparent at large learning rates.
- Central Limit Theorem assumptions break down for small dataset, large batches.
PhD Candidate at Stanford University
Ski, code, eat, sleep, repeat.