Three Factors Influencing Minima in SGD

Stanisław Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, Amos Storkey

Published November 2017
Read on Aug 12, 2020

PDF

TL;DR

Batch size, learning rate and gradient covariance influence minima. LR/BS ratio is key in the width of the minima, impacting generalization. SGD as SDE discretization. Experimental validation of theory.

Review

Summary

SGD performs similarly for different batch sizes, but a constant LR/BS ratio.
The authors note that SGD with the same LR/BS ratio are different discretizations of the same Stochastic Differential Equation.
LR schedules and BS schedules are interchangeable, what matters, again, is what the LR/BS looks like.
Width of minima is defined in terms of the trace of the Hessian $Tr(H)$ at the minima: lower trace = wider minima.
- Assumption 1: At a local minima, loss surface approximated via a quadratic bowl. This lets training process be approximated by Orenstein-Unhlenbcek process.
- Assumption 2: $H$ is approximated via the covariance matrix of the stochastic gradients ($H=C$ relies on $C$ being anisotropic).
Larger LR/BS correlates with wider minima, giving better generalization.
However, larger $\beta$, with constant $\frac{LR}{BS}=\frac{\beta \eta}{\beta S}$ causes the approximation to the SDE to break down, leading to lower performance.
Discretization errors become aparent at large learning rates.
Central Limit Theorem assumptions break down for small dataset, large batches.

Javier Sagastuy-Brena

PhD Candidate at Stanford University

Ski, code, eat, sleep, repeat.