Review Summary Variantional inference Proof that SGD minimizes a potential along with an entropic regularization term. However, this potential differs from the loss used to compute backpropagation gradients. They are only equal if the gradient noise were isotropic (i.
Review Summary The authors expand on their previous work on the continuous time-limit of SGD. They show how SGD with a constant LR can be modelled as an SDE that reaches a stationary distribution.
Review Summary SGD performs similarly for different batch sizes, but a constant LR/BS ratio. The authors note that SGD with the same LR/BS ratio are different discretizations of the same Stochastic Differential Equation.
Review Summary Batch normalization induces scale invariance of the loss with respect to the weights (i.e. $L(x; \theta) = L(x; k \theta)$, for parameters $\theta$ with BN. This expression is not mathematically precise).