Alex Williams, Erin Kunz, Simon Kornblith, Scott Linderman
Published October 2021
Read on Feb 7, 2021
TL;DR This paper highlights issues that can occur when represtation similarity metrics are not *metrics* in the mathematical sense. The authors formulate novel metrics based on previous approaches, by getting them to satisfy the triangle inequality, including one geared specifically for CNNs. They demonstrate theis methods are effective and scalable to large numbers of specimens.
TL;DR The authors present two main results: a thorough mathematical analysis on how SGD performs variational inference and what its steady state behavior looks like: limit cycles. They present empirical quantities similar to the ones we have measured and analyze those compared to the null of Brownian motion.
TL;DR Rethinking SGD in the limit of continuous time yields valuable insight, particularly on hyperparameter tuning. This paper introduces the SDE derivation in the previously reviewed 'Three Factors' paper, and elaborates on the minimization of the KL divergence of the stationary distribution of the underlying OU process and the target posterior (and as such relies on the Bayesian view on ML algorithms, rather than the optimization view).
Stanisław Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, Amos Storkey
Published November 2017
Read on Aug 12, 2020
TL;DR Batch size, learning rate and gradient covariance influence minima. LR/BS ratio is key in the width of the minima, impacting generalization. SGD as SDE discretization. Experimental validation of theory.
TL;DR DNNs trained with weight decay and batch normalization reach learning equilibrium on the surface of a sphere in parameter space and their limit angular update can be computed a priori.