Paper reviews | Javier Sagastuy

Generalized Shape Metrics on Neural Representations

Alex Williams, Erin Kunz, Simon Kornblith, Scott Linderman

Published October 2021
Read on Feb 7, 2021

TL;DR This paper highlights issues that can occur when represtation similarity metrics are not *metrics* in the mathematical sense. The authors formulate novel metrics based on previous approaches, by getting them to satisfy the triangle inequality, including one geared specifically for CNNs. They demonstrate theis methods are effective and scalable to large numbers of specimens.

Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks

Pratik Chaudhari, Stefano Soatto

Published October 2017
Read on Aug 28, 2020

TL;DR The authors present two main results: a thorough mathematical analysis on how SGD performs variational inference and what its steady state behavior looks like: limit cycles. They present empirical quantities similar to the ones we have measured and analyze those compared to the null of Brownian motion.

Learning Dynamics Theory SDE

A Variational Analysis of Stochastic Gradient Algorithms

Stephan Mandt, Matthew D. Hoffman, David M. Blei

Published February 2016
Read on Aug 18, 2020

TL;DR Rethinking SGD in the limit of continuous time yields valuable insight, particularly on hyperparameter tuning. This paper introduces the SDE derivation in the previously reviewed 'Three Factors' paper, and elaborates on the minimization of the KL divergence of the stationary distribution of the underlying OU process and the target posterior (and as such relies on the Bayesian view on ML algorithms, rather than the optimization view).

Learning Dynamics Theory

Three Factors Influencing Minima in SGD

Stanisław Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, Amos Storkey

Published November 2017
Read on Aug 12, 2020

TL;DR Batch size, learning rate and gradient covariance influence minima. LR/BS ratio is key in the width of the minima, impacting generalization. SGD as SDE discretization. Experimental validation of theory.

Learning Dynamics Theory SGD

Spherical Motion Dynamics of Deep Neural Networks with Batch Normalization and Weight Decay

Ruosi Wan, Zhanxing Zhu, Xiangyu Zhang, Jian Sun

Published June 2020
Read on Aug 1, 2020

TL;DR DNNs trained with weight decay and batch normalization reach learning equilibrium on the surface of a sphere in parameter space and their limit angular update can be computed a priori.

Learning Dynamics Theory