Table of contents:
Deep Learning ultimately is about finding a minimum that generalizes well -- with bonus points for finding one fast and reliably. Our workhorse, stochastic gradient descent (SGD), is a 60-year old algorithm (Robbins and Monro, 1951) [
1
], that is as essential to the current generation of Deep Learning algorithms as back-propagation.
Different optimization algorithms have been proposed in recent years, which use different equations to update a model's parameters. Adam (Kingma and Ba, 2015) [
18
] was introduced in 2015 and is arguably today still the most commonly used one of these algorithms. This indicates that from the Machine Learning practitioner's perspective, best practices for optimization for Deep Learning have largely remained the same.
New ideas, however, have been developed over the course of this year, which may shape the way will optimize our models in the future. In this blog post, I will touch on the most exciting highlights and most promising directions in optimization for Deep Learning in my opinion. Note that this blog post assumes a familiarity with SGD and with adaptive learning rate methods such as Adam. To get up to speed, refer to this blog post for an overview of existing gradient descent optimization algorithms.
Improving Adam
Despite the apparent supremacy of adaptive learning rate methods such as Adam, state-of-the-art results for many tasks in computer vision and NLP such as object recognition (Huang et al., 2017) [
17
] or machine translation (Wu et al., 2016) [
3
] have still been achieved by plain old SGD with momentum. Recent theory (Wilson et al., 2017) [
15
] provides some justification for this, suggesting that adaptive learning rate methods converge to different (and less optimal) minima than SGD with momentum. It is empirically shown that the minima found by adaptive learning rate methods perform generally worse compared to those found by SGD with momentum on object recognition, character-level language modeling, and constituency parsing. This seems counter-intuitive given that Adam comes with nice convergence guarantees and that its adaptive learning rate should give it an edge over the regular SGD. However, Adam and other adaptive learning rate methods are not without their own flaws.