专栏名称: 机器学习研究会

机器学习研究会是北京大学大数据与机器学习创新中心旗下的学生组织，旨在构建一个机器学习从事者交流的平台。除了及时分享领域资讯外，协会还会举办各种业界巨头/学术神牛讲座、学术大牛沙龙分享会、real data 创新竞赛等活动。

【推荐】2017深度学习优化研究亮点

机器学习研究会 · 公众号 · AI · 2017-12-04 22:22

正文

点击上方 “机器学习研究会” 可以订阅

摘要

转自：爱可可-爱生活

Table of contents:

Improving Adam

Decoupling weight decay
Fixing the exponential moving average

Tuning the learning rate
Warm restarts

SGD with restarts
Snapshot ensembles
Adam with restarts

Learning to optimize
Understanding generalization

Deep Learning ultimately is about finding a minimum that generalizes well -- with bonus points for finding one fast and reliably. Our workhorse, stochastic gradient descent (SGD), is a 60-year old algorithm (Robbins and Monro, 1951) [ 1 ], that is as essential to the current generation of Deep Learning algorithms as back-propagation.

Different optimization algorithms have been proposed in recent years, which use different equations to update a model's parameters. Adam (Kingma and Ba, 2015) [ 18 ] was introduced in 2015 and is arguably today still the most commonly used one of these algorithms. This indicates that from the Machine Learning practitioner's perspective, best practices for optimization for Deep Learning have largely remained the same.

New ideas, however, have been developed over the course of this year, which may shape the way will optimize our models in the future. In this blog post, I will touch on the most exciting highlights and most promising directions in optimization for Deep Learning in my opinion. Note that this blog post assumes a familiarity with SGD and with adaptive learning rate methods such as Adam. To get up to speed, refer to this blog post for an overview of existing gradient descent optimization algorithms.

Improving Adam

Despite the apparent supremacy of adaptive learning rate methods such as Adam, state-of-the-art results for many tasks in computer vision and NLP such as object recognition (Huang et al., 2017) [ 17 ] or machine translation (Wu et al., 2016) [ 3 ] have still been achieved by plain old SGD with momentum. Recent theory (Wilson et al., 2017) [ 15 ] provides some justification for this, suggesting that adaptive learning rate methods converge to different (and less optimal) minima than SGD with momentum. It is empirically shown that the minima found by adaptive learning rate methods perform generally worse compared to those found by SGD with momentum on object recognition, character-level language modeling, and constituency parsing. This seems counter-intuitive given that Adam comes with nice convergence guarantees and that its adaptive learning rate should give it an edge over the regular SGD. However, Adam and other adaptive learning rate methods are not without their own flaws.