专栏名称: 机器学习研究会

机器学习研究会是北京大学大数据与机器学习创新中心旗下的学生组织，旨在构建一个机器学习从事者交流的平台。除了及时分享领域资讯外，协会还会举办各种业界巨头/学术神牛讲座、学术大牛沙龙分享会、real data 创新竞赛等活动。

【推荐】直接未来预测：增强学习监督学习

机器学习研究会 · 公众号 · AI · 2017-11-24 22:31

正文

点击上方 “机器学习研究会” 可以订阅

摘要

转自：爱可可-爱生活

In this post, I am going to go over a novel algorithm in reinforcement learning called Direct Future Prediction (DFP). In my opinion, it has a few really interesting properties that make it stand out from well known methods such as Actor Critic and Deep Q Learning. I will try my best to give my insights and walk you through my implementation on a scenario (VizDoom Health Gather) which I think is well suited to showcase DFP’s benefits.

Why should we care about DFP?

DFP first caught my attention when it won the ‘Full Deathmatch’ track of the VizDoom AI Competition in 2016. The competition took place in an unseen 3D partially observable environment. Participating agents have to fight against each other and the one with the most number of frags (kills - deaths) was declared winner. Not only did DFP won the competition, it did so in an utterly dominating fashion, outperforming the rest of the field (including A3C and variants of DQN) by more than 50%. All it took was a simple architecture with no additional supervisory signals! You might wonder how did DFP perform so well as compared to other well known methods (i.e. A3C and DQN)?

Reformulate RL as SL

The trick, it turns out, is to reformulate the reinforcement learning (RL) problem as supervised learning (SL) problem . This is not a new idea. As pointed out by the author of the paper, supervised learning perspective on reinforcement learning dates back decades.Jordan & Rumelhart (1992) argue that the choice of SL versus RL should be “guided by the characteristics of the environment”. Their analysis suggests that RL may be more efficient when the environment provides only a sparse scalar reward signal, whereas SL can be advantageous when dense multidimensional feedback is available . What does that mean exactly?

Recall that in RL settings learning is guided by a stream of scalar reward signal. In complex environments, the scalar reward can be sparse and delayed, it’s not easy to tell which action / sequence of actions are responsible for a particular positive reward which happens many time steps later. This is known as credit assignment problem. What if the environment also provides, in addition to the rewards, some kind of rich and temporally dense multidimensional feedback (for example measurements like kills, health, ammunition levels in a first person shooting game), we can program the agent to learn to predict these rich and temporally dense measurements feedback instead. All the agent has to do, at inference time, is to observe the effects of different actions on such measurements stream and choose the action that maximizes an “objective” (let’s call it $U$