专栏名称: 机器学习研究会

机器学习研究会是北京大学大数据与机器学习创新中心旗下的学生组织，旨在构建一个机器学习从事者交流的平台。除了及时分享领域资讯外，协会还会举办各种业界巨头/学术神牛讲座、学术大牛沙龙分享会、real data 创新竞赛等活动。

【论文】WSDM 2017论文：如何把强化学习融合到广告中的Real-Time Bidding中

机器学习研究会 · 公众号 · AI · 2017-02-24 20:08

正文

点击上方 “机器学习研究会” 可以订阅哦

摘要

转自：洪亮劼

论文《Real-Time Bidding by Reinforcement Learning in Display Advertising》的作者团队来自上海交大和伦敦大学学院（University College London）。这篇文章是继强化学习被应用到搜索和推荐领域之后，又一个把强化学习应用到一个重要领域的尝试。与推荐和搜索不同的是，RTB因为其实时性，更加讲究能够对于一个决策过程进行动态调整，从而能够提供最优的解决方案。

目前大多数Bidding算法或者是策略（Strategy）的核心问题，就是他们都是静态的一个决策过程。那么，这篇文章的主要思路就是用Markov Decision Process（MDP）来对RTB进行建模。MDP的一般建模，需要三个必备元素，那就是State、Action和Reward。这里，State是一个（当前时间，剩余预算，当前Feature Vector）这么一个三元组；Action则是以State为输入，输出一个少于当前预算的Bid；Reward在这篇文章里定义为在当前Feature Vector为输入情况下的点击率（CTR）或者是0（没有赢得Auction的情况）。MDP除了这三个要素以外，一般还需要定义从每一个状态跳转另外状态的转移概率。这篇文章里，转移概率是一个Feature Vector的概率分布和市场价格分布的一个乘积。市场价格分布取决于现在的Feature Vector和当前的Bid价格。

整个MDP的布局设置好以后，RTB的问题就转换成为了如何在MDP中找到最优Action的决策问题。和传统的MDP一样，文章介绍了通过Value Iteration的方式来找到最佳的Value函数，然后通过找到的Value函数，来找到最佳的Bidding策略。然而，这样的方法，只适合在比较小规模的数据上，原因是第一个阶段的得到最佳Value函数的步骤太过于耗时。文章介绍了一种在大规模数据上的思路，那就是去通过小数据来学习Value函数的表达，然后应用到大规模数据上。

这篇文章在两个数据集上做了实验，一个是PinYou的数据，另一个是YOYI的数据，数量都算是当前比较大的RTB数据集了。从实验结果上来看，采用MDP的方法比能够比其他方法大幅度有效得提高CTR，以及各项指标。除了在这两个数据集上的结果以外，这篇文章还在Vlion DSP的线上系统进行了评测，在CTR基本和以前方法持平的情况下，CPM和eCPC都更加有效。

总之，这篇文章对于希望探索强化学习在广告或者是推荐以及搜索等领域的应用有着一定的借鉴意义。从目前的情况下来看，算法依然比较复杂，而且Value函数的逼近可能有不小的性能损失。另外，这篇文章的参考文献部分十分详尽。对于想了解RTB的朋友来说，是一个不可多得的言简意赅的介绍。

摘要：

The majority of online display ads are served through real-time bidding (RTB) --- each ad display impression is auctioned off in real-time when it is just being generated from a user visit. To place an ad automatically and optimally, it is critical for advertisers to devise a learning algorithm to cleverly bid an ad impression in real-time. Most previous works consider the bid decision as a static optimization problem of either treating the value of each impression independently or setting a bid price to each segment of ad volume. However, the bidding for a given ad campaign would repeatedly happen during its life span before the budget runs out. As such, each bid is strategically correlated by the constrained budget and the overall effectiveness of the campaign (e.g., the rewards from generated clicks), which is only observed after the campaign has completed. Thus, it is of great interest to devise an optimal bidding strategy sequentially so that the campaign budget can be dynamically allocated across all the available impressions on the basis of both the immediate and future rewards. In this paper, we formulate the bid decision process as a reinforcement learning problem, where the state space is represented by the auction information and the campaign's real-time parameters, while an action is the bid price to set. By modeling the state transition via auction competition, we build a Markov Decision Process framework for learning the optimal bidding policy to optimize the advertising performance in the dynamic real-time bidding environment. Furthermore, the scalability problem from the large real-world auction volume and campaign budget is well handled by state value approximation using neural networks.

【论文】WSDM 2017论文：如何把强化学习融合到广告中的Real-Time Bidding中

正文

请到「今天看啥」查看全文