考古OpenAI RLHF基石之作：探索RL和RM阶段的Scaling Law

吃果冻不吐果冻皮 · 公众号 · · 2024-11-14 08:39

正文

原文：https://zhuanlan.zhihu.com/p/3654680219

这次我们考古一篇很著名的论文：

Paper：Scaling Laws for Reward Model OveroptimizationAbs：https://arxiv.org/abs/2210.10760

这篇论文可以说是RLHF基石之作。基本上能完全理解和复现这篇论文，就能达到chatGPT或者instructGPT的RLHF水平。但国内现状比较残酷，能真正完全复现这篇论文的公司十分少。尽管很多公司宣称模型能达到GPT4甚至4o水平，但实际上，可能国内技术水平并没有达到了ChatGPT时候OpenAI的技术能力。（当然除了北美三强以外，似乎也没有哪家一定复现出来了。）

Motivation

这篇论文探索的是RL和RM阶段的Scaling Law ：

RLHF(包括BON & PPO)利用RM作为proxy objective，会有overoptimization的问题，也就是reward hacking问题。那么是否增加数据量或者模型参数可以缓解这个问题，以及这个增加方式是否符合scaling law就是这篇论文主要讨论的问题。

主要结论

这个结论说明：

• RL是一个消耗KL distance的产物，当KL前期增长时，和都是先随KL变化上升，然后下降。
• 但下降更快。

画个图看看

超参数：

alpha_bon = 2.5 beta_bon = 0.05 alpha_rl = 2.5 beta_rl = 0.5

import numpy as np
import matplotlib.pyplot as plt

# 设置参数值
alpha_bon = 2.5
beta_bon = 0.05
alpha_rl = 2.5
beta_rl = 0.5

# 定义d的范围
d_values = np.linspace(1, 100, 400)

# 计算两个公式的值
R_bon_values = d_values * (alpha_bon - beta_bon * d_values)
R_rl_values = d_values * (alpha_rl - beta_rl * np.log(d_values))

# 创建图形
plt.figure(figsize=(12, 6))

# Best-of-n (BoN) 采样图
plt.subplot(1, 2, 1)
plt.plot(d_values, R_bon_values, label=r'$R_{\text{bon}}(d) = d (\alpha_{\text{bon}} - \beta_{\text{bon}} d)$')
plt.title('Best-of-n (BoN) Sampling')
plt.xlabel('d')
plt.ylabel(r'$R_{\text{bon}}(d)$')
plt.legend()
plt.grid(True)

# 强化学习 (RL) 图
plt.subplot




    
(1, 2, 2)
plt.plot(d_values, R_rl_values, label=r'$R_{\text{RL}}(d) = d (\alpha_{\text{RL}} - \beta_{\text{RL}} \log d)$')
plt.title('Reinforcement Learning')
plt.xlabel('d')
plt.ylabel(r'$R_{\text{RL}}(d)$')
plt.legend()
plt.grid(True)

# 展示图形
plt.tight_layout()
plt.show()

真实的图片：

其余结论 ：

• 1. BON比RL随着KL增长更容易optimization和over-optimization。
• 2. 随着模型参数增长 ，和参数也是跟着增长。
• 3.Policy的大小不影响最终gold reward效果。（有点问题）
• 4.KL penalty对于这些结果并不影响。（有点问题）

问题在于

• 1. 合成reward的分数分布未必和现实reward分布一致。没有考虑真实reward的噪音问题。
• 2. 因为研究的是over-optimization，测试的是train reward的gold reward，没有考虑泛化和OOD。

Setting

• 1.使用和Instruct GPT一样的setting。
• 2.所有RM使用了加scalar head方式输出rm score。
• 3.RL使用PPO，KL penalty设置为0.

•4 . 6B模型作为3B reward model的gold reward： 这个setting其实很有问题，因为模型给的label，3B模型更好学，且不存在很多的噪声 。

• 5. 利用validation set来帮助gold reward重新renormalization&recalibration：这个细节好像很多repo没有做过。

The RM scores are translation-invariant, so to ensure comparability across different reward models, we recenter each RM such that the average reward of the initial policy is 0. We also unit normalize the variance of the gold RM scores. Because our hard thresholding synthetic data setup produces labels that are miscalibrated (since they do not incorporate the gold RM’s confidence), we recalibrate the proxy RMs by rescaling the logits to minimize cross-entropy loss using a validation set of soft labels. All renormalization and recalibration is applied after the experiments; this does not affect BoN at all, and likely has no impact on RL because Adam is loss scale invariant, though it is possible that there are slight differences due to algorithmic details.

Detail Result

Scaling RM参数获得的Scaling Law

Scaling with RM Data Size

• 2000 pair以下，效果scaling不明显。
• 2000 pair以上，效果可以scaling。
• 里面有个有趣的点：虽然更大的奖励模型（Reward Models，RMs）总体上能够得到更好的评分，但它们在达到某个关键阈值方面并没有比较小的模型更早表现出显著优势。说明这个gold reward还是非常的model specific，连大模型也需要去拟合那个优化方向。
• 为了证明optimization = generalization，他们画了在training reward上的BON和reward model上的validation loss的关系, 但这个结论也存疑问～。

Scaling with Policy Size

结论： policy size增大，模型能力并不能提升。这个结论就很诡异了，因为这说明，rm 模型完全dominate policy，但这个应该是不可能的。