为更好理解GRPO带来的改进,我们先简单回顾一下此前的主流RL算法,包括GRPO的前身PPO(Proximal Policy
Optimization
)算法,以及当下在具身智能领域中表现最好的DSAC-T(Distributional Soft A
ctor-Critic with Three Refinements)。这些算法在设计之初是面向相对较小的模型(例如1B以下的模型),并用于自动驾驶、机器人、游戏等任务,采用Actor-Critic架构,通过价值函数模型为策略改进提供依据。
例如在PPO中,算法需要计算优势函数
,用于衡量在状态
下选择动作
相对于平均情况的优势,即
其中
是状态价值函数,代表在当前状态和策略下未来回报的期望值。
算法细节不在这展开,可以参考《Reinforcement Learning for Sequential Decision and Optimal Control》(以下简称RLBook)一书的10.4.5小节。
[1] Shao Z, Wang P, Zhu Q, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models[J]. arXiv preprint arXiv:2402.03300, 2024.
[2] Guo D, Yang D, Zhang H, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning[J]. arXiv preprint arXiv:2501.12948, 2025.
[3] Li S E. Reinforcement learning for sequential decision and optimal control[M]. Springer, 2023. https://link.springer.com/book/10.1007/978-981-19-7784-8
[4] Sutton R S, McAllester D, Singh S, et al. Policy gradient methods for reinforcement learning with function approximation[J]. Advances in neural information processing systems, 1999, 12.
[5]
Guan Y, Li S E, Duan J, et al. Direct and indirect reinforcement learning[J]. International Journal of Intelligent Systems, 2021, 36(8): 4439-4467.
[6] Duan J, Guan Y, Li S E, et al. Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors[J]. IEEE transactions on neural networks and learning systems, 2021, 33(11): 6584-6598.
[7] Duan J, Wang W, Xiao L, et al. DSAC-T: Distributional soft actor-critic with three refinements[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, online available, 2025.
[8] Lightman H, Kosaraju V, Burda Y, et al. Let's verify step by step[J]. arXiv preprint arXiv:2305.20050, 2023.