在作者自己构建的测试集上和其他一些主流模型(o1 4o Deepseek Math 7B)表现对比如下(N代表难度,越大难度越高):
测试集难度N
2
3
4
5
6
7
8
Openai-o1-1217
0.83
0.51
0.38
0.38
0.35
0.30
0.20
GPT-4o
0.68
0.57
0.49
0.32
0.23
0.21
0.11
Deepseek-Math-7B
0.35
0.21
0.08
0.06
0.02
0.00
0.00
Ours(7B)
0.68
0.59
0.44
0.34
0.22
0.16
0.15
根据表格数据分析,不同模型在递增的测试集难度(N=2至N=8)下的性能表现呈现显著差异:
1.
整体趋势
:所有模型的性能均随难度上升而下降,但下降幅度不同。Openai-o1-1217和GPT-4o在低难度(N=2)时表现较优(0.83和0.68),但高难度(N=8)时分别降至0.20和0.11,降幅达76%和84%。
2.
模型对比
:
a.
复现的模型在低难度(N=2-3)与GPT-4o接近(0.68 vs 0.68,0.59 vs 0.57),但在中高难度(N=4-8)中表现更优。例如N=5时准确率为0.34,显著高于GPT-4o的0.32和Openai的0.38;N=8时准确率(0.15)也优于GPT-4o(0.11)。
b.Deepseek-Math-7B性能全面落后,高难度(N≥6)时趋近于0,凸显模型优化不足。
3.
参数效率
:复现的模型与Deepseek-Math-7B同为7B参数模型,但前者性能显著提升(如N=3时0.59 vs 0.21),表明多阶段RL训练策略优化有效。
在参数量相同的情况下,通过这样的训练策略实现了与更大模型(如GPT-4o)在低难度问题上比肩竞争力和在中高难度问题上的优势,尤其在N=5-8阶段,而即使Openai-o1-1217模型在高难度问题下仍表现不佳。该结果凸显了复现的模型强大的逻辑推理能力。
1. **William's statement**: "William is a knight and Chloe is a knight.'- If William is telling the truth (i.e., he is a knighht), then both parts of his statement must be true. Howeve$5,5if William is a knight, the statement "William is a knighht"istrue, but the statement "Chloe is a knight" cannotdetermine the truth about Chloe at this point since we haven't analyzed Chloe's statement yet. Therefore, William cannot be a knight becausse it would lead to a contradiction.
例如上文给出的“ at this point since we haven't analyzed Chloe's statement yet”这是一种谨慎反思的机制:当遇到不确定性时,模型会主动标记当前推理步骤并暂停(类似人类"等等先想清楚"的状态),这种停顿训练机制显著提升了复杂问题的解决能力。
■ 1.2.2 多路径探索
2. **Now that we know William is a knave, let's analyze Chloe's staatement**: "Lily is a knight.- Since William is a knave, Chloe's statement can be eitther true or false. If Chloe is a knight, her statement mu- Let's test both possibilities.
“Let's test both possibilities” 体现了模型习得的双轨并行思考能力,在面对二选一场景时,模型会像科学家一样同时测试两种可能性,这种多路径探索模式可以大大提高模型推理的准确率。
■ 1.2.3 回溯分析
4. **Analyzing Lily's statement**: "William is a knave and Jack is a knight.-Since William is a knave and Lily is a knight, her staterment is true. This means Lily is indeed telling the truth5. *Analyzing Jack's statement**:"William is a knave."-Since William is a knave, Jack's statement is true, so Jaack is a knight**Analyzing Logan's statement**: "Lily is a knave if and onlyif William is a knave."- Since Lily is a knight and William is a knave, this statement is false. Therefore, Logan is a knave.7. *Analyzing Chloe's statement again**: Wee already determined that if Chloe is a knight, her statement "Lilyis aknight" is true, so she is a knight.
Now,we havedetermined:William is a knave.Chloe is a knight.- Lily is a knight.- Jack is a knight.- Logan is a knave.
上述例子体现了模型进行阶段性总结的能力,对之前的推理结果进行汇总,以便于进行后续的验证环节。
■ 1.2.5 最终验证
Let'sverify all statements:VerifyWilliam: "William is a knight and Chloe is a knight" (False, since Williamis a knave).Chloe: "Lily is a knight"(True).Logan: "Lily is a knave if and only if William is a knave"(False, since Lilyis a knight and Williamis a knave)Jack: "William is a knave"(True).Lily: "William is a knave and Jack is a knight" (True)
Since both assuming Sofia is aknight and a knave lead to contradictions, we need tore-evaluate our initial assumption. However, the only consistent path is to recheck the inittial assumption that William is a knightLet's assume William is a knight again and proceed with the same logic to ensure no contradictions arise.The consistent path will be威廉不是一个有效的输入,让我继续使用正常的推理方式来解决这个问题,意味着威廉的陈述是正确的。再次从威廉开始:威廉说:"威廉是骑士或者索菲亚是骑士"。假设威廉是骑士,那么他的陈述必须为真,因此"威廉是骑士或者索菲亚是骑士"为真,这接下来检查索菲亚的陈述:"诺亚不是骑士",假设索菲亚是骑士,那么她的陈述必须为真,意味着诺亚确实不是骑士(即诺亚是骗子)。现在检查诺亚的陈述:"佐伊不是骗子"。如果诺亚是骗子,那么这个陈述为假,这意味着"佐伊不是骗子"为假,即佐伊是骗子。如果佐伊是骗子,那么她所做的陈述"如果诺亚是骑士那么梅森是骗子"将为假。因为如果诺亚是骑士,那么如果诺亚是骑士那么梅森是骗子"将为真,但由于佐伊是骗子,她应该说这导致矛盾。因此,索菲亚不能是骑士。既然索菲亚不能是骑士,那么她必须是骗子。因此,她的陈述"诺亚不是骑士"为假,这意味着诺亚是骑士。现在我们检查诺亚的陈述"佐伊不是骗子"。既然诺亚是骑士,这个陈述必须为真,所以佐伊确实不是骗子,佐伊是骑士。接下来检查佐伊的陈述:"如果诺亚是骑士那么梅森是骗子因为诺亚是骑士,该条件为真,而佐伊的陈述为真,因此佐伊是骑士。然后我们检查梅森的陈述:"威廉是骗子或者佐伊是骗子"。威廉是骑士,佐伊是骑士因此梅森的陈述为假,因此梅森是骗子。最后检查威廉的陈述:"威廉是骑士或者索菲亚是骑士"。威廉是骑士,索菲亚是骗子,因此威廉的陈述为真,这与我们的推理一致。综上所述,我们得到了以下结果:1.威廉是骑士2.索菲亚是骗子3.诺亚是骑士4.佐伊是骑士5.梅森是骗子(1) William is a knight,(2)Sofia is a knave Noah is a knight,(4)Zoey is
a knight,(5)Mason is a knave<|im_end|>
上述例子展示了模型在推理分析问题的过程中从英文切换到中文逻辑框架最终又切换回英文的现象。
■ 2.1. 前置设定
■ 2.1.1 训练数据
Knights and Knaves(K&K)谜题
这个项目选用了K&K谜题作为训练模型逻辑推理能力的训练数据。Knights and Knaves(K&K)谜题是一种经典的逻辑推理谜题,其核心设定源于美国逻辑学家雷蒙德·斯穆里安(Raymond Smullyan)的著作《这本书的名字是什么?》。
谜题设定在一个虚构岛屿上,居民分为两类:
<|im_start|>system\nYou are a helpful assistant. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think>think> and<answer>answer> tags, respectively, i.e., <think> reasoning process here think><answer> answer here answer>. Now the user asks you to solve a logical reasoning problem. After thinking, when you finally reach a conclusion, clearly state the identity of each character within <answer>answer> tags. i.e., <answer> (1) Zoey is a knight\n(2) ... answer>.\n<|im_end|>\n<|im_start|>user\n{quiz}\n<|im_end|>\n<|im_start|>assistant\n<think>