另外这个模型 official name 叫做 OpenAI o1,而不是 GPT-o1,更能体现出这在技术路线上极有可能是有与 GPT-4 系列的路数稍有不同的新玩法。在 JS 离开了之后,颇有雄关漫道真如铁,而今迈步从头越的豪迈之情。要是模型再不出来, 这个 code name 梗估计都要被玩烂了。
We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute).
在这个建模方式中,我们把节点定义为状态(state)对应强化学习中的 s,把边定义成行为(action)对应强化学习中的 a,大语言模型控制从状态 s 到行为之间的转移 a ~ π (· | s), 每做完一次转移之后 s' = s ⊕ a 表示下一个状态由上一个时刻的状态 s 和 a 条件型生成,最简单的条件生成为直接拼接。
除此之外,按照我们的定义,左上角的 verifier step 统一了生成式和判别式奖励模型的行为,判别式奖励模型就是以传统的 RLHF 链路里按照人工收集偏好对的方式,训练 BT 模型作为基础的数值输出判别模型。他对于一组问题和答案对 (s , a) 可以给出一个数值的打分,分数越高说明表现的越好。而 o1 的模式大概率不仅仅只有一个判别式的奖励模型,还有类似于 GPT-4 catch bugs5 的生成式奖励模型:模型不仅能输出分数,还能够直接数据判断的文字出来。所以虚线表示 verifier step,建模成 πr ~ π (· | s , a) 即表示奖励模型也是概率型生成。
按照这种建模方式,可以很清晰地表示几种 test-time 推理的 scaling 模式。第一种就是 Best of N 搜索,这是一种极为朴素的并行搜索模式,对于一个状态 s 同时生成出 N 个可能的 candidate,然后使用 Reward Model 作为最终 verifier,并将最高的奖励分值作为答案。BoN 极为简单,质朴,scaling 方向为宽度方向。
这种方式的好处是非常直接,等同于 DP 中的全搜索策略,但是由于探索过程中没有启发容易造成计算的资源浪费在了宽度上面。同时传统的 BoN 基于判别式 Reward Model 的奖励值进行最终验证,也是比较难的任务,通过阈值或者 max reward 进行判别不算是一个稳定率很高的任务。可以理解为 BoN 是在宽度(空间)上广阔,深度(时序)上浅显的推理 scaling。
First, the cipher seems connected to the plaintext.
Given the time constraints, perhaps the easiest way is to try to see patterns.
Option 1: Try to find mapping from letters to letters.
Do any letters match?
First, let's write down the ciphertext and plaintext letters on top of each other.
Ciphertext: o y f j d n i s d r r t q w a i n r a c x z m y n z b h h x
Plaintext: T h i n k s t e p b y s t e p
Wait a minute.
I think maybe there is an anagram or substitution cipher here.
Alternatively, I think that we can notice that each group of ciphertext corresponds to a plaintext word.
Check the number of letters.
First ciphertext word: oyfjdnisdr (10 letters)
First plaintext word: Think (5 letters)
Hmm.
Second ciphertext word: rtqwainr (8 letters)
Second plaintext word: step (4 letters)
Third ciphertext word: acxz (4 letters)
Third plaintext word: by (2 letters)
Fourth ciphertext word: mynzbhhx (8 letters)
Fourth plaintext word: step (4 letters)
Interesting.
It seems that the ciphertext words are exactly twice as long as the plaintext words.
(10 vs 5, 8 vs 4, 4 vs 2, 8 vs 4)
Idea: Maybe we need to take every other letter or rebuild the plaintext from the ciphertext accordingly.
Let's test this theory.
If we take every other letter from the ciphertext words, maybe we'll get the plaintext.
Let's try this for the first word:
Ciphertext word: o y f j d n i s d r (10 letters)
Alternate letters:
First letters (positions 1,3,5,7,9): o f d i d
Second letters (positions 2,4,6,8,10): y j n s r
Wait: Let's try taking every second letter or first letter.
Option 1: Take the odd positions:
Positions: 13579
Letters: o f d i d
Plaintext word is "Think", letters are T h i n k
Hmm.
Perhaps we can map letters:
o f d i d --> Think
Not directly
Option 2: Try mapping as per an assigned code: perhaps columns of letters?
Alternatively, perhaps the cipher is more complex.
Alternatively, notice that "oyfjdnisdr" has 10 letters and "Think" has 5 letters.
What if each plaintext letter is being represented by two ciphertext letters?
Similarly for other words.
Suppose that each plaintext letter corresponds to pair of letters in the ciphertext.
Meaning that the first two letters of ciphertext map to the first plaintext letter, and so on.
So for the first word, partition the ciphertext letters into pairs:
oy fj dn is dr
Corresponding to plaintext letters T h i n k
So we can attempt to map:
oy -> T
fj -> h
dn -> i
is -> n
dr -> k
If this is the mapping, perhaps each pair of letters maps to one plaintext letter.
Similarly for other words.
Let's test this mapping.
Is there a consistent mapping from pairs to letters?
Let's make a table of pairs and corresponding letters.
First, make the pairs from the first word and their corresponding plaintext letters:
Ciphertext Pairs -> Plaintext Letter
oy -> T
fj -> h
dn -> i
is -> n
dr -> k
Now, what's the mapping from "oy" to 'T'?
Is there a way to map 'oy' to 'T'?
Similarly for others.
One possible way is that 'oy' represents 'T'.
But how?
Option 1: Sum of the letters.
Convert letters to numbers (A=1, B=2, ..., Z=26)
So let's try to get to 'T' from 'o' and 'y'
o = 15
y = 25
Sum: 15 +25 = 40
But 'T' is 20.
Alternatively, perhaps subtract: 25 -15 = 10.
No.
Alternatively, perhaps combine the numbers in some way.
Alternatively, think about their positions in the alphabet.
Alternatively, perhaps the letters are encrypted via a code.
Alternatively, perhaps if we overlay the word 'Think' over the cipher pairs 'oy', 'fj', etc., the cipher is formed by substituting each plaintext letter with two letters.
Alternatively, perhaps consider the 'original' letters.
Let's look at the letters between the word "Think" and "oyfjdnisdr"
Perhaps we can spot that "oyfjdnisdr" is an anagram for "Thinkfiles"?
But that seems unlikely.
Alternatively, think of the pairs.
Suppose we assign numbers to letters: a=1, b=2, ..., z=26
Consider the numbers in 'oy':
o = 15
y = 25
Add them: 15 + 25 = 40
Now, for 'T' we have:
'T' is number 20
What if we take the sum of the cipher pair mod 26.
15 +25 = 40
Then 40 mod 26 = 14
14 corresponds to 'N'
Doesn't match.
Alternatively, see if the cipher pairs correspond to something else.
Alternatively, perhaps the pair 'oy' corresponds to 'T' via key.
Alternatively, perhaps it's as simple as 'oy' is 'T'.
But 'o' corresponds to 'T', subtracting 'o' ASCII code from 'T'.
Now, let's think in terms of symmetric letters.
Alternatively, perhaps we can spot that the letters correspond to the first letters of words.
But perhaps an easier way is to look at the word positions.
Alternatively, perhaps we can see that the ciphertext letters are offsets from the plaintext letters.
So let's try to map the pairs to the plaintext letters.
First pair 'oy' and plaintext letter 'T'
Letter 'o' is 15, 'T' is 20
Letter 'o' + 5 = 'T' (15 +5=20)
But 'y' is 25, which is 5 letters after 'T' (20 +5=25)
Interesting, seems that 'o' +5 = 'T'
Similarly, 'y' +1 = 'Z', which is not 'T'.
Alternatively, maybe the average of the two letters corresponds to 'T'
(15 +25)/2 =20, which is 'T'
Wait a minute, that seems promising.
First pair: 'o' (15) + 'y' (25) = 40
40 /2 =20
20 corresponds to 'T'
So perhaps the average of the letters in the pair corresponds to the plaintext letter.
Let's check this with the second pair.
Second pair: 'fj' corresponding to 'h'
Letters 'f'=6, 'j'=10
Sum: 6+10=16
Average:16/2=8
8 corresponds to 'h' (since 'h' is 8)
Success!
Third pair: 'dn' to 'i'
'd'=4, 'n'=14
Sum:4+14=18
Average:18/2=9
9 corresponds to 'i'(9='i')
But 'i' is 9, so that seems off by 1.
So perhaps we need to think carefully about letters.
Wait, 18/2=9, 9 corresponds to 'I'
So this works.
Fourth pair: 'is' corresponding to 'n'
'i'=9, 's'=19
Sum:9+19=28
Average:28/2=14
14 corresponds to 'n'(14='N')
Yes!
Fifth pair: 'dr' corresponds to 'k'
'd'=4, 'r'=18
Sum:4+18=22
Average:22/2=11
11 corresponds to 'k'(11='K')
Perfect!
So our code is: For each pair, sum their numeric values, divide by 2, get the corresponding letter.
Alternatively, the average of the numeric values of the letters corresponds to the plaintext letter.
This seems to be a valid method.
Now let's test this with the other words.
Second word ciphertext: rtqwainr
Ciphertext pairs:
rt -> s
qw -> ?
ai -> ?
nr -> ?
Corresponding to plaintext letters 's', 't', 'e', 'p'