The RM scores are translation-invariant, so to ensure comparability across different reward models, we recenter each RM such that the average reward of the initial policy is 0. We also unit normalize the variance of the gold RM scores. Because our hard thresholding synthetic data setup produces labels that are miscalibrated (since they do not incorporate the gold RM’s confidence),
we recalibrate the proxy RMs by rescaling the logits to minimize cross-entropy loss using a validation set of soft labels.
All renormalization and recalibration is applied after the experiments; this does not affect BoN at all, and likely has no impact on RL because Adam is loss scale invariant, though it is possible that there are slight differences due to algorithmic details.