“Scaling laws in the context of machine learning, particularly in the field of deep learning and AI, refer to empirical relationships that describe how certain performance metrics scale with various resources, such as the amount of data, the size of the model (number of parameters), or the amount of computational power. These laws are often derived from experimental results and are used to predict and guide the design of large-scale AI systems. “
回到论文里Scaling Law的定义,这里被最广泛引用的文章是OpenAI的Scaling Laws for Neural Language Models(其实有更早的Scaling Law的文章,来自百度研究院,但这篇的认可度更高)。
“We study
empirical scaling laws
for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range.
Simple equations
govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the
optimal allocation of a fixed compute budget
. Larger models are significantly more sample efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.“
基于scaling law 的可预测性又可以做很多模型结构探索,训练方法探索的比较。比如大家非常关注的Mamba,RWKV,transformer,只要比较拟合的时候哪个系数更大,模型loss就更小,效果就更好。再比如我们做结构变化的时候,不同的normalization,不同的optimization方法都可以用scaling law拟合来比较,就不用担心小模型的结论不能推广到大模型上了,因为scaling law比较的是趋势,很多在小模型上比较好的方法由于斜率低,当模型变大就不管用了。