Deriving Model Scaling Laws Based on the Quantization Hypothesis · English (unofficial) translations of posts at kexue.fm

Scaling Law refers to the asymptotic relationship between model capability and model scale. Specifically, model capability can be simply understood as the model’s loss function, while model scale can refer to the number of parameters, the amount of training data, the number of training steps, etc. The study of Scaling Laws investigates the approximate relationship between the loss function and variables such as parameters, data volume, and training steps. Experimental results from works such as "Scaling Laws for Neural Language Models" and "Training Compute-Optimal Large Language Models" indicate that the scaling laws of neural networks mostly take the form of a "Power law."

Why is it a power law? Can it be explained theoretically? The paper "The Quantization Model of Neural Scaling" provides an interesting derivation based on a "quantization" hypothesis. Let us explore it together in this article.

Derivation Hypotheses

First, we assume that for a specific task, there exists a "perfect model," and the models we train are all approximations of this "perfect model." Furthermore, we assume that the "perfect model" is composed of "Quanta," where each quantum represents a specific capability (note that "quanta" here primarily refers to virtual units of capability, not necessarily specific capabilities we can name).

To complete a task, multiple capabilities are often required. Therefore, without loss of generality, we assume the "perfect model" contains an infinite number of capability quanta. Different quanta are responsible for solving samples of different difficulty levels. Generally, simple samples account for the majority, while difficult samples are the minority. Thus, these capability quanta can be sorted by their frequency of occurrence from high to low, labeled as 1, 2, \dots, k, \dots, with corresponding frequencies p_1, p_2, \dots, p_k, \dots.

Finally, we assume that the frequencies of these capability quanta follow "Zipf’s law", namely: p_k = \frac{k^{-\gamma - 1}}{Z_{\gamma}} where \gamma > 0, and Z_{\gamma} is the normalization factor \sum_{k=1}^{\infty} k^{-\gamma - 1}.

Zipf’s Law

Readers might ask: why Zipf’s law? Zipf’s law is an empirical law published by Zipf in 1949. His original discovery was that the frequency of a word is roughly inversely proportional to its rank in the frequency table. Later, people generalized it to be inversely proportional to the "power of the rank," and Zipf’s law has been observed in many fields.

Zipf himself and several successors have attempted to derive Zipf’s law under more fundamental assumptions; related work can be found on Wikipedia and will not be expanded upon here. For the author, the most important reason for choosing Zipf’s law is actually—there were simply no other options.

Don’t forget that p_k has already been sorted from high to low, so p_k is a monotonically decreasing function. What non-negative, monotonically decreasing functions can we think of? It basically boils down to exponential functions and power functions. Exponential functions decay very quickly and thus lack the long-tail phenomenon, while power functions decay more slowly and are relatively more long-tailed. Which one to choose depends on our prior knowledge of the importance of the tail. Regarding the previous capability quanta hypothesis, we believe that every capability is crucial, so we can only choose a power function, which results in Zipf’s law.

Basic Results

Returning to the main topic. We previously assumed the ideal model has infinite capability quanta, while a real-world model with finite capacity can only learn n quanta. To cover as many samples as possible, the model should learn the first n quanta. Assuming each quantum can reduce the loss of its corresponding samples from b to a, we can estimate the average loss of the model as: L = a \sum_{k=1}^n p_k + b \sum_{k=n+1}^{\infty} p_k The first n quanta have been learned, so the loss for that portion of samples is a; the subsequent quanta have not been learned, so the loss is b. This assumption seems somewhat strong—setting a, b as functions of k might be more reasonable—but the result is already representative (refer to the original paper’s appendix). For the above equation, we can perform an asymptotic estimation: \begin{aligned} L =&\, a \sum_{k=1}^{\infty} p_k + (b - a) \sum_{k=n+1}^{\infty} p_k \\ =&\, a + (b - a) \sum_{k=n+1}^{\infty} \frac{k^{-\gamma-1}}{Z_{\gamma}} \\ \sim&\, a + (b - a) \int_n^{\infty} \frac{k^{-\gamma-1}}{Z_{\gamma}} dk \\ =&\, a + \frac{b - a}{\gamma Z_{\gamma}} n^{-\gamma} \\ \end{aligned} It shows that the model’s capability (loss function) and the number of capability quanta n follow a power law of the form n^{-\gamma}. Obviously, a represents the minimum value of the loss function. If a=0, then L \sim \mathcal{O}(n^{-\gamma}). In the following, we assume a=0.

Scaling Laws

The n in the basic result is the number of capability quanta learned by the model, which so far is just a virtual concept. Next, we relate it to common model variables.

Parameters: Assume the model’s parameter count is N, and assume that on average, it takes C parameters to learn one capability quantum (assuming C is a constant). Then clearly n \propto N, and: L \sim \mathcal{O}(N^{-\gamma})

Data Volume: Assume the total number of samples in the training set is D. Since we assume different quanta solve samples of different difficulties, we can naturally assume that the number of samples solved by quantum 1 is Dp_1, by quantum 2 is Dp_2, by quantum 3 is Dp_3, and so on. If we assume that learning a quantum requires at least \tau samples, then quanta where Dp_k < \tau cannot be learned. Thus, from \tau = D p_n, we can solve for n \propto D^{1/(\gamma + 1)}. Substituting this back, we get: L \sim \mathcal{O}(D^{-\gamma/(\gamma + 1)})

Training Amount: Assume the model’s parameters and the training data are unlimited; then the number of quanta n learned by the model depends on the number of training steps S. Assuming a batch size of B, then on average, the number of samples for learning quantum 1 is Bp_1, for quantum 2 is Bp_2, and so on. Similarly, assuming learning a quantum requires at least \tau samples, then after S steps of training, quantum n has been trained on a total of SBp_n samples. From \tau = SB p_n, we can solve for n \propto S^{1/(\gamma + 1)}. Substituting this back, we get: L \sim \mathcal{O}(S^{-\gamma/(\gamma + 1)})

As we can see, although the results are all power laws, because \gamma > \gamma/(\gamma + 1) \in (0, 1), it is clear that the number of parameters has a greater impact on model capability.

Emergence Phenomenon

Some readers might ask: can the capability quantization hypothesis be used to explain the "Emergence" phenomenon in large models?

To some extent, yes. We previously assumed the perfect model should have infinite capability quanta. If we change this infinity to a finite number, then by increasing the parameter count, the model will eventually have the chance to cover all capability quanta, reaching the theoretically optimal perfect model—this is emergence. Alternatively, the perfect model might still have infinite capability quanta, but human "resolution" of intelligence is limited to a finite number of quanta (humans themselves are not necessarily perfect). Thus, when a large model learns a certain number of capability quanta, it appears as "emergence" from a human perspective.

Summary

This article introduced the process of deriving model Scaling Laws from the quantization hypothesis. Specifically, it covered the asymptotic relationship between the model’s loss function and its parameters, data volume, and training amount, and briefly analyzed its possible connection to the emergence phenomenon.

When reposting, please include the original link: https://kexue.fm/archives/9607

For more detailed reposting matters, please refer to: "Scientific Space FAQ"