LLM is an abbreviation for “Large Language Model,” which currently generally refers to language models with more than 10 billion parameters, primarily oriented toward text generation tasks. Unlike the “hundred flowers blooming” state of small-scale models (on the order of 1 billion or fewer parameters), the current status of LLMs is dominated by research into Decoder-only architectures. Setting aside OpenAI’s GPT series, which has always insisted on Decoder-only, even companies like Google, which do not bet entirely on Decoder-only, have indeed invested significant effort into researching Decoder-only models, such as PaLM. So, why has the Decoder-only architecture become the mainstream choice for LLMs?
There is a similar question on Zhihu: “Why are current LLMs all Decoder-only architectures?” Most of the answers there focus on the advantages of Decoder-only in terms of training efficiency and engineering implementation. Does it have any theoretical advantages? This article attempts to provide a simple analysis from that perspective.
Unified Perspective
It should be pointed out that the largest model the author has trained is only at the 1-billion-parameter level. Therefore, from the general concept of LLMs, I am not strictly qualified to answer this question. The following content is just a forceful attempt to answer from a more theoretical perspective based on some research experience. Most inferences in this article are based on my own experimental results; some parts may conflict with results in certain literature, so readers should use their own judgment.
We know that general NLP tasks involve predicting an output based on a given input; completely unconditional random generation is rare. In other words, any NLP task can be decomposed into an “input” part and an “output” part. We can call the model that processes the “input” the Encoder and the model that generates the “output” the Decoder. Thus, all tasks can be understood from an “Encoder-Decoder” perspective. The differences between different models lie in the attention patterns of the Encoder and Decoder, and whether they share parameters:
| Encoder Attention | Decoder Attention | Shared Parameters? | |
|---|---|---|---|
| GPT | Unidirectional | Unidirectional | Yes |
| UniLM | Bidirectional | Unidirectional | Yes |
| T5 | Bidirectional | Unidirectional | No |
Here, GPT is the representative of Decoder-only; UniLM is a Decoder architecture similar to GPT but with a mixed attention pattern; T5 is the representative of the Encoder-Decoder architecture, which Google is particularly interested in.
Google conducted extensive comparative experiments in the T5 and UL2 papers. The results consistently showed the advantages of the Encoder-Decoder architecture compared to Decoder-only. However, since the model scales in these two papers are not large from an LLM perspective, and most LLMs are indeed Decoder-only, it remains unanswered whether this advantage extends to larger-scale LLMs and what the reason for this advantage is.
Comparative Experiments
From the table above, we can see that comparing GPT with UniLM constitutes a strictly controlled variable experiment. If GPT is compared directly with T5, two variables are actually introduced: the attention of the input part is changed to bidirectional, and the parameters are doubled. The reason they are compared together is that their inference costs are roughly the same.
Since T5 has two variables compared to GPT, we cannot determine whether the advantage of the Encoder-Decoder architecture is caused by changing the input attention to bidirectional or by doubling the parameters. To this end, the author conducted comparative experiments between GPT and UniLM on a 1-billion-parameter scale. The results showed that for training from scratch on the same input and output (where the Loss is only calculated for the output part, and the only difference is the attention pattern of the input part), UniLM showed no advantage over GPT and was even worse in some tasks.
Assuming this conclusion is representative, we can reach a preliminary conclusion:
Changing the attention of the input part to bidirectional does not bring gains; the advantage of the Encoder-Decoder architecture is very likely just due to doubling the parameters.
In other words, under the same parameter count and inference cost, the Decoder-only architecture is likely the optimal choice. Of course, to fully verify this hypothesis, further experiments are needed, such as keeping the Encoder and Decoder parameters unshared but changing the Encoder to unidirectional attention, or to the forward-backward mixed attention introduced in the next section, and then comparing it with the conventional Encoder-Decoder architecture. However, due to limited computing power, these experiments are left to interested readers.
Low-Rank Problem
Why does “changing the attention of the input part to bidirectional not bring gains”? Since the input part does not need to consider autoregressive generation, shouldn’t a complete attention matrix be better intuitively? The author guesses that this is likely due to the performance degradation caused by the low-rank problem of bidirectional attention.
As is well known, an Attention matrix is generally formed by a low-rank decomposition matrix plus a softmax. Specifically, it is an n \times d matrix multiplied by a d \times n matrix followed by a softmax (n \gg d). This form of Attention matrix suffers from a decrease in expressive power due to the low-rank problem; for a detailed analysis, refer to “Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth.” In contrast, the Attention matrix of the Decoder-only architecture is a lower triangular matrix. Note that the determinant of a triangular matrix is equal to the product of its diagonal elements. Due to the presence of softmax, the diagonal elements must be positive, so the determinant must be positive. That is, the Attention matrix of the Decoder-only architecture is necessarily full rank! Full rank implies a theoretically stronger expressive power. This means that the Attention matrix of the Decoder-only architecture has stronger expressive power in theory, and changing it to bidirectional attention might actually make it insufficient.
There is also a phenomenon that indirectly supports this view: the gap between Linear Attention and standard Attention in language modeling tasks (unidirectional attention) is smaller than the gap in MLM tasks (bidirectional attention). That is to say, Linear Attention performs relatively worse on bidirectional attention tasks. This is because when performing language modeling tasks, the Attention matrix of Linear Attention is a full-rank lower triangular matrix, just like standard Attention. When performing MLM tasks, the rank of the Linear Attention matrix is lower than that of the standard Attention matrix (Linear Attention is an n \times d matrix multiplied by a d \times n matrix, so the rank does not exceed d; standard Attention is an n \times d matrix multiplied by a d \times n matrix plus softmax, where softmax has a certain rank-increasing effect; refer to the “Low-Rank Problem” section and comments in “Transformer Upgrade Road: 3. From Performer to Linear Attention”).
Conversely, can this conclusion be used to improve bidirectional attention models like BERT? The idea is not hard to conceive. For example, in Multi-Head Attention, half of the heads’ Attention matrices could be truncated into lower triangular matrices (forward attention), and the other half into upper triangular matrices (backward attention). Alternatively, the Attention matrices of odd layers could be truncated into lower triangular matrices (forward attention), and even layers into upper triangular matrices (backward attention). Both designs maintain the bidirectionality of the overall model interaction (unlike GPT, where a previous token cannot interact with a subsequent token) while incorporating the full-rank advantage of unidirectional attention.
The author also conducted a simple comparative experiment and found that forward-backward mixed attention performs slightly better on MLM tasks than full bidirectional attention models like BERT:
The good news is that a slight advantage is visible, indirectly supporting the previous hypothesis. The bad news is that the experiment was only on a base version (100 million parameters) model; the effect on larger models is not yet clear.
Article Summary
Therefore, the answer provided by the author is: the reason LLMs mainly use the Decoder-only architecture, besides the advantages in training efficiency and engineering implementation, is theoretically because the bidirectional attention of the Encoder suffers from a low-rank problem, which may weaken the model’s expressive power. As far as generation tasks are concerned, introducing bidirectional attention offers no substantial benefit. The reason the Encoder-Decoder architecture can perform better in certain scenarios is likely just because it has twice the parameters. Thus, under the same parameter count and inference cost, the Decoder-only architecture is the optimal choice.
Please include the original link when reprinting: https://kexue.fm/archives/9529
For more details on reprinting, please refer to: Scientific Space FAQ