Last week, I wrote "Why are current LLMs all Decoder-only architectures?", summarizing some of my experimental conclusions and conjectures on this issue. As expected of a hot topic, the traffic was significant; shortly after PaperWeekly forwarded it, the view count exceeded 10,000, and it received many upvotes on Zhihu. Across various platforms, I have received several comments and questions from readers. I have summarized some of the most representative questions into this FAQ, hoping to further help resolve any confusion.
Review
In "Why are current LLMs all Decoder-only architectures?", I conducted comparative experiments between GPT and UniLM architectures. Combining these with my previous research experience, I conjectured the following:
Changing the attention in the input section to bidirectional does not bring gains; the advantage of the Encoder-Decoder architecture likely stems solely from doubling the parameters.
The reason bidirectional attention fails to bring gains might be due to a performance degradation caused by the low-rank problem of bidirectional attention.
Based on these two conjectures, we reached the conclusion:
Under the same number of parameters and the same inference cost, the Decoder-only architecture is the optimal choice.
For details regarding the experiments and reasoning, please refer to the original article; I will not repeat them here.
Q&A
Here are my answers to some of the readers’ doubts.
Question 1: It seems that n \gg d does not hold?
Answer: n is the
sequence length, and d is the
head_size, not the hidden_size. In multi-head
attention, \text{head\_size} =
\text{hidden\_size} / \text{heads}. For example, in BERT-base,
\text{head\_size} = 768 / 12 = 64,
while the pre-training length n is
typically 512. Therefore, n \gg d
generally holds true.
Question 2: BERT and the original GPT have the same number of parameters. Why does BERT perform better on understanding tasks?
Answer: BERT and GPT differ not only in architecture but also in their pre-training tasks, making a fair comparison impossible. At the end of the original article, I provided an idea for improving BERT using GPT’s philosophy, and preliminary experiments suggest it is likely to outperform BERT. That experiment is the one where variables were strictly controlled.
Question 3: "Performance degradation caused by the low-rank problem of bidirectional attention" seems like a major bug. Since the vast majority of models in the industry currently use bidirectional attention, wouldn’t the impact be too widespread?
Answer: We did not conclude that "bidirectional attention is terrible for all tasks." The phenomenon that "the vast majority of models in the industry use bidirectional attention" does not actually conflict with the conclusions of the original article. Our experimental conclusion in the original text was "introducing bidirectional attention into the Encoder for generation tasks does not seem to bring gains." The condition for this conclusion is very specific—"in the Encoder of generation tasks."
Question 4: Not necessarily... Decoder models are just more suitable for dialogue models. Inside Google, there are LLMs based on Encoder-only, Decoder-only, and Encoder-Decoder architectures. They apply to different scenarios, and the other two perform better on other tasks.
Answer: The answer to this is similar to the previous one. The existence of both "Decoder models and Encoder-Decoder models" does not contradict the original conclusion. We only tentatively conjectured that "introducing bidirectional attention into the Encoder for generation tasks does not seem to bring gains"; we did not say that the doubling of parameters brought by an Encoder would not bring gains.
Question 5: Does your conclusion seem to contradict the conclusions of T5 and UL2?
Answer: First, the original conclusion does not contradict UL2. The original text conjectures that "under the same number of parameters and inference cost, Decoder-only is the optimal choice." UL2’s conclusion is that Encoder-Decoder performs better, but Encoder-Decoder and Decoder-only were not compared with the same number of parameters. Secondly, the original conclusion does indeed conflict with some experimental results in T5 (Table 2). However, I have doubts about the T5 experimental results:
In that table, were the variables strictly controlled between Decoder-only and UniLM? The difference between the two is so large that it feels unreasonable; even if Decoder-only were inferior to UniLM, the gap shouldn’t be that significant.
In my article, I compared UniLM and Decoder-only trained from scratch on the same tasks and data (directly comparing pre-training results without fine-tuning on other tasks). The T5 paper compares results after pre-training on various tasks and then fine-tuning on downstream tasks. Since the processes are different, could this lead to the discrepancy in results?
Question 6: Does a faster drop in the final experimental loss prove that the model is better?
Answer: Based on the number of training steps I have performed so far, the mixed forward-backward attention has consistently performed better. I can only conjecture that this trend will continue. This is the limit of the experiments I can currently conduct. I look forward to interested readers with the necessary resources conducting further experiments to confirm or refute this conclusion.
Question 7: Regarding your statement that "comparing GPT with UniLM counts as a strict control of variables," I think it’s not quite accurate. The Google UL2 paper points out that for pre-trained language models, both the model architecture and the pre-training tasks play a key role in model quality.
Answer: In this article, UniLM and GPT refer to two model architectures where only the Attention Mask is inconsistent. When conducting the comparative experiments, all other details were aligned except for the Attention Mask.
Question 8: Could there be another reason: that lower-triangular or upper-triangular masks are better at processing position encoding information?
Answer: This is indeed a very novel perspective that I hadn’t considered. In fact, besides increasing the rank, the triangular mask does indeed bring advantages in position recognition. It breaks the permutation invariance of the Transformer and directly introduces a left-to-right order, so much so that it can even work without position encodings. Perhaps both factors play a role.
Summary
This article has addressed some of the questions raised by readers regarding the previous post.
When reposting, please include the original address: https://kexue.fm/archives/9547
For more detailed reposting matters, please refer to: "Scientific Space FAQ"