GAU-$ $: A First Experience with the Fast, Effective, and Efficient Next-Generation Attention · English (unofficial) translations of posts at kexue.fm

In "FLASH: Probably the Most Interesting Efficient Transformer Design Lately", we introduced the GAU (Gated Attention Unit). Here, the author is willing to call it the "most promising next-generation Attention design" because it truly achieves the characteristics of being "faster (speed), better (effect), and more efficient (memory)."

However, some readers obtained opposite results in their own tests, such as slower convergence or poorer performance, which differs significantly from the author’s test results. This article shares the author’s own training experience and releases a preview version, "GAU-\alpha," for everyone to test.

Open Source Address: https://github.com/ZhuiyiTechnology/GAU-alpha

GAU-\alpha

First, let’s introduce the performance of the open-sourced "GAU-\alpha" on CLUE tasks:

All models are the Base version. The table above shows the results on the validation sets of CLUE tasks. The execution methods and comparisons are fair, making it a reasonable relative comparison. Additionally, the RoFormerV2^* here is not the multi-task version from "RoFormerV2: Exploring the Limits of Natural Language Understanding", but a version that only underwent MLM pre-training (this version was not open-sourced). This comparison is used because GAU-\alpha also only underwent MLM pre-training.

As can be seen from the table, except for the "outlier" WSC, which has a very small amount of data, GAU-\alpha has an advantage in most tasks, and its average score (excluding WSC) is the best. Among them, the comparison between RoFormerV2^* and GAU-\alpha is the most fair because their training scripts, training data, and overall structures are identical; the only difference is that GAU-\alpha replaces the Attention+FFN combination in RoFormerV2^* with two layers of GAU. The comparison between the two fully demonstrates the "better" nature of the GAU design.

Furthermore, we introduced in "RoFormerV2: Exploring the Limits of Natural Language Understanding" that RoFormerV2 simplified its structure to achieve faster speeds. GAU-\alpha, which shares the same overall structure, does the same. Therefore, the speed of GAU-\alpha is faster than the BERT, RoBERTa, and RoFormer in the table, yet its average performance is superior. Further testing shows that when the sequence length exceeds 512, the speed of GAU-\alpha begins to surpass that of the similarly streamlined RoFormerV2, and its memory usage is lower; the longer the sequence, the more advantageous it is for GAU-\alpha.

Training

Now, let’s introduce the training details of the model. The complete code has been open-sourced on GitHub; if you have any doubts, you can read it alongside the code.

Model Architecture: GAU-\alpha replaces the Attention+FFN of RoFormerV2 with two layers of GAU. In a previous article, we compared how the computational complexity and parameter count of two GAU layers are roughly equivalent to an Attention+FFN combination, so this replacement is reasonable. A characteristic of RoFormerV2 is that it retains the Post Norm structure, removes all Bias terms, and replaces Layer Norm with the simplest variant of RMS Norm; the same applies to GAU-\alpha.

Normalization: In "It seems Attention and Softmax are a better match ", we discussed the normalization issue of Attention. For GAU-\alpha’s Attention normalization, we selected the Entropy-invariant Softmax (temporarily called softmax_plus in bert4keras), which was proposed by the author and possesses good extrapolation capabilities.

Training Method: Regarding initialization, the author made adjustments according to "What are the difficulties in training a 1000-layer Transformer?". Consequently, training can proceed directly without Warmup. The optimizer used is LAMB, with piecewise linear learning rate decay. The pre-training task is Whole Word MLM, and the tokenization tool is Baidu’s LAC; these are all aligned with RoFormerV2.

There doesn’t seem to be much else worth mentioning; indeed, not many changes were made. Aside from spending some time testing normalization methods, other aspects didn’t require much time, and direct training yielded good results.

Summary

GAU is what the author considers to be the "most promising next-generation Attention design at present." This article shared some training experiences with GAU and open-sourced a preview version, "GAU-\alpha."

When reprinting, please include the original address of this article: https://kexue.fm/archives/9052

For more detailed reprinting matters, please refer to: "Scientific Space FAQ"