Previously, I have focused primarily
on the conception and implementation of models, rarely paying attention
to model training acceleration. Although I had heard of technologies
like mixed precision and XLA, I had never truly put them into practice.
Over the past two days, after some experimentation, I successfully
utilized mixed precision and XLA to accelerate training in
bert4keras. Here is a brief summary for your reference.
Most of the empirical conclusions in this article are not limited to
use within bert4keras. The reason bert4keras
is emphasized in the title is simply that the model implementations in
bert4keras are relatively structured, making the
modifications required to enable these acceleration techniques
minimal.
Experimental Environment
The GPU used for the experiments in this article is an RTX 3090. The
Docker image used is nvcr.io/nvidia/tensorflow:21.09-tf1-py3,
which comes with TensorFlow version 1.15.5. Additionally, the version of
bert4keras used is 0.11.3. Other environments can be set up
by following these references, but please maintain a spirit of
experimentation and do not expect "brainless" calls to work
perfectly.
As a side note, cards like the 3090 and A100 only support CUDA 11, and the official version 1.15 of TensorFlow from Google does not support CUDA 11. If you still want to use TensorFlow 1.x, you must use nvidia-tensorflow maintained by NVIDIA itself, or use the Docker images they build. Using TensorFlow maintained by NVIDIA instead of Google not only allows you to use version 1.x on the latest GPUs but also includes specific additional optimizations made by NVIDIA. Detailed documentation can be found here.
Do not say things like "TensorFlow is already at version 2.8, why are you still using 1.15?" Your graphics card is manufactured by NVIDIA, so which version of TensorFlow is best is not up to you or me, or even Google; it’s up to NVIDIA. Since NVIDIA is still maintaining 1.15, it means 1.15 is the GOAT (Greatest of All Time).
Mixed Precision
First, let’s look at mixed precision training. Simply put, model calculations use FP16, while parameter updates and storage use FP32. The representation range of FP16 is approximately 6 \times 10^{-8} \sim 65504. Both its upper and lower bounds are limits we might encounter when implementing models. Therefore, the biggest problems introduced by FP16 are overflow and precision loss. For a more detailed introduction to the principles, you can search for them yourself; this article focuses on how to use it.
The introduction to mixed precision training in the
nvidia-tensorflow help documentation can be found here.
The simplest way to enable mixed precision training is to add
environment variables at the beginning of the script:
import os
os.environ['TF_KERAS'] = '1' # Must use tf.keras
os.environ['TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE'] = '1' # Mixed precision trainingReaders might notice that most tutorials introduce TF_ENABLE_AUTO_MIXED_PRECISION,
whereas I use TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE.
The difference is that the former automatically adds "Dynamic Loss
Scaling," while the latter does not. However, my tests found that
"Dynamic Loss Scaling" cannot replace manual loss adjustment, so I
decided to forgo this feature entirely.
After adding the environment variables, you can restart the training
script to check the situation. If NaN appears as soon as
training starts, you can adjust the infinity and epsilon values:
from bert4keras.backend import K
K.set_infinity(1e4)
K.set_epsilon(1e-5)After these adjustments, NaN usually won’t appear right
at the start (if it does, check if other parts of the model use infinity and epsilon that are not controlled
by these two functions and modify them). However, it is possible for the
loss to decrease first, then increase, and finally become
NaN. This is due to poor initialization or intentional
design like in DeepNet,
where gradients for some parameters are extremely small (less than 10^{-8}). In the FP16 precision range, these
gradients become exactly 0, so those parameters won’t be updated, or
equivalently, the gradients are inaccurate. Updating with inaccurate
gradients over a long period easily leads to non-convergence.
The solution here is "Loss Scaling." We can directly multiply the
loss function by a scaling factor (e.g., 1000; you can tune this
yourself—the larger the better, as long as NaN does not
occur). This allows originally tiny gradients to be scaled into the FP16
range, preventing them from being zeroed out and avoiding precision
loss. For optimizers we commonly use, such as Adam and LAMB, multiplying the loss
function by a constant does not change the training process of these
optimizers, meaning they are fully compatible with "Loss Scaling."
In fact, I have found that the "Loss Scaling" technique is effective not only in mixed precision training scenarios but also provides some benefit in full FP32 precision training. In full FP32 training, if loss scaling is not performed, the model may stay at a certain loss value for a while at the beginning before slowly decreasing; if loss scaling is performed, the model maintains a slow downward trend from the start, resulting in relatively faster convergence.
Algebraic Acceleration
Now let’s look at XLA, which stands for "Accelerated Linear Algebra," specifically designed to speed up linear algebra operations. Simply put, XLA performs ahead-of-time compilation optimization on the computation graph, merging operators that can be merged (reducing intermediate variables to save memory) and parallelizing operators that can be parallelized (increasing computation speed).
In nvidia-tensorflow, the simplest way to enable XLA is
still by adding environment variables:
import os
os.environ['TF_KERAS'] = '1' # Must use tf.keras
os.environ['TF_XLA_FLAGS'] = '--tf_xla_auto_jit=1' # Enable XLAHowever, note that XLA does not guarantee an improvement. As
mentioned, XLA tries to parallelize operators as much as possible;
obviously, this is a strategy of trading space for time. Therefore,
enabling XLA may consume more VRAM, leading to OOM (Out of Memory), or
even performance degradation if the parallel clusters are too large. The
official
documentation provides a detailed analysis of possible anomalies and
offers corresponding suggestions. Among them, the solution I recommend
is to add the –tf_xla_enable_lazy_compilation=false
parameter:
import os
os.environ['TF_KERAS'] = '1' # Must use tf.keras
os.environ['TF_XLA_FLAGS'] = '--tf_xla_auto_jit=1' # Enable XLA
os.environ['TF_XLA_FLAGS'] += ' --tf_xla_enable_lazy_compilation=false' # Optimize XLAIf this still doesn’t solve the problem, switch to XLA Lite:
import os
os.environ['TF_KERAS'] = '1' # Must use tf.keras
os.environ['TF_XLA_FLAGS'] = '--tf_xla_auto_jit=fusible' # Enable XLA LiteIf even XLA Lite cannot solve the issue, it basically means XLA is not suitable for your model.
Performance Comparison
On a 3090, the speedup brought by enabling mixed precision training is a bit more than 10%. This magnitude might not be as fast as one might imagine. I speculate this is because, on newer cards like the 3090 and A100, the default FP32 format actually uses a format called TF32 (refer to here). TF32 is, in a sense, a "half-precision format" itself, which is faster than standard FP32. In other words, FP32 on the 3090 is already equivalent to having undergone some half-precision optimization and is inherently faster; thus, the improvement after switching to mixed precision is relatively smaller.
As for the improvement brought by XLA, it is roughly 15%. In my
training script, directly setting the environment variable TF_XLA_FLAGS to
–tf_xla_auto_jit=1 caused an OOM; adding
–tf_xla_enable_lazy_compilation=false resulted in the same,
while changing it to –tf_xla_auto_jit=fusible allowed for
normal training.
Finally, and most importantly, mixed precision and XLA can be used together! Using both together brings a speedup of about 30%, and the addition of mixed precision training basically offsets the increase in VRAM consumption caused by XLA. The two truly complement each other.
Summary
This article introduced attempts to use mixed precision and XLA to
accelerate training in bert4keras. Enabling both
simultaneously can achieve a speedup of about 30% on a 3090.
When reposting, please include the original address of this article: https://kexue.fm/archives/9059
For more detailed reposting matters, please refer to: "Scientific Space FAQ"