Accelerating Training with Mixed Precision and XLA in bert4keras · English (unofficial) translations of posts at kexue.fm

Previously, I have focused primarily on the conception and implementation of models, rarely paying attention to model training acceleration. Although I had heard of technologies like mixed precision and XLA, I had never truly put them into practice. Over the past two days, after some experimentation, I successfully utilized mixed precision and XLA to accelerate training in bert4keras. Here is a brief summary for your reference.

Most of the empirical conclusions in this article are not limited to use within bert4keras. The reason bert4keras is emphasized in the title is simply that the model implementations in bert4keras are relatively structured, making the modifications required to enable these acceleration techniques minimal.

Experimental Environment

The GPU used for the experiments in this article is an RTX 3090. The Docker image used is nvcr.io/nvidia/tensorflow:21.09-tf1-py3, which comes with TensorFlow version 1.15.5. Additionally, the version of bert4keras used is 0.11.3. Other environments can be set up by following these references, but please maintain a spirit of experimentation and do not expect "brainless" calls to work perfectly.

As a side note, cards like the 3090 and A100 only support CUDA 11, and the official version 1.15 of TensorFlow from Google does not support CUDA 11. If you still want to use TensorFlow 1.x, you must use nvidia-tensorflow maintained by NVIDIA itself, or use the Docker images they build. Using TensorFlow maintained by NVIDIA instead of Google not only allows you to use version 1.x on the latest GPUs but also includes specific additional optimizations made by NVIDIA. Detailed documentation can be found here.

Do not say things like "TensorFlow is already at version 2.8, why are you still using 1.15?" Your graphics card is manufactured by NVIDIA, so which version of TensorFlow is best is not up to you or me, or even Google; it’s up to NVIDIA. Since NVIDIA is still maintaining 1.15, it means 1.15 is the GOAT (Greatest of All Time).

Mixed Precision

First, let’s look at mixed precision training. Simply put, model calculations use FP16, while parameter updates and storage use FP32. The representation range of FP16 is approximately 6 \times 10^{-8} \sim 65504. Both its upper and lower bounds are limits we might encounter when implementing models. Therefore, the biggest problems introduced by FP16 are overflow and precision loss. For a more detailed introduction to the principles, you can search for them yourself; this article focuses on how to use it.

The introduction to mixed precision training in the nvidia-tensorflow help documentation can be found here. The simplest way to enable mixed precision training is to add environment variables at the beginning of the script:

import os
os.environ['TF_KERAS'] = '1'  # Must use tf.keras
os.environ['TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE'] = '1'  # Mixed precision training

Readers might notice that most tutorials introduce TF_ENABLE_AUTO_MIXED_PRECISION, whereas I use TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE. The difference is that the former automatically adds "Dynamic Loss Scaling," while the latter does not. However, my tests found that "Dynamic Loss Scaling" cannot replace manual loss adjustment, so I decided to forgo this feature entirely.

After adding the environment variables, you can restart the training script to check the situation. If NaN appears as soon as training starts, you can adjust the infinity and epsilon values:

from bert4keras.backend import K
K.set_infinity(1e4)
K.set_epsilon(1e-5)

After these adjustments, NaN usually won’t appear right at the start (if it does, check if other parts of the model use infinity and epsilon that are not controlled by these two functions and modify them). However, it is possible for the loss to decrease first, then increase, and finally become NaN. This is due to poor initialization or intentional design like in DeepNet, where gradients for some parameters are extremely small (less than 10^{-8}). In the FP16 precision range, these gradients become exactly 0, so those parameters won’t be updated, or equivalently, the gradients are inaccurate. Updating with inaccurate gradients over a long period easily leads to non-convergence.

The solution here is "Loss Scaling." We can directly multiply the loss function by a scaling factor (e.g., 1000; you can tune this yourself—the larger the better, as long as NaN does not occur). This allows originally tiny gradients to be scaled into the FP16 range, preventing them from being zeroed out and avoiding precision loss. For optimizers we commonly use, such as Adam and LAMB, multiplying the loss function by a constant does not change the training process of these optimizers, meaning they are fully compatible with "Loss Scaling."

In fact, I have found that the "Loss Scaling" technique is effective not only in mixed precision training scenarios but also provides some benefit in full FP32 precision training. In full FP32 training, if loss scaling is not performed, the model may stay at a certain loss value for a while at the beginning before slowly decreasing; if loss scaling is performed, the model maintains a slow downward trend from the start, resulting in relatively faster convergence.

Algebraic Acceleration

Now let’s look at XLA, which stands for "Accelerated Linear Algebra," specifically designed to speed up linear algebra operations. Simply put, XLA performs ahead-of-time compilation optimization on the computation graph, merging operators that can be merged (reducing intermediate variables to save memory) and parallelizing operators that can be parallelized (increasing computation speed).

In nvidia-tensorflow, the simplest way to enable XLA is still by adding environment variables:

import os
os.environ['TF_KERAS'] = '1'  # Must use tf.keras
os.environ['TF_XLA_FLAGS'] = '--tf_xla_auto_jit=1'  # Enable XLA

However, note that XLA does not guarantee an improvement. As mentioned, XLA tries to parallelize operators as much as possible; obviously, this is a strategy of trading space for time. Therefore, enabling XLA may consume more VRAM, leading to OOM (Out of Memory), or even performance degradation if the parallel clusters are too large. The official documentation provides a detailed analysis of possible anomalies and offers corresponding suggestions. Among them, the solution I recommend is to add the –tf_xla_enable_lazy_compilation=false parameter:

import os
os.environ['TF_KERAS'] = '1'  # Must use tf.keras
os.environ['TF_XLA_FLAGS'] = '--tf_xla_auto_jit=1'  # Enable XLA
os.environ['TF_XLA_FLAGS'] += ' --tf_xla_enable_lazy_compilation=false'  # Optimize XLA

If this still doesn’t solve the problem, switch to XLA Lite:

import os
os.environ['TF_KERAS'] = '1'  # Must use tf.keras
os.environ['TF_XLA_FLAGS'] = '--tf_xla_auto_jit=fusible'  # Enable XLA Lite

If even XLA Lite cannot solve the issue, it basically means XLA is not suitable for your model.

Performance Comparison

On a 3090, the speedup brought by enabling mixed precision training is a bit more than 10%. This magnitude might not be as fast as one might imagine. I speculate this is because, on newer cards like the 3090 and A100, the default FP32 format actually uses a format called TF32 (refer to here). TF32 is, in a sense, a "half-precision format" itself, which is faster than standard FP32. In other words, FP32 on the 3090 is already equivalent to having undergone some half-precision optimization and is inherently faster; thus, the improvement after switching to mixed precision is relatively smaller.

As for the improvement brought by XLA, it is roughly 15%. In my training script, directly setting the environment variable TF_XLA_FLAGS to –tf_xla_auto_jit=1 caused an OOM; adding –tf_xla_enable_lazy_compilation=false resulted in the same, while changing it to –tf_xla_auto_jit=fusible allowed for normal training.

Finally, and most importantly, mixed precision and XLA can be used together! Using both together brings a speedup of about 30%, and the addition of mixed precision training basically offsets the increase in VRAM consumption caused by XLA. The two truly complement each other.

Summary

This article introduced attempts to use mixed precision and XLA to accelerate training in bert4keras. Enabling both simultaneously can achieve a speedup of about 30% on a 3090.

When reposting, please include the original address of this article: https://kexue.fm/archives/9059

For more detailed reposting matters, please refer to: "Scientific Space FAQ"