Large-scale language models (LLMs) are AI systems trained on vast amounts of text data that provide advanced capabilities and flexible ways to understand, generate, and reason about natural language. LLM training has made significant advances in recent years, with organizations pushing the boundaries of what is possible in terms of model size, performance, and efficiency. In this post, we explore how FP8 optimization can significantly speed up training large models on Amazon SageMaker P5 instances.
LLM training with SageMaker P5
In 2023, SageMaker announced P5 instances that support up to eight of the latest NVIDIA H100 Tensor Core GPUs. Equipped with high-bandwidth networking technologies such as EFA, P5 instances provide a powerful platform for distributed training, allowing large models to be trained in parallel across multiple nodes. Using Amazon SageMaker model training, organizations can now achieve higher training speeds and efficiency by leveraging P5 instances. This demonstrates the transformative potential of using SageMaker Training to train models at different scales faster and more efficiently.
LLM training with FP8
P5 instances based on NVIDIA H100 GPUs also have the ability to train models using FP8 accuracy. The FP8 data type has emerged as a game changer in LLM training. FP8 reduces model weights and activation precision, allowing for more efficient memory usage and faster computation without significantly impacting model quality. The throughput of performing matrix operations such as multiplication and convolution with 32-bit floating-point tensors is significantly lower than when using 8-bit floating-point tensors. FP8’s accuracy reduces data footprint and computational requirements, making it ideal for large models where memory and speed are critical. This allows researchers to train larger models using the same hardware resources, and to train models faster while maintaining comparable performance. To make models compatible with FP8, NVIDIA released the Transformer Engine (TE) library. It provides support for several layers, including: Linear
, LayerNorm
and DotProductAttention
. To enable FP8 training, you must include these layers using the TE API when your model is cast to FP8. For example, the following Python code shows how to integrate the FP8 compatibility layer.
try:
import transformer_engine.pytorch as te
using_te = True
except ImportError as ie:
using_te = False
......
linear_type: nn.Module = te.Linear if using_te else nn.Linear
......
in_proj = linear_type(dim, 3 * n_heads * head_dim, bias=False, device="cuda" if using_te)
out_proj = linear_type(n_heads * head_dim, dim, bias=False, device="cuda" if using_te)
......
result
We performed some tests using LLM with 1B and 7B parameters by running training with and without FP8. The tests are run with 24 billion tokens in one epoch, thereby comparing the throughput (tokens per second per GPU) and model performance (number of losses). For the 1B parameter model, we calculated results comparing the performance with FP8 using different numbers of instances for distributed training. The following table summarizes the results.
Number of P5 nodes | No FP8 | FP8 available | % faster with FP8 | Loss rate is higher with FP8 than without FP8 | ||||
tokens/sec/GPU | % decrease | Loss after 1 epoch | tokens/sec/GPU | % decrease | Loss after 1 epoch | |||
1 | 40200 | – | 6.205 | 40800 | – | 6.395 | 1.49 | 3.06 |
2 | 38500 | 4.2288 | 6.211 | 41600 | -3.4825 | 6.338 | 8.05 | 2.04 |
4 | 39500 | 1.7412 | 6.244 | 42000 | -4.4776 | 6.402 | 6.32 | 2.53 |
8 | 38200 | 4.9751 | 6.156 | 41800 | -3.98 | 6.365 | 9.42 | 3.39 |
16 | 35500 | 11.6915 | 6.024 | 39500 | 1.7412 | 6.223 | 11.26 | 3.3 |
32 | 33500 | 16.6667 | 6.112 | 38000 | 5.4726 | 6.264 | 13.43 | 2.48 |
The following graph shows the throughput performance of the 1B parameter model in terms of tokens/second/GPU on different numbers of P5 instances.
For the 7B parameter model, we calculated results comparing the performance with FP8 using different numbers of instances for distributed training. The following table summarizes the results.
Number of P5 nodes | No FP8 | FP8 available | % faster with FP8 | Loss rate is higher with FP8 than without FP8 | ||||
tokens/sec/GPU | % decrease | Loss after 1 epoch | tokens/sec/GPU | % decrease | Loss after 1 epoch | |||
1 | 9350 | – | 6.595 | 11000 | – | 6.602 | 15 | 0.11 |
2 | 9400 | -0.5347 | 6.688 | 10750 | 2.2935 | 6.695 | 12.56 | 0.1 |
4 | 9300 | 0.5347 | 6.642 | 10600 | 3.6697 | 6.634 | 12.26 | -0.12 |
8 | 9250 | 1.0695 | 6.612 | 10400 | 4.9541 | 6.652 | 11.06 | 0.6 |
16 | 8700 | 6.9518 | 6.594 | 10100 | 8.7155 | 6.644 | 13.86 | 0.76 |
32 | 7900 | 15.508 | 6.523 | 9700 | 11.8182 | 6.649 | 18.56 | 1.93 |
The following graph shows the throughput performance of the 7B parameter model in terms of tokens/second/GPU on different numbers of P5 instances.
The table above shows that using FP8, training a 1B model is 13% faster and training a 7B model is 18% faster. Since FP8 trains the model faster, the tradeoff is generally slower loss reduction. However, the impact on model performance after one epoch is minimal, with FP8 increasing the loss by approximately 3% for the 1B model and 2% for the 7B model compared to training without FP8. That’s all. The following graph shows the loss performance.
As we discussed in Scalable Multi-Node Training with TensorFlow, due to inter-node communication, we observe a slight decrease in overall throughput as the number of nodes increases.
Implications for LLM training and beyond
Using FP8 precision in combination with SageMaker P5 instances has a huge impact on the field of LLM training. Demonstrating the feasibility and effectiveness of this approach will pave the way for other researchers and organizations to adopt similar techniques, accelerating progress in large-scale model training. Additionally, the benefits of FP8 and advanced hardware extend beyond LLM training. These advances enable the training of larger, more complex models in less time and with fewer resources, ultimately saving time and money, and therefore research in areas such as computer vision and reinforcement learning. It can be accelerated. In terms of inference, models with FP8 activation are shown to improve by a factor of 2 compared to BF16 models.
conclusion
The adoption of FP8 accuracy and SageMaker P5 instances marks an important milestone in the evolution of LLM training. These advances push the limits of model size, training speed, and efficiency, opening new possibilities for research and innovation in large-scale models. As the AI community builds on these technological advances, we can expect even greater advances in the future. Ongoing research is exploring further improvements with techniques such as PyTorch 2.0 Fully Sharded Data Parallel (FSDP) and TorachCompile. Combining these advances with FP8 training could lead to even faster and more efficient LLM training. If you are interested in the potential impact of FP8, experimenting with 1B or 7B models such as GPT-Neo or Meta Llama 2 on a SageMaker P5 instance can provide valuable insight into performance differences compared to FP16 and FP32. Possibly.
About the author
Romil Shah is a senior data scientist at AWS Professional Services. Romil has over 8 years of experience in the computer vision, machine learning, generative AI, and IoT edge device industries. He works with customers to help train, optimize, and deploy underlying models on edge devices and in the cloud.
mike garrison I’m a global solutions architect based in Ypsilanti, Michigan. With 20 years of experience, he helps auto companies accelerate innovation. In his free time, he enjoys playing video games and traveling.