How FP8 boosts LLM training by 18% on Amazon SageMaker P5 instances

Large-scale language models (LLMs) are AI systems trained on vast amounts of text data that provide advanced capabilities and flexible ways to understand, generate, and reason about natural language. LLM training has made significant advances in recent years, with organizations pushing the boundaries of what is possible in terms of model size, performance, and efficiency. In this post, we explore how FP8 optimization can significantly speed up training large models on Amazon SageMaker P5 instances.

LLM training with SageMaker P5

In 2023, SageMaker announced P5 instances that support up to eight of the latest NVIDIA H100 Tensor Core GPUs. Equipped with high-bandwidth networking technologies such as EFA, P5 instances provide a powerful platform for distributed training, allowing large models to be trained in parallel across multiple nodes. Using Amazon SageMaker model training, organizations can now achieve higher training speeds and efficiency by leveraging P5 instances. This demonstrates the transformative potential of using SageMaker Training to train models at different scales faster and more efficiently.

LLM training with FP8

P5 instances based on NVIDIA H100 GPUs also have the ability to train models using FP8 accuracy. The FP8 data type has emerged as a game changer in LLM training. FP8 reduces model weights and activation precision, allowing for more efficient memory usage and faster computation without significantly impacting model quality. The throughput of performing matrix operations such as multiplication and convolution with 32-bit floating-point tensors is significantly lower than when using 8-bit floating-point tensors. FP8’s accuracy reduces data footprint and computational requirements, making it ideal for large models where memory and speed are critical. This allows researchers to train larger models using the same hardware resources, and to train models faster while maintaining comparable performance. To make models compatible with FP8, NVIDIA released the Transformer Engine (TE) library. It provides support for several layers, including: Linear, LayerNormand DotProductAttention. To enable FP8 training, you must include these layers using the TE API when your model is cast to FP8. For example, the following Python code shows how to integrate the FP8 compatibility layer.

try:
    import transformer_engine.pytorch as te
    using_te = True
except ImportError as ie:
    using_te = False
......
linear_type: nn.Module = te.Linear if using_te else nn.Linear
......
    in_proj = linear_type(dim, 3 * n_heads * head_dim, bias=False, device="cuda" if using_te)
    out_proj = linear_type(n_heads * head_dim, dim, bias=False, device="cuda" if using_te)
......

result

We performed some tests using LLM with 1B and 7B parameters by running training with and without FP8. The tests are run with 24 billion tokens in one epoch, thereby comparing the throughput (tokens per second per GPU) and model performance (number of losses). For the 1B parameter model, we calculated results comparing the performance with FP8 using different numbers of instances for distributed training. The following table summarizes the results.

Number of P5 nodes	No FP8			FP8 available			% faster with FP8	Loss rate is higher with FP8 than without FP8
Number of P5 nodes	tokens/sec/GPU	% decrease	Loss after 1 epoch	tokens/sec/GPU	% decrease	Loss after 1 epoch	% faster with FP8	Loss rate is higher with FP8 than without FP8
1	40200	–	6.205	40800	–	6.395	1.49	3.06
2	38500	4.2288	6.211	41600	-3.4825	6.338	8.05	2.04
4	39500	1.7412	6.244	42000	-4.4776	6.402	6.32	2.53
8	38200	4.9751	6.156	41800	-3.98	6.365	9.42	3.39
16	35500	11.6915	6.024	39500	1.7412	6.223	11.26	3.3
32	33500	16.6667	6.112	38000	5.4726	6.264	13.43	2.48

The following graph shows the throughput performance of the 1B parameter model in terms of tokens/second/GPU on different numbers of P5 instances.

For the 7B parameter model, we calculated results comparing the performance with FP8 using different numbers of instances for distributed training. The following table summarizes the results.

Number of P5 nodes	No FP8			FP8 available			% faster with FP8	Loss rate is higher with FP8 than without FP8
Number of P5 nodes	tokens/sec/GPU	% decrease	Loss after 1 epoch	tokens/sec/GPU	% decrease	Loss after 1 epoch	% faster with FP8	Loss rate is higher with FP8 than without FP8
1	9350	–	6.595	11000	–	6.602	15	0.11
2	9400	-0.5347	6.688	10750	2.2935	6.695	12.56	0.1
4	9300	0.5347	6.642	10600	3.6697	6.634	12.26	-0.12
8	9250	1.0695	6.612	10400	4.9541	6.652	11.06	0.6
16	8700	6.9518	6.594	10100	8.7155	6.644	13.86	0.76
32	7900	15.508	6.523	9700	11.8182	6.649	18.56	1.93

The following graph shows the throughput performance of the 7B parameter model in terms of tokens/second/GPU on different numbers of P5 instances.

The table above shows that using FP8, training a 1B model is 13% faster and training a 7B model is 18% faster. Since FP8 trains the model faster, the tradeoff is generally slower loss reduction. However, the impact on model performance after one epoch is minimal, with FP8 increasing the loss by approximately 3% for the 1B model and 2% for the 7B model compared to training without FP8. That’s all. The following graph shows the loss performance.

As we discussed in Scalable Multi-Node Training with TensorFlow, due to inter-node communication, we observe a slight decrease in overall throughput as the number of nodes increases.

Implications for LLM training and beyond

Using FP8 precision in combination with SageMaker P5 instances has a huge impact on the field of LLM training. Demonstrating the feasibility and effectiveness of this approach will pave the way for other researchers and organizations to adopt similar techniques, accelerating progress in large-scale model training. Additionally, the benefits of FP8 and advanced hardware extend beyond LLM training. These advances enable the training of larger, more complex models in less time and with fewer resources, ultimately saving time and money, and therefore research in areas such as computer vision and reinforcement learning. It can be accelerated. In terms of inference, models with FP8 activation are shown to improve by a factor of 2 compared to BF16 models.

conclusion

The adoption of FP8 accuracy and SageMaker P5 instances marks an important milestone in the evolution of LLM training. These advances push the limits of model size, training speed, and efficiency, opening new possibilities for research and innovation in large-scale models. As the AI community builds on these technological advances, we can expect even greater advances in the future. Ongoing research is exploring further improvements with techniques such as PyTorch 2.0 Fully Sharded Data Parallel (FSDP) and TorachCompile. Combining these advances with FP8 training could lead to even faster and more efficient LLM training. If you are interested in the potential impact of FP8, experimenting with 1B or 7B models such as GPT-Neo or Meta Llama 2 on a SageMaker P5 instance can provide valuable insight into performance differences compared to FP16 and FP32. Possibly.

About the author

Romil Shah is a senior data scientist at AWS Professional Services. Romil has over 8 years of experience in the computer vision, machine learning, generative AI, and IoT edge device industries. He works with customers to help train, optimize, and deploy underlying models on edge devices and in the cloud.

mike garrison I’m a global solutions architect based in Ypsilanti, Michigan. With 20 years of experience, he helps auto companies accelerate innovation. In his free time, he enjoys playing video games and traveling.

What's Hot

YouTube adds option to save music from short videos, releases new channel management guide

Urban birds harbor antibiotic-resistant bacteria

James S. A. Corey’s new book is one of the best new sci-fi books of August 2024

Mathematicians Discover a New Kind of Shape That’s All over Nature

Book review: Expanded new translation of Haruki Murakami’s classic

Could Ocean Engineering Pull Carbon from the Atmosphere as a Last Resort against Climate Change?

Banning books harms children | Scientific American

What causes the disparity in cesarean section rates between black and white pregnant women?

Embedding secure generative AI in mission-critical public safety applications

Streamline RAG applications with intelligent metadata filtering using Amazon Bedrock

Automate Q&A email responses using Amazon Bedrock Knowledge Bases

The Crow Is a Gothic Superhero Romance Destined for Cult Status

Next-generation learning experience using Amazon Bedrock and Anthropic’s Claude: Innovation from Classworks

DC’s Compact Comics Line Is the Smartest Thing It’s Done in Years

Most Popular

Build a multimodal social media content generator using Amazon Bedrock

The Crow Is a Gothic Superhero Romance Destined for Cult Status

We Have Got to Stop Giving These Movie Universes Ridiculous Acronyms

Our Picks

The real problem with banning masks at protests

5 features for better chats

Reinvent personalization with generative AI on Amazon Bedrock using task decomposition for agentic workflows

Subscribe to our newsletter

Subscribe to Updates

What's Hot

How FP8 boosts LLM training by 18% on Amazon SageMaker P5 instances

LLM training with SageMaker P5

LLM training with FP8

result

Implications for LLM training and beyond

conclusion

About the author

Related Posts

Subscribe to our newsletter

Subscribe to our newsletter