Today, we are happy to announce support for AWS Trainium and AWS Inferentia for fine-tuning and inference of Llama 3.1 models. The Llama 3.1 family of multilingual large language models (LLMs) is a collection of pre-trained, instruction-tuned generative models in sizes of 8B, 70B, and 405B. In a previous post, we described how to deploy Llama 3 models on AWS Trainium and Inferentia-based instances on Amazon SageMaker JumpStart. In this post, we show you how to fine-tune and deploy the Llama 3.1 family of models on AWS AI chips and realize their price-performance benefits.
Llama 3.1 Model Overview
The Llama 3.1 multilingual LLM family is a collection of pre-trained, instruction-tuned generative models of 8B, 70B, and 405B sizes (text input/text and code output). All models support long context lengths (128k) and are optimized for inference with support for Grouped Query Attention (GQA).
Llama 3.1 instruction-tuned models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the published chat models on popular industry benchmarks. They are trained to generate tool calls for several specific tools for functions such as search, image generation, code execution, and mathematical reasoning. Additionally, they support zero-shot tool usage.
Llama 3.1 405B is the largest publicly available LLM in the world, according to Meta. The model sets a new standard in artificial intelligence (AI) and is ideal for enterprise-level applications and research and development. It is ideal for tasks such as synthetic data generation, where the model’s output can be used to improve smaller Llama models after fine-tuning, or to transfer knowledge from the 405B model to smaller models using model distillation. The model excels in general knowledge, long-form text generation, multilingual translation, machine translation, coding, mathematics, tool usage, enhanced contextual understanding, and advanced reasoning and decision-making.
Architecturally, the core LLMs in Llama 3 and Llama 3.1 have the same dense architecture: they are autoregressive language models that use an optimized transformer architecture, with the tuned version using supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for usefulness and safety.
Meta’s Responsible Use Guide can help you take appropriate safeguards and implement any additional tweaks needed to customize and optimize your model.
Trainium powers Llama 3.1 with Amazon Bedrock and Amazon SageMaker
The fastest way to get started with Llama 3.1 on AWS is with Amazon Bedrock, powered by our purpose-built AI infrastructure, including AWS Trainium. Amazon Bedrock gives you the benefits of our purpose-built AI infrastructure through fully managed APIs, simplifying access to these powerful models so you can focus on building differentiated AI applications.
If you need more control over the underlying resources, you can fine-tune and deploy your Llama 3.1 models using SageMaker. Trainium support for Llama 3.1 on SageMaker JumpStart is coming soon.
AWS Trainium and AWS Inferentia2 deliver high performance and low cost for Llama 3.1 models
If you want to build your own ML pipeline for training and inference for more flexibility and control, you can get started with Llama 3.1 on AWS AI chips using Amazon Elastic Compute Cloud (Amazon EC2) Trn1 and Inf2 instances. Let’s see how to get started with the new Llama 3.1 8/70B model on Trainium using the AWS Neuron SDK.
Tweaking Llama 3.1 with Trainium
To get started fine-tuning Llama 3.1 8B or Llama 3.1 70B, you can use the NeuronX Distributed library. NeuronX Distributed provides implementations of the more common distributed training and inference techniques. To get started fine-tuning, you can use the following samples:
Both samples are built on AWS ParallelCluster to manage the Trainium cluster infrastructure, and Slurm for workload management. Below is an example Slurm command to start training Llama3.1 70B:
In your Slurm script, you launch a distributed training process on your cluster. In your runner script, you load the pre-trained weights and configurations provided by Meta and launch the training process.
Introducing Llama 3.1 to Trainium
When you’re ready to deploy your model, you can do so by updating the model ID in the previous Llama 3 8B Neuron example code.
You can use the same example inference code:
For detailed instructions, see the new Llama 3.1 example.
You can also use Hugging Face’s Optimum Neuron library to rapidly deploy models directly from SageMaker via the Hugging Face Model Hub. expandafter that Sage MakerAnd finally AWS Inference and TrainiumCopy the sample code into your SageMaker notebook. Run.
Additionally, if you want to deploy your model using vLLM, you can refer to the Continuous Batch Processing guide to create your environment. After you create your environment, you can use vLLM to deploy your Llama 3.1 8/70B model to AWS Trainium or Inferentia. Below is an example to deploy Llama 3.1 8B:
Conclusion
AWS Trainium and Inferentia provide high performance and low cost for fine-tuning and deploying Llama 3.1 models. We are excited to see how you use these powerful models and our purpose-built AI infrastructure to build differentiated AI applications. To learn more about how to get started with AWS AI chips, see the model samples and tutorials in the AWS Neuron documentation.
About the Author
John Gray John is a Senior Solutions Architect for AWS’ Annapurna Labs based in Seattle. In this role, John works with customers on AI and machine learning use cases, designing solutions that cost-effectively solve their business problems, and helping them build scalable prototypes using AWS AI chips.
Pinak Panigrahi He works with customers to build ML-driven solutions that solve strategic business problems on AWS. In his current role, he works on optimizing the training and inference of generative AI models on AWS AI chips.
Kamran KhanHe is the Head of Business Development for AWS Inferentina/Trianium at AWS, with over 10 years of experience helping customers deploy and optimize their deep learning training and inference workloads using AWS Inferentia and AWS Trainium.
Shruti Koparkar He is a Senior Product Marketing Manager at AWS, helping customers explore, evaluate, and adopt Amazon EC2 Accelerated Computing infrastructure for their Machine Learning needs.