AWS AI chips deliver high performance and low cost for Llama 3.1 models on AWS

Today, we are happy to announce support for AWS Trainium and AWS Inferentia for fine-tuning and inference of Llama 3.1 models. The Llama 3.1 family of multilingual large language models (LLMs) is a collection of pre-trained, instruction-tuned generative models in sizes of 8B, 70B, and 405B. In a previous post, we described how to deploy Llama 3 models on AWS Trainium and Inferentia-based instances on Amazon SageMaker JumpStart. In this post, we show you how to fine-tune and deploy the Llama 3.1 family of models on AWS AI chips and realize their price-performance benefits.

Llama 3.1 Model Overview

The Llama 3.1 multilingual LLM family is a collection of pre-trained, instruction-tuned generative models of 8B, 70B, and 405B sizes (text input/text and code output). All models support long context lengths (128k) and are optimized for inference with support for Grouped Query Attention (GQA).

Llama 3.1 instruction-tuned models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the published chat models on popular industry benchmarks. They are trained to generate tool calls for several specific tools for functions such as search, image generation, code execution, and mathematical reasoning. Additionally, they support zero-shot tool usage.

Llama 3.1 405B is the largest publicly available LLM in the world, according to Meta. The model sets a new standard in artificial intelligence (AI) and is ideal for enterprise-level applications and research and development. It is ideal for tasks such as synthetic data generation, where the model’s output can be used to improve smaller Llama models after fine-tuning, or to transfer knowledge from the 405B model to smaller models using model distillation. The model excels in general knowledge, long-form text generation, multilingual translation, machine translation, coding, mathematics, tool usage, enhanced contextual understanding, and advanced reasoning and decision-making.

Architecturally, the core LLMs in Llama 3 and Llama 3.1 have the same dense architecture: they are autoregressive language models that use an optimized transformer architecture, with the tuned version using supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for usefulness and safety.

Meta’s Responsible Use Guide can help you take appropriate safeguards and implement any additional tweaks needed to customize and optimize your model.

Trainium powers Llama 3.1 with Amazon Bedrock and Amazon SageMaker

The fastest way to get started with Llama 3.1 on AWS is with Amazon Bedrock, powered by our purpose-built AI infrastructure, including AWS Trainium. Amazon Bedrock gives you the benefits of our purpose-built AI infrastructure through fully managed APIs, simplifying access to these powerful models so you can focus on building differentiated AI applications.

If you need more control over the underlying resources, you can fine-tune and deploy your Llama 3.1 models using SageMaker. Trainium support for Llama 3.1 on SageMaker JumpStart is coming soon.

AWS Trainium and AWS Inferentia2 deliver high performance and low cost for Llama 3.1 models

If you want to build your own ML pipeline for training and inference for more flexibility and control, you can get started with Llama 3.1 on AWS AI chips using Amazon Elastic Compute Cloud (Amazon EC2) Trn1 and Inf2 instances. Let’s see how to get started with the new Llama 3.1 8/70B model on Trainium using the AWS Neuron SDK.

Tweaking Llama 3.1 with Trainium

To get started fine-tuning Llama 3.1 8B or Llama 3.1 70B, you can use the NeuronX Distributed library. NeuronX Distributed provides implementations of the more common distributed training and inference techniques. To get started fine-tuning, you can use the following samples:

Both samples are built on AWS ParallelCluster to manage the Trainium cluster infrastructure, and Slurm for workload management. Below is an example Slurm command to start training Llama3.1 70B:

sbatch --exclusive \
--nodes 32 \
--cpus-per-task 128 \
--wrap="srun bash $(pwd)/run_llama3_70B_tp_pp.sh"

In your Slurm script, you launch a distributed training process on your cluster. In your runner script, you load the pre-trained weights and configurations provided by Meta and launch the training process.

torchrun $DISTRIBUTED_ARGS run_llama_nxd.py \
    —train_batch_size $BS \
    —use_meta_device_init 1 \
    —training_dir $DATA_PATH \
    —training_config $SCRIPT_DIR/${MODEL_SIZE}config_llama${LLAMA_VERSION} \
    —max_steps $max_steps \
    —seq_len $SEQ_LEN \
    —pipeline_parallel_size $PP_DEGREE \
    —tensor_parallel_size $TP_DEGREE \
    —num_microbatches $NUM_MICROBATCHES \
    —lr 0.000015 \
    —min_lr 1e-06 \
    —beta1 0.9 \
    —beta2 0.95 \
    —weight_decay 0.1 \
    —warmup_steps 2000 \
    —constant_steps 0 \
    —use_zero1_optimizer 1 \
    —use_selective_checkpoint 1 \
    —use_flash_attention 1 \
    —qkv_linear 1 \
    —kv_replicator 4 \
    —pretrained_weight 1 \
    —save_load_xser 1 \
    —checkpoint_dir "/shared/llama${LLAMA_VERSION}${MODEL_SIZE}/" \
    —checkpoint_freq $checkpoint_freq \
    —num_kept_checkpoint -1 \
    —loading_step -1 \
    —tb_dir $tb_dir |& tee $LOG_PATH/log
exit ${PIPESTATUS(0)}

Introducing Llama 3.1 to Trainium

When you’re ready to deploy your model, you can do so by updating the model ID in the previous Llama 3 8B Neuron example code.

model_id = "meta-llama/Meta-Llama-3.1-8B"
neuron_model = LlamaForSampling.from_pretrained(model_id, neuron_config=neuron_config, batch_size=1, tp_degree=24, amp='bf16', n_positions=4096)
neuron_model.to_neuron()

You can use the same example inference code:

tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "Hello, I'm a language model and I like to"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# run inference with top-k sampling
with torch.inference_mode():
    start = time.time()
    generated_sequences = neuron_model.sample(input_ids, sequence_length=2048, top_k=50)
    elapsed = time.time() - start

generated_sequences = (tokenizer.decode(seq) for seq in generated_sequences)
print(f'generated sequences {generated_sequences} in {elapsed} seconds')

For detailed instructions, see the new Llama 3.1 example.

You can also use Hugging Face’s Optimum Neuron library to rapidly deploy models directly from SageMaker via the Hugging Face Model Hub. expandafter that Sage MakerAnd finally AWS Inference and TrainiumCopy the sample code into your SageMaker notebook. Run.

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")("Role")("Arn")

# Hub Model configuration. https://huggingface.co/models
hub = {
    "HF_MODEL_ID": "meta-llama/Meta-Llama-3.1-8B",
    "HF_NUM_CORES": "2",
    "HF_AUTO_CAST_TYPE": "fp16",
    "MAX_BATCH_SIZE": "8",
    "MAX_INPUT_LENGTH": "3686",
    "MAX_TOTAL_TOKENS": "4096",
    "HF_TOKEN": "<REPLACE WITH YOUR TOKEN>",
}

assert hub("HF_TOKEN") != "<REPLACE WITH YOUR TOKEN>", "Please replace '<REPLACE WITH YOUR TOKEN>' with your Hugging Face Hub API token"


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface-neuronx", version="0.0.23"),
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.xlarge",
    container_startup_health_check_timeout=1800,
    volume_size=512,
)

# send request
predictor.predict(
    {
        "inputs": "What is is the capital of France?",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }
)

Additionally, if you want to deploy your model using vLLM, you can refer to the Continuous Batch Processing guide to create your environment. After you create your environment, you can use vLLM to deploy your Llama 3.1 8/70B model to AWS Trainium or Inferentia. Below is an example to deploy Llama 3.1 8B:

from vllm import LLM, SamplingParams
# Sample prompts.
prompts = (
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
)
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B",
    max_num_seqs=8,
    # The max_model_len and block_size arguments are required to be same as max sequence length,
    # when targeting neuron device. Currently, this is a known limitation in continuous batching
    # support in transformers-neuronx.
    max_model_len=128,
    block_size=128,
    # The device can be automatically detected when AWS Neuron SDK is installed.
    # The device argument can be either unspecified for automated detection, or explicitly assigned.
    device="neuron",
    tensor_parallel_size=8)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs(0).text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Conclusion

AWS Trainium and Inferentia provide high performance and low cost for fine-tuning and deploying Llama 3.1 models. We are excited to see how you use these powerful models and our purpose-built AI infrastructure to build differentiated AI applications. To learn more about how to get started with AWS AI chips, see the model samples and tutorials in the AWS Neuron documentation.

About the Author

John Gray John is a Senior Solutions Architect for AWS’ Annapurna Labs based in Seattle. In this role, John works with customers on AI and machine learning use cases, designing solutions that cost-effectively solve their business problems, and helping them build scalable prototypes using AWS AI chips.

Pinak Panigrahi He works with customers to build ML-driven solutions that solve strategic business problems on AWS. In his current role, he works on optimizing the training and inference of generative AI models on AWS AI chips.

Kamran KhanHe is the Head of Business Development for AWS Inferentina/Trianium at AWS, with over 10 years of experience helping customers deploy and optimize their deep learning training and inference workloads using AWS Inferentia and AWS Trainium.

Shruti Koparkar He is a Senior Product Marketing Manager at AWS, helping customers explore, evaluate, and adopt Amazon EC2 Accelerated Computing infrastructure for their Machine Learning needs.

What's Hot

Countries are cheating their way to net zero by relying too much on forests

Is Snapchat safe for sending private photos?

Supreme Court issues regarding transgender medical care

A third party received a genetically engineered pig kidney transplant.

Generative AI and climate change are on a collision course

Study shows black plastic tableware has serious mathematical error

CDC confirms first severe case of bird flu in the U.S.

How feminism can guide climate change action

Design multi-agent orchestration with reasoning using Amazon Bedrock and open source frameworks

An introduction to preparing your own dataset for LLM training

Add a generative AI experience to your website or web application with Amazon Q embedded

NYT “Connections” Hints and Answers for August 26: Hints to solve “Connections” #442.

‘The Decameron’ review: 14th century play is a hidden satire of modern times

Apple Vision Pro’s visionOS 2 is a Baby Step Toward the Future

Most Popular

The US unveils its first plan to tackle plastic pollution.

Best reading lights (2024): clip-on, rechargeable, portable

What is the optimal amount of exercise, and how much is too much?

Our Picks

Record-breaking rainfall in the Carolinas and Europe explained

Go Bag & Breath:

New Scientist Live 2024: What are you most looking forward to this year?

Subscribe to our newsletter

Subscribe to Updates

What's Hot

AWS AI chips deliver high performance and low cost for Llama 3.1 models on AWS

Llama 3.1 Model Overview

Trainium powers Llama 3.1 with Amazon Bedrock and Amazon SageMaker

AWS Trainium and AWS Inferentia2 deliver high performance and low cost for Llama 3.1 models

Tweaking Llama 3.1 with Trainium

Introducing Llama 3.1 to Trainium

Conclusion

About the Author

Related Posts

Subscribe to our newsletter

Subscribe to our newsletter