In recent years, there has been a significant increase in the size of large language models (LLMs) used to solve natural language processing (NLP) tasks such as question answering and text summarization. Larger models with more parameters, on the order of hundreds of billions at the time of writing, tend to produce better results. For example, Llama-3-70B scores better on metrics such as reading comprehension than the smaller 8B parameter version (SQuAD 85.6 vs. 76.4). As such, customers often try larger and newer models to build ML-based products that deliver value.
However, larger models are more computationally intensive and expensive to deploy. For example, on AWS Trainium, Llama-3-70B has a median per-token latency of 21.4 ms, while Llama-3-8B takes 4.7 ms. Similarly, Llama-2-70B has a median per-token latency of 20.6 ms, while Llama-2-7B takes 3.7 ms. Customers must consider performance to meet their users’ needs. In this blog post, we explain how AWS Inferentia and Trainium make inference for large language models more computationally and cost-effective through speculative sampling. This technique improves throughput and output token latency (TPOT) for LLM inference.
introduction
Modern language models are based on a transformer architecture. An input prompt is first processed using a technique called context encoding, which is parallelizable and therefore runs fast. Then, an autoregressive token generation is performed, where output tokens are generated sequentially. Note that the next token cannot be generated until the previous token is known, as shown in Figure 1. Thus, to generate N output tokens, the decoder must be run N times in succession. Larger models, such as Llama-3-70B, take longer to run than smaller models, such as Llama-3-8B.
From a computational perspective, token generation in LLM is a memory bandwidth bound process: the larger the model, the higher the chance of waiting for memory transfers, resulting in underutilization of compute units and not being able to fully benefit from the available floating-point operations (FLOPS).
Speculative Sampling
Speculative sampling is a technique to improve the computational efficiency of inference execution with LLMs while maintaining accuracy. It uses a smaller, faster draft model to generate multiple tokens, which are then validated with a larger, slower target model. This validation step is computationally more efficient than processing tokens sequentially, since it processes multiple tokens in a single pass instead of sequentially. Processing more tokens in parallel increases the number of tokens that can be multiplied by the same weight tensor, which is computationally more expensive. This improves performance and improves utilization of hardware resources compared to non-speculative execution, which is typically memory bandwidth limited.
The guessing process involves an adjustable window k, where the target model guarantees and delivers one correct token, and the draft model guesses the next k-1 tokens. If the draft model’s token is accepted, the process is sped up; if not, the target model takes over to ensure accuracy.
Figure 2 shows the case where all guessed tokens are accepted, speeding up the process: the target model provides guaranteed output tokens, and the draft model is run multiple times to generate a sequence of possible output tokens, which are verified by the target model and then accepted by the probabilistic method.
However, Figure 3 shows the case where some tokens are rejected. This speculative sampling loop takes the same amount of time to execute as in Figure 2, but we get fewer output tokens, which means we have to repeat this process multiple times to complete the response, slowing down the overall processing.
By tuning the window size k and understanding when the draft and target models are likely to produce similar results, we can maximize the benefits of speculative sampling.
Llama-2-70B/7B demonstration
We demonstrate how speculative sampling works on Amazon EC2 Inf2 instances with Inferentia2 and EC2 Trn1 instances with Trainium. We use the Llama-2-7B model as a draft model and Llama-2-70B as an example to generate text faster. This example walkthrough is based on the Llama-2 model, but you can follow a similar process with the Llama-3 model.
Loading a model
Llama-2 models can be loaded using the bfloat16 data type. Draft models should be loaded in the standard way, for example: n_positions
It is adjustable and represents the maximum sequence length allowed for generation. batch_size
At the time of writing, support for inferential sampling is 1. tp_degree
We’ll explain this later in this section.
The target model must be loaded in a similar way, but with the speculative sampling feature enabled. The value of k was previously described.
The two models together will require approximately 200 GB of device memory for weights, plus an additional GB of memory for the key-value (KV) cache. If you use a model with float32 parameters, you will need approximately 360 GB of device memory. Note that the KV cache grows linearly with sequence length (input tokens + tokens not yet generated). Use neuron-top to see memory utilization live. To accommodate these memory requirements, you will need either the largest Inf2 instance (inf2.48xlarge) or the largest Trn1 instance (trn1.32xlarge).
Due to the large size of our models, we need to distribute the model weights across neuron cores using a technique called tensor parallelism. Note that in the provided examples, we use tp_degree for each model to specify how many neuron cores it uses, which affects memory bandwidth utilization, which is important for token generation performance. tp_degree
This improves bandwidth utilization and increases throughput. In the Trn1 topology, tp_degree
It can be set to a multiple of 1, 2, 8, 16, or 32. For Inf2, it must be a multiple of 1 or 2.
The order in which you load the models is also important: once a set of neuron cores has been initialized and assigned to one model, you cannot use the same neuron cores for another model unless it is the exact same set. If you try to use only some of the previously initialized neuron cores, nrt_load_collectives - global nec_comm is already init'd
error.
To understand this better, let’s look at two examples for trn1.32xlarge (32 NeuronCores): Calculate the number of NeuronCores required per model. The formula we use is the model size observed in memory using neuron-top divided by 16GB of device memory per NeuronCore.
- To run the model using bfloat16, the Llama-2-70B requires 10 or more NeuronCores, and the Llama-2-7B requires 2 or more NeuronCores. Due to topology constraints, at least
tp_degree=16
The remaining 16 NeuronCores in the Llama-2-70B can be used for the Llama-2-7B. However, both models fit in memory across 32 NeuronCores, sotp_degree=32
It speeds up inference for both models. - To run the model using float32, you will need 18 or more NeuronCores for Llama-2-70B and 3 or more NeuronCores for Llama-2-7B. Due to topology constraints,
tp_degree=32
For Llama-2-70B, since Llama-2-7B needs to reuse the same NeuronCore set, you should set it like this:tp_degree=32
Also applies to Llama-2-7B.
walk through
The decoder we will use from transformers-neuronx is LlamaForSampling, which is suitable for loading and running Llama models. Alternatively, we can use NeuronAutoModelForCausalLM, which will auto-detect which decoder to use. To perform speculative sampling, we first need to create a speculative generator that takes two models and values. k
As explained before.
Invoke the inference process by calling the following function:
During sampling, there are several hyperparameters (e.g. temperature
, top_p
and top_k
) affect whether the output is deterministic across multiple runs. At the time of writing, the speculative sampling implementation sets default values for these hyperparameters. With these values, you can expect results to be random when running the model multiple times, even for the same prompt. This is the normal and intended behavior of LLMs, as they improve qualitative responses.
When you run the example, it uses a default token acceptor based on a paper by DeepMind that introduces speculative sampling, which uses a probabilistic method of accepting tokens. However, you can also implement a custom token acceptor, which can be done using the acceptor
Specify parameters when initializing the SpeculativeGenerator, for example if you want a more deterministic response. See the implementation of the DefaultTokenAcceptor class in transformers-neuronx on how to write your own.
Conclusion
As more developers consider incorporating LLM into their applications, they are faced with a choice between using a larger, more costly, and slower model that delivers higher quality results, or using a smaller, cheaper, and faster model that may result in lower quality answers. Now, with AWS artificial intelligence (AI) chips and speculative sampling, developers no longer have to make that choice. Developers can take advantage of the high quality output of larger models, with the speed and responsiveness of smaller models.
In this blog post, we showed that a new feature called speculative sampling can be used to speed up inference on large-scale models such as Llama-2-70B.
To try it yourself, check out the speculative sampling example and tweak the input prompt and k parameter to see what results you get. For more advanced use cases, you can develop your own token acceptor implementation. For more information about running models on Inferentia and Trainium instances, see the AWS Neuron documentation. You can also visit the repost.aws AWS Neuron channel to discuss your experiments and share your ideas with the AWS Neuron community.
About the Author
Syl Taylor is a Specialist Solutions Architect at Efficient Compute advising clients across EMEA on Amazon EC2 cost optimization and improving application performance using AWS-designed chips. Previously, he worked for AWS Professional Services in Software Development and AI/ML, designing and implementing cloud-native solutions. Based in the UK, he loves spending time in nature.
Emile Ayar He is a Senior Tech Lead Solutions Architect in the AWS Prototyping team. He specializes in helping customers build ML and Generative AI solutions and implement architectural best practices. He values agile innovation and prototyping, helping customers experiment with solution architectures to achieve their business goals. He lives in Luxembourg and enjoys playing synthesizers.