The use of large-scale language models (LLMs) and generative AI has exploded over the past year. With the release of a powerful, publicly available foundational model, we’ve also democratized the tools to train, fine-tune, and host your own LLM. vLLM on AWS Trainium and Inferentia allows you to host LLMs for high-performance inference and scalability.
This post shows you how to quickly deploy Meta’s latest Llama models using vLLM on Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instances. This example uses the 1B version, but you can use these steps to deploy other sizes along with other popular LLMs.
Deploy vLLM on AWS Trainium and Inferentia EC2 instances
These sections explain how to deploy Meta’s latest Llama 3.2 model using vLLM on AWS Inferentia EC2 instances. Learn how to request access to a model, create a Docker container to deploy the model using vLLM, and perform online and offline inference on the model. We also discuss performance tuning of inference graphs.
Prerequisite: Access to a Hugging Face account and model
To use meta-llama/Llama-3.2-1B
To use a model, you must have a Hugging Face account and access to the model. Visit your model card to sign up and accept the model license. Next, you will need the Hugging Face token. This can be obtained by following these steps: When you arrive at Save access token Make sure to copy the token, as it will not appear again, as shown in the following image.
Create an EC2 instance
You can follow the guide to create an EC2 instance. There are a few things to note:
- If you are using an inf/trn instance for the first time, you must request a quota increase.
- use
inf2.xlarge
as an instance type.inf2.xlarge
Instances are available only in these AWS Regions. - Increase the gp3 volume to 100G.
- use
Deep Learning AMI Neuron (Ubuntu 22.04)
Add it as an AMI as shown in the following image.
After your instance is launched, you can connect to it and access the command line. The next step is to run the neuron vLLM container image using Docker (preinstalled on this AMI).
Starting the vLLM server
Use Docker to create a container with all the tools needed to run vLLM. Create a Dockerfile using the following command:
Then run:
Building the image takes approximately 10 minutes. Once done, use (replace) the new Docker image YOUR_TOKEN_HERE
(using Hugging Face token):
You can now start the vLLM server using the following command:
This command runs vLLM with the following parameters:
serve meta-llama/Llama-3.2-1B
: Faces hugging each othermodelID
of the model being deployed for inference.--device neuron
: Configure vLLM to run on neuron devices.--tensor-parallel-size 2
: Set the number of partitions for tensor parallel processing. inf2.xlarge has one neuron device, and each neuron device has two neuron cores.--max-model-len 4096
: This is set to the maximum sequence length (input and output tokens) to compile the model.--block-size 8
: For neuron devices, this is set internally to max-model-len.--max-num-seqs 32
: This is set to the desired level of hardware batch size or concurrency that the model server must handle.
When you load a model for the first time, you must compile the model if it was not previously compiled. This compiled model can optionally be saved, so no compilation step is required when you recreate the container. Once everything is complete and the model server is running, you should see the following logs:
This means that the model server is running, but has not received any requests and is therefore not processing them yet. You can now detach it from the container by pressing . ctrl + p
and ctrl + q
.
inference
When starting the Docker container, I used the command -p 8000:8000. This told Docker to forward port 8000 from the container to port 8000 on the local machine. The following command will cause the model server to meta-llama/Llama-3.2-1B
is running.
This should return something like this:
Then send the prompt.
You should receive a response from vLLM similar to the following:
Offline inference using vLLM
Another way to use vLLM with Inferentia is to have your script send several requests all at the same time. This is useful for automation or when you want to send multiple prompts at the same time.
You can reconnect to the Docker container and stop the online inference server using the following command:
At this point you should see a blank cursor. ctrl + c
After you stop the server, you should be back at the bash prompt inside the container. Create a file to use the offline inference engine.
Then run the script python offline_inference.py
You should receive responses to the four prompts. This may take a minute or so, as the model must be restarted.
You can now enter exit
Press return then press ctrl + c
Shut down the Docker container and return to the inf2 instance.
cleaning
Now that you have completed testing Llama 3.2 1B LLM, you need to terminate your EC2 instance to avoid additional charges.
Performance tuning for variable sequence length
Perhaps you need to handle variable length sequences during LLM inference. The Neuron SDK generates buckets and computational graphs that function according to the shape and size of the bucket. To fine-tune performance based on the length of input and output tokens in an inference request, use the following environment variables as a list of integers to configure two types of buckets corresponding to the two phases of LLM inference: can.
NEURON_CONTEXT_LENGTH_BUCKETS
Corresponds to the context encoding phase. Set this to the estimated length of the prompt during inference.NEURON_TOKEN_GEN_BUCKETS
Corresponds to the token generation phase. Set this to a range of powers of 2 within the generation length.
You can use the Docker run command to set environment variables when starting the vLLM server (make sure to replace them) YOUR_TOKEN_HERE
(Using Hug Face Token):
You can then start the server using the same command.
vllm serve meta-llama/Llama-3.2-1B --device neuron --tensor-parallel-size 2 --block-size 8 --max-model-len 4096 --max-num-seqs 32
The model graph has changed and the model must be recompiled. If the container exits, the model will be downloaded again. You can then press to detach from the container and send your request. ctrl + p
and ctrl + q
Using the same command:
curl localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq '.choices(0).text'
For more information on how to configure buckets, see the developer guide on bucketing. Note, NEURON_CONTEXT_LENGTH_BUCKETS
corresponds to context_length_estimate
document and NEURON_TOKEN_GEN_BUCKETS
corresponds to n_positions
It’s in the documentation.
conclusion
I checked the deployment method meta-llama/Llama-3.2-1B
Use vLLM with Amazon EC2 Inf2 instances. If you are interested in implementing Hugging Face’s other popular LLMs, modelID
Within vLLM serve
Instructions. For more information about integrating the Neuron SDK with vLLM, see the Neuron User Guide for Continuous Batch Processing and the vLLM Guide for Neuron.
After you identify a model to use in production, you need to deploy the model with autoscaling, observability, and fault tolerance. You can also refer to this blog post to understand how to deploy vLLM on Inferentia through Amazon Elastic Kubernetes Service (Amazon EKS). In the next post in this series, we will learn how to deploy vLLM with autoscaling and observability into production using Amazon EKS and Ray Serve.
About the author
Omri Shiv is an open source machine learning engineer focused on helping customers through their AI/ML initiatives. In my free time, I like to cook, tinker with open source and open hardware, and listen to and play music.
Pinak Panigrahi works with customers to build ML-driven solutions to solve strategic business problems on AWS. In my current role, I work on training and optimizing inference for generative AI models on AWS AI chips.