Delivering LLM using Amazon EC2 instances with vLLM and AWS AI chips

The use of large-scale language models (LLMs) and generative AI has exploded over the past year. With the release of a powerful, publicly available foundational model, we’ve also democratized the tools to train, fine-tune, and host your own LLM. vLLM on AWS Trainium and Inferentia allows you to host LLMs for high-performance inference and scalability.

This post shows you how to quickly deploy Meta’s latest Llama models using vLLM on Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instances. This example uses the 1B version, but you can use these steps to deploy other sizes along with other popular LLMs.

Deploy vLLM on AWS Trainium and Inferentia EC2 instances

These sections explain how to deploy Meta’s latest Llama 3.2 model using vLLM on AWS Inferentia EC2 instances. Learn how to request access to a model, create a Docker container to deploy the model using vLLM, and perform online and offline inference on the model. We also discuss performance tuning of inference graphs.

Prerequisite: Access to a Hugging Face account and model

To use meta-llama/Llama-3.2-1B To use a model, you must have a Hugging Face account and access to the model. Visit your model card to sign up and accept the model license. Next, you will need the Hugging Face token. This can be obtained by following these steps: When you arrive at Save access token Make sure to copy the token, as it will not appear again, as shown in the following image.

Create an EC2 instance

You can follow the guide to create an EC2 instance. There are a few things to note:

If you are using an inf/trn instance for the first time, you must request a quota increase.
use inf2.xlarge as an instance type. inf2.xlarge Instances are available only in these AWS Regions.
Increase the gp3 volume to 100G.
use Deep Learning AMI Neuron (Ubuntu 22.04) Add it as an AMI as shown in the following image.

After your instance is launched, you can connect to it and access the command line. The next step is to run the neuron vLLM container image using Docker (preinstalled on this AMI).

Starting the vLLM server

Use Docker to create a container with all the tools needed to run vLLM. Create a Dockerfile using the following command:

cat > Dockerfile <<\EOF
# default base image
ARG BASE_IMAGE="public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04"
FROM $BASE_IMAGE
RUN echo "Base image is $BASE_IMAGE"
# Install some basic utilities
RUN apt-get update && \
    apt-get install -y \
        git \
        python3 \
        python3-pip \
        ffmpeg libsm6 libxext6 libgl1
### Mount Point ###
# When launching the container, mount the code directory to /app
ARG APP_MOUNT=/app
VOLUME ( ${APP_MOUNT} )
WORKDIR ${APP_MOUNT}/vllm
RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install --no-cache-dir fastapi ninja tokenizers pandas
RUN python3 -m pip install sentencepiece transformers==4.36.2 -U
RUN python3 -m pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
RUN python3 -m pip install --pre neuronx-cc==2.15.* --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
ENV VLLM_TARGET_DEVICE neuron
RUN git clone https://github.com/vllm-project/vllm.git && \
    cd vllm && \
    git checkout v0.6.2 && \
    python3 -m pip install -U \
        cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \
        -r requirements-neuron.txt && \
    pip install --no-build-isolation -v -e . && \
    pip install --upgrade triton==3.0.0
CMD ("/bin/bash")
EOF

Then run:

docker build . -t vllm-neuron

Building the image takes approximately 10 minutes. Once done, use (replace) the new Docker image YOUR_TOKEN_HERE (using Hugging Face token):

export HF_TOKEN="YOUR_TOKEN_HERE"
docker run \
        -it \
        -p 8000:8000 \
        --device /dev/neuron0 \
        -e HF_TOKEN=$HF_TOKEN \
        -e NEURON_CC_FLAGS=-O1 \
        vllm-neuron

You can now start the vLLM server using the following command:

vllm serve meta-llama/Llama-3.2-1B --device neuron --tensor-parallel-size 2 --block-size 8 --max-model-len 4096 --max-num-seqs 32

This command runs vLLM with the following parameters:

serve meta-llama/Llama-3.2-1B: Faces hugging each other modelID of the model being deployed for inference.
--device neuron: Configure vLLM to run on neuron devices.
--tensor-parallel-size 2: Set the number of partitions for tensor parallel processing. inf2.xlarge has one neuron device, and each neuron device has two neuron cores.
--max-model-len 4096: This is set to the maximum sequence length (input and output tokens) to compile the model.
--block-size 8: For neuron devices, this is set internally to max-model-len.
--max-num-seqs 32: This is set to the desired level of hardware batch size or concurrency that the model server must handle.

When you load a model for the first time, you must compile the model if it was not previously compiled. This compiled model can optionally be saved, so no compilation step is required when you recreate the container. Once everything is complete and the model server is running, you should see the following logs:

Avg prompt throughput: 0.0 tokens/s ...

This means that the model server is running, but has not received any requests and is therefore not processing them yet. You can now detach it from the container by pressing . ctrl + p and ctrl + q.

inference

When starting the Docker container, I used the command -p 8000:8000. This told Docker to forward port 8000 from the container to port 8000 on the local machine. The following command will cause the model server to meta-llama/Llama-3.2-1B is running.

curl localhost:8000/v1/models

This should return something like this:

{"object":"list","data":({"id":"meta-llama/Llama-3.2-1B","object":"model","created":1732552038,"owned_by":"vllm","root":"meta-llama/Llama-3.2-1B","parent":null,"max_model_len":4096,"permission":({"id":"modelperm-6d44a6f6e52447eb9074b13ae1e9e285","object":"model_permission","created":1732552038,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false})})}ubuntu@ip-172-31-12-216:~$

Then send the prompt.

curl localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq '.choices(0).text'

You should receive a response from vLLM similar to the following:

ubuntu@ip-172-31-13-178:~$ curl localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq '.choices(0).text'
  % Total    % Received % Xferd  Average Speed   Time    Time    Time  Current
                                 Dload  Upload   Total   Spent  Left  Speed
100  1067  100   966  100   101    108     11  0:00:09  0:00:08 0:00:01   258
" How does it work?\nGen AI is a new type of artificial intelligence that is designed to learn and adapt to new situations and environments. It is based on the idea that the human brain is a complex system 
that can learn and adapt to new situations and environments. Gen AI is designed to be able to learn and adapt to new situations and environments in a way that is similar to how the human brain does.\nGen AI is 
a new type of artificial intelligence that is designed to learn and adapt to new situations and environments. It is based on the idea that the human brain is a complex system that can learn and adapt to new 
situations and environments."

Offline inference using vLLM

Another way to use vLLM with Inferentia is to have your script send several requests all at the same time. This is useful for automation or when you want to send multiple prompts at the same time.

You can reconnect to the Docker container and stop the online inference server using the following command:

docker attach $(docker ps --format "{{.ID}}")

At this point you should see a blank cursor. ctrl + c After you stop the server, you should be back at the bash prompt inside the container. Create a file to use the offline inference engine.

cat > offline_inference.py <<EOF
from vllm.entrypoints.llm import LLM
from vllm.sampling_params import SamplingParams

# Sample prompts.
prompts = (
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
)
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="meta-llama/Llama-3.2-1B",
        max_num_seqs=32,
        max_model_len=4096,
        block_size=8,
        device="neuron",
        tensor_parallel_size=2)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs(0).text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

EOF

Then run the script python offline_inference.py You should receive responses to the four prompts. This may take a minute or so, as the model must be restarted.

Processed prompts: 100%|
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 (00:01<00:00,  2.53it/s, est. speed input: 16.46 toks/s, output: 40.51 toks/s)
Prompt: 'Hello, my name is', Generated text: ' Anna and I am the 4th year student of the Bachelor of Engineering at'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States of America. A'
Prompt: 'The capital of France is', Generated text: ' also the most expensive city to live in. The average cost of living in Paris'
Prompt: 'The future of AI is', Generated text: ' now\nThe 10 most influential AI professionals to watch in 2019\n'

You can now enter exit Press return then press ctrl + c Shut down the Docker container and return to the inf2 instance.

cleaning

Now that you have completed testing Llama 3.2 1B LLM, you need to terminate your EC2 instance to avoid additional charges.

Performance tuning for variable sequence length

Perhaps you need to handle variable length sequences during LLM inference. The Neuron SDK generates buckets and computational graphs that function according to the shape and size of the bucket. To fine-tune performance based on the length of input and output tokens in an inference request, use the following environment variables as a list of integers to configure two types of buckets corresponding to the two phases of LLM inference: can.

NEURON_CONTEXT_LENGTH_BUCKETS Corresponds to the context encoding phase. Set this to the estimated length of the prompt during inference.
NEURON_TOKEN_GEN_BUCKETS Corresponds to the token generation phase. Set this to a range of powers of 2 within the generation length.

You can use the Docker run command to set environment variables when starting the vLLM server (make sure to replace them) YOUR_TOKEN_HERE (Using Hug Face Token):

export HF_TOKEN="YOUR_TOKEN_HERE"
docker run \
        -it \
        -p 8000:8000 \
        --device /dev/neuron0 \
        -e HF_TOKEN=$HF_TOKEN \
        -e NEURON_CC_FLAGS=-O1 \
        -e NEURON_CONTEXT_LENGTH_BUCKETS="1024,1280,1536,1792,2048" \
        -e NEURON_TOKEN_GEN_BUCKETS="256,512,1024" \
        vllm-neuron

You can then start the server using the same command.

vllm serve meta-llama/Llama-3.2-1B --device neuron --tensor-parallel-size 2 --block-size 8 --max-model-len 4096 --max-num-seqs 32

The model graph has changed and the model must be recompiled. If the container exits, the model will be downloaded again. You can then press to detach from the container and send your request. ctrl + p and ctrl + q Using the same command:

curl localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq '.choices(0).text'

For more information on how to configure buckets, see the developer guide on bucketing. Note, NEURON_CONTEXT_LENGTH_BUCKETS corresponds to context_length_estimate document and NEURON_TOKEN_GEN_BUCKETS corresponds to n_positions It’s in the documentation.

conclusion

I checked the deployment method meta-llama/Llama-3.2-1B Use vLLM with Amazon EC2 Inf2 instances. If you are interested in implementing Hugging Face’s other popular LLMs, modelID Within vLLM serve Instructions. For more information about integrating the Neuron SDK with vLLM, see the Neuron User Guide for Continuous Batch Processing and the vLLM Guide for Neuron.

After you identify a model to use in production, you need to deploy the model with autoscaling, observability, and fault tolerance. You can also refer to this blog post to understand how to deploy vLLM on Inferentia through Amazon Elastic Kubernetes Service (Amazon EKS). In the next post in this series, we will learn how to deploy vLLM with autoscaling and observability into production using Amazon EKS and Ray Serve.

About the author

Omri Shiv is an open source machine learning engineer focused on helping customers through their AI/ML initiatives. In my free time, I like to cook, tinker with open source and open hardware, and listen to and play music.

Pinak Panigrahi works with customers to build ML-driven solutions to solve strategic business problems on AWS. In my current role, I work on training and optimizing inference for generative AI models on AWS AI chips.

What's Hot

Lestat shines in Interview with the Vampire season 3 teaser

The dolphin that bit the person may have just been friendly.

Fluoride in Drinking Water Is Safe. Here’s the Evidence

Will hibernation technology allow humans to survive winter?

7 Coolest Mathematical Discoveries of 2024

Missile detectors and Santa trackers? It’s a festival mystery

US meat and milk prices should skyrocket if President Donald Trump carries out his mass deportation plan

Humanity will continue to live in an era of incredible food waste

EBSCOlearning scales assessment generation for their online learning content with generative AI

How Tealium built a chatbot evaluation platform with Ragas and Auto-Instruct using AWS generative AI services

Discover insights from your Amazon Aurora PostgreSQL database using the Amazon Q Business connector

Summarize call transcriptions securely with Amazon Transcribe and Amazon Bedrock Guardrails

Get AirPods Pro for 51% off

NYT Mini Crossword Answers for August 14

Most Popular

President Trump’s anti-climate policies could help China dominate global markets

NYT “Connections” clues and answers for July 29: Clues to solving “Connections” #414.

Next-Generation Biotech Is Rendering Some Lab Animals Obsolete

Our Picks

How long will your iPad last?

Participants on Amazon show Mr. Beast claim there were dangerous situations during filming

Top 10 Cat Furniture, Scratching Posts, and Litter Boxes (2024)

Subscribe to our newsletter

Subscribe to Updates

What's Hot

Delivering LLM using Amazon EC2 instances with vLLM and AWS AI chips

Deploy vLLM on AWS Trainium and Inferentia EC2 instances

Prerequisite: Access to a Hugging Face account and model

Create an EC2 instance

Starting the vLLM server

inference

Offline inference using vLLM

cleaning

Performance tuning for variable sequence length

conclusion

About the author

Related Posts

Subscribe to our newsletter

Subscribe to our newsletter