With the rise of large language models (LLMs) like Meta Llama 3.1, there is an increasing need for scalable, reliable, and cost-effective solutions to deploy and serve these models. AWS Trainium and AWS Inferentia based instances, combined with Amazon Elastic Kubernetes Service (Amazon EKS), provide a performant and low cost framework to run LLMs efficiently in a containerized environment.
In this post, we walk through the steps to deploy the Meta Llama 3.1-8B model on Inferentia 2 instances using Amazon EKS.
Solution overview
The steps to implement the solution are as follows:
- Create the EKS cluster.
- Set up the Inferentia 2 node group.
- Install the Neuron device plugin and scheduling extension.
- Prepare the Docker image.
- Deploy the Meta Llama 3.18B model.
We also demonstrate how to test the solution and monitor performance, and discuss options for scaling and multi-tenancy.
Prerequisites
Before you begin, make sure you have the following utilities installed on your local machine or development environment. If you don’t have them installed, follow the instructions provided for each tool.
In this post, the examples use an inf2.48xlarge
instance; make sure you have a sufficient service quota to use this instance. For more information on how to view and increase your quotas, refer to Amazon EC2 service quotas.
Create the EKS cluster
If you don’t have an existing EKS cluster, you can create one using eksctl
. Adjust the following configuration to suit your needs, such as the Amazon EKS version, cluster name, and AWS Region. Before running the following commands, make sure you authenticate towards AWS:
Then complete the following steps:
- Create a new file named
eks_cluster.yaml
with the following command:
This configuration file contains the following parameters:
- metadata.name – Specifies the name of your EKS cluster, which is set to
my-cluster
in this example. You can change it to a name of your choice. - metadata.region – Specifies the Region where you want to create the cluster. In this example, it’s set to
us-east-2
. Change this to your desired Region. Because we’re using Inf2 instances, you should choose a Region where those instances are presented. - metadata.version – Specifies the Kubernetes version to use for the cluster. In this example, it’s set to 1.30. You can change this to a different version if needed, but make sure to use a version that is supported by Amazon EKS. For a list of supported versions, see Review release notes for Kubernetes versions on standard support.
- addons.vpc-cni – Specifies the version of the Amazon VPC CNI (Container Network Interface) add-on to use. Setting it to
latest
will install the latest available version. - cloudWatch.clusterLogging – Enables cluster logging, which sends logs from the control plane to Amazon CloudWatch Logs.
- iam.withOIDC – Enables the OpenID Connect (OIDC) provider for the cluster, which is required for certain AWS services to interact with the cluster.
- After you create the
eks_cluster.yaml
file, you can create the EKS cluster by running the following command:
This command will create the EKS cluster based on the configuration specified in the eks_cluster.yaml
file. The process will take approximately 15–20 minutes to complete.
During the cluster creation process, eksctl
will also create a default node group with a recommended instance type and configuration. However, in the next section, we create a separate node group with Inf2 instances, specifically for running the Meta Llama 3.1-8B model.
- To complete the setup of
kubectl
, run the following code:
Set up the Inferentia 2 node group
To run the Meta Llama 3.1-8B model, you’ll need to create an Inferentia 2 node group. Complete the following steps:
- First, retrieve the latest Amazon EKS optimized accelerated AMI ID:
- Create the Inferentia 2 node group using
eksctl
:
- Run
eksctl create nodegroup --config-file eks_nodegroup.yaml
to create the node group.
This will take approximately 5 minutes.
Install the Neuron device plugin and scheduling extension
To set up your EKS cluster for running workloads on Inferentia chips, you need to install two key components: the Neuron device plugin and the Neuron scheduling extension.
The Neuron device plugin is essential for exposing Neuron cores and devices as resources in Kubernetes. The Neuron scheduling extension facilitates the optimal scheduling of pods requiring multiple Neuron cores or devices.
For detailed instructions on installing and verifying these components, refer to Kubernetes environment setup for Neuron. Following these instructions will help you make sure your EKS cluster is properly configured to schedule and run workloads that require worker nodes, such as the Meta Llama 3.1-8B model.
Prepare the Docker image
To run the model, you’ll need to prepare a Docker image with the required dependencies. We use the following code to create an Amazon Elastic Container Registry (Amazon ECR) repository and then build a custom Docker image based on the AWS Deep Learning Container (DLC).
- Set up environment variables:
- Create an ECR repository:
Although the base Docker image already includes TorchServe, to keep things simple, this implementation uses the server provided by the vLLM repository, which is based on FastAPI. In your production scenario, you can connect TorchServe to vLLM with your own custom handler.
- Create the Dockerfile:
- Use the following commands to create an ECR repository, build your Docker image, and push it to the newly created repository. The account ID and Region are dynamically set using AWS CLI commands, making the process more flexible and avoiding hard-coded values.
Deploy the Meta Llama 3.1-8B model
With the setup complete, you can now deploy the model using a Kubernetes deployment. The following is an example deployment specification that requests specific resources and sets up multiple replicas:
Apply the deployment specification with kubectl apply -f neuronx-vllm-deployment.yaml
.
This deployment configuration sets up multiple replicas of the Meta Llama 3.1-8B model using tensor parallelism (TP) of 8. In the current setup, we’re hosting three copies of the model across the available Neuron cores. This configuration allows for the efficient utilization of the hardware resources while enabling multiple concurrent inference requests.
The use of TP=8 helps in distributing the model across multiple Neuron cores, which improves inference performance and throughput. The specific number of replicas and cores used may vary depending on your particular hardware setup and performance requirements.
To modify the setup, update the neuronx-vllm-deployment.yaml
file, adjusting the replicas
field in the deployment specification and the NUM_NEURON_CORES
environment variable in the container specification. Always verify that the total number of cores used (replicas * cores per replica) doesn’t exceed your available hardware resources and that the number of attention heads is evenly divisible by the TP degree for optimal performance.
The deployment also includes environment variables for the Hugging Face token and EFA fork safety. The args
section (see the preceding code) configures the model and its parameters, including an increased max model length and block size of 8192.
Test the deployment
After you deploy the model, it’s important to monitor its progress and verify its readiness. Complete the following steps:
- Check the deployment status:
This will show you the desired, current, and up-to-date number of replicas.
- Monitor the pods:
The -w
flag will watch for changes. You’ll see the pods transitioning from "Pending"
to "ContainerCreating"
to "Running"
.
- Check the logs of a specific pod:
The initial startup process takes around 15 minutes. During this time, the model is being compiled for the Neuron cores. You’ll see the compilation progress in the logs.
To support proper management of your vLLM pods, you should configure Kubernetes probes in your deployment. These probes help Kubernetes determine when a pod is ready to serve traffic, when it’s alive, and when it has successfully started.
- Add the following probe configurations to your container spec in the deployment YAML:
The configuration is comprised of three probes:
- Readiness probe – Checks if the pod is ready to serve traffic. It starts checking after 60 seconds and repeats every 10 seconds.
- Liveness probe – Verifies if the pod is still running correctly. It begins after 120 seconds and checks every 15 seconds.
- Startup probe – Gives the application time to start up. It allows up to 25 minutes for the application to start before considering it failed.
These probes assume that your vLLM application exposes a /health
endpoint. If it doesn’t, you’ll need to implement one or adjust the probe configurations accordingly.
With these probes in place, Kubernetes will do the following:
- Only send traffic to pods that are ready
- Restart pods that are no longer alive
- Allow sufficient time for initial startup and compilation
This configuration helps facilitate high availability and proper functioning of your vLLM deployment.
Now you’re ready to access the pods.
- Identify the pod that is running your inference server. You can use the following command to list the pods with the
neuronx-vllm
label:
This command will output a list of pods, and you’ll need the name of the pod you want to forward.
- Use
kubectl port-forward
to forward the port from the Kubernetes pod to your local machine. Use the name of your pod from the previous step:
This command forwards port 8000 on the pod to port 8000 on your local machine. You can now access the inference server at http://localhost:8000
.
Because we’re forwarding a port directly from a single pod, requests will only be sent to that specific pod. As a result, traffic won’t be balanced across all replicas of your deployment. This is suitable for testing and development purposes, but it doesn’t utilize the deployment efficiently in a production scenario where load balancing across multiple replicas is crucial to handle higher traffic and provide fault tolerance.
In a production environment, a proper solution like a Kubernetes service with a LoadBalancer or Ingress should be used to distribute traffic across available pods. This facilitates the efficient utilization of resources, a balanced load, and improved reliability of the inference service.
- You can test the inference server by making a request from your local machine. The following code is an example of how to make an inference call using curl:
This setup allows you to test and interact with your inference server locally without needing to expose your service publicly or set up complex networking configurations. For production use, make sure that load balancing and scalability considerations are addressed appropriately.
For more information about routing, see Route application and HTTP traffic with Application Load Balancers.
Monitor performance
AWS offers powerful tools to monitor and optimize your vLLM deployment on Inferentia chips. The AWS Neuron Monitor container, used with Prometheus and Grafana, provides advanced visualization of your ML application performance. Additionally, CloudWatch Container Insights for Neuron offers deep, Neuron-specific analytics.
These tools allow you to track Inferentia chip utilization, model performance, and overall cluster health. By analyzing this data, you can make informed decisions about resource allocation and scaling to meet your workload requirements.
Remember that the initial 15-minute startup time for model compilation is a one-time process per deployment, with subsequent restarts being faster due to caching.
To learn more about setting up and using these monitoring capabilities, see Scale and simplify ML workload monitoring on Amazon EKS with AWS Neuron Monitor container.
Scaling and multi-tenancy
As your application’s demand grows, you may need to scale your deployment to handle more requests. Scaling your Meta Llama 3.1-8B deployment on Amazon EKS with Neuron cores involves two coordinated steps:
- Increasing the number of nodes in your EKS node group to provide additional Neuron cores
- Increasing the number of replicas in your deployment to utilize these new resources
You can scale your deployment manually. Use the AWS Management Console or AWS CLI to increase the size of your EKS node group. When new nodes are available, scale your deployment with the following code:
Alternatively, you can set up auto scaling:
- Configure auto scaling for your EKS node group to automatically add nodes based on resource demands
- Use Horizontal Pod Autoscaling (HPA) to automatically adjust the number of replicas in your deployment
You can configure the node group’s auto scaling to respond to increased CPU, memory, or custom metric demands, automatically provisioning new nodes with Neuron cores as needed. This makes sure that as the number of incoming requests grows, both your infrastructure and your deployment can scale accordingly.
Example scaling solutions include:
- Cluster Autoscaler with Karpenter – Though not currently installed in this setup, Karpenter offers more flexible and efficient auto scaling for future consideration. It can dynamically provision the right number of nodes needed for your Neuron workloads based on pending pods and custom scheduling constraints. For more details, see Scale cluster compute with Karpenter and Cluster Autoscaler.
- Multi-cluster federation – For even larger scale, you could set up multiple EKS clusters, each with its own Neuron-equipped nodes, and use a multi-cluster federation tool to distribute traffic among them.
You should consider the following when scaling:
- Alignment of resources – Make sure that your scaling strategy for both nodes and pods aligns with the Neuron core requirements (multiples of 8 for optimal performance). This is model dependent and unique for the Meta Llama 3.1 model.
- Compilation time – Remember the 15-minute compilation time for new pods when planning your scaling strategy. Consider pre-warming pods during off-peak hours.
- Cost management – Monitor costs closely as you scale, because Neuron-equipped instances can be expensive.
- Performance testing – Conduct thorough performance testing as you scale to verify that increased capacity translates to improved throughput and reduced latency.
By coordinating the scaling of both your node group and your deployment, you can effectively handle increased request volumes while maintaining optimal performance. The auto scaling capabilities of both your node group and deployment can work together to automatically adjust your cluster’s capacity based on incoming request volumes, providing a more responsive and efficient scaling solution.
Clean up
Use the following code to delete the cluster created in this solution:
Conclusion
Deploying LLMs like Meta Llama 3.1-8B at scale poses significant computational challenges. Using Inferentia 2 instances and Amazon EKS can help overcome these challenges by enabling efficient model deployment in a containerized, scalable, and multi-tenant environment.
This solution combines the exceptional performance and cost-effectiveness of Inferentia 2 chips with the robust and flexible landscape of Amazon EKS. Inferentia 2 chips deliver high throughput and low latency inference, ideal for LLMs. Amazon EKS provides dynamic scaling, efficient resource utilization, and multi-tenancy capabilities.
The process involves setting up an EKS cluster, configuring an Inferentia 2 node group, installing Neuron components, and deploying the model as a Kubernetes pod. This approach facilitates high availability, resilience, and efficient resource sharing for language model services, while allowing for automatic scaling, load balancing, and self-healing capabilities.
For the complete code and detailed implementation steps, visit the GitHub repository.
About the Authors
Dmitri Laptev is a Senior GenAI Solutions Architect at AWS, based in Munich. With 17 years of experience in the IT industry, his interest in AI and ML dates back to his university years, fostering a long-standing passion for these technologies. Dmitri is enthusiastic about cloud computing and the ever-evolving landscape of technology.
Maurits de Groot is a Solutions Architect at Amazon Web Services, based out of Amsterdam. He specializes in machine learning-related topics and has a predilection for startups. In his spare time, he enjoys skiing and bouldering.
Ziwen Ning is a Senior Software Development Engineer at AWS. He currently focuses on enhancing the AI/ML experience through the integration of AWS Neuron with containerized environments and Kubernetes. In his free time, he enjoys challenging himself with kickboxing, badminton, and other various sports, and immersing himself in music.
Jianying Lang is a Principal Solutions Architect at the AWS Worldwide Specialist Organization (WWSO). She has over 15 years of working experience in the HPC and AI fields. At AWS, she focuses on helping customers deploy, optimize, and scale their AI/ML workloads on accelerated computing instances. She is passionate about combining the techniques in HPC and AI fields. Jianying holds a PhD in Computational Physics from the University of Colorado at Boulder.