As the demand for generative AI increases, developers and enterprises are looking for more flexible, cost-effective, and powerful accelerators to meet their needs. Today, we’re excited to announce that G6e instances powered by NVIDIA’s L40S Tensor Core GPUs are now available on Amazon SageMaker. You have the option to provision nodes with 1, 4, and 8 L40S GPU instances, with each GPU providing 48 GB of high-bandwidth memory (HBM). With this release, organizations can now use single-node GPU instances (G6e. can be reduced. -Effective and high performance option. This makes it a great choice for anyone looking to optimize costs while maintaining high performance for their inference workloads.
Key highlights of G6e instances include:
- With twice the GPU memory compared to G5 and G6 instances, FP16 enables deployment of large language models, including:
- 14B parameter model on single GPU node (G6e.xlarge)
- 72B parameter model on 4 GPU nodes (G6e.12xlarge)
- 90B parameter model on 8 GPU nodes (G6e.48xlarge)
- Up to 400 Gbps network throughput
- Up to 384 GB GPU memory
use case
G6e instances are ideal for fine-tuning and deploying open large-scale language models (LLMs). Our benchmarks show that G6e offers higher performance and cost efficiency compared to G5 instances, making it ideal for use in low-latency, real-time use cases such as:
- Chatbots and conversational AI
- Text generation and summarization
- Image generation and visual models
We also observed that G6e performs better in inference with high concurrency and longer context lengths. The next section provides a complete benchmark.
performance
The following two figures show that for the Llama 3.1 8B model, G6e.2xlarge achieves up to 37% better latency and 60% better throughput compared to G5.2xlarge for context lengths of 512 and 1024. You can see that there are.
In the following two images, you can see that G5.2xlarge throws a CUDA Out of Memory (OOM) when deploying the LLama 3.2 11B Vision model, whereas G6e.2xlarge provides better performance.
The following two diagrams compare G5.48xlarge (8 GPU nodes) and G6e.12xlarge (4 GPU) nodes. G6e.12xlarge (4 GPU) nodes cost 35% less and have better performance. At high concurrency, we see that G6e.12xlarge has 60% lower latency and 2.5x higher throughput.
The diagram below compares the cost per 1000 tokens when deploying Llama 3.1 70b. This further highlights the cost/performance benefits of using G6e instances compared to G5.
Introduction walkthrough
Prerequisites
To try this solution using SageMaker, you need the following prerequisites:
introduction
You can clone the repository and use the notebook provided here.
cleaning
To avoid unnecessary charges, we recommend that you clean up your deployed resources when you are finished using them. You can delete a deployed model using the following code.
predictor.delete_predictor()
conclusion
SageMaker’s G6e instances allow you to cost-effectively deploy a variety of open source models. With superior memory capacity, enhanced performance, and cost efficiency, these instances are an attractive solution for organizations looking to deploy and scale AI applications. G6e instances are especially valuable for modern AI applications because they can process larger models, support longer context lengths, and maintain high throughput. Try the code deployed on G6e.
About the author
Vivek Gangasani I am a Senior GenAI Specialist Solutions Architect at AWS. He helps emerging GenAI companies build innovative solutions using AWS services and accelerated computing. His current focus is on developing strategies to fine-tune and optimize the inference performance of large-scale language models. In his free time, Vivek enjoys hiking, watching movies, and sampling different cuisines.
Alan Tan He is a senior product manager at SageMaker, where he leads large-scale model inference efforts. He is passionate about applying machine learning to the field of analytics. Outside of work, he enjoys the outdoors.
pavan kumar madhuri I’m an Associate Solutions Architect at Amazon Web Services. He has a strong interest in designing innovative solutions in Generative AI and is passionate about helping customers harness the power of the cloud. I earned a master’s degree in information technology from Arizona State University. Outside of work, I enjoy swimming and watching movies.
Michael Nguyen He is a Senior Startup Solutions Architect at AWS, specializing in leveraging AI/ML to drive innovation and develop business solutions on AWS. Michael has 12 AWS certifications and holds BS/MS and MBA degrees in Electrical/Computer Engineering from Pennsylvania State University, Binghamton University, and the University of Delaware.