Amazon SageMaker Inference now supports G6e instances

As the demand for generative AI increases, developers and enterprises are looking for more flexible, cost-effective, and powerful accelerators to meet their needs. Today, we’re excited to announce that G6e instances powered by NVIDIA’s L40S Tensor Core GPUs are now available on Amazon SageMaker. You have the option to provision nodes with 1, 4, and 8 L40S GPU instances, with each GPU providing 48 GB of high-bandwidth memory (HBM). With this release, organizations can now use single-node GPU instances (G6e. can be reduced. -Effective and high performance option. This makes it a great choice for anyone looking to optimize costs while maintaining high performance for their inference workloads.

Key highlights of G6e instances include:

With twice the GPU memory compared to G5 and G6 instances, FP16 enables deployment of large language models, including:
- 14B parameter model on single GPU node (G6e.xlarge)
- 72B parameter model on 4 GPU nodes (G6e.12xlarge)
- 90B parameter model on 8 GPU nodes (G6e.48xlarge)
Up to 400 Gbps network throughput
Up to 384 GB GPU memory

use case

G6e instances are ideal for fine-tuning and deploying open large-scale language models (LLMs). Our benchmarks show that G6e offers higher performance and cost efficiency compared to G5 instances, making it ideal for use in low-latency, real-time use cases such as:

Chatbots and conversational AI
Text generation and summarization
Image generation and visual models

We also observed that G6e performs better in inference with high concurrency and longer context lengths. The next section provides a complete benchmark.

performance

The following two figures show that for the Llama 3.1 8B model, G6e.2xlarge achieves up to 37% better latency and 60% better throughput compared to G5.2xlarge for context lengths of 512 and 1024. You can see that there are.

In the following two images, you can see that G5.2xlarge throws a CUDA Out of Memory (OOM) when deploying the LLama 3.2 11B Vision model, whereas G6e.2xlarge provides better performance.

The following two diagrams compare G5.48xlarge (8 GPU nodes) and G6e.12xlarge (4 GPU) nodes. G6e.12xlarge (4 GPU) nodes cost 35% less and have better performance. At high concurrency, we see that G6e.12xlarge has 60% lower latency and 2.5x higher throughput.

The diagram below compares the cost per 1000 tokens when deploying Llama 3.1 70b. This further highlights the cost/performance benefits of using G6e instances compared to G5.

Introduction walkthrough

Prerequisites

To try this solution using SageMaker, you need the following prerequisites:

introduction

You can clone the repository and use the notebook provided here.

cleaning

To avoid unnecessary charges, we recommend that you clean up your deployed resources when you are finished using them. You can delete a deployed model using the following code.

predictor.delete_predictor()

conclusion

SageMaker’s G6e instances allow you to cost-effectively deploy a variety of open source models. With superior memory capacity, enhanced performance, and cost efficiency, these instances are an attractive solution for organizations looking to deploy and scale AI applications. G6e instances are especially valuable for modern AI applications because they can process larger models, support longer context lengths, and maintain high throughput. Try the code deployed on G6e.

About the author

Vivek Gangasani I am a Senior GenAI Specialist Solutions Architect at AWS. He helps emerging GenAI companies build innovative solutions using AWS services and accelerated computing. His current focus is on developing strategies to fine-tune and optimize the inference performance of large-scale language models. In his free time, Vivek enjoys hiking, watching movies, and sampling different cuisines.

Alan Tan He is a senior product manager at SageMaker, where he leads large-scale model inference efforts. He is passionate about applying machine learning to the field of analytics. Outside of work, he enjoys the outdoors.

pavan kumar madhuri I’m an Associate Solutions Architect at Amazon Web Services. He has a strong interest in designing innovative solutions in Generative AI and is passionate about helping customers harness the power of the cloud. I earned a master’s degree in information technology from Arizona State University. Outside of work, I enjoy swimming and watching movies.

Michael Nguyen He is a Senior Startup Solutions Architect at AWS, specializing in leveraging AI/ML to drive innovation and develop business solutions on AWS. Michael has 12 AWS certifications and holds BS/MS and MBA degrees in Electrical/Computer Engineering from Pennsylvania State University, Binghamton University, and the University of Delaware.

What's Hot

The next president should scrap NASA’s Space Launch System rockets

Do laptop cases cause overheating?

Preserved footprints suggest non-avian dinosaurs used wings to run

Will hibernation technology allow humans to survive winter?

7 Coolest Mathematical Discoveries of 2024

Missile detectors and Santa trackers? It’s a festival mystery

US meat and milk prices should skyrocket if President Donald Trump carries out his mass deportation plan

Humanity will continue to live in an era of incredible food waste

EBSCOlearning scales assessment generation for their online learning content with generative AI

How Tealium built a chatbot evaluation platform with Ragas and Auto-Instruct using AWS generative AI services

Discover insights from your Amazon Aurora PostgreSQL database using the Amazon Q Business connector

Build a multimodal social media content generator using Amazon Bedrock

The Atlantic has been suspiciously quiet this hurricane season

Considerations for addressing the core dimensions of responsible AI for Amazon Bedrock applications

Most Popular

Meta considers buying 5% stake in sunglasses maker Essilor-Luxottica

Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod

New Sickle Cell Treatments Highlight the Power of Patient Perspectives

Our Picks

13 Best Office Chairs of 2024, Tested and Reviewed

Study finds Bumble, Hinge and other apps needed to fix privacy risks

Google makes major advances in quantum computing

Subscribe to our newsletter

Subscribe to Updates

What's Hot

Amazon SageMaker Inference now supports G6e instances

use case

performance

Introduction walkthrough

Prerequisites

introduction

cleaning

conclusion

About the author

Related Posts

Subscribe to our newsletter

Subscribe to our newsletter