This post was co-authored by NVIDIA’s Eliuth Triana, Abhishek sawarkar, Jiahong Liu, Kshitiz Gupta, JR Morgan and Deepika Padmanabhan.
At the 2024 NVIDIA GTC conference, we announced support for NVIDIA NIM Inference Microservices in Amazon SageMaker Inference. This integration enables you to deploy industry-leading large-scale language models (LLMs) on SageMaker, optimizing performance and cost. Optimized pre-built containers enable you to deploy state-of-the-art LLMs in minutes instead of days, facilitating seamless integration into enterprise-grade AI applications.
NIM is built on technologies such as NVIDIA TensorRT, NVIDIA TensorRT-LLM, and vLLM. NIM is designed to enable easy, secure, and high-performance AI inference on NVIDIA GPU-accelerated instances hosted by SageMaker, enabling developers to harness the power of these advanced models using SageMaker APIs and just a few lines of code to accelerate the deployment of cutting-edge AI capabilities in their applications.
Part of the NVIDIA AI Enterprise software platform listed on AWS Marketplace, NIM is a set of inference microservices that bring the power of state-of-the-art LLM to your applications, such as developing chatbots, summarizing documents, and implementing other NLP-powered applications. Pre-built NVIDIA containers can be used to host and rapidly deploy popular LLMs optimized for specific NVIDIA GPUs. Companies such as Amgen, A-Alpha Bio, Agilent, and Hippocratic AI are some of the companies using NVIDIA AI on AWS to accelerate computational biology, genomics analysis, and conversational AI.
In this post, we walk through how the integration of NVIDIA NIM with SageMaker enables customers to use generative artificial intelligence (AI) models and LLM. We explain how this integration works and how you can deploy these state-of-the-art models on SageMaker to optimize performance and cost.
You can deploy LLM using optimized, pre-built NIM containers and integrate it into your enterprise-grade AI applications built with SageMaker in minutes instead of days. We also share example notebooks you can use to get started, showcasing simple APIs and the few lines of code needed to leverage the capabilities of these advanced models.
Solution overview
Getting started with NIM is easy. Within the NVIDIA API catalog, developers have access to a wide range of NIM-optimized AI models that can be used to build and deploy their own AI applications. You can start prototyping directly within the catalog using the GUI (see screenshot below) or work for free by interacting directly with the APIs.
To deploy NIM on SageMaker, you need to download and deploy it. To start this process, Run Anywhere with NIM Perform the following operations on the selected model as shown in the following screenshot:
You can sign up for a 90-day free evaluation license in the API catalog by signing up with your organizational email address, which gives you a personal NGC API key to pull assets from NGC and run them in SageMaker. For more information about SageMaker pricing, see Amazon SageMaker Pricing.
Prerequisites
As a prerequisite, set up your Amazon SageMaker Studio environment.
- Verify that your existing SageMaker domain has Docker access enabled. If it isn’t, run the following command to update the domain:
- Once Docker access is enabled for your domain, run the following command to create a user profile.
- Create a JupyterLab space for the user profile you created.
- After you create a JupyterLab space, run the following bash script to install the Docker CLI:
Set up your Jupyter notebook environment
This set of steps uses a SageMaker Studio JupyterLab notebook. You also need to attach an Amazon Elastic Block Store (Amazon EBS) volume that is at least 300 MB in size. You can do this in the domain settings in SageMaker Studio. For this example, we use an ml.g5.4xlarge instance with an NVIDIA A10G GPU.
First, open the sample notebook provided in your JupyterLab instance, import the corresponding packages, and configure your SageMaker session, role, and account information.
Pull the NIM container from the public container and push it to the private container.
NIM containers with built-in SageMaker integration are available in the Amazon ECR public gallery. To deploy securely into your own SageMaker account, pull the Docker container from the public Amazon Elastic Container Registry (Amazon ECR) container managed by NVIDIA and re-upload it to your private container.
Set up your NVIDIA API key
NIM can be accessed through the NVIDIA API Catalog. To register an NVIDIA API key from the NGC catalog, select: Generate a personal key.
When you create an NGC API key, you must include at least NGC Catalog To Included Services A drop-down menu that allows you to include more services if you plan to reuse this key for other purposes.
For this post, we’ll store this in an environment variable.
NGC_API_KEY = YOUR_KEY
This key is used to download pre-optimized model weights when running NIM.
Create a SageMaker endpoint
Now we have all the resources to deploy to our SageMaker endpoint. After setting up our Boto3 environment, we need to use the notebook and first make sure we reference the container we pushed to Amazon ECR in the previous step.
Once the model definition is set correctly, the next step is to define the endpoint configuration for the deployment. In this example, we will deploy NIM to one ml.g5.4xlarge instance.
Finally, create a SageMaker endpoint.
Use NIM to run inference against a SageMaker endpoint
Once the endpoint is successfully deployed, you can use the REST API to make requests to the NIM-powered SageMaker endpoint and try out different questions and prompts to interact with the generative AI model.
That’s it, you now have a running endpoint using NIM in SageMaker.
NIM license
NIM is part of the NVIDIA Enterprise License. NIM initially comes with a 90-day evaluation license. To use NIM with SageMaker beyond the 90-day license, contact NVIDIA to review private pricing on AWS Marketplace. NIM is also available as a paid service as part of the NVIDIA AI Enterprise software subscription available on AWS Marketplace.
Conclusion
This post showed you how to get started with NIM for pre-built models in SageMaker. Feel free to try it out by following the sample notebooks.
We encourage you to explore NIM and adopt it for your own use cases and applications.
About the Author
Saurabh Trikhande He is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is driven by the goal of democratizing machine learning. He focuses on core challenges such as deploying complex ML applications, multi-tenant ML models, cost optimization, and making deep learning model deployment easier. In his spare time, he enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS and is particularly interested in AI and machine learning. In his spare time, he enjoys exploring new cultures, new experiences, and keeping up with the latest technology trends. You can find him on LinkedIn.
Seiran Qing is a Software Development Engineer at AWS. He has worked on some challenging products at Amazon, including high-performance ML inference solutions and high-performance logging systems. Qing’s team successfully launched the first billion-parameter model in Amazon Advertising, which required extremely low latency. Qing has deep knowledge in infrastructure optimization and deep learning acceleration.
Raghu Ramesh He is a Sr. GenAI/ML Solutions Architect in the Amazon SageMaker service team. He is focused on helping customers build, deploy, and migrate large-scale ML production workloads to SageMaker. He specializes in the areas of Machine Learning, AI, and Computer Vision and holds an MS in Computer Science from the University of Texas at Dallas. In his spare time, he enjoys traveling and photography.
Elius Triana is a Developer Relations Manager at NVIDIA helping Amazon AI MLOps, DevOps, Scientists, and AWS technical experts master the NVIDIA Compute Stack to accelerate and optimize Generative AI Foundation models from data curation, GPU training, model inference, and production deployment on AWS GPU instances. Eliuth is also an enthusiast of mountain biking, skiing, tennis, and poker.
Abhishek Sawalkar As a Product Manager in the NVIDIA AI Enterprise team, I work on integrating NVIDIA AI software into our cloud MLOps platform, focusing on integrating the NVIDIA AI end-to-end stack within our cloud platform and improving user experience in accelerated computing.
Liu Chia-Hung He is a Solutions Architect in the Cloud Service Provider team at NVIDIA, helping deploy machine learning and AI solutions leveraging NVIDIA accelerated computing to address training and inference challenges. In his spare time, he enjoys origami, DIY projects and playing basketball.
Kshitiz Gupta He is a Solutions Architect at NVIDIA and is passionate about educating cloud customers on the GPU AI technology NVIDIA offers to help accelerate their machine learning and deep learning applications. Outside of work, he enjoys running, hiking, and wildlife watching.
J.R. Morgan He is a Principal Technical Product Manager for the Enterprise Products Group at NVIDIA, working at the intersection of partner services, APIs, and open source. After work, he can be found riding his Gixxer, going to the beach, or spending time with his amazing family.
Deepika Padmanabhan I’m a Solutions Architect at NVIDIA, working on building and deploying NVIDIA software solutions in the cloud. Outside of work, I enjoy solving puzzles and playing video games like Age of Empires.