Today, we’re excited to announce that Meta’s Llama 3.3 70B is now available on Amazon SageMaker JumpStart. Llama 3.3 70B represents an exciting advance in large-scale language model (LLM) development, delivering performance comparable to larger Llama versions with fewer computational resources.
In this post, we explore how to efficiently deploy this model to Amazon SageMaker AI using advanced SageMaker AI features for optimal performance and cost control.
Llama 3.3 70B Model Overview
Llama 3.3 70B has made great strides in optimizing model efficiency and performance. This new model provides output quality comparable to Llama 3.1 405B while requiring a fraction of the computational resources. According to Meta, this efficiency improvement makes inference operations nearly five times more cost-effective, making it an attractive option for production deployments.
The model’s sophisticated architecture is built on an optimized version of Meta’s transformer design and features enhanced attention mechanisms that help significantly reduce inference costs. During development, Meta’s engineering team powered the model on an extensive dataset of approximately 15 trillion tokens, incorporating both web-sourced content and over 25 million synthetic samples created specifically for LLM development. I trained. This comprehensive training approach provides robust model understanding and generation capabilities across a variety of tasks.
Llama 3.3 70B is characterized by its sophisticated training method. This model underwent an extensive supervised fine-tuning process and was complemented by Reinforcement Learning from Human Feedback (RLHF). This dual-approach training strategy helps bring model output closer to human preferences while maintaining high performance standards. In benchmark evaluations against its larger counterparts, Llama 3.3 70B showed remarkable consistency, beating Llama 3.1 405B by less than 2% in 6 out of 10 standard AI benchmarks, and actually outperforming it in 3 categories. I surpassed it. This performance profile makes it an ideal candidate for organizations seeking a balance between model functionality and operational efficiency.
The following figure summarizes the benchmark results (source).
Try using SageMaker JumpStar
SageMaker JumpStart is a machine learning (ML) hub that helps you accelerate your ML journey. SageMaker JumpStart allows you to evaluate, compare, and select pre-trained foundation models (FMs), including Llama 3 models. These models are fully customizable to suit your data use case and can be deployed to production using the UI or SDK.
There are two convenient ways to deploy Llama 3.3 70B via SageMaker JumpStart: using the intuitive SageMaker JumpStart UI or implementing it programmatically via the SageMaker Python SDK. Let’s consider both methods so you can choose the approach that best suits your needs.
Deploy Llama 3.3 70B via SageMaker JumpStart UI
You can access the SageMaker JumpStart UI through either Amazon SageMaker Unified Studio or Amazon SageMaker Studio. To deploy Llama 3.3 70B using the SageMaker JumpStart UI, follow these steps:
- In SageMaker Unified Studio, build menu, selection jump start model.
Or, in the SageMaker Studio console, jump start in the navigation pane.
- Search for Meta Llama 3.3 70B.
- Select the Meta Llama 3.3 70B model.
- choose expand.
- Accept the End User License Agreement (EULA).
- for instance type¸ Select your instance (ml.g5.48xlarge or ml.p4d.24xlarge).
- choose expand.
Wait until the endpoint status displays as follows: In operation. You can now use the model to perform inference.
Deploy Llama 3.3 70B using SageMaker Python SDK
For teams looking to automate deployment or integrate with an existing MLOps pipeline, you can use the following code to deploy your model using the SageMaker Python SDK.
Set autoscaling and scale down to zero
Optionally, you can configure autoscale to scale down to zero after deployment. For more information, see Reduce costs with SageMaker Inference’s new scale-down to zero feature.
Optimize your deployment with SageMaker AI
SageMaker AI simplifies the deployment of sophisticated models like Llama 3.3 70B and offers a variety of features designed to optimize both performance and cost efficiency. With SageMaker AI’s advanced features, organizations can take full advantage of the efficiency of Llama 3.3 70B while benefiting from SageMaker AI’s streamlined deployment process and optimization tools to deploy LLM in production. Deploy and manage. Default deployment via SageMaker JumpStart uses fast deployment, which uses speculative decoding to improve throughput. For more information about how speculative decoding works with SageMaker AI, see Amazon SageMaker launches updated inference optimization toolkit for generative AI.
First, Fast Model Loader revolutionizes the model initialization process by implementing an innovative weight streaming mechanism. This feature fundamentally changes the way model weights are loaded into the accelerator, significantly reducing the time required to get the model ready for inference. Instead of the traditional approach of loading the entire model into memory before starting operations, Fast Model Loader streams weights directly from Amazon Simple Storage Service (Amazon S3) to the accelerator, reducing startup and scaling times. .
One of SageMaker’s inference features is the container cache. This transforms the way model containers are managed during scaling operations. This feature eliminates one of the major bottlenecks in scaling deployments by pre-caching container images, eliminating the need for time-consuming downloads when adding new instances. For large models like Llama 3.3 70B, where the container image size can be large, this optimization significantly reduces scaling latency and improves overall system responsiveness.
Another important feature is Scale to Zero. It introduces intelligent resource management that automatically adjusts computing power based on actual usage patterns. This feature represents a paradigm shift in model deployment cost optimization, allowing endpoints to be fully scaled down during periods of inactivity while retaining the ability to quickly scale up when demand returns. It will be. This feature is especially valuable for organizations running multiple models or dealing with fluctuating workload patterns.
Together, these features create a powerful deployment environment that takes full advantage of Llama 3.3 70B’s efficient architecture and provides robust tools to manage operational costs and performance.
conclusion
Llama 3.3 70B combined with the advanced inference capabilities of SageMaker AI provides an ideal solution for production deployments. Fast Model Loader, Container Caching, and Scale to Zero capabilities enable organizations to achieve both high performance and cost efficiency in LLM deployments.
We encourage you to try this implementation and share your experience.
About the author
mark karp I’m an ML Architect on the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In my free time, I enjoy traveling and exploring new places.
Saurabh Trikhande Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on key challenges related to deploying complex AI applications, inference with multi-tenant models, optimizing costs, and making the deployment of generative AI models more accessible. In my free time, I enjoy hiking, learning about innovative technology, following TechCrunch, and spending time with my family.
Melanie LeeWith a Ph.D., she is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where she focuses on collaborating with customers to build solutions that leverage cutting-edge AI and machine learning tools. I’m leaving it there. She has been actively involved in multiple generative AI initiatives across APJ, leveraging the power of large-scale language models (LLM). Prior to joining AWS, Dr. Lee held data science roles in the financial and retail industries.
adrianna simmons I’m a senior product marketing manager at AWS.
Lokeswaran Ravi He is a senior deep learning compiler engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on building a secure ecosystem to increase efficiency, reduce costs, and democratize AI technology, making cutting-edge ML available and impactful across industries. .
Yotam Moss is a software development manager for AWS AI inference.