To stay competitive, companies across industries are using foundational models (FM) to transform their applications. While FM offers great out-of-the-box functionality, achieving true competitiveness often requires deep customization of the model through pre-training and fine-tuning. However, these approaches require advanced AI expertise, high-performance computing, and fast storage access, which can be cost-prohibitive for many organizations.
In this post, we explore how organizations can address these challenges and cost-effectively customize and adapt FM using AWS managed services such as Amazon SageMaker training jobs and Amazon SageMaker HyperPod. Learn how these powerful tools can help organizations optimize computing resources and reduce the complexity of training and fine-tuning models. Learn how to make informed decisions about which Amazon SageMaker service best fits your business needs and requirements.
business challenges
Today’s enterprises face numerous challenges in effectively implementing and managing machine learning (ML) initiatives. These challenges include scaling operations to handle rapidly growing data and models, accelerating the development of ML solutions, and managing complex infrastructure without shifting focus from core business objectives. It will be. Additionally, organizations need to optimize costs, maintain data security and compliance, and democratize both ease of use and access to machine learning tools across teams.
Customers are building their own ML architectures on bare metal machines using open source solutions such as Kubernetes and Slurm. Although this approach provides control over the infrastructure, the effort required to manage and maintain the underlying infrastructure over time (e.g., hardware failures) can be significant. Organizations often underestimate the complexity involved in integrating these various components, maintaining security and compliance, keeping systems up to date, and optimizing performance.
As a result, many companies struggle to leverage the full potential of ML while maintaining efficiency and innovation in a competitive environment.
How Amazon SageMaker can help
Amazon SageMaker addresses these challenges by providing fully managed services that streamline and accelerate the entire ML lifecycle. You can use a comprehensive set of SageMaker tools to build and train models at scale while offloading the management and maintenance of the underlying infrastructure to SageMaker.
With SageMaker, you can scale your training cluster to thousands of accelerators with your own choice of compute and optimize the performance of your workloads with the SageMaker Distributed Training Library. To make your cluster more resilient, SageMaker provides self-healing capabilities that automatically detect and recover from failures, enabling continuous FM training for months with little or no interruption, and training Save time by up to 40%. SageMaker also supports popular ML frameworks such as TensorFlow and PyTorch through managed pre-built containers. For further customization, SageMaker also allows users to bring their own libraries and containers.
To address a variety of business and technical use cases, Amazon SageMaker offers two options for distributed pre-training and fine-tuning: SageMaker Training Jobs and SageMaker HyperPod.
SageMaker training job
SageMaker Training Jobs provides a managed user experience for distributed FM training at scale, eliminating the undifferentiated heavy lifting of infrastructure management and cluster resiliency while offering a pay-as-you-go option. SageMaker training jobs automatically launch resilient distributed training clusters, provide managed orchestration, monitor infrastructure, and automatically recover from failures for a smooth training experience . Once training is complete, SageMaker spins down the cluster and customers are billed for net training time on a per-second basis. FM builders can further optimize this experience using SageMaker-managed warm pools. This allows you to retain and reuse the provisioned infrastructure after the training job completes, reducing latency and iteration times between different ML experiments.
SageMaker training jobs give FM builders the flexibility to choose the right instance type that best suits them to further optimize their training budget. For example, you can pre-train a large-scale language model (LLM) on a P5 cluster or fine-tune an open-source LLM on a p4d instance. This enables companies to provide a consistent training user experience across ML teams with varying levels of technical expertise and different workload types.
Additionally, Amazon SageMaker training jobs include tools such as SageMaker Profiler for training job profiling, Amazon SageMaker with MLflow for ML experiment management, Amazon CloudWatch for monitoring and alerting, and TensorBoard for debugging and analyzing training jobs. Integrated. Together, these tools enhance model development by providing performance insights, tracking experiments, and facilitating proactive management of the training process.
AI21 Labs, Technology Innovation Institute, Upstage, Bria AI to train and fine-tune FM while reducing total cost of ownership by offloading workload orchestration and underlying compute management to SageMaker , selected the SageMaker training job. SageMaker handled the provisioning, creation, and termination of the compute cluster while focusing resources on model development and experimentation, delivering faster results.
The following demo provides a high-level, step-by-step guide to using Amazon SageMaker training jobs.
SageMaker Hyperpod
SageMaker HyperPod provides persistent clusters with granular infrastructure control. Builders can use it to connect to Amazon Elastic Compute Cloud (Amazon EC2) instances via Secure Shell (SSH) for advanced model training, infrastructure management, and debugging. To maximize availability, HyperPod maintains a pool of dedicated and spare instances (at no additional cost to customers), minimizing downtime for critical node replacements. Customers can use familiar orchestration tools such as Slurm and Amazon Elastic Kubernetes Service (Amazon EKS) and libraries built on top of these tools for flexible job scheduling and compute sharing. . Additionally, when you use Slurm to orchestrate SageMaker HyperPod clusters, you can quickly schedule containers as high-performance unprivileged sandboxes through NVIDIA’s Enroot and Pyxis integration. The operating system and software stack is based on a Deep Learning AMI preconfigured with NVIDIA CUDA, NVIDIA cuDNN, and the latest versions of PyTorch and TensorFlow. HyperPod also includes the SageMaker distributed training library, which is optimized for AWS infrastructure, allowing users to automatically split training workloads across thousands of accelerators for efficient parallel training. You can.
FM builders can use HyperPod’s built-in ML tools to enhance model performance. For example, you use Amazon SageMaker and TensorBoard to visualize your model’s architecture and address convergence issues, while the Amazon SageMaker debugger captures real-time training metrics and profiles. Additionally, integration with observability tools such as Amazon CloudWatch Container Insights, Amazon Managed Service for Prometheus, and Amazon Managed Grafana provides deeper insight into cluster performance, health, and utilization, saving valuable development time. You can save money.
This self-healing, high-performance environment is trusted by customers including Articul8, IBM, Perplexity AI, Hugging Face, Luma, and Thomson Reuters to support advanced ML workflows and internal optimizations.
The following demo provides a high-level step-by-step guide to using Amazon SageMaker HyperPod.
Choosing the right option
SageMaker HyperPod is an ideal choice for organizations that require granular control over their training infrastructure and extensive customization options. HyperPod provides support for custom network configurations, flexible parallelism strategies, and custom orchestration techniques. It seamlessly integrates with tools like Slurm, Amazon EKS, Nvidia’s Enroot, and Pyxis, and provides SSH access for deep debugging and custom configuration.
SageMaker training jobs are tailored for organizations that focus on model development rather than infrastructure management and prefer the ease of use of a managed experience. SageMaker training jobs feature a user-friendly interface, simplified setup and scaling, automatic processing of distributed training tasks, built-in synchronization, checkpointing, fault tolerance, and abstraction of infrastructure complexity.
When choosing between SageMaker HyperPods and training jobs, organizations should base their decision on their specific training needs, workflow preferences, and desired level of control over their training infrastructure. HyperPod is the recommended option for businesses seeking advanced technical control and extensive customization, while Training Jobs is ideal for organizations that prefer a streamlined, fully managed solution.
conclusion
To learn more about Amazon SageMaker and large-scale distributed training on AWS, visit Getting Started with Amazon SageMaker, watch the Amazon SageMaker Deep Dive series on Generative AI, and learn about awsome-distributed-training and amazon-sagemaker-examples GitHub Explore the repository.
About the author
trevor harvey He is a Principal Specialist in Generative AI at Amazon Web Services and an AWS Certified Solutions Architect – Professional. Trevor works with customers to design and implement machine learning solutions and leads go-to-market strategies for generative AI services.
kanwaljit walnut I am a Principal Generative AI/ML Solutions Architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized machine learning applications.
Miron Perel I am a Principal Machine Learning Business Development Manager at Amazon Web Services. Miron advises generative AI companies to build next-generation models.
Guillaume Mangeot He is a Senior WW GenAI Specialist Solutions Architect at Amazon Web Services with over 10 years of experience in High Performance Computing (HPC). With an interdisciplinary background in applied mathematics, he has designed highly scalable architectures in cutting-edge areas such as GenAI, ML, HPC, and storage across various industries such as oil and gas, research, life sciences, and insurance. I am leading.