Scaling machine learning (ML) workflows from early prototypes to large-scale production deployments is a difficult task, but the integration of Amazon SageMaker Studio and Amazon SageMaker HyperPod provides a streamlined solution to this challenge. . As teams progress from proof-of-concept to production-ready models, they often struggle to effectively manage their growing infrastructure and storage needs. This integration addresses these obstacles by providing data scientists and ML engineers with a comprehensive environment that supports the entire ML lifecycle, from development to large-scale deployment.
This post describes the process of scaling ML workloads using SageMaker Studio and SageMaker HyperPod.
Solution overview
Implementing the solution consists of the following high-level steps:
- Set up the environment and permissions to access your Amazon HyperPod cluster in SageMaker Studio.
- Create a JupyterLab space and mount an Amazon FSx for Luster file system into the space. This eliminates the need for data migration or code changes as you scale. This also reduces potential reproducibility issues that often arise from data inconsistencies at different stages of model development.
- You can now discover SageMaker HyperPod clusters and view cluster details and metrics using SageMaker Studio. If you have access to multiple clusters, this information can help you compare each cluster’s specifications, current utilization, and cluster queue status to determine which cluster meets the requirements of a particular ML task.
- Use a sample notebook to demonstrate how to connect to a cluster and run a Meta Llama 2 training job using PyTorch FSDP on a Slurm cluster.
- After you submit a long-running job to your cluster, you can monitor the task directly through the SageMaker Studio UI. This provides real-time insights into distributed workflows, allowing you to quickly identify bottlenecks, optimize resource utilization, and improve overall workflow efficiency.
This unified approach not only streamlines the transition from prototype to large-scale training, but also increases overall productivity by maintaining a familiar development experience as you scale up to production-level workloads. Improve.
Prerequisites
Complete the following prerequisite steps:
- Create a SageMaker HyperPod Slurm cluster. For instructions, see the Amazon SageMaker HyperPod Workshop or the Getting Started with SageMaker HyperPod tutorial.
- Make sure you’re using the latest version of the AWS Command Line Interface (AWS CLI).
- Create a user with a UID greater than 10000 on the Slurm head node or login node. For instructions on creating users, see Multi-User.
- Tag a key to your SageMaker HyperPod cluster.
hyperpod-cluster-filesystem
. This is the ID of the FSx for Luster file system associated with your SageMaker HyperPod cluster. This is required for Amazon SageMaker Studio to mount FSx for Luster into your Jupyter Lab and Code Editor spaces. Add tags to your existing SageMaker HyperPod cluster using the following code snippet.
Set permissions
The following sections outline the steps to create an Amazon SageMaker domain, create a user, set up a SageMaker Studio space, and connect to a SageMaker HyperPod cluster. After completing these steps, you can connect to your SageMaker HyperPod Slurm cluster and run the sample training workload. You must have administrator privileges to follow the setup instructions. Follow these steps:
- Create a new AWS Identity and Access Management (IAM) execution role and attach AmazonSageMakerFullAccess to the role. Also attach the following JSON policy to the role. This allows SageMaker Studio to access your SageMaker HyperPod cluster. Make sure the role trust allows it.
sagemaker.amazonaws.com
A service assumes this role.
- To use the role you created to access your SageMaker HyperPod cluster head or login node using AWS Systems Manager, you must add a tag to this IAM role.
Tag Key = “SSMSessionRunAs”
andTag Value = “<posix user>”
. The POSIX user is the user that is set up on the Slurm head node. Systems Manager uses this user to run the head node. - Enabling Run As support allows Session Manager to
ssm-user
An account on a managed node. To enable execution in Session Manager, follow these steps:- In the Session Manager Console, select: settingthen select edit.
- Do not specify a username. Username is selected from role tag
SSMSessionRunAs
Previously created. - in Linux shell profile Enter “/bin/bash” in the section.
- choose keep.
- Create a new SageMaker Studio domain using the previously created execution role and other parameters required to access the SageMaker HyperPod cluster. Create the domain using the following script and replace the export variables accordingly. here,
VPC_ID
andSubnet_ID
Same as VPC and subnets for SageMaker HyperPod clusters. ofEXECUTION_ROLE_ARN
is the role you created earlier.
The UID and GID for the above configuration are set as follows: 10000
and 1001
As default; this can be overridden according to the user created in Slurm and this UID/GID is used to grant permissions to the FSx for Luster file system. Also, setting this at the domain level gives each user the same UID. To set individual UIDs for each user, consider the following settings: CustomPosixUserConfig
Creating user profile.
- Once you have created your domain, you need to attach it
SecurityGroupIdForInboundNfs
Created as part of domain creation for all ENIs of FSx Luster volumes.- Find the Amazon Elastic File System (Amazon EFS) file system associated with your domain and the corresponding security group attached to it. EFS file systems can be found in the Amazon EFS console. The domain ID is tagged as shown in the following screenshot.
- Collect and name the corresponding security group.
inbound-nfs-<domain-id>
can be found at network tab. - In the FSx for Luster console, select: To see all ENIs, see the Amazon EC2 console.. This will display all ENIs attached to FSx for Luster. Alternatively, use the AWS CLI or
fsx:describeFileSystems
- For each ENI,
SecurityGroupIdForInboundNfs
to that of the domain.
Alternatively, you can use the following script to automatically find and attach security groups to ENIs associated with FSx for Luster volumes. Please replace REGION
, DOMAIN_ID
and FSX_ID
Change the attributes accordingly.
Without this step, application creation will fail with an error.
- After you create a domain, you can use it to create user profiles. Replace the DOMAIN_ID value with the value created in the previous step.
Create a JupyterLab space and mount the FSx for Luster file system.
Create a space using the FSx for Luster file system using the following code.
Create an application with spaces in the following code.
Discover clusters in SageMaker Studio
You should now be all set to access your SageMaker HyperPod cluster using SageMaker Studio. Follow these steps:
- In the SageMaker console, select: Administrator configuration, domain.
- Locate the user profile you created and launch SageMaker Studio.
- under computing In the navigation pane, select HyperPod cluster.
Here you can view the SageMaker HyperPod clusters available in your account.
- Review cluster details and cluster hardware metrics to identify the right cluster for your training workload.
You can also preview your cluster by selecting the arrow icon.
You can also go to setting and detail Use the tabs to see detailed information about your cluster.
Work in SageMaker Studio and connect to your cluster.
You can also launch JupyterLab or a code editor to mount cluster FSx for Luster volumes for development and debugging.
- In SageMaker Studio, select Let’s start with and select jupiter lab.
- Choose a space where your FSx for Luster file system is mounted to ensure a consistent and reproducible environment.
of cluster file system The column indicates the space in which the cluster file system is mounted.
This should start JupyterLab with the FSx for Luster volume mounted. By default, the Getting Started Notebook appears in your home folder. This notebook provides step-by-step instructions for running a Meta Llama 2 training job using PyTorch FSDP on a Slurm cluster. This example notebook shows how to use SageMaker Studio notebooks to move from prototyping training scripts to scaling up workloads across multiple instances in a cluster environment. Additionally, you should see the FSx for Luster file system you mounted in your JupyterLab space at: /home/sagemaker-user/custom-file-systems/fsx_lustre
.
Monitor tasks in SageMaker Studio
You can view the list of tasks currently in the Slurm queue by going to SageMaker Studio and selecting your cluster.
Select a task to get additional task details such as schedule and job status, resource usage details, job submissions and limits.
You can also perform actions such as releasing, requeuing, pausing, and holding these Slurm tasks using the UI.
cleaning
To clean up your resources, follow these steps:
- Remove spaces.
- Delete a user profile.
- Delete a domain. To keep EFS volumes, specify
HomeEfsFileSystem=Retain
.
- Delete the SageMaker HyperPod cluster.
- Delete the IAM role you created.
conclusion
In this post, we explored an approach to streamlining ML workflows using SageMaker Studio. We showed how you can seamlessly move from prototyping training scripts within SageMaker Studio to scaling up your workload across multiple instances in a clustered environment. We also covered how to mount cluster FSx for Luster volumes into a SageMaker Studio space for a consistent and reproducible environment.
This approach not only streamlines your development process, but also makes it easy to start long-running jobs on your cluster and monitor their progress directly from SageMaker Studio.
We encourage you to try it out and share your feedback in the comments section.
Special thanks to Durga Sury (Senior ML SA), Monidipa Chakraborty (Senior SDE), and Sumedha Swamy (Senior Manager PMT) for their help in launching this post.
About the author
Arun Kumar Lokanata I am a Senior ML Solutions Architect on the Amazon SageMaker team. He specializes in large-scale language model training workloads and helps customers build LLM workloads using SageMaker HyperPod, SageMaker Training Jobs, and SageMaker Distributed Training. Outside of work, I enjoy running, hiking, and cooking.
Pooja Karaj I am a Senior Technical Product Manager at Amazon Web Services. At AWS, she is part of the Amazon SageMaker Studio team, helping build products that meet the needs of administrators and data scientists. She started her career as a software engineer and then transitioned to product management. Outside of work, I enjoy creating travel planners using spreadsheets in true MBA style. Considering the amount of time she spent creating these planners, it’s clear that she has a deep love for travel, along with a strong passion for hiking.