Scale your ML workflows using Amazon SageMaker Studio and Amazon SageMaker HyperPod

Scaling machine learning (ML) workflows from early prototypes to large-scale production deployments is a difficult task, but the integration of Amazon SageMaker Studio and Amazon SageMaker HyperPod provides a streamlined solution to this challenge. . As teams progress from proof-of-concept to production-ready models, they often struggle to effectively manage their growing infrastructure and storage needs. This integration addresses these obstacles by providing data scientists and ML engineers with a comprehensive environment that supports the entire ML lifecycle, from development to large-scale deployment.

This post describes the process of scaling ML workloads using SageMaker Studio and SageMaker HyperPod.

Solution overview

Implementing the solution consists of the following high-level steps:

Set up the environment and permissions to access your Amazon HyperPod cluster in SageMaker Studio.
Create a JupyterLab space and mount an Amazon FSx for Luster file system into the space. This eliminates the need for data migration or code changes as you scale. This also reduces potential reproducibility issues that often arise from data inconsistencies at different stages of model development.
You can now discover SageMaker HyperPod clusters and view cluster details and metrics using SageMaker Studio. If you have access to multiple clusters, this information can help you compare each cluster’s specifications, current utilization, and cluster queue status to determine which cluster meets the requirements of a particular ML task.
Use a sample notebook to demonstrate how to connect to a cluster and run a Meta Llama 2 training job using PyTorch FSDP on a Slurm cluster.
After you submit a long-running job to your cluster, you can monitor the task directly through the SageMaker Studio UI. This provides real-time insights into distributed workflows, allowing you to quickly identify bottlenecks, optimize resource utilization, and improve overall workflow efficiency.

This unified approach not only streamlines the transition from prototype to large-scale training, but also increases overall productivity by maintaining a familiar development experience as you scale up to production-level workloads. Improve.

Prerequisites

Complete the following prerequisite steps:

Create a SageMaker HyperPod Slurm cluster. For instructions, see the Amazon SageMaker HyperPod Workshop or the Getting Started with SageMaker HyperPod tutorial.
Make sure you’re using the latest version of the AWS Command Line Interface (AWS CLI).
Create a user with a UID greater than 10000 on the Slurm head node or login node. For instructions on creating users, see Multi-User.
Tag a key to your SageMaker HyperPod cluster. hyperpod-cluster-filesystem. This is the ID of the FSx for Luster file system associated with your SageMaker HyperPod cluster. This is required for Amazon SageMaker Studio to mount FSx for Luster into your Jupyter Lab and Code Editor spaces. Add tags to your existing SageMaker HyperPod cluster using the following code snippet.
```
aws sagemaker add-tags --resource-arn <cluster_ARN> \
--tags Key=hyperpod-cluster-filesystem,Value=<fsx_id>
```

Set permissions

The following sections outline the steps to create an Amazon SageMaker domain, create a user, set up a SageMaker Studio space, and connect to a SageMaker HyperPod cluster. After completing these steps, you can connect to your SageMaker HyperPod Slurm cluster and run the sample training workload. You must have administrator privileges to follow the setup instructions. Follow these steps:

Create a new AWS Identity and Access Management (IAM) execution role and attach AmazonSageMakerFullAccess to the role. Also attach the following JSON policy to the role. This allows SageMaker Studio to access your SageMaker HyperPod cluster. Make sure the role trust allows it. sagemaker.amazonaws.com A service assumes this role.

{
    "Version": "2012-10-17",            
    "Statement": (
        {
            "Effect": "Allow",
            "Action": (
                "ssm:StartSession",
                "ssm:TerminateSession"
            ),
            "Resource": "*"    
        }
{
            "Effect": "Allow",
            "Action": (
                "sagemaker:CreateCluster",
                "sagemaker:ListClusters"
            ),
            "Resource": "*"    
        },
        {
            "Effect": "Allow",
            "Action": (
                "sagemaker:DescribeCluster",
                "sagemaker:DescribeClusterNode",
                "sagemaker:ListClusterNodes",
                "sagemaker:UpdateCluster",
                "sagemaker:UpdateClusterSoftware"
            ),
            "Resource": "arn:aws:sagemaker:region:account-id:cluster/*"    
        }
    )
}

To use the role you created to access your SageMaker HyperPod cluster head or login node using AWS Systems Manager, you must add a tag to this IAM role. Tag Key = “SSMSessionRunAs” and Tag Value = “<posix user>”. The POSIX user is the user that is set up on the Slurm head node. Systems Manager uses this user to run the head node.
Enabling Run As support allows Session Manager to ssm-user An account on a managed node. To enable execution in Session Manager, follow these steps:
1. In the Session Manager Console, select: settingthen select edit.
2. Do not specify a username. Username is selected from role tag SSMSessionRunAs Previously created.
3. in Linux shell profile Enter “/bin/bash” in the section.
4. choose keep.

Create a new SageMaker Studio domain using the previously created execution role and other parameters required to access the SageMaker HyperPod cluster. Create the domain using the following script and replace the export variables accordingly. here, VPC_ID and Subnet_ID Same as VPC and subnets for SageMaker HyperPod clusters. of EXECUTION_ROLE_ARN is the role you created earlier.

export DOMAIN_NAME=<domain name>
export VPC_ID=vpc_id-for_hp_cluster
export SUBNET_ID=private_subnet_id
export EXECUTION_ROLE_ARN=execution_role_arn
export FILE_SYSTEM_ID=fsx id
export USER_UID=10000
export USER_GID=1001
export REGION=us-east-2

cat > user_settings.json << EOL
{
    "ExecutionRole": "$EXECUTION_ROLE_ARN",
    "CustomPosixUserConfig":
    {
        "Uid": $USER_UID,
        "Gid": $USER_GID
    },
    "CustomFileSystemConfigs":
    (
        {
            "FSxLustreFileSystemConfig":
            {
                "FileSystemId": "$FILE_SYSTEM_ID",
                "FileSystemPath": "$FILE_SYSTEM_PATH"
            }
        }
    )
}
EOL

aws sagemaker create-domain \
--domain-name $DOMAIN_NAME \
--vpc-id $VPC_ID \
--subnet-ids $SUBNET_ID \
--auth-mode IAM \
--default-user-settings file://user_settings.json \
--region $REGION

The UID and GID for the above configuration are set as follows: 10000 and 1001 As default; this can be overridden according to the user created in Slurm and this UID/GID is used to grant permissions to the FSx for Luster file system. Also, setting this at the domain level gives each user the same UID. To set individual UIDs for each user, consider the following settings: CustomPosixUserConfig Creating user profile.

Once you have created your domain, you need to attach it SecurityGroupIdForInboundNfs Created as part of domain creation for all ENIs of FSx Luster volumes.
1. Find the Amazon Elastic File System (Amazon EFS) file system associated with your domain and the corresponding security group attached to it. EFS file systems can be found in the Amazon EFS console. The domain ID is tagged as shown in the following screenshot.
2. Collect and name the corresponding security group. inbound-nfs-<domain-id> can be found at network tab.
3. In the FSx for Luster console, select: To see all ENIs, see the Amazon EC2 console.. This will display all ENIs attached to FSx for Luster. Alternatively, use the AWS CLI or fsx:describeFileSystems
4. For each ENI, SecurityGroupIdForInboundNfs to that of the domain.

Alternatively, you can use the following script to automatically find and attach security groups to ENIs associated with FSx for Luster volumes. Please replace REGION, DOMAIN_IDand FSX_ID Change the attributes accordingly.

#!/bin/bash

export REGION=us-east-2
export DOMAIN_ID=d-xxxxx
export FSX_ID=fs-xxx

export EFS_ID=$(aws sagemaker describe-domain --domain-id $DOMAIN_ID --region $REGION --query 'HomeEfsFileSystemId' --output text)
export MOUNT_TARGET_ID=$(aws efs describe-mount-targets --file-system-id $EFS_ID --region $REGION --query 'MountTargets(0).MountTargetId' --output text)
export EFS_SG=$(aws efs describe-mount-target-security-groups --mount-target-id $MOUNT_TARGET_ID --query 'SecurityGroups(0)' --output text)
echo "security group associated with the Domain $EFS_SG"

echo "Adding security group to FSxL file system ENI's"
# Get the network interface IDs associated with the FSx file system
NETWORK_INTERFACE_IDS=$(aws fsx describe-file-systems --file-system-ids $FILE_SYSTEM_ID --query "FileSystems(0).NetworkInterfaceIds" --output text)
# Iterate through each network interface and attach the security group
for ENI_ID in $NETWORK_INTERFACE_IDS; do
aws ec2 modify-network-interface-attribute --network-interface-id $ENI_ID --groups $EFS_SG
echo "Attached security group $EFS_SG to network interface $ENI_ID"
done

Without this step, application creation will fail with an error.

After you create a domain, you can use it to create user profiles. Replace the DOMAIN_ID value with the value created in the previous step.

export DOMAIN_ID=d-xxx
export USER_PROFILE_NAME=test
export REGION=us-east-2

aws sagemaker create-user-profile \
--domain-id $DOMAIN_ID \
--user-profile-name$USER_PROFILE_NAME \
--region $REGION

Create a JupyterLab space and mount the FSx for Luster file system.

Create a space using the FSx for Luster file system using the following code.

export SPACE_NAME=hyperpod-space
export DOMAIN_ID=d-xxx
export USER_PROFILE_NAME=test
export FILE_SYSTEM_ID=fs-xxx
export REGION=us-east-2

aws sagemaker create-space --domain-id $DOMAIN_ID \
--space-name $SPACE_NAME \
--space-settings "AppType=JupyterLab,CustomFileSystems=({FSxLustreFileSystem={FileSystemId=$FILE_SYSTEM_ID}})" \
--ownership-settings OwnerUserProfileName=$USER_PROFILE_NAME  --space-sharing-settings SharingType=Private  \
--region $REGION

Create an application with spaces in the following code.

export SPACE_NAME=hyperpod-space
export DOMAIN_ID=d-xxx
export APP_NAME=test-app
export INSTANCE_TYPE=ml.t3.medium
export REGION=us-east-2
export IMAGE_ARN=arn:aws:sagemaker:us-east-2:081975978581:image/sagemaker-distribution-cpu

aws sagemaker create-app --space-name $SPACE_NAME \
--resource-spec '{"InstanceType":"$INSTANCE_TYPE","SageMakerImageArn":"$IMAGE_ARN"}' \
--domain-id  $DOMAIN_ID --app-type JupyterLab --app-name $APP_NAME --region $REGION

Discover clusters in SageMaker Studio

You should now be all set to access your SageMaker HyperPod cluster using SageMaker Studio. Follow these steps:

In the SageMaker console, select: Administrator configuration, domain.
Locate the user profile you created and launch SageMaker Studio.
under computing In the navigation pane, select HyperPod cluster.

Here you can view the SageMaker HyperPod clusters available in your account.

Review cluster details and cluster hardware metrics to identify the right cluster for your training workload.

You can also preview your cluster by selecting the arrow icon.

You can also go to setting and detail Use the tabs to see detailed information about your cluster.

Work in SageMaker Studio and connect to your cluster.

You can also launch JupyterLab or a code editor to mount cluster FSx for Luster volumes for development and debugging.

In SageMaker Studio, select Let’s start with and select jupiter lab.
Choose a space where your FSx for Luster file system is mounted to ensure a consistent and reproducible environment.

of cluster file system The column indicates the space in which the cluster file system is mounted.

This should start JupyterLab with the FSx for Luster volume mounted. By default, the Getting Started Notebook appears in your home folder. This notebook provides step-by-step instructions for running a Meta Llama 2 training job using PyTorch FSDP on a Slurm cluster. This example notebook shows how to use SageMaker Studio notebooks to move from prototyping training scripts to scaling up workloads across multiple instances in a cluster environment. Additionally, you should see the FSx for Luster file system you mounted in your JupyterLab space at: /home/sagemaker-user/custom-file-systems/fsx_lustre.

Monitor tasks in SageMaker Studio

You can view the list of tasks currently in the Slurm queue by going to SageMaker Studio and selecting your cluster.

Select a task to get additional task details such as schedule and job status, resource usage details, job submissions and limits.

You can also perform actions such as releasing, requeuing, pausing, and holding these Slurm tasks using the UI.

cleaning

To clean up your resources, follow these steps:

Remove spaces.

aws —region <REGION> sagemaker delete-space \
--domain-id <DomainId> \
--space-name <SpaceName>

Delete a user profile.

aws —region <REGION> sagemaker delete-user-profile \
--domain-id <DomainId> \
--user-profile-name <UserProfileName>

Delete a domain. To keep EFS volumes, specify HomeEfsFileSystem=Retain.

aws —region <REGION> sagemaker delete-domain \
--domain-id <DomainId> \
--retention-policy HomeEfsFileSystem=Delete

Delete the SageMaker HyperPod cluster.
Delete the IAM role you created.

conclusion

In this post, we explored an approach to streamlining ML workflows using SageMaker Studio. We showed how you can seamlessly move from prototyping training scripts within SageMaker Studio to scaling up your workload across multiple instances in a clustered environment. We also covered how to mount cluster FSx for Luster volumes into a SageMaker Studio space for a consistent and reproducible environment.

This approach not only streamlines your development process, but also makes it easy to start long-running jobs on your cluster and monitor their progress directly from SageMaker Studio.

We encourage you to try it out and share your feedback in the comments section.

Special thanks to Durga Sury (Senior ML SA), Monidipa Chakraborty (Senior SDE), and Sumedha Swamy (Senior Manager PMT) for their help in launching this post.

About the author

Arun Kumar Lokanata I am a Senior ML Solutions Architect on the Amazon SageMaker team. He specializes in large-scale language model training workloads and helps customers build LLM workloads using SageMaker HyperPod, SageMaker Training Jobs, and SageMaker Distributed Training. Outside of work, I enjoy running, hiking, and cooking.

Pooja Karaj I am a Senior Technical Product Manager at Amazon Web Services. At AWS, she is part of the Amazon SageMaker Studio team, helping build products that meet the needs of administrators and data scientists. She started her career as a software engineer and then transitioned to product management. Outside of work, I enjoy creating travel planners using spreadsheets in true MBA style. Considering the amount of time she spent creating these planners, it’s clear that she has a deep love for travel, along with a strong passion for hiking.

What's Hot

The mathematical theory that made the Internet possible

Enjoy the Enhanced UI of Windows 11 Pro for 89% off

YouTube adds AI-generated face and voice detection tools

AWS Deepracer: Closure times at AWS Re: Invent 2024 – How was that physical race?

Evaluate healthcare generative AI applications using LLM-as-a-judge on AWS

Billions of daily videos using the bytedance process AWS multimodal video understanding model

How to configure a cross-account model deployment using Amazon Bedrock custom model import

How Pattern PXM’s Content Brief is driving conversion on ecommerce marketplaces using AI

AWS Deepracer: Closure times at AWS Re: Invent 2024 – How was that physical race?

Evaluate healthcare generative AI applications using LLM-as-a-judge on AWS

Billions of daily videos using the bytedance process AWS multimodal video understanding model

New Zealand vs USA Paris 2024 live stream: watch football free

Wild bees find surprising nesting sites in cities

Buccaneers vs Jaguars live stream: How to watch NFL preseason for free

Most Popular

These newly identified cells could change the face of plastic surgery

Amandla Stenberg Wasn’t Shocked by The Acolyte’s Cancelation

Blood-clotting drug heparin may prevent snakebite victims from losing limbs

Our Picks

Does this title say it all? “Penis Damage Caused by Vacuum Cleaners”

Trellix reduces costs, increases speeds and adds delivery flexibility with Amazon Nova Micro and Amazon Nova Lite models with cost-effective performance

Beyond Meat is stalled in the U.S.; Europe may be a different story

Subscribe to our newsletter

Subscribe to Updates

What's Hot

Scale your ML workflows using Amazon SageMaker Studio and Amazon SageMaker HyperPod

Solution overview

Prerequisites

Set permissions

Create a JupyterLab space and mount the FSx for Luster file system.

Discover clusters in SageMaker Studio

Work in SageMaker Studio and connect to your cluster.

Monitor tasks in SageMaker Studio

cleaning

conclusion

About the author

Related Posts

Subscribe to our newsletter

Subscribe to our newsletter