With access to a wide range of generative AI-based models (FM) and the ability to build and train your own machine learning (ML) models in Amazon SageMaker, users can seamlessly and securely experiment and choose which models to serve. I hope that. Deliver the most value for your business. During the early stages of an ML project, data scientists work closely together to share experiment results and address business challenges. However, keeping track of large numbers of experiments, their parameters, metrics, and results can be difficult, especially when working on complex projects at the same time. MLflow, a popular open-source tool, helps data scientists organize, track, and analyze their ML and generative AI experiments, making it easy to reproduce and compare results.
SageMaker is a comprehensive, fully managed ML service designed to give data scientists and ML engineers the tools they need to handle their entire ML workflow. Amazon SageMaker and MLflow are features of SageMaker that enable users to seamlessly create, manage, analyze, and compare ML experiments. This simplifies the complex and time-consuming tasks involved in setting up and managing an MLflow environment, allowing ML administrators to quickly establish a secure and scalable MLflow environment on AWS. For more information, see Fully Managed MLFlow for Amazon SageMaker.
Enhanced Security: AWS VPC and AWS PrivateLink
When using SageMaker, you can decide the level of internet access you want to provide your users. For example, you can give users permission to download popular packages or customize their development environment. However, this may also create a potential risk of unauthorized access to your data. To reduce these risks, you can further limit the traffic that can access the internet by launching your ML environment in Amazon Virtual Private Cloud (Amazon VPC). Amazon VPC also lets you control network access and internet connectivity for your SageMaker environment, or add another layer of security by removing direct internet access. To understand the implications of running SageMaker inside a VPC and the differences when using network isolation, see Connect to SageMaker through VPC interface endpoints.
SageMaker with MLflow now supports AWS PrivateLink. This allows important data to VPC to MLflow tracking server Via a VPC endpoint. This feature provides additional protection for sensitive information by transmitting data sent to MLflow tracking servers within the AWS network and avoiding exposure to the public internet. This feature is available in all AWS Regions where SageMaker is currently available, except China Regions and GovCloud (US) Regions. For more information, see Connect to your MLflow tracking server through an interface VPC endpoint.
In this blog post: sage maker environment Private VPC (no internet access), in use ML flow ability to Accelerate your ML experiments.
Solution overview
The reference code for this sample can be found on GitHub. The general steps are as follows:
- Deploy your infrastructure using the AWS Cloud Development Kit (AWS CDK), which includes:
- Run ML experiments in MLflow using the @remote decorator from the open source SageMaker Python SDK.
The overall solution architecture is shown in the following diagram.
For reference, this blog post provides a solution for creating a VPC without an internet connection using AWS CloudFormation templates.
Prerequisites
You need an AWS account with an AWS Identity and Access Management (IAM) role that has permissions to manage the resources created as part of your solution. For more information, see Create an AWS Account.
Deploy infrastructure using AWS CDK
The first step is to create your infrastructure using this CDK stack. You can follow the deployment instructions in the README.
First, let’s take a closer look at the CDK stack itself.
Define multiple VPC endpoints, including MLflow endpoints, as shown in the following sample.
vpc.add_interface_endpoint(
"mlflow-experiments",
service=ec2.InterfaceVpcEndpointAwsService.SAGEMAKER_EXPERIMENTS,
private_dns_enabled=True,
subnets=ec2.SubnetSelection(subnets=subnets),
security_groups=(studio_security_group)
)
We are also trying to restrict the SageMaker execution IAM role so that we can only use SageMaker MLflow when we are in the appropriate VPC.
You can further restrict MLflow’s VPC endpoints by attaching a VPC endpoint policy.
Users outside your VPC may be able to connect to Sagemaker MLflow through a VPC endpoint to MLflow. You can add restrictions so that user access to SageMaker MLflow is only allowed from your VPC.
studio_execution_role.attach_inline_policy(
iam.Policy(self, "mlflow-policy",
statements=(
iam.PolicyStatement(
effect=iam.Effect.ALLOW,
actions=("sagemaker-mlflow:*"),
resources=("*"),
conditions={"StringEquals": {"aws:SourceVpc": vpc.vpc_id } }
)
)
)
)
If the deployment is successful, you should see a new one. VPC Run in the AWS Management Console in Amazon VPC without internet access, as shown in the following screenshot.
a CodeArtifact domain and CodeArtifact repository With external connection PyPI It must also be created so that SageMaker can use it to download the required packages without internet access, as shown in the following image. You can verify the domain and repository creation by going to the CodeArtifact console. From the navigation pane, select “Repositories” under “Artifacts” and you will see the repository “pip”.
Experiment with ML using MLflow
setting
After creating the CDK stack, the new SageMaker domain and User profile must also be created. Launch Amazon SageMaker Studio and create a JupyterLab space. In JupyterLab Space, select the following instance type: ml.t3.medium
and select the image containing the SageMaker Distribution 2.1.0
.
To verify that your SageMaker environment does not have internet connectivity, open your JupyterLab space and run the curl command in the terminal to check for internet connectivity.
SageMaker with MLflow now supports MLflow versions 2.16.2
Accelerate generative AI and ML workflows from experiment to production. MLflow 2.16.2
A tracking server is created with the CDK stack.
can be found MLflow tracking server Amazon Resource Name (ARN) Run from the CDK output or from the SageMaker Studio UI by clicking the MLFlow icon, as shown in the following image. You can copy the MLflow tracking server ARN by clicking the “Copy” button next to “mlflow-server”.
Download and name the reference dataset from the public UC Irvine ML repository to your local PC as a sample dataset for training the model. predictive_maintenance_raw_data_header.csv
.
Upload the reference dataset from your local PC to JupyterLab Space, as shown in the following image.
To test your private connection to the MLflow tracking server, you can download the sample notebook that was automatically uploaded during stack creation in your bucket in your AWS account. You can find the S3 bucket name in the CDK output, as shown in the following image.
Run the following command from the JupyterLab app’s terminal.
aws s3 cp --recursive <YOUR-BUCKET-URI> ./
Now you can open the private-mlflow.ipynb notebook.
The first cell retrieves the CodeArtifact PyPI repository credentials so that SageMaker can use pip from the private AWS CodeArtifact repository. Credentials expire in 12 hours. Be sure to log on again after the expiration date.
%%bash
AWS_ACCOUNT=$(aws sts get-caller-identity --output text --query 'Account')
aws codeartifact login --tool pip --repository pip --domain code-artifact-domain --domain-owner ${AWS_ACCOUNT} --region ${AWS_DEFAULT_REGION}
experiment
Once setup is complete, start experimenting. This scenario uses the XGBoost algorithm to train a binary classification model. Both data processing and model training jobs use the @remote decorator, so the jobs run from your private VPC in the private subnet and security group associated with SageMaker.
In this case, the @remote decorator retrieves the parameter value from SageMaker. Configuration file (config.yaml). These parameters are used for data processing and training jobs. Define private subnets and security groups associated with SageMaker in a configuration file. For a complete list of configurations supported by the @remote decorator, see Configuration Files in the SageMaker Developer Guide.
Please note that specifying PreExecutionCommands
of aws codeartifact login
Run the command to point SageMaker to your private CodeAritifact repository. This is required to ensure dependencies can be installed at runtime. Alternatively, you can pass a reference to a container in Amazon ECR as follows: ImageUri
which includes all installed dependencies.
Specify security group and subnet information. VpcConfig
.
config_yaml = f"""
SchemaVersion: '1.0'
SageMaker:
PythonSDK:
Modules:
TelemetryOptOut: true
RemoteFunction:
# role arn is not required if in SageMaker Notebook instance or SageMaker Studio
# Uncomment the following line and replace with the right execution role if in a local IDE
# RoleArn: <replace the role arn here>
# ImageUri: <replace with your image if you want to avoid installing dependencies at run time>
S3RootUri: s3://{bucket_prefix}
InstanceType: ml.m5.xlarge
Dependencies: ./requirements.txt
IncludeLocalWorkDir: true
PreExecutionCommands:
- "aws codeartifact login --tool pip --repository pip --domain code-artifact-domain --domain-owner {account_id} --region {region}"
CustomFileFilter:
IgnoreNamePatterns:
- "data/*"
- "models/*"
- "*.ipynb"
- "__pycache__"
VpcConfig:
SecurityGroupIds:
- {security_group_id}
Subnets:
- {private_subnet_id_1}
- {private_subnet_id_2}
"""
Here’s how to set up a similar MLflow experiment.
from time import gmtime, strftime
# Mlflow (replace these values with your own, if needed)
project_prefix = project_prefix
tracking_server_arn = mlflow_arn
experiment_name = f"{project_prefix}-sm-private-experiment"
run_name=f"run-{strftime('%d-%H-%M-%S', gmtime())}"
Data preprocessing
During data processing, @remote
Decorator to link parameters config.yaml to you preprocess
function.
MLflow tracking is mlflow.start_run()
API.
of mlflow.autolog()
APIs can automatically log information such as metrics, parameters, and artifacts.
can be used log_input()
Method to record a dataset to the MLflow artifact store.
@remote(keep_alive_period_in_seconds=3600, job_name_prefix=f"{project_prefix}-sm-private-preprocess")
def preprocess(df, df_source: str, experiment_name: str):
mlflow.set_tracking_uri(tracking_server_arn)
mlflow.set_experiment(experiment_name)
with mlflow.start_run(run_name=f"Preprocessing") as run:
mlflow.autolog()
columns = ('Type', 'Air temperature (K)', 'Process temperature (K)', 'Rotational speed (rpm)', 'Torque (Nm)', 'Tool wear (min)', 'Machine failure')
cat_columns = ('Type')
num_columns = ('Air temperature (K)', 'Process temperature (K)', 'Rotational speed (rpm)', 'Torque (Nm)', 'Tool wear (min)')
target_column = 'Machine failure'
df = df(columns)
mlflow.log_input(
mlflow.data.from_pandas(df, df_source, targets=target_column),
context="DataPreprocessing",
)
...
model_file_path="/opt/ml/model/sklearn_model.joblib"
os.makedirs(os.path.dirname(model_file_path), exist_ok=True)
joblib.dump(featurizer_model, model_file_path)
return X_train, y_train, X_val, y_val, X_test, y_test, featurizer_model
Run the preprocessing job and go to the MLflow UI (see the image below) to see the tracked preprocessing job with the input dataset.
X_train, y_train, X_val, y_val, X_test, y_test, featurizer_model = preprocess(df=df,
df_source=input_data_path,
experiment_name=experiment_name)
You can open the MLflow UI from SageMaker Studio, as shown in the following image. Click Experiments from the navigation pane and select your experiment.
From the MLflow UI, you can see the processing job that just ran.
You can also check the security details in the corresponding training job in the SageMaker Studio console, as shown in the following image.
Training the model
Similar to data processing jobs, you can also use: @remote
Decorator with training job.
Please be careful. log_metrics()
The method sends the defined metrics to the MLflow tracking server.
@remote(keep_alive_period_in_seconds=3600, job_name_prefix=f"{project_prefix}-sm-private-train")
def train(X_train, y_train, X_val, y_val,
eta=0.1,
max_depth=2,
gamma=0.0,
min_child_weight=1,
verbosity=0,
objective="binary:logistic",
eval_metric="auc",
num_boost_round=5):
mlflow.set_tracking_uri(tracking_server_arn)
mlflow.set_experiment(experiment_name)
with mlflow.start_run(run_name=f"Training") as run:
mlflow.autolog()
# Creating DMatrix(es)
dtrain = xgboost.DMatrix(X_train, label=y_train)
dval = xgboost.DMatrix(X_val, label=y_val)
watchlist = ((dtrain, "train"), (dval, "validation"))
print('')
print (f'===Starting training with max_depth {max_depth}===')
param_dist = {
"max_depth": max_depth,
"eta": eta,
"gamma": gamma,
"min_child_weight": min_child_weight,
"verbosity": verbosity,
"objective": objective,
"eval_metric": eval_metric
}
xgb = xgboost.train(
params=param_dist,
dtrain=dtrain,
evals=watchlist,
num_boost_round=num_boost_round)
predictions = xgb.predict(dval)
print ("Metrics for validation set")
print('')
print (pd.crosstab(index=y_val, columns=np.round(predictions),
rownames=('Actuals'), colnames=('Predictions'), margins=True))
rounded_predict = np.round(predictions)
val_accuracy = accuracy_score(y_val, rounded_predict)
val_precision = precision_score(y_val, rounded_predict)
val_recall = recall_score(y_val, rounded_predict)
# Log additional metrics, next to the default ones logged automatically
mlflow.log_metric("Accuracy Model A", val_accuracy * 100.0)
mlflow.log_metric("Precision Model A", val_precision)
mlflow.log_metric("Recall Model A", val_recall)
from sklearn.metrics import roc_auc_score
val_auc = roc_auc_score(y_val, predictions)
mlflow.log_metric("Validation AUC A", val_auc)
model_file_path="/opt/ml/model/xgboost_model.bin"
os.makedirs(os.path.dirname(model_file_path), exist_ok=True)
xgb.save_model(model_file_path)
return xgb
Define hyperparameters and run the training job.
eta=0.3
max_depth=10
booster = train(X_train, y_train, X_val, y_val,
eta=eta,
max_depth=max_depth)
In the MLflow UI, you can see the tracking metrics as shown in the following image. In the Experiments tab, go to the Experiment task’s Training job. It’s located under the Overview tab.
You can also view metrics as graphs. The (Model Metrics) tab allows you to see the model performance metrics that are configured as part of the training job log.
MLflow allows you to log dataset information along with other key metrics such as hyperparameters and model evaluation. For more information, see the blog post Experimenting with LLM with MLFlow.
cleaning
To clean up, first delete all spaces and applications created within your SageMaker Studio domain. Next, run the following code to destroy the infrastructure created.
cdk destroy
conclusion
SageMaker with MLflow ML practitioners can create, manage, analyze, and compare ML experiments on AWS. For added security, SageMaker and MLflow are currently supported. AWS private link. All MLflow Tracking Server versions including: 2.16.2
Seamless integration with this feature enables secure communication between your ML environment and AWS services without exposing your data to the public internet.
As an additional layer of security, you can set up SageMaker Studio within a private VPC without internet access and run your ML experiments in this environment.
SageMaker with MLflow now supports MLflow 2.16.2
. Setting up a new installation provides the best experience and full compatibility with the latest features.
About the author
Xiaoyu Xin I’m a solution architect at AWS. She is driven by a deep passion for artificial intelligence (AI) and machine learning (ML). She strives to bridge the gap between these cutting-edge technologies and a broader audience, making it easier for individuals from diverse backgrounds to learn and leverage AI and ML. She helps customers deploy AI and ML solutions on AWS in a secure and responsible manner.
paolo di francesco I am a Senior Solutions Architect at Amazon Web Services (AWS). He holds a PhD in Telecommunications Engineering and has experience in software engineering. He is passionate about machine learning and is currently focused on leveraging his experience to help customers achieve their goals on AWS, particularly in discussions around MLOps. Outside of work, I enjoy playing soccer and reading.
Tomer Shenhar I’m a product manager at AWS. He specializes in responsible AI and is driven by a passion for developing ethically sound and transparent AI solutions.