Fine-tune generative AI applications for Amazon Bedrock using Amazon SageMaker pipeline decorators

Building large-scale deployment pipelines for generative artificial intelligence (AI) applications is a formidable challenge due to the complexity and unique requirements of these systems. Generative AI models are constantly evolving, with new versions and updates released frequently. Managing and deploying these updates across large-scale deployment pipelines while maintaining consistency and minimizing downtime is therefore a daunting task. Generative AI applications require continuous ingest, preprocessing, and formatting of vast amounts of data from various sources. Building a robust data pipeline that can handle this workload reliably and efficiently at scale is a considerable challenge. Monitoring the performance, bias, and ethical impact of generative AI models in production is a non-trivial task.

Achieving this at scale requires significant investment in resources, expertise, and cross-functional collaboration across multiple personas: data scientists and machine learning (ML) developers who focus on developing ML models, and machine learning operations (MLOps) engineers who focus on unique aspects of AI/ML projects and help improve delivery times, reduce defects, and increase data science productivity. In this post, we show how to convert Python code that fine-tunes generative AI models in Amazon Bedrock from local files into a reusable workflow using Amazon SageMaker Pipelines decorators. Amazon SageMaker model building pipelines enable collaboration across multiple AI/ML teams.

SageMaker Pipelines

SageMaker Pipelines allows you to define and orchestrate various steps involved in the ML lifecycle, including data pre-processing, model training, evaluation, and deployment. This streamlines the process and ensures consistency across various stages of the pipeline. SageMaker Pipelines can handle model versioning and lineage tracking. It automatically tracks model artifacts, hyperparameters, and metadata, and helps you reproduce and audit model versions.

The SageMaker Pipelines decorator feature helps you transform local ML code written as a Python program into one or more pipeline steps. Because Amazon Bedrock is accessible as an API, developers who are not familiar with Amazon SageMaker can write regular Python programs to implement or fine-tune Amazon Bedrock applications.

ML functions are written just like any other ML project. After testing it locally or as a training job, a data scientist or practitioner who is an expert in SageMaker can extend it by adding a SageMaker Pipeline step to the function. @step Decorator.

Solution overview

SageMaker model building pipelines is a tool for building ML pipelines that leverages direct integration with SageMaker, allowing you to create pipelines for orchestration using tools that automatically handle many of the creation and management of the steps.

As you move from the pilot and testing phase to deploying large-scale generative AI models, you need to apply DevOps practices to your ML workloads. SageMaker Pipelines is integrated with SageMaker, so you don’t need to interact with other AWS services. You also don’t need to manage resources because SageMaker Pipelines is a fully managed service; SageMaker Pipelines creates and manages the resources for you. Amazon SageMaker Studio provides an environment for managing the end-to-end SageMaker Pipelines experience. The solution in this post shows how to convert Python code written to preprocess, fine-tune, and test large language models (LLMs) using the Amazon Bedrock API into a SageMaker Pipeline to improve operational efficiency for ML.

The solution has three main steps:

Write Python code to preprocess, train, and test LLM on Amazon Bedrock.
addition @step A decorated function for converting Python code into a SageMaker pipeline.
Create and run a SageMaker pipeline.

The following diagram illustrates the solution workflow:

Prerequisites

If you just want to view the notebook code, you can view the notebook on GitHub.

If you are new to AWS, you must first create and configure an AWS account. Then, configure SageMaker Studio in your AWS account. Create a JupyterLab space in SageMaker Studio and run a JupyterLab application.

Once you are in your SageMaker Studio JupyterLab space, complete the following steps:

On the File menu, select New, then Terminal to open a new terminal.

Enter the following code in the terminal:

git clone https://github.com/aws/amazon-sagemaker-examples.git

The folder’s caller is displayed amazon-sagemaker-examples In the File Explorer pane of SageMaker Studio.
Open Folder amazon-sagemaker-examples/sagemaker-pipelines/step-decorator/bedrock-examples.
Open a notebook fine_tune_bedrock_step_decorator.ipynb.

This notebook contains all the code for this post and can be run from start to finish.

Explanation of the notebook code

The notebook uses the user’s default Amazon Simple Storage Service (Amazon S3) bucket. The default S3 bucket follows a naming pattern: s3://sagemaker-{Region}-{your-account-id}If it doesn’t exist yet, it will be created automatically.

A default AWS Identity and Access Management (IAM) role for SageMaker Studio is used for the user. If your SageMaker Studio user role does not have admin access, you must add the required permissions to the role.

For more information, see:

Create a SageMaker session and get a default S3 bucket and IAM role.

sagemaker_session = sagemaker.session.Session()
region = sagemaker_session.boto_region_name

bucket_name = sagemaker_session.default_bucket()
role_arn = sagemaker.get_execution_role() 
...

Preprocessing, training, and testing LLM on Amazon Bedrock using Python

First, we need to download the data and prepare the LLM in Amazon Bedrock, which we will do using Python.

Loading Data

To fine-tune our model, we use the CNN/DailyMail dataset from Hugging Face. The CNN/DailyMail dataset is an English language dataset that contains over 300,000 unique news articles written by CNN and Daily Mail journalists. The raw dataset contains articles and their summaries for training, validation, and testing. Before using the dataset, we need to format it to include prompts. See the following code:

def add_prompt_to_data(dataset):

    datapoints = ()
    
    for datapoint in dataset:
        # Add insruction prompt to each CNN article
        # and add prefix 'response:' to the article summary.
        temp_dict = {}
        temp_dict('prompt') = instruction + datapoint('article')
        temp_dict('completion') = 'response:\n\n' + datapoint('highlights')
        datapoints.append(temp_dict)
    return datapoints

def data_load(ds_name: str, ds_version: str) -> tuple:

    dataset = load_dataset(ds_name, ds_version)
    datapoints_train = add_prompt_to_data(dataset('train'))
    datapoints_valid = add_prompt_to_data(dataset('validation'))
    datapoints_test = add_prompt_to_data(dataset('test'))
    ...

Splitting the Data

Split the dataset into training, validation, and testing. This post limits the size of each row to 3,000 words and chooses 100 rows for training, 10 rows for validation, and 5 rows for testing. For more information, see the notebook on GitHub.

def data_split(step_load_result: tuple)  -> tuple:

    train_lines = reduce_dataset_size(step_load_result(0), 3000, 100)
    validation_lines = reduce_dataset_size(step_load_result(1), 3000, 10)
    test_lines = reduce_dataset_size(step_load_result(2), 3000, 5)
    
    ...

    return train_lines, validation_lines, test_lines

Upload data to Amazon S3

Then, convert the data into JSONL format and upload the training, validation, and test files to Amazon S3.

def upload_file_to_s3(bucket_name: str, file_names: tuple,
                        s3_key_names: tuple):
    import boto3
    s3_client = boto3.client('s3')
    for i in range(len(file_names)):
        s3_client.upload_file(file_names(i), bucket_name, s3_key_names(i))
    ...
    
def data_upload_to_s3(data_split_response: tuple, bucket_name: str) -> tuple:

    dataset_folder = "fine-tuning-datasets"

    if not os.path.exists(dataset_folder):
        os.makedirs(dataset_folder)

    abs_path = os.path.abspath(dataset_folder)
    train_file = write_jsonl_file(abs_path, 'train-cnn.jsonl', data_split_response(0))
    val_file = write_jsonl_file(abs_path, 'validation-cnn.jsonl', data_split_response(1))
    test_file = write_jsonl_file(abs_path, 'test-cnn.jsonl', data_split_response(2))

    file_names = train_file, val_file, test_file

    s3_keys = f'{dataset_folder}/train/train-cnn.jsonl', f'{dataset_folder}/validation/validation-cnn.jsonl', f'{dataset_folder}/test/test-cnn.jsonl'

    upload_file_to_s3(bucket_name, file_names, s3_keys)
    
    ...

Train the model

Now that the training data is uploaded to Amazon S3, we will fine-tune the Amazon Bedrock model using the CNN/DailyMail dataset. For the summarization use case, we will fine-tune the Amazon Titan Text Lite model provided by Amazon Bedrock. We will define the hyperparameters for fine-tuning and start the training job.

    hyper_parameters = {
        "epochCount": "2",
        "batchSize": "1",
        "learningRate": "0.00003",
    }
...

    training_job_response = bedrock.create_model_customization_job(
        customizationType = "FINE_TUNING",
        jobName = training_job_name,
        customModelName = custom_model_name,
        roleArn = role_arn,
        baseModelIdentifier = "amazon.titan-text-lite-v1:0:4k",
        hyperParameters = hyper_parameters,
        trainingDataConfig = training_data_config,
        validationDataConfig = validation_data_config,
        outputDataConfig = output_data_config
    )
...
    model_id = bedrock.get_custom_model(modelIdentifier=custom_model_name)('modelArn')

    print(f'Model id: {model_id}')
    return model_id

Creating Provisioned Throughput

throughput Refers to the number and rate at which inputs and outputs are processed and returned by a model. To provision dedicated resources instead of on-demand throughput, you must purchase provisioned throughput. On-demand throughput can have variable performance. For customized models, you must purchase and use provisioned throughput. For more information, see Amazon Bedrock Provisioned Throughput.

def create_prov_thruput(model_id: str, provisioned_model_name: str) -> str:

    bedrock = boto3.client(service_name="bedrock")

    provisioned_model_id = bedrock.create_provisioned_model_throughput(
                modelUnits=1,
                provisionedModelName=provisioned_model_name,
                modelId=model_id
                )('provisionedModelArn')
    ...

    return provisioned_model_id

Test the model

Now we will call and test the model, using the Amazon Bedrock runtime prompt for the test dataset, the provisioned throughput ID we set in the previous step, and the inference parameters. maxTokenCount, stopSequence, temperature, and top:

...
def test_model (provisioned_model_id: str) -> tuple:

    s3.download_file(s3_bucket, s3_key, 'test-cnn.jsonl')

...
    body = json.dumps(
        {
            "inputText": prompt,
            "textGenerationConfig": {
                "maxTokenCount": 2048,
                "stopSequences": ('User:'),
                "temperature": 0,
                "topP": 0.9
            }
        }
    )

    accept="application/json"
    contentType="application/json"

    bedrock_runtime = boto3.client(service_name="bedrock-runtime")

    fine_tuned_response = bedrock_runtime.invoke_model(body=body,
                                        modelId=provisioned_model_id,
                                        accept=accept,
                                        contentType=contentType)

    fine_tuned_response_body = json.loads(fine_tuned_response.get('body').read())
    summary = fine_tuned_response_body("results")(0)("outputText")

    return prompt, summary

Convert a Python function into a SageMaker Pipeline step Decorate the function with @step

of @step Decorators are functions that transform local ML code into one or more pipeline steps. You can write ML functions just like you would for any other ML project, and then create a pipeline by transforming the Python functions into pipeline steps. @step You use decorators to create dependencies between those functions, creating a pipeline graph, or directed acyclic graph (DAG), and pass the leaf nodes of that graph as lists of steps into the pipeline. @step Decorators, annotating functions @stepWhen this function is invoked, it receives as input the DelayedReturn output of the previous pipeline step. The instance holds information about all previous steps defined in the function that form the SageMaker Pipeline DAG.

The notebook already has @step Add the decorator to the beginning of each function definition in the cell where the function is defined, as shown in the following code: The code for the function comes from the fine-tuned Python program that we are now going to convert into a SageMaker Pipeline.

@step(
name="data-load-step",
keep_alive_period_in_seconds = 300,
)
def data_load(ds_name: str, ds_version: str) -> tuple:
...
return datapoints_train, datapoints_valid, datapoints_test

@step(
name = "data-split-step",
keep_alive_period_in_seconds = 300,
)
def data_split(step_load_result: tuple)  -> tuple:
...
return train_lines, validation_lines, test_lines

@step(
name = "data-upload-to-s3-step",
keep_alive_period_in_seconds=300,
)
def data_upload_to_s3(data_split_response: tuple, bucket_name: str) -> tuple:
...
return f's3://{bucket_name}/{s3_keys(0)}', f's3://{bucket_name}/{s3_keys(1)}', f's3://{bucket_name}/{s3_keys(2)}'

@step(
name = "model-training-step",
keep_alive_period_in_seconds=300,
)
def train(custom_model_name: str,
training_job_name: str,
step_data_upload_to_s3_result: tuple) -> str:
...
return model_id

@step(
name = "create-provisioned-throughput-step",
keep_alive_period_in_seconds=300,
)
def create_prov_thruput(model_id: str, provisioned_model_name: str) -> str:
...
return provisioned_model_id

@step(
name = "model-testing-step",
keep_alive_period_in_seconds=300,
)
def test_model (provisioned_model_id: str) -> tuple:
...
return prompt, summary

Create and run a SageMaker pipeline

To bring it all together, we connect the defined pipelines: @step You combine functions into a multi-step pipeline, and then submit the pipeline for execution.

pipeline_name = "bedrock-fine-tune-pipeline"
...
data_load_response = data_load(param1, param2)

data_split_response = data_split(data_load_response)

data_upload_to_s3_response = data_upload_to_s3(data_split_response, bucket_name)

train_response = train(custom_model_name, training_job_name, data_upload_to_s3_response)

create_prov_thruput_response = create_prov_thruput(train_response, provisioned_model_name)

test_model_response = test_model(create_prov_thruput_response)

pipeline = Pipeline(
    name=pipeline_name,
    steps=(test_model_response),
    parameters=(param1, param2)
    )
...
execution = pipeline.start()

After the pipeline runs, you can list the pipeline steps to get the entire resulting dataset.

execution.list_steps()

({'StepName': 'model-testing-step',
  ...
  'StepStatus': 'Succeeded',
  'Metadata': {'TrainingJob': {'Arn': 'arn:aws:sagemaker:us-east-1:xxxxxxxx:training-job/pipelines-a6lnarybitw1-model-testing-step-rnUvvmGxgn'}},
  ... 
 {'StepName': 'create-provisioned-throughput-step',
  ...  
  'StepStatus': 'Succeeded',
  'Metadata': {'TrainingJob': {'Arn': 'arn:aws:sagemaker:us-east-1:xxxxxxxx:training-job/pipelines-a6lnarybitw1-create-provisioned-t-vmNdXHTaH3'}},
  ...  
 {'StepName': 'model-training-step',
  ...
  'StepStatus': 'Succeeded',
  'Metadata': {'TrainingJob': {'Arn': 'arn:aws:sagemaker:us-east-1:xxxxxxxx:training-job/pipelines-a6lnarybitw1-model-training-step-t3vmuAmWf6'}},
  ... 
 {'StepName': 'data-upload-to-s3-step',
  ... 
  'StepStatus': 'Succeeded',
  'Metadata': {'TrainingJob': {'Arn': 'arn:aws:sagemaker:us-east-1:xxxxxxxx:training-job/pipelines-a6lnarybitw1-data-upload-to-s3-st-cDKe6fJYtf'}},
  ...  
 {'StepName': 'data-split-step',
  ...
  'StepStatus': 'Succeeded',
  'Metadata': {'TrainingJob': {'Arn': 'arn:aws:sagemaker:us-east-1:xxxxxxxx:training-job/pipelines-a6lnarybitw1-data-split-step-ciIP7t0tTq'}},
  ...
 {'StepName': 'data-load-step',
  ... 
  'StepStatus': 'Succeeded',
  'Metadata': {'TrainingJob': {'Arn': 'arn:aws:sagemaker:us-east-1:xxxxxxxx:training-job/pipelines-a6lnarybitw1-data-load-step-swEWNYi5mK'}},

You can trace the lineage of a SageMaker ML pipeline in SageMaker Studio. Lineage tracing in SageMaker Studio revolves around DAGs, which represent the steps in a pipeline. From the DAG, you can trace the lineage from any step to any other step. The following diagram shows the steps in an Amazon Bedrock fine-tuning pipeline. For more information, see Viewing a Pipeline Execution.

You can focus on a specific part of the graph by selecting a step in the (Select a Step) drop-down menu. Detailed logs for each step of your pipeline are available in Amazon CloudWatch Logs.

cleaning

To clean up and avoid incurring charges, follow the detailed cleanup instructions in the GitHub repository and remove the following:

Amazon Bedrock Provisioned Throughput
Customer Model
Sagemaker Pipeline
An Amazon S3 object that stores the fine-tuned dataset

Conclusion

MLOps focuses on streamlining, automating, and monitoring the entire lifecycle of ML models. Building a robust MLOps pipeline requires cross-functional collaboration. Data scientists, ML engineers, IT staff, and DevOps teams need to work together to operationalize models from research to deployment and maintenance. SageMaker Pipelines allows you to create and manage ML workflows while providing storage and reuse capabilities for workflow steps.

This post walked through an example of using SageMaker Step Decorators to convert a Python program into a SageMaker Pipeline for creating a custom Amazon Bedrock model. With SageMaker Pipelines, you get the benefit of automated workflows that you can configure to run on a schedule based on your model retraining requirements. You can also use SageMaker Pipelines to add useful features such as lineage tracking and the ability to manage and visualize the entire workflow from within the SageMaker Studio environment.

AWS offers managed ML solutions, such as Amazon Bedrock and SageMaker, that can help you deploy and serve existing off-the-shelf foundational models or create and run your own custom models.

For more information on the topics discussed in this post, check out the following resources:

About the Author

Neil Sendas Neel is a Principal Technical Account Manager at Amazon Web Services. He works with enterprise customers to design, deploy, and scale cloud applications to achieve their business goals. He has worked on a variety of ML use cases ranging from anomaly detection to product quality prediction for manufacturing and logistics optimization. When not supporting customers, Neel enjoys golfing and salsa dancing.

Ashish Rawat Ashish is a Sr. AI/ML Specialist Solutions Architect at Amazon Web Services based in Atlanta, GA. Ashish has extensive experience in enterprise IT architecture and software development including AI/ML and generative AI. He is committed to guiding customers to solve complex business challenges and create competitive advantage using AWS AI/ML services.

What's Hot

Colossal, the company that wiped out the endangered species, claims to have a nearly complete thylacine genome

Data centers could soon consume as much extra gas as the state of California uses every day

How to watch Inside Out 2: When will it be available on Disney+?

Improve your bike safety with Amazon Rekognition

Earthly Meditation – My Travel and Geology Blog: Tierra del Fuego

Use language embeddings for zero-shot classification and semantic search with Amazon Bedrock

Build a dynamic, role-based AI agent using Amazon Bedrock inline agents

From concept to reality: Navigating the Journey of RAG from proof of concept to production

Improve your bike safety with Amazon Rekognition

Use language embeddings for zero-shot classification and semantic search with Amazon Bedrock

Build a dynamic, role-based AI agent using Amazon Bedrock inline agents

Let’s Talk About the Ending, and End Credits, of Trap

We need to rethink cannabis as a strategy in the overdose crisis

Taco Bell expands AI ordering to hundreds of U.S. drive-thru locations

Most Popular

Has the mystery of life’s “handedness” finally been solved?

Things to know about Pixel’s Video Boost

Apple Watch Series 9 vs. SE: I tested both for 13 days

Our Picks

Use Amazon SageMaker Canvas to perform AI-powered generative data preparation and no-code ML on data of any size.

Do you need active noise cancellation in your headphones?

Amazon SageMaker launches the updated inference optimization toolkit for generative AI

Subscribe to our newsletter

Subscribe to Updates

What's Hot

Fine-tune generative AI applications for Amazon Bedrock using Amazon SageMaker pipeline decorators

SageMaker Pipelines

Solution overview

Prerequisites

Explanation of the notebook code

Preprocessing, training, and testing LLM on Amazon Bedrock using Python

Loading Data

Splitting the Data

Upload data to Amazon S3

Train the model

Creating Provisioned Throughput

Test the model

Convert a Python function into a SageMaker Pipeline step Decorate the function with @step

Create and run a SageMaker pipeline

cleaning

Conclusion

About the Author

Related Posts

Subscribe to our newsletter

Subscribe to our newsletter