Benchmarking Amazon bedrock customized models using llmperf and litellm

Open Foundation Models (FMS) allows organizations to build customized AI applications by tweaking specific domains or tasks while managing costs and deployments. However, as engineers need to carefully optimize instance types and configure serving parameters through careful testing, deployment is a significant part of the effort, and often requires 30% of the project time. This process can be complex and time-consuming, requires specialized knowledge and iterative testing, and requires achieving the desired performance.

Amazon Bedrock Custom Model Import simplifies custom model deployment by providing a simple API for model deployment and invocation. You can upload model weights and allow AWS to handle optimal, fully managed deployments. This ensures that the deployment is running and is cost-effective. Amazon Bedrock Custom Model imports also handle autoscaling, such as zero scaling. If it is not used and there are no 5 minute calls, scale to zero. You will only pay for anything you use in 5 minutes. It also handles scale-up and automatically increases the number of active model copies when higher concurrency is required. With these features, Amazon Bedrock Custom Model imports into an attractive solution for organizations looking to use custom models with Amazon Bedrock, providing simplicity and cost-effectiveness.

It is important to use benchmark tools to assess performance before deploying these models into production. These tools can help you actively detect potential production issues such as throttling and ensure that deployment can handle the expected production load.

This post launches a blog series exploring DeepSeek and Open FMS with Amazon Bedrock custom model imports. We cover the performance benchmarking process for Amazon Bedrock’s custom models using popular open source tools LLMPERF and LITELLM. This includes a notebook with step-by-step instructions for deploying the DeepSeek-R1-Distill-Lalama-8B model, but the same steps apply to other models supported by Amazon Bedrock Custom Model Import.

Prerequisites

This post requires an Amazon Bedrock custom model. If you don’t have one in your AWS account yet, follow the instructions for deploying the Deepseek-R1 distilled llama model with Amazon Bedrock Custom Model import.

Use open source tools llmperf and litellm for performance benchmarks

To run performance benchmarks, use LLMPERF, an open source library that is popular for benchmark foundation models. LLMPERF simulates load testing of the model’s call API by creating a concurrent ray client and analyzing the response. An important advantage of LLMPERF is the broad support of the basic model API. This includes Litellm, which supports all models available on Amazon Bedrock.

Set up a custom model call using Litellm

Litellm is a versatile open source tool that can be used as both a Python SDK and a proxy server (AI gateway) to access over 100 FMSs using standardized formats. Litellm standardizes inputs to suit the specific endpoint requirements of each FM provider. Includes Amazon Bedrock API InvokeModel FMS available on Amazon Bedrock, including Converse APIs and imported custom models.

To invoke a custom model using Litellm, use model parameters (see Litellm’s Amazon Bedrock documentation). This is the string that follows bedrock/provider_route/model_arn format.

provider_route Indicates the litellm implementation of the request/response specification to use. DeepSeek R1 models can be called using a custom chat template using the DeepSeek R1 provider route or using a Llama chat template using the Llama provider route.

model_arn The model Amazon resource name (ARN) for the import model. You can either retrieve the model ARN of the imported model in the console or send a ListImportModels request.

For example, the following script invokes a custom model using a DeepSeek R1 chat template:

import time
from litellm import completion

while True:
    try:
        response = completion(
            model=f"bedrock/deepseek_r1/{model_id}",
            messages=({"role": "user", "content": """Given the following financial data:
        - Company A's revenue grew from $10M to $15M in 2023
        - Operating costs increased by 20%
        - Initial operating costs were $7M
        
        Calculate the company's operating margin for 2023. Please reason step by step."""},
                      {"role": "assistant", "content": "<think>"}),
            max_tokens=4096,
        )
        print(response('choices')(0)('message')('content'))
        break
    except:
        time.sleep(60)

After the invocation parameters for the imported model have been validated, you can configure LLMPERF for the benchmark.

Configure token benchmark tests in llmperf

To benchmark performance, LLMPERF uses Ray, a distributed computing framework, to simulate realistic loads. Multiple remote clients are generated, each of which can send simultaneous requests. These clients are implemented as actors that run in parallel. llmperf.requests_launcher It manages request distribution across Ray clients, allowing for the simulation of various load scenarios and concurrent request patterns. At the same time, each client collects performance metrics during the request, such as latency, throughput, and error rates.

Includes two important metrics for performance delay and throughput:

Latency refers to the amount of time it takes for a single request to be processed.
Throughput measures the number of tokens generated per second.

Choosing the right configuration to provide an FMS typically involves closely monitoring GPU utilization and experimenting with various batch sizes, examining factors such as available memory, model size, and specific workload requirements. For more information, see Optimizing AI Responsiveness: A Practical Guide to Amazon Bedrock Latency-Optimized Incerence. While Amazon Bedrock Custom Model Import simplifies this by providing a pre-optimized serving configuration, it is still important to check the latency and throughput of your deployment.

Start with the settings token_benchmark.pya sample script that makes it easier to configure benchmark tests. The script can define the following parameters:

LLM API: Use Litellm to invoke the Amazon Bedrock custom import model.
Model: Define the root, API, and model ARN and call it in the same way as in the previous section.
Input token mean/standard deviation: The parameter used in the probability distribution where the number of input tokens is sampled.
Output token mean/standard deviation: The parameter used in the probability distribution where the number of output tokens is sampled.
Number of concurrent requests: The number of users that your application is likely to support during use.
Number of completed requests: The total number of requests sent to the LLM API in the test.

The following script shows an example of how to invoke a model: See this notebook for step-by-step instructions on importing custom models and running benchmark tests.

python3 ${{LLM_PERF_SCRIPT_DIR}}/token_benchmark_ray.py \\
--model "bedrock/llama/{model_id}" \\
--mean-input-tokens {mean_input_tokens} \\
--stddev-input-tokens {stddev_input_tokens} \\
--mean-output-tokens {mean_output_tokens} \\
--stddev-output-tokens {stddev_output_tokens} \\
--max-num-completed-requests ${{LLM_PERF_MAX_REQUESTS}} \\
--timeout 1800 \\
--num-concurrent-requests ${{LLM_PERF_CONCURRENT}} \\
--results-dir "${{LLM_PERF_OUTPUT}}" \\
--llm-api litellm \\
--additional-sampling-params '{{}}'

At the end of the test, LLMPERF outputs two JSON files. One has an aggregation metric, and one has a separate entry for each call.

Scaling from zero to cold start latency

One thing to remember is that the Amazon Bedrock custom model import shrinks to zero when the model is not in use, so you must first make a request to make sure there is at least one active model copy. If an error occurs indicating that the model is not ready, you will need to wait up to 1 minute for approximately 10 seconds to prepare at least one active model copy. When you’re ready, run the test call again and proceed with the benchmark.

Deepseek-R1-Distill-Lalama-8B example scenario

Consider A DeepSeek-R1-Distill-Llama-8B Models hosted with Amazon Bedrock Custom Model imports support AI applications with low traffic with fewer than two simultaneous requests. To account for variability, you can adjust the token count parameters for prompts and completion. for example:

Number of clients: 2
Average input token count: 500
Standard deviation input token count: 25
Average output token count: 1000
Standard deviation output token count: 100
Number of requests per client: 50

This example test takes about 8 minutes. At the end of the test, you get a summary of the results of the aggregate metric.

inter_token_latency_s
    p25 = 0.010615988283217918
    p50 = 0.010694698716183695
    p75 = 0.010779359342088015
    p90 = 0.010945443657517748
    p95 = 0.01100556307365132
    p99 = 0.011071086908721675
    mean = 0.010710014800224604
    min = 0.010364670612635254
    max = 0.011485444453299149
    stddev = 0.0001658793389904756
ttft_s
    p25 = 0.3356793452499005
    p50 = 0.3783651359990472
    p75 = 0.41098671700046907
    p90 = 0.46655246950049334
    p95 = 0.4846706690498647
    p99 = 0.6790834719300077
    mean = 0.3837810468001226
    min = 0.1878921090010408
    max = 0.7590946710006392
    stddev = 0.0828713133225014
end_to_end_latency_s
    p25 = 9.885957818500174
    p50 = 10.561580732000039
    p75 = 11.271923759749825
    p90 = 11.87688222009965
    p95 = 12.139972019549713
    p99 = 12.6071144856102
    mean = 10.406450886010116
    min = 2.6196457750011177
    max = 12.626598834998731
    stddev = 1.4681851822617253
request_output_throughput_token_per_s
    p25 = 104.68609252502657
    p50 = 107.24619111072519
    p75 = 108.62997591951486
    p90 = 110.90675007239598
    p95 = 113.3896235445618
    p99 = 116.6688412475626
    mean = 107.12082450567561
    min = 97.0053466021563
    max = 129.40680882698936
    stddev = 3.9748004356837137
number_input_tokens
    p25 = 484.0
    p50 = 500.0
    p75 = 514.0
    p90 = 531.2
    p95 = 543.1
    p99 = 569.1200000000001
    mean = 499.06
    min = 433
    max = 581
    stddev = 26.549294727074212
number_output_tokens
    p25 = 1050.75
    p50 = 1128.5
    p75 = 1214.25
    p90 = 1276.1000000000001
    p95 = 1323.75
    p99 = 1372.2
    mean = 1113.51
    min = 339
    max = 1392
    stddev = 160.9598415942952
Number Of Errored Requests: 0
Overall Output Throughput: 208.0008834264341
Number Of Completed Requests: 100
Completed Requests Per Minute: 11.20784995697034

In addition to the overview, you will receive metrics for individual requests that can be used to create detailed reports such as the following histogram First token time and Token Throughput.

Analyse performance results from LLMPERF and estimate costs using Amazon CloudWatch

LLMPERF provides the ability to benchmark the performance of custom models offered by Amazon Bedrock without inspecting the serving properties and configuration details of Amazon Bedrock custom models import deployments. This information is valuable as it represents the expected end-user experience of the application.

Additionally, benchmark exercises can serve as a valuable tool for cost estimating. With Amazon CloudWatch, you can observe the number of active model copies that Amazon Bedrock Custom Model Import Scale imports scales in response to load tests. ModelCopy is published as a CloudWatch metric in the AWS/Bedrock namespace and is reported using the imported model ARN as a label. Plot of ModelCopy The metrics are shown in the diagram below. This data can help you estimate costs, as billing is based on the number of active model copies at a given time.

Conclusion

While importing Amazon Bedrock Custom Models simplifies model deployment and scaling, performance benchmarks are essential for predicting production performance and comparing models across key metrics such as cost, latency, and throughput.

For more information, try the sample notebook with your custom model.

Additional resources:

About the Author

Felipe Lopez I am AWS Senior AI/ML Specialist Solution Architect. Prior to joining AWS, Felipe worked with GE Digital and SLB to focus on industrial application modeling and optimization products.

Rupinder Growal I am an advanced AI/ML specialist solution architect at AWS. He is currently focusing on Amazon Sagemaker models and serving MLOPs. Prior to this role, he worked as a builder and hosting model for a machine learning engineer. Outside of work, he enjoys tennis and cycling on the mountain trails.

Parasmera I’m a senior product manager at AWS. He focuses on helping Amazon build bedrock. In her spare time, she enjoys spending time with her family and cycling around the Bay Area.

Prashant Patel I am a senior software development engineer at AWS Bedrock. He is passionate about scaling large-scale language models for enterprise applications. Before joining AWS, he worked at IBM to produce large-scale AI/ML workloads with Kubernetes. Prashant holds a Master’s degree from the NYU Tandon School of Engineering. While not working, he enjoys traveling and playing with his dogs.

What's Hot

How to watch the 2024 Summer Olympics

Scientists discover origin of giant asteroid that wiped out the dinosaurs

Uncovering the deep, dark underground of cybercriminals

Revolutionizing customer service: MaestroQA’s integration with Amazon Bedrock for actionable insight

How GoDaddy built a category generation system at scale with batch inference for Amazon Bedrock

How to run Qwen 2.5 on AWS AI chips using hugging face library

Creating asynchronous AI agents with Amazon Bedrock