Open Foundation Models (FMS) allows organizations to build customized AI applications by tweaking specific domains or tasks while managing costs and deployments. However, as engineers need to carefully optimize instance types and configure serving parameters through careful testing, deployment is a significant part of the effort, and often requires 30% of the project time. This process can be complex and time-consuming, requires specialized knowledge and iterative testing, and requires achieving the desired performance.
Amazon Bedrock Custom Model Import simplifies custom model deployment by providing a simple API for model deployment and invocation. You can upload model weights and allow AWS to handle optimal, fully managed deployments. This ensures that the deployment is running and is cost-effective. Amazon Bedrock Custom Model imports also handle autoscaling, such as zero scaling. If it is not used and there are no 5 minute calls, scale to zero. You will only pay for anything you use in 5 minutes. It also handles scale-up and automatically increases the number of active model copies when higher concurrency is required. With these features, Amazon Bedrock Custom Model imports into an attractive solution for organizations looking to use custom models with Amazon Bedrock, providing simplicity and cost-effectiveness.
It is important to use benchmark tools to assess performance before deploying these models into production. These tools can help you actively detect potential production issues such as throttling and ensure that deployment can handle the expected production load.
This post launches a blog series exploring DeepSeek and Open FMS with Amazon Bedrock custom model imports. We cover the performance benchmarking process for Amazon Bedrock’s custom models using popular open source tools LLMPERF and LITELLM. This includes a notebook with step-by-step instructions for deploying the DeepSeek-R1-Distill-Lalama-8B model, but the same steps apply to other models supported by Amazon Bedrock Custom Model Import.
Prerequisites
This post requires an Amazon Bedrock custom model. If you don’t have one in your AWS account yet, follow the instructions for deploying the Deepseek-R1 distilled llama model with Amazon Bedrock Custom Model import.
Use open source tools llmperf and litellm for performance benchmarks
To run performance benchmarks, use LLMPERF, an open source library that is popular for benchmark foundation models. LLMPERF simulates load testing of the model’s call API by creating a concurrent ray client and analyzing the response. An important advantage of LLMPERF is the broad support of the basic model API. This includes Litellm, which supports all models available on Amazon Bedrock.
Set up a custom model call using Litellm
Litellm is a versatile open source tool that can be used as both a Python SDK and a proxy server (AI gateway) to access over 100 FMSs using standardized formats. Litellm standardizes inputs to suit the specific endpoint requirements of each FM provider. Includes Amazon Bedrock API InvokeModel
FMS available on Amazon Bedrock, including Converse APIs and imported custom models.
To invoke a custom model using Litellm, use model parameters (see Litellm’s Amazon Bedrock documentation). This is the string that follows bedrock/provider_route/model_arn
format.
provider_route
Indicates the litellm implementation of the request/response specification to use. DeepSeek R1 models can be called using a custom chat template using the DeepSeek R1 provider route or using a Llama chat template using the Llama provider route.
model_arn
The model Amazon resource name (ARN) for the import model. You can either retrieve the model ARN of the imported model in the console or send a ListImportModels request.
For example, the following script invokes a custom model using a DeepSeek R1 chat template:
import time
from litellm import completion
while True:
try:
response = completion(
model=f"bedrock/deepseek_r1/{model_id}",
messages=({"role": "user", "content": """Given the following financial data:
- Company A's revenue grew from $10M to $15M in 2023
- Operating costs increased by 20%
- Initial operating costs were $7M
Calculate the company's operating margin for 2023. Please reason step by step."""},
{"role": "assistant", "content": "<think>"}),
max_tokens=4096,
)
print(response('choices')(0)('message')('content'))
break
except:
time.sleep(60)
After the invocation parameters for the imported model have been validated, you can configure LLMPERF for the benchmark.
Configure token benchmark tests in llmperf
To benchmark performance, LLMPERF uses Ray, a distributed computing framework, to simulate realistic loads. Multiple remote clients are generated, each of which can send simultaneous requests. These clients are implemented as actors that run in parallel. llmperf.requests_launcher
It manages request distribution across Ray clients, allowing for the simulation of various load scenarios and concurrent request patterns. At the same time, each client collects performance metrics during the request, such as latency, throughput, and error rates.
Includes two important metrics for performance delay and throughput:
- Latency refers to the amount of time it takes for a single request to be processed.
- Throughput measures the number of tokens generated per second.
Choosing the right configuration to provide an FMS typically involves closely monitoring GPU utilization and experimenting with various batch sizes, examining factors such as available memory, model size, and specific workload requirements. For more information, see Optimizing AI Responsiveness: A Practical Guide to Amazon Bedrock Latency-Optimized Incerence. While Amazon Bedrock Custom Model Import simplifies this by providing a pre-optimized serving configuration, it is still important to check the latency and throughput of your deployment.
Start with the settings token_benchmark.py
a sample script that makes it easier to configure benchmark tests. The script can define the following parameters:
- LLM API: Use Litellm to invoke the Amazon Bedrock custom import model.
- Model: Define the root, API, and model ARN and call it in the same way as in the previous section.
- Input token mean/standard deviation: The parameter used in the probability distribution where the number of input tokens is sampled.
- Output token mean/standard deviation: The parameter used in the probability distribution where the number of output tokens is sampled.
- Number of concurrent requests: The number of users that your application is likely to support during use.
- Number of completed requests: The total number of requests sent to the LLM API in the test.
The following script shows an example of how to invoke a model: See this notebook for step-by-step instructions on importing custom models and running benchmark tests.
python3 ${{LLM_PERF_SCRIPT_DIR}}/token_benchmark_ray.py \\
--model "bedrock/llama/{model_id}" \\
--mean-input-tokens {mean_input_tokens} \\
--stddev-input-tokens {stddev_input_tokens} \\
--mean-output-tokens {mean_output_tokens} \\
--stddev-output-tokens {stddev_output_tokens} \\
--max-num-completed-requests ${{LLM_PERF_MAX_REQUESTS}} \\
--timeout 1800 \\
--num-concurrent-requests ${{LLM_PERF_CONCURRENT}} \\
--results-dir "${{LLM_PERF_OUTPUT}}" \\
--llm-api litellm \\
--additional-sampling-params '{{}}'
At the end of the test, LLMPERF outputs two JSON files. One has an aggregation metric, and one has a separate entry for each call.
Scaling from zero to cold start latency
One thing to remember is that the Amazon Bedrock custom model import shrinks to zero when the model is not in use, so you must first make a request to make sure there is at least one active model copy. If an error occurs indicating that the model is not ready, you will need to wait up to 1 minute for approximately 10 seconds to prepare at least one active model copy. When you’re ready, run the test call again and proceed with the benchmark.
Deepseek-R1-Distill-Lalama-8B example scenario
Consider A DeepSeek-R1-Distill-Llama-8B
Models hosted with Amazon Bedrock Custom Model imports support AI applications with low traffic with fewer than two simultaneous requests. To account for variability, you can adjust the token count parameters for prompts and completion. for example:
- Number of clients: 2
- Average input token count: 500
- Standard deviation input token count: 25
- Average output token count: 1000
- Standard deviation output token count: 100
- Number of requests per client: 50
This example test takes about 8 minutes. At the end of the test, you get a summary of the results of the aggregate metric.
inter_token_latency_s
p25 = 0.010615988283217918
p50 = 0.010694698716183695
p75 = 0.010779359342088015
p90 = 0.010945443657517748
p95 = 0.01100556307365132
p99 = 0.011071086908721675
mean = 0.010710014800224604
min = 0.010364670612635254
max = 0.011485444453299149
stddev = 0.0001658793389904756
ttft_s
p25 = 0.3356793452499005
p50 = 0.3783651359990472
p75 = 0.41098671700046907
p90 = 0.46655246950049334
p95 = 0.4846706690498647
p99 = 0.6790834719300077
mean = 0.3837810468001226
min = 0.1878921090010408
max = 0.7590946710006392
stddev = 0.0828713133225014
end_to_end_latency_s
p25 = 9.885957818500174
p50 = 10.561580732000039
p75 = 11.271923759749825
p90 = 11.87688222009965
p95 = 12.139972019549713
p99 = 12.6071144856102
mean = 10.406450886010116
min = 2.6196457750011177
max = 12.626598834998731
stddev = 1.4681851822617253
request_output_throughput_token_per_s
p25 = 104.68609252502657
p50 = 107.24619111072519
p75 = 108.62997591951486
p90 = 110.90675007239598
p95 = 113.3896235445618
p99 = 116.6688412475626
mean = 107.12082450567561
min = 97.0053466021563
max = 129.40680882698936
stddev = 3.9748004356837137
number_input_tokens
p25 = 484.0
p50 = 500.0
p75 = 514.0
p90 = 531.2
p95 = 543.1
p99 = 569.1200000000001
mean = 499.06
min = 433
max = 581
stddev = 26.549294727074212
number_output_tokens
p25 = 1050.75
p50 = 1128.5
p75 = 1214.25
p90 = 1276.1000000000001
p95 = 1323.75
p99 = 1372.2
mean = 1113.51
min = 339
max = 1392
stddev = 160.9598415942952
Number Of Errored Requests: 0
Overall Output Throughput: 208.0008834264341
Number Of Completed Requests: 100
Completed Requests Per Minute: 11.20784995697034
In addition to the overview, you will receive metrics for individual requests that can be used to create detailed reports such as the following histogram First token time and Token Throughput.
Analyse performance results from LLMPERF and estimate costs using Amazon CloudWatch
LLMPERF provides the ability to benchmark the performance of custom models offered by Amazon Bedrock without inspecting the serving properties and configuration details of Amazon Bedrock custom models import deployments. This information is valuable as it represents the expected end-user experience of the application.
Additionally, benchmark exercises can serve as a valuable tool for cost estimating. With Amazon CloudWatch, you can observe the number of active model copies that Amazon Bedrock Custom Model Import Scale imports scales in response to load tests. ModelCopy is published as a CloudWatch metric in the AWS/Bedrock namespace and is reported using the imported model ARN as a label. Plot of ModelCopy
The metrics are shown in the diagram below. This data can help you estimate costs, as billing is based on the number of active model copies at a given time.
Conclusion
While importing Amazon Bedrock Custom Models simplifies model deployment and scaling, performance benchmarks are essential for predicting production performance and comparing models across key metrics such as cost, latency, and throughput.
For more information, try the sample notebook with your custom model.
Additional resources:
About the Author
Felipe Lopez I am AWS Senior AI/ML Specialist Solution Architect. Prior to joining AWS, Felipe worked with GE Digital and SLB to focus on industrial application modeling and optimization products.
Rupinder Growal I am an advanced AI/ML specialist solution architect at AWS. He is currently focusing on Amazon Sagemaker models and serving MLOPs. Prior to this role, he worked as a builder and hosting model for a machine learning engineer. Outside of work, he enjoys tennis and cycling on the mountain trails.
Parasmera I’m a senior product manager at AWS. He focuses on helping Amazon build bedrock. In her spare time, she enjoys spending time with her family and cycling around the Bay Area.
Prashant Patel I am a senior software development engineer at AWS Bedrock. He is passionate about scaling large-scale language models for enterprise applications. Before joining AWS, he worked at IBM to produce large-scale AI/ML workloads with Kubernetes. Prashant holds a Master’s degree from the NYU Tandon School of Engineering. While not working, he enjoys traveling and playing with his dogs.