This post was written by Claudiu Bota, Oleg Yurchenko and VladySlav Melnyk of AWS partner Automat-It.
As organizations adopt AI and machine learning (ML), they use these technologies to improve processes and enhance their products. AI use cases include video analytics, market forecasting, fraud detection, and natural language processing, all relying on models that analyze data efficiently. Models achieve impressive accuracy with little latency, but often require computational resources with critical computing power, including GPUs, to perform inference. Therefore, maintaining a proper balance between performance and cost is essential, especially when deploying models at large scale.
One of our customers encountered this exact challenge. To address this issue, they designed and implemented the platform on AWS on Autalat-It, a partner in the AWS Premier Tier, specifically using Elastic Kubernetes Services (Amazon EKS). Automat-IT specializes in helping startups and scale-ups grow through hands-on cloud DevOps, MLOPS and Finops services. Collaboration aims to achieve scalability and performance while optimizing costs. Their platform requires a very accurate model with low latency, and the cost of such demanding tasks escalates quickly without proper optimization.
In this post, we’ll explain how this customer helped them achieve 12x more cost savings while maintaining the performance of their AI models within the required performance thresholds. This was achieved through careful coordination of architecture, algorithm choice, and infrastructure management.
Customer challenges
Our customers specialize in developing AI models for video intelligence solutions using Yolov8 and Ultralytics libraries. An end-to-end Yolov8 deployment consists of three stages:
- Pre-processing – Prepare raw video frames through resizing, normalization, and format conversion
- inference – Yolov8 model generates predictions by detecting and classifying objects in curated video frames
- Post-processing – Improve predictions using techniques such as non-maximal suppression (NMS), filtering, and output formats.
It provides clients with a model that analyzes live video streams and extracts valuable insights from captured frames. Each is customized for a specific use case. Initially, the solution required each model to run on a dedicated GPU at runtime, and each customer required a GPU instance. This setup has increased underutilized GPU resources and operating costs.
Therefore, our main objective was to reduce overall platform costs and minimize data processing times while optimizing GPU utilization. Specifically, we aimed to limit AWS infrastructure costs to $30 per camera per month, but kept the total processing time (preprocessing, inference, and postprocessing) below 500 ms. Achieving these savings without slowing down the performance of the model is essential to deliver the desired level of service to each customer, particularly by maintaining low inference latency.
Early Approach
Our first approach followed the client-server architecture, splitting the end-to-end deployment of Yolov8 into two components. The client components running on the CPU instance handled the pre-processing and post-processing stages. Meanwhile, the server components running on the GPU instance were dedicated to inference and responded to requests from clients. This feature was implemented using a custom GRPC wrapper to provide efficient communication between components.
The goal of this approach was to reduce costs by using GPUs only for the inference stage, rather than for the entire end-to-end deployment. Furthermore, we assumed that client-server communication delays had a minimal impact on overall inference time. To assess the effectiveness of this architecture, performance tests were performed using the following baseline parameters:
- The reasoning has been carried out
g4dn.xlarge
Customer models have been optimized to run on T4 GPUS NVIDIA, making GPU-based instances - The customer model used the Yolov8n model in ultra-low resolution version 8.2.71
Results were evaluated based on the following key performance indicators (KPIs).
- Pre-processing time – Time required to prepare the input data for the model
- Inference time – The duration that Yolov8 model takes to process inputs and generate results
- Post-processing time – Time required to complete and format the output of the model to use
- Network communication time – The duration of communication between client components running on the CPU instance and server components running on the GPU instance
- Total time – The overall period from when the image is sent to the Yolov8 model until the result is received, including all processing steps
The survey results were as follows:
Preprocessing (MS) | Inference (MS) | Post Process (MS) | Network Communication (MS) | Total (MS) | |
Custom GRPC | 2.7 | 7.9 | 1.1 | 10.26 | 21.96 |
The GPU-based instance completed the inference in 7.9 ms. However, a network communication overhead of 10.26 ms increased the total processing time. Total processing time was allowed, but each model required a dedicated GPU-based instance to run, costing it unacceptable to customers. Specifically, the inference cost per camera was $353.03 per month, exceeding the customer’s budget.
Find a better solution
The performance results were promising, but the cost per camera was still too high, even with the addition of latency from network communications, so the solution had to be further optimized. Additionally, custom GRPC wrappers do not have an autoscaling mechanism to accommodate the addition of new models, and require continuous maintenance and adds operational complexity.
To address these challenges, we have moved away from the client-server approach and implemented GPU time slices (fractionation) with splitting GPU accesses into discrete time intervals. This approach allows AI models to share a single GPU, each using a virtual GPU during the assigned slice. This is similar to CPU time slices between processes and optimizes resource allocation without reducing performance. This approach was inspired by several AWS blog posts in the reference section.
I implemented GPU time slicing in my EKS cluster using the Nvidia Kubernetes device plugin. This allows you to use Kubernetes’ native scaling mechanisms to simplify the scaling process to accommodate new models and reduce operational overhead. Additionally, relying on plugins has avoided the need to maintain custom code and streamlined both implementation and long term maintenance.
In this configuration, the GPU instance is configured to be split into 60 time-sliced virtual GPUs. Using the same KPIs as previous setups, efficiency and performance were measured under these optimized conditions, confirming cost reductions consistent with the quality of service benchmarks.
The tests were conducted in three stages, as explained in the next section.
Stage 1
At this stage, I ran one pod in a g4dn.xlarge
GPU-based instances. Each pod performs three phases of end-to-end Yolov8 deployment on the GPU, processing video frames from a single camera. The findings are shown in the following graphs and tables.
Preprocessing (MS) | Inference (MS) | Post Process (MS) | Total (MS) | |
1 pod | 2 | 7.8 | 1 | 10.8 |
We achieved an inference time of 7.8 ms and a total processing time of 10.8 ms, and tailored to the project’s requirements. The GPU memory usage for a single pod was 247mib, with GPU processor usage of 12%. Memory usage per POD showed that about 60 processes (or pods) can be run on a 16GIB GPU.
Stage 2
At this stage, I ran 20 pods in a g4dn.2xlarge
GPU-based instances. Changed instance type from g4dn.xlarge
In g4dn.2xlarge
Due to CPU overload related to data processing and loading. The findings are shown in the following graphs and tables.
Preprocessing (MS) | Inference (MS) | Post Process (MS) | Total (MS) | |
20 pods | 11 | 42 | 55 | 108 |
At this stage, GPU memory usage reached 7,244 MIBs, with GPU processor utilization peaking between 95% and 99%. A total of 20 pods utilized half of the GPU’s 16 GIB memory, completely consuming GPU processors, leading to increased processing time. Both inference and total processing time increased, but this result was considered predicted and accepted. The next goal was to determine the maximum number of pods that the GPU could support memory capacity.
Stage 3
At this stage we aimed to run 60 pods. g4dn.2xlarge
GPU-based instances. After that, I changed the instance type from g4dn.2xlarge
In g4dn.4xlarge
And g4dn.8xlarge
.
The goal was to maximize GPU memory usage. However, processing and loading data has overloaded the instance’s CPU. This has led me to switch to an instance that still has one GPU but offers more CPUs.
The findings are shown in the following graphs and tables.
Preprocessing (MS) | Inference (MS) | Post Process (MS) | Total (MS) | |
54 pods | twenty one | 56 | 128 | 205 |
The GPU memory usage was 14780mib, and the GPU processor usage rate was 99-100%. Despite these adjustments I ran into an out-of-GPU memory error that prevented me from scheduling all 60 pods. Ultimately, it represents the maximum number of AI models that can accommodate 54 pods and fit a single GPU.
In this scenario, the per-camera inference cost associated with GPU usage was $27.81 per month per camera, a 12x reduction compared to the initial approach.. By adopting this approach, we successfully met customer cost requirements per camera per month, while maintaining an acceptable level of performance.
Conclusion
In this post, we investigated how we helped one customer achieve 12x cost savings while maintaining the performance of the Yolov8-based AI model within acceptable limits. Test results show that using GPU time zones allows the maximum number of AI models to operate efficiently on a single GPU, significantly reducing costs while providing high performance. Additionally, this method requires minimal maintenance and modification of model code, which improves scalability and ease of use.
reference
For more information, see the following resources:
aws
community
Disclaimer
The content and opinions of this post are third party authors and AWS is not responsible for the content or accuracy of this post.
About the author
Claudiu Bota He is a senior solution architect at Automat-IT, helping customers across the EMEA region move to AWS to optimize their workloads. He specializes in containers, serverless technology and microservices, focusing on building scalable and efficient cloud solutions. Outside of work, Claudiu enjoys reading, traveling and chess.
Oleg Yurchenko He is the DevOps Director at Automat-IT and leads the company’s expertise in DevOps best practices and solutions. His focus areas include containers, Kubernetes, serverless, infrastructure as code, and CI/CD. With over 20 years of hands-on experience in systems management, DevOps and cloud technology, Oleg is a passionate advocate for its customers and guides you to build modern, scalable, and cost-effective cloud solutions.
VladySlav Melnyk I am a senior MLOPS engineer at Automat-It. He is a veteran, deep learning enthusiast with a passion for artificial intelligence, caring for AI products throughout the lifecycle, from experimentation to production. With over 9 years of experience with AI within AWS environments, he is also a huge fan of leveraging cool open source tools. Result-oriented, ambitious, and focused on MLOP, VladySlav ensures smooth transitions and efficient model deployment. He is skilled at providing deep learning models, constantly learning and adapting to stay ahead of the field.