Amazon Rufus is a shopping assistant experience powered by generative AI. Generate answers using relevant information from Amazon and from across the web to help Amazon customers make better, more informed shopping decisions. With Rufus, customers can shop with generative AI-powered experts who know Amazon’s selection inside and out, integrating all the information from across the web to help shoppers It can help you make more informed purchasing decisions.
To meet the needs of Amazon customers at scale, Rufus needed a low-cost, high-performance, and highly available inference infrastructure. The solution required the ability to deliver multi-billion-parameter large language models (LLMs) around the world with low latency to serve a broad customer base. Low latency means users have a pleasant experience chatting with Rufus and can start receiving responses within a second. To accomplish this, the Rufus team uses multiple AWS services and AWS AI chips, AWS Trainium and AWS Inferentia.
Inferentia and Trainium are purpose-built chips developed by AWS to accelerate deep learning workloads with high performance and low overall cost. Using these chips, Rufus reduced costs by 4.5x over other evaluated solutions while maintaining low latency for customers. This post details the deployment of Rufus inference using AWS chips and how it enabled one of the most demanding events of the year: Amazon Prime Day.
Solution overview
At the core of Rufus is an LLM trained on information from Amazon’s product catalog and from across the web. Deploying LLM can be challenging and requires balancing factors such as model size, model accuracy, and inference performance. Larger models generally have better knowledge and inference power, but are more costly due to more stringent computing requirements and increased latency. Rufus needs to be deployed and scaled to meet the huge demand of peak events like Amazon Prime Day. Considerations for this scale include the required performance, environmental impact, and cost of hosting the solution. To address these challenges, Rufus used a combination of AWS solutions: Inferentia2 and Trainium, Amazon Elastic Container Service (Amazon ECS), and Application Load Balancer (ALB). Additionally, the Rufus team partnered with NVIDIA to power the solution using NVIDIA’s Triton Inference Server, providing the ability to host models using AWS chips.
Rufus Inference is a search augmentation generation (RAG) system that enriches responses by retrieving additional information, such as product information, from Amazon search results. These results are based on customer queries and ensure that LLM is producing reliable, high-quality, and accurate responses.
To ensure Rufus is best positioned for Prime Day, the Rufus team built a heterogeneous inference system using multiple AWS Regions powered by Inferentia2 and Trainium. By building a system that spans multiple regions, Rufus was able to benefit in two key areas. Firstly, it provided additional capacity that could be used during times of high demand, and secondly, it increased the resiliency of the entire system.
The Rufus team was also able to use both Inf2 and Trn1 instance types. Because the Inf2 and Trn1 instance types use the same AWS Neuron SDK, the Rufus team was able to use both instances to deliver the same Rufus model. The only configuration setting that needed to be adjusted was the tensor parallelism (24 for Inf2 and 32 for Trn1). Using Trn1 instances, we saw an additional 20% reduction in latency and increased throughput compared to Inf2.
The following diagram shows the solution architecture.
To support real-time traffic routing across multiple regions, Rufus has built a new traffic orchestrator. Amazon CloudWatch supported the underlying monitoring, allowing the team to adjust traffic ratios between different regions within 15 minutes based on changing traffic patterns. Using this type of orchestration, the Rufus team can now send requests to other regions as needed, at the cost of a small amount of latency to the first token. Due to Rufus’ streaming architecture and high-performance cross-region AWS network, end users experienced minimal delays.
With these selections, Rufus will scale up over 80,000 Trainium and Inferentia chips across three regions, delivering an average of 3 million tokens per minute while reducing latency to first response for Prime Day customers by 1. We were able to keep it to P99 in less than a second. Additionally, by using these purpose-built chips, Rufus achieved 54% more performance per watt than other evaluated solutions, helping the Rufus team meet its energy efficiency goals.
Optimizing inference performance and host utilization
Within each region, the Rufus inference system used Amazon ECS to manage instances powered by the underlying Inferentia and Trainium. By managing the underlying infrastructure, the Rufus team only needed to define ECS tasks and deploy containers and configurations. Inside each container, an NVIDIA Triton Inference Server with a Python backend is used to run vLLM with the Neuron SDK. vLLM is a memory-efficient inference and processing engine optimized for high throughput. The Neuron SDK makes it easy for teams to deploy AWS chips and supports a variety of libraries and frameworks, including PyTorch Lightning.
Neuron SDK provides a simple LLM inference solution with optimized performance that supports a wide range of transformer-based LLM architectures on Trainium and Inferentia hardware. To reduce latency, Rufus collaborated with the AWS Annapurna team to implement various optimizations such as INT8 (weights only) quantization, continuous batch processing with vLLM, resource, compute, and memory bandwidth for the Neuron compiler and runtime. has been developed. These optimizations are currently deployed in the Rufus production environment and are available in Neuron SDK 2.18 and later.
To reduce the overall latency before customers start seeing responses from Rufus, the team also developed an inference streaming architecture. Because LLM inference requires a high compute and memory load, the total time it takes to generate a complete response to a customer query can take several seconds. Using a streaming architecture, Rufus can return tokens immediately after they are generated. This optimization allows customers to start consuming responses in less than a second. Additionally, multiple services work together using gRPC connections to intelligently aggregate and enrich streaming responses to customers in real time.
As shown in the following image, images and links are embedded in the response so customers can continue to engage and explore Rufus.
scale up
While maintaining low latency is necessary for the best customer experience, it is also important to achieve high utilization of hardware resources to scale service throughput. High hardware utilization prevents accelerators from sitting idle and increasing costs unnecessarily. To optimize the inference system’s throughput, the team improved both single-host throughput and load-balancing efficiency.
Load balancing for LLM inference requires attention because of the following challenges: First, a single host can only handle a limited number of concurrent requests. Second, the end-to-end latency to complete a single request can span several seconds depending on the length of the LLM response.
To address this challenge, the team optimized throughput by considering both single-host throughput and throughput across many hosts using load balancing.
The team used ALB’s Least Outstanding Request (LOR) routing algorithm to improve throughput by five times faster compared to previous baseline measurements. This allows each host enough time to process in-flight requests and stream back responses using gRPC connections without being overwhelmed by multiple requests received at the same time. Rufus also worked with AWS and the vLLM team to improve single-host concurrency using vLLM’s integration with Neuron SDK and NVIDIA Triton Inference Server.
This integration allowed Rufus to benefit from continuous batch processing, an important optimization. Continuous batch processing significantly increases throughput on a single host. Additionally, continuous batch processing offers unique features compared to other batch techniques such as static batch processing. For example, when using static batching, time to first token (TTFT) increases linearly with the number of requests in a single batch. Continuous batch processing prioritizes the prepopulation stage of LLM inference to keep TTFT under control even when more requests are running concurrently. This allows Rufus to provide a low-latency and pleasant experience when generating the first response, increase single-host throughput, and control service costs.
conclusion
In this post, we explained how Rufus can reliably deploy and serve billion-parameter LLMs using the Neuron SDK with Inferentia2 and Trainium chips and AWS services. Rufus continues to evolve with advances in generative AI and customer feedback, and we recommend using Inferentia and Trainium.
Learn more about how we’re innovating with generative AI across Amazon.
About the author
james park I’m a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In my free time, I enjoy exploring new cultures, new experiences, and keeping up with the latest technology trends.
R.J. I’m an engineer within Amazon. He works on building and optimizing systems of distributed systems for training and optimizing system deployment to reduce latency in ML inference. Outside of work, I’m looking into using generative AI to create food recipes.
Yang Zhou I’m a software engineer who works on building and optimizing machine learning systems. His recent focus is on improving the performance and cost efficiency of generative AI inference. Outside of work, I love to travel and have recently developed a passion for long-distance running.
Adam (Hongsheng) Chao I am a software development manager for Amazon Stores Foundational AI. In his current role, Adam leads the Rufus Inference team building large-scale GenAI inference optimization solutions and inference systems that deliver low-cost, fast inference. Outside of work, I enjoy traveling with my wife and creating art.
Fakin Zong I’m a software engineer at Amazon Stores Foundational AI, working on Large Language Model (LLM) inference infrastructure and optimizations. Passionate about Generative AI technology, Faqin will collaborate with key teams to drive innovation, make LLM more accessible and impactful, and ultimately improve customer experiences across diverse applications. . Outside of work, I enjoy doing cardio and baking with my son.
nicholas troun I am a basic AI engineer for the Amazon store. His recent focus is on leveraging his Rufus-wide systems expertise to support the Rufus Inference team and assist in efficient utilization across the Rufus experience. Outside of work, I enjoy spending time with my wife and taking day trips to the nearby coast, Napa, and Sonoma areas.
Bing Yin I am the Director of Science for Amazon Stores Foundational AI. He is leading efforts to build an LLM focused on shopping use cases and optimized for inference at Amazon scale. Outside of work, I enjoy running marathon races.
1 Comment
Přijetí hypoteční platby může být nebezpečné pokud nemáte rádi čekání v dlouhých
řadách , vyplnění extrémní formuláře , a odmítnutí úvěru na základě vašeho úvěrového skóre .
Přijímání hypoteční platby může být problematické, pokud nemáte rádi čekání v dlouhých řadách
, podávání extrémních formulářů ,
a odmítnutí úvěru na základě vašeho úvěrového skóre .
Přijímání hypoteční platby může být
problematické , pokud nemáte rádi čekání v dlouhých řadách , vyplnění extrémních formulářů a odmítnutí úvěrových rozhodnutí založených
na úvěrových skóre . Nyní můžete svou hypotéku zaplatit rychle a efektivně v České
republice. https://groups.google.com/g/sheasjkdcdjksaksda/c/gljJAhzHxfI