Cutting-edge generative AI models and high-performance computing (HPC) applications are driving the need for unprecedented levels of computing. Customers are pushing the limits of these technologies to bring higher fidelity products and experiences to market across industries.
The size of large language models (LLMs), as measured by the number of parameters, has grown exponentially in recent years, reflecting a larger trend in the AI field: model sizes have grown from billions of parameters to hundreds of billions of parameters in five years. As LLMs have gotten larger, so has their performance on a variety of natural language processing tasks; however, the large size of LLMs poses significant computational and resource challenges. Training and deploying these models requires enormous amounts of compute power, memory, and storage.
The size of the LLM has a significant impact on the choice of compute required for inference. A larger LLM requires more GPU memory to store model parameters and intermediate calculations, and more compute power to perform matrix multiplications and other operations required for inference. A larger LLM requires longer to run a single inference pass due to increased computational complexity. This increased compute requirement can result in longer inference latency, which is a critical factor for applications that require real-time or near real-time responses.
HPC customers are seeing a similar trend: as their data collection becomes more fidelity and datasets reach exabyte scale, they are looking for ways to accelerate time to solution across increasingly complex applications.
To address customer demand for high performance and scalability for deep learning, generative AI, and HPC workloads, we are announcing the general availability of Amazon Elastic Compute Cloud (Amazon EC2) P5e instances featuring NVIDIA H200 Tensor Core GPUs. AWS is the first major cloud provider to offer the H200 GPUs in production. We are also announcing the upcoming launch of P5en instances, a network-optimized variant of the P5e instances.
In this post, I discuss the core capabilities of these instances, the use cases for which they are suitable, and provide an example of how you can get started using these instances to run inference deployments for the Meta Llama 3.1 70B and 405B models.
Overview of EC2 P5e Instances
P5e instances feature the NVIDIA H200 GPU, which provides 1.7x the GPU memory capacity and 1.5x the GPU memory bandwidth compared to the NVIDIA H100 Tensor Core GPU found in P5 instances.
P5e instances feature eight NVIDIA H200 GPUs with 1128 GB of high-bandwidth GPU memory, 3rd generation AMD EPYC processors, 2 TiB of system memory, and 30 TB of local NVMe storage. P5e instances also offer 3,200 Gbps of aggregate network bandwidth with support for GPUDirect RDMA, bypassing the CPU for inter-node communication, delivering lower latency and efficient scale-out performance.
The following table summarizes the instance details:
Instance size | Number of vCPUs | Instance Memory (TiB) | Graphics Processor | GPU Memory | Network Bandwidth (Gbps) | GPU Direct RDMA | GPU Peer-to-Peer | Instance Storage (TB) | EBS Bandwidth (Gbps) |
p5e.48xlarge | 192 | 2 | NVIDIA H200 x 8 | 1128 GB HBM3e |
3200Gbps EFA | yes | 900GB/s NVSwitch | 8 x 3.84 NVMe SSD | 80 |
EC2 P5en Instances Coming Soon
One of the bottlenecks in GPU-accelerated computing can be the communication between the CPU and the GPU. Data transfer between these two components can be time-consuming, especially for workloads that require large datasets or frequent data exchange. This challenge can impact a wide range of GPU-accelerated applications, including deep learning, high-performance computing, and real-time data processing. The need to move data between the CPU and GPU can introduce latency and reduce overall efficiency. Additionally, network latency can be an issue for ML workloads on distributed systems, as data needs to be transferred between multiple machines.
The EC2 P5en instances, scheduled for release in 2024, will help address these challenges. The P5en instances will feature NVIDIA H200 GPUs and custom 4Number Powered by 5th Gen Intel Xeon Scalable processors and enabling PCIe Gen 5 between the CPU and GPU, these instances improve workload performance by providing up to 4x the bandwidth and reducing network latency between the CPU and GPU.
P5e Use Cases
P5e instances are ideal for training, fine-tuning, and running inference on the increasingly complex LLMs and multi-modal foundational models (FMs) behind the most demanding and compute-intensive generative AI applications, including question answering, code generation, video and image generation, and speech recognition.
Customers deploying LLM for inference can benefit from using P5e instances, which offer several key advantages that make them an ideal choice for these workloads:
First, the higher memory bandwidth of the H200 GPUs in P5e instances enables the GPU to retrieve data from memory faster and process it. This translates into lower inference latency, which is important for real-time applications such as conversational AI systems where users expect near-instantaneous responses. Higher memory bandwidth also improves throughput, enabling the GPU to process more inferences per second. Customers deploying the 70 billion parameter Meta Llama 3.1 model on P5e instances can expect up to 1.87x throughput.1 40% increase in throughput1 This is a lower cost compared to using an equivalent P5 instance.1Input sequence length 121, output sequence length 5000, batch size 10, vLLM framework)
Second, the large scale of modern LLMs with hundreds of billions of parameters requires enormous amounts of memory to store models and intermediate computations during inference. With standard P5 instances, it is likely necessary to use multiple instances to accommodate memory requirements. However, the 1.76x GPU memory capacity of P5e instances enables you to scale up by using a single instance to fit your entire model. This avoids the complexities and overhead associated with distributed inference systems, including data synchronization, communication, and load balancing. Customers deploying a 405 billion parameter Meta Llama 3.1 model on a single P5e instance can achieve up to 1.7x faster inference times.2 Double throughput by up to 69%2 This is a lower cost compared to using two P5 instances.2Input sequence length 121, output sequence length 50, batch size 10, vLLM framework)
Finally, the larger GPU memory on P5e instances enables better utilization of the GPU by using larger batch sizes during inference, resulting in faster inference times and higher overall throughput. This additional memory is especially beneficial for customers with large inference needs.
When optimizing inference throughput and cost, consider tuning the batch size, input/output sequence length, and quantization level, as these parameters can have a large impact. Try different configurations to find the best balance between performance and cost for your specific use case.
In summary, the combination of higher memory bandwidth, increased GPU memory capacity, and support for larger batch sizes makes P5e instances an ideal choice for customers deploying LLM inference workloads. These instances can deliver significant performance improvements, cost savings, and operational simplicity compared to other options.
P5e instances are also well suited for memory-intensive HPC applications such as simulation, drug discovery, seismic analysis, weather forecasting, financial modeling, etc. Customers using Dynamic Programming (DP) algorithms for applications such as genome sequencing and high-speed data analytics can also take advantage of the P5e’s additional support for the DPX instruction set.
Get started with P5e instances
When you launch a P5 instance, you can support it with the AWS Deep Learning AMI (DLAMI). DLAMI provides ML practitioners and researchers with the infrastructure and tools to rapidly build scalable, secure, distributed ML applications in a pre-configured environment. You can run containerized applications on P5 instances with AWS Deep Learning Containers using libraries from Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS).
P5e Instances Now Available
EC2 P5e instances are now available in the US East (Ohio) AWS region in p5e.48xlarge sizes through Amazon EC2 Capacity Blocks for ML. To learn more, see Amazon EC2 P5 Instances.
About the Author
Avi Kulkarni As a Senior Specialist, he focuses on global business development and go-to-market for ML and HPC workloads for both commercial and public sector customers. Previously, he managed partnerships at AWS and led product management for automotive customers at Honeywell, covering electric, autonomous and conventional vehicles.
Kartik Vena He is a Principal Product Manager at AWS, where he leads the development of EC2 instances for a variety of workloads, including deep learning and generative AI.
Khaled Rawashdeh He is a Senior Product Manager at AWS, where he defines and creates Amazon EC2 Accelerated Computing Instances for the most demanding AI/Machine Learning workloads. Prior to joining AWS, he worked for leading companies focused on creating data center software and systems for enterprise customers.
Aman Shanbagh He is an Associate Specialist Solutions Architect in the ML Frameworks team at Amazon Web Services, helping customers and partners adopt ML training and inference solutions at scale. Prior to joining AWS, Aman graduated from Rice University with degrees in Computer Science, Mathematics, and Entrepreneurship.
Pavel Berevich He is a Senior Applied Scientist on the ML Frameworks team at Amazon Web Services, where he applies his research on distributed training and inference for large-scale models to real-world customer needs. Prior to joining AWS, he worked on various distributed training techniques including FSDP and pipeline parallelism in the PyTorch Distributed team.
Dr. Maxime Hugues He is a Principal WW Specialist Solutions Architect for GenAI at AWS, having joined the company in 2020. He holds an MSc in Engineering from the French National Polytechnic School “ISEN-Toulon”, an MSc in Science, and a PhD in Computer Science from the University of Lille 1 in 2011. His research is mainly focused on programming paradigms, innovative hardware for extreme computing, and HPC/machine learning performance. Prior to joining AWS, he worked as an HPC Research Scientist and Technical Lead at TotalEnergies.
Shruti Koparkar He is a Senior Product Marketing Manager at AWS, helping customers explore, evaluate, and adopt Amazon EC2 Accelerated Computing infrastructure for their Machine Learning needs.