This post was co-authored by Travis Mehlinger and Karthik Raghunathan of Cisco.
Cisco’s Webex is a leading provider of cloud-based collaboration solutions including video conferencing, calling, messaging, events, voting, asynchronous video, and customer experience solutions such as contact center and dedicated collaboration devices. Focused on delivering inclusive collaboration experiences, Webex drives our innovation by leveraging AI and machine learning to remove barriers such as geography, language, personality, and technology familiarity. Our solutions are built with security and privacy as their foundation by design. Webex works with the world’s leading business and productivity apps, including AWS.
Cisco’s Webex AI (WxAI) team plays a key role in enhancing these products with AI-driven features and leveraging LLMs to improve user productivity and experience. Over the past year, the team has been increasingly focused on building artificial intelligence (AI) capabilities leveraging large language models (LLMs) to improve user productivity and experience. In particular, the team’s work extends to Webex Contact Center, a cloud-based omnichannel contact center solution that helps organizations deliver great customer experiences. By integrating LLMs, the WxAI team enables advanced capabilities such as intelligent virtual assistants, natural language processing, and sentiment analysis, helping Webex Contact Center deliver more personalized and efficient customer support. However, as these LLM models grew to contain hundreds of gigabytes of data, the WxAI team faced challenges in efficiently allocating resources and launching applications with embedded models. To optimize its AI/ML infrastructure, Cisco migrated LLMs to Amazon SageMaker Inference to improve speed, scalability, and price-performance.
In this blog post, we’ll show you how Cisco did it. Faster Autoscaling Release Reference. For more information about Cisco’s use case, solution, and benefits, see How Cisco Accelerated the Use of Generative AI with Amazon SageMaker Inference.
In this post, we’ll cover:
- Cisco Use Case and Architecture Overview
- Introducing faster autoscaling
- Single-Model Real-Time Endpoint
- Deploying with Amazon SageMaker InferenceComponents
- Cisco Shares Performance Improvements Achieved by Rapid Autoscaling of GenAI Inference
- Next steps
Cisco Use Case: Enhancing the Contact Center Experience
Webex is applying generative AI to its contact center solutions to enable more natural, human-like conversations between customers and agents. AI can generate contextual, empathetic responses to customer inquiries and automatically create personalized emails and chat messages, helping contact center agents work more efficiently while maintaining a high level of customer service.
Architecture
Initially, WxAI embedded LLM models directly into application container images running on Amazon Elastic Kubernetes Service (Amazon EKS). However, as models grew larger and more complex, this approach faced significant challenges in terms of scalability and resource utilization. Operating the resource-intensive LLM through the application required provisioning large amounts of compute resources, slowing down processes like resource allocation and application launch. This inefficiency prevented WxAI from quickly developing, testing, and deploying new AI-powered capabilities for the Webex portfolio.
To address these challenges, the WxAI team turned to SageMaker Inference, a fully managed AI inference service that enables seamless deployment and scaling of models, independent of the applications that use them. By decoupling LLM hosting from Webex applications, WxAI can provision the compute resources required for their models without impacting core collaboration and communication capabilities.
“Applications and models work and scale fundamentally differently and have very different cost considerations. By isolating them rather than lumping them together, it becomes much easier to solve the problems separately.”
-Travis Mehlinger, Principal Engineer, Cisco.
This architectural shift has enabled Webex to harness the power of generative AI across its entire suite of collaboration and customer engagement solutions.
Currently, Sagemaker endpoints use autoscaling with a per-instance call, but it takes about 6 minutes to detect the need for autoscaling.
Introducing new predefined metric types for faster autoscaling
The Cisco Webex AI team wanted to improve inference autoscaling times and worked with Amazon SageMaker to improve inference.
Amazon SageMaker real-time inference endpoints provide a scalable, managed solution for hosting Generative AI models. This versatile resource can accommodate multiple instances and serve one or more deployed models for instant predictions. Customers have the flexibility to deploy a single model or multiple models using SageMaker InferenceComponents on the same endpoint. This approach allows for efficient handling of diverse workloads and cost-effective scaling.
To optimize real-time inference workloads, SageMaker employs application autoscaling (autoscaling). This feature dynamically adjusts both the number of instances in use and the number of deployed model copies (if you use the inference component) to respond to real-time changes in demand. When traffic to your endpoint exceeds a predefined threshold, autoscaling increases the available instances and deploys additional model copies to meet the growing demand. Conversely, when the workload decreases, the system automatically removes unnecessary instances and model copies, effectively reducing costs. This adaptive scaling ensures that resources are optimally used and balances performance needs and cost considerations in real time.
Amazon SageMaker, in collaboration with Cisco, has released new predefined metric types with sub-minute high resolution. SageMakerVariantConcurrentRequestsPerModelHighResolution
Achieve faster autoscaling and reduced detection times. This new high-resolution metric has been shown to reduce scaling detection times by up to 6x (compared to existing SageMakerVariantInvocationsPerInstance
metric), improving overall end-to-end inference latency by up to 50% for endpoints hosting generative AI models such as Llama3-8B.
With this new release, the SageMaker real-time endpoint also publishes a new one. ConcurrentRequestsPerModel
and ConcurrentRequestsPerModelCopy
CloudWatch metrics are similarly well suited for monitoring and scaling the Amazon SageMaker endpoints that host the LLM and FM.
Cisco evaluates GenAI inference’s rapid autoscaling capabilities
Cisco evaluated the new predefined metric types in Amazon SageMaker to speed up autoscaling for generative AI workloads. By using the new metric types, Cisco saw up to 50% improvement in end-to-end inference latency. SageMakerequestsPerModelHighResolution
Existing SageMakerVariantInvocationsPerInstance
metric.
The setup involved using a Generative AI model on a SageMaker real-time inference endpoint. SageMaker’s autoscaling feature dynamically adjusted both the number of instances and the number of copies of the model to accommodate real-time changes in demand. SageMakerVariantConcurrentRequestsPerModelHighResolution
Metrics now detect scaling up to 6x faster, resulting in faster autoscaling and reduced latency.
In addition, SageMaker now emits new CloudWatch metrics, including: ConcurrentRequestsPerModel
and ConcurrentRequestsPerModelCopy
is suited to monitor and scale endpoints hosting Large Language Models (LLMs) and Foundation Models (FMs). This enhanced autoscaling capability is a game changer for Cisco, helping to improve the performance and efficiency of critical generative AI applications.
“We are really pleased with the performance improvements that the new auto scaling metrics in Amazon SageMaker have brought. High-resolution scaling metrics have significantly reduced latency during initial load and scale-out for our Gen AI workloads, and we look forward to rolling this feature out broadly across our infrastructure.“
-Travis Mehlinger, Principal Engineer, Cisco.
Cisco also plans to work with SageMaker Inference to drive improvements to other variables that affect autoscaling latency, such as model download and load times.
Conclusion
Cisco’s Webex AI team continues to leverage Amazon SageMaker Inference to enhance generative AI experiences across the Webex portfolio. Evaluations with SageMaker’s fast autoscaling have seen up to 50% latency improvements on Cisco’s GenAI inference endpoints. As the Webex AI team continues to push the boundaries of AI-driven collaboration, our partnership with Amazon SageMaker is integral to determining future improvements and advanced GenAI inference capabilities. With this new capability, Cisco hopes to further optimize AI inference performance by providing customers with broader deployments across multiple regions and even more impactful generative AI capabilities.
About the Author
Travis Mellinger As a Principal Software Engineer in the Webex Collaboration AI group, he helps his team develop and operate cloud-native AI and ML capabilities to support Webex AI features for customers around the world. In his spare time, he enjoys BBQs, playing video games and traveling around the US and UK in go-karts.
Karthik Raghunathan John is Senior Director of Voice, Language and Video AI for the Webex Collaboration AI group. He leads a multidisciplinary team of software engineers, machine learning engineers, data scientists, computational linguists and designers to develop advanced AI-driven capabilities for the Webex collaboration portfolio. Prior to joining Cisco, John held research positions at MindMeld (acquired by Cisco), Microsoft and Stanford University.
Praveen Chamarthi He is a Sr. AI/ML Specialist at Amazon Web Services. He is passionate about all things AI/ML and AWS. He helps customers across the Americas scale, innovate, and operate their ML workloads efficiently on AWS. In his spare time, he enjoys reading and watching sci-fi movies.
Saurabh Trikhande He is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is driven by the goal of democratizing AI. He focuses on key challenges related to deploying complex AI applications, multi-tenant models, cost optimization, and making Generative AI models easier to deploy. In his spare time, he enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
Ravi Thakur Ravi is a Senior Solutions Architect supporting strategic industries at AWS and is based in Charlotte, NC. His career spans various industries including banking, automotive, telecommunications, insurance, and energy. Ravi’s expertise is driven by a focus on solving complex business challenges on behalf of his customers leveraging distributed, cloud-native, and well-architected design patterns. His proficiency spans microservices, containerization, AI/ML, generative AI, and more. Currently, Ravi is leveraging his ability to deliver proven, tangible benefits to help AWS strategic customers on their personalized digital transformation journeys.