Cisco achieves 50% latency improvement using Amazon SageMaker Inference rapid autoscaling

This post was co-authored by Travis Mehlinger and Karthik Raghunathan of Cisco.

Cisco’s Webex is a leading provider of cloud-based collaboration solutions including video conferencing, calling, messaging, events, voting, asynchronous video, and customer experience solutions such as contact center and dedicated collaboration devices. Focused on delivering inclusive collaboration experiences, Webex drives our innovation by leveraging AI and machine learning to remove barriers such as geography, language, personality, and technology familiarity. Our solutions are built with security and privacy as their foundation by design. Webex works with the world’s leading business and productivity apps, including AWS.

Cisco’s Webex AI (WxAI) team plays a key role in enhancing these products with AI-driven features and leveraging LLMs to improve user productivity and experience. Over the past year, the team has been increasingly focused on building artificial intelligence (AI) capabilities leveraging large language models (LLMs) to improve user productivity and experience. In particular, the team’s work extends to Webex Contact Center, a cloud-based omnichannel contact center solution that helps organizations deliver great customer experiences. By integrating LLMs, the WxAI team enables advanced capabilities such as intelligent virtual assistants, natural language processing, and sentiment analysis, helping Webex Contact Center deliver more personalized and efficient customer support. However, as these LLM models grew to contain hundreds of gigabytes of data, the WxAI team faced challenges in efficiently allocating resources and launching applications with embedded models. To optimize its AI/ML infrastructure, Cisco migrated LLMs to Amazon SageMaker Inference to improve speed, scalability, and price-performance.

In this blog post, we’ll show you how Cisco did it. Faster Autoscaling Release Reference. For more information about Cisco’s use case, solution, and benefits, see How Cisco Accelerated the Use of Generative AI with Amazon SageMaker Inference.

In this post, we’ll cover:

Cisco Use Case and Architecture Overview
Introducing faster autoscaling
1. Single-Model Real-Time Endpoint
2. Deploying with Amazon SageMaker InferenceComponents
Cisco Shares Performance Improvements Achieved by Rapid Autoscaling of GenAI Inference
Next steps

Cisco Use Case: Enhancing the Contact Center Experience

Webex is applying generative AI to its contact center solutions to enable more natural, human-like conversations between customers and agents. AI can generate contextual, empathetic responses to customer inquiries and automatically create personalized emails and chat messages, helping contact center agents work more efficiently while maintaining a high level of customer service.

Architecture

Initially, WxAI embedded LLM models directly into application container images running on Amazon Elastic Kubernetes Service (Amazon EKS). However, as models grew larger and more complex, this approach faced significant challenges in terms of scalability and resource utilization. Operating the resource-intensive LLM through the application required provisioning large amounts of compute resources, slowing down processes like resource allocation and application launch. This inefficiency prevented WxAI from quickly developing, testing, and deploying new AI-powered capabilities for the Webex portfolio.

To address these challenges, the WxAI team turned to SageMaker Inference, a fully managed AI inference service that enables seamless deployment and scaling of models, independent of the applications that use them. By decoupling LLM hosting from Webex applications, WxAI can provision the compute resources required for their models without impacting core collaboration and communication capabilities.

“Applications and models work and scale fundamentally differently and have very different cost considerations. By isolating them rather than lumping them together, it becomes much easier to solve the problems separately.”

-Travis Mehlinger, Principal Engineer, Cisco.

This architectural shift has enabled Webex to harness the power of generative AI across its entire suite of collaboration and customer engagement solutions.

Currently, Sagemaker endpoints use autoscaling with a per-instance call, but it takes about 6 minutes to detect the need for autoscaling.

Introducing new predefined metric types for faster autoscaling

The Cisco Webex AI team wanted to improve inference autoscaling times and worked with Amazon SageMaker to improve inference.

Amazon SageMaker real-time inference endpoints provide a scalable, managed solution for hosting Generative AI models. This versatile resource can accommodate multiple instances and serve one or more deployed models for instant predictions. Customers have the flexibility to deploy a single model or multiple models using SageMaker InferenceComponents on the same endpoint. This approach allows for efficient handling of diverse workloads and cost-effective scaling.

To optimize real-time inference workloads, SageMaker employs application autoscaling (autoscaling). This feature dynamically adjusts both the number of instances in use and the number of deployed model copies (if you use the inference component) to respond to real-time changes in demand. When traffic to your endpoint exceeds a predefined threshold, autoscaling increases the available instances and deploys additional model copies to meet the growing demand. Conversely, when the workload decreases, the system automatically removes unnecessary instances and model copies, effectively reducing costs. This adaptive scaling ensures that resources are optimally used and balances performance needs and cost considerations in real time.

Amazon SageMaker, in collaboration with Cisco, has released new predefined metric types with sub-minute high resolution. SageMakerVariantConcurrentRequestsPerModelHighResolution Achieve faster autoscaling and reduced detection times. This new high-resolution metric has been shown to reduce scaling detection times by up to 6x (compared to existing SageMakerVariantInvocationsPerInstance metric), improving overall end-to-end inference latency by up to 50% for endpoints hosting generative AI models such as Llama3-8B.

With this new release, the SageMaker real-time endpoint also publishes a new one. ConcurrentRequestsPerModel and ConcurrentRequestsPerModelCopy CloudWatch metrics are similarly well suited for monitoring and scaling the Amazon SageMaker endpoints that host the LLM and FM.

Cisco evaluates GenAI inference’s rapid autoscaling capabilities

Cisco evaluated the new predefined metric types in Amazon SageMaker to speed up autoscaling for generative AI workloads. By using the new metric types, Cisco saw up to 50% improvement in end-to-end inference latency. SageMakerequestsPerModelHighResolution Existing SageMakerVariantInvocationsPerInstance metric.

The setup involved using a Generative AI model on a SageMaker real-time inference endpoint. SageMaker’s autoscaling feature dynamically adjusted both the number of instances and the number of copies of the model to accommodate real-time changes in demand. SageMakerVariantConcurrentRequestsPerModelHighResolution Metrics now detect scaling up to 6x faster, resulting in faster autoscaling and reduced latency.

In addition, SageMaker now emits new CloudWatch metrics, including: ConcurrentRequestsPerModel and ConcurrentRequestsPerModelCopyis suited to monitor and scale endpoints hosting Large Language Models (LLMs) and Foundation Models (FMs). This enhanced autoscaling capability is a game changer for Cisco, helping to improve the performance and efficiency of critical generative AI applications.

“We are really pleased with the performance improvements that the new auto scaling metrics in Amazon SageMaker have brought. High-resolution scaling metrics have significantly reduced latency during initial load and scale-out for our Gen AI workloads, and we look forward to rolling this feature out broadly across our infrastructure.“

-Travis Mehlinger, Principal Engineer, Cisco.

Cisco also plans to work with SageMaker Inference to drive improvements to other variables that affect autoscaling latency, such as model download and load times.

Conclusion

Cisco’s Webex AI team continues to leverage Amazon SageMaker Inference to enhance generative AI experiences across the Webex portfolio. Evaluations with SageMaker’s fast autoscaling have seen up to 50% latency improvements on Cisco’s GenAI inference endpoints. As the Webex AI team continues to push the boundaries of AI-driven collaboration, our partnership with Amazon SageMaker is integral to determining future improvements and advanced GenAI inference capabilities. With this new capability, Cisco hopes to further optimize AI inference performance by providing customers with broader deployments across multiple regions and even more impactful generative AI capabilities.

About the Author

Travis Mellinger As a Principal Software Engineer in the Webex Collaboration AI group, he helps his team develop and operate cloud-native AI and ML capabilities to support Webex AI features for customers around the world. In his spare time, he enjoys BBQs, playing video games and traveling around the US and UK in go-karts.

Karthik Raghunathan John is Senior Director of Voice, Language and Video AI for the Webex Collaboration AI group. He leads a multidisciplinary team of software engineers, machine learning engineers, data scientists, computational linguists and designers to develop advanced AI-driven capabilities for the Webex collaboration portfolio. Prior to joining Cisco, John held research positions at MindMeld (acquired by Cisco), Microsoft and Stanford University.

Praveen Chamarthi He is a Sr. AI/ML Specialist at Amazon Web Services. He is passionate about all things AI/ML and AWS. He helps customers across the Americas scale, innovate, and operate their ML workloads efficiently on AWS. In his spare time, he enjoys reading and watching sci-fi movies.

Saurabh Trikhande He is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is driven by the goal of democratizing AI. He focuses on key challenges related to deploying complex AI applications, multi-tenant models, cost optimization, and making Generative AI models easier to deploy. In his spare time, he enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Ravi Thakur Ravi is a Senior Solutions Architect supporting strategic industries at AWS and is based in Charlotte, NC. His career spans various industries including banking, automotive, telecommunications, insurance, and energy. Ravi’s expertise is driven by a focus on solving complex business challenges on behalf of his customers leveraging distributed, cloud-native, and well-architected design patterns. His proficiency spans microservices, containerization, AI/ML, generative AI, and more. Currently, Ravi is leveraging his ability to deliver proven, tangible benefits to help AWS strategic customers on their personalized digital transformation journeys.

What's Hot

Galaxy Z Fold 5 review: Modest upgrades

How bad is COVID, INF, and RSV this winter?

Today’s Wordle: August 3 Answers and Hints

Accelerating insurance policy reviews with generative AI: Verisk’s Mozart companion

Announcing general availability of Amazon Bedrock Knowledge Bases GraphRAG with Amazon Neptune Analytics

Evaluate RAG responses with Amazon Bedrock, LlamaIndex and RAGAS

Build a Multi-Agent System with LangGraph and Mistral on AWS

Ground truth generation and review best practices for evaluating generative AI question-answering with FMEval

Accelerating insurance policy reviews with generative AI: Verisk’s Mozart companion

Announcing general availability of Amazon Bedrock Knowledge Bases GraphRAG with Amazon Neptune Analytics

Evaluate RAG responses with Amazon Bedrock, LlamaIndex and RAGAS

Why would anyone use a Chromebook?

Beyond Meat is stalled in the U.S.; Europe may be a different story

The cause of the biggest space explosion in history may finally be known

Most Popular

Evidence that alcohol isn’t necessary as a social pillar in Dry January

These are all missions heading to the moon in 2025

Best Cheap VPN (UK) August 2024

Our Picks

Import a question answering fine-tuned model into Amazon Bedrock as a custom model

Poetry: “Mendeleev’s Nightmare”

19 Good News Science Stories to Savor This Summer

Subscribe to our newsletter

Subscribe to Updates

What's Hot

Cisco achieves 50% latency improvement using Amazon SageMaker Inference rapid autoscaling

Cisco Use Case: Enhancing the Contact Center Experience

Architecture

Introducing new predefined metric types for faster autoscaling

Cisco evaluates GenAI inference’s rapid autoscaling capabilities

Conclusion

About the Author

Related Posts

Subscribe to our newsletter

Subscribe to our newsletter