This post was co-written with Deepali Rajale of Karini AI.
Karini AI, the leading generative AI foundation platform built on AWS, enables customers to rapidly build secure, high-quality generative AI apps. GenAI is not just a technology; it is a transformational tool that is changing the way businesses use technology. Depending on where you are in the adoption journey, adopting generative AI can be a big challenge for businesses. Pilot projects using generative AI are easy to get started with, but most businesses need help moving beyond this phase. According to Everest Research, more than 50% of projects face roadblocks and never move beyond pilot due to a lack of standardized or established GenAI operational practices.
Karini AI provides a robust and user-friendly GenAI foundation platform that enables enterprises to build, manage, and deploy Generative AI applications. It enables beginners and seasoned practitioners to develop and deploy Gen AI applications for a variety of use cases beyond simple chatbots, including agents, multi-agents, generative BI, and batch workflows. The no-code platform is ideal for rapid experimentation, building PoCs, and moving quickly to production, with built-in guardrails for safety and troubleshooting observability. The platform includes offline and online quality assessment frameworks to assess quality during experimentation and continuously monitor applications after deployment. Karini AI’s intuitive prompt playground allows you to create prompts, compare with different models across providers, manage prompts, and tune prompts. It supports easier iterative testing of agent and multi-agent prompts. For production deployment, you can easily assemble data ingestion pipelines using no-code recipes, create knowledge bases, and deploy RAGs or agent chains. Platform owners can monitor cost and performance in real time with deep observability, seamlessly integrate with Amazon Bedrock for LLM inference, and benefit from a wide range of enterprise connectors and data pre-processing techniques.
The following diagram illustrates how Karini AI provides a comprehensive Generative AI foundation platform that spans the entire application lifecycle, providing a unified framework for development, deployment, and management, delivering a holistic solution that accelerates time to market and optimizes resource utilization.
In this post, we show how Karini AI migrated their vector embedding models from Kubernetes to Amazon SageMaker endpoints, resulting in a 30% increase in concurrency and over 23% reduction in infrastructure costs.
Karini AI’s data ingestion pipeline for creating vector embeddings
To build practical generative AI applications, it is important to enrich large language models (LLMs) with new data. This is where Search Augmentation Generation (RAG) comes in. RAG enhances the capabilities of LLMs by ingesting external data and producing state-of-the-art performance on knowledge-intensive tasks. Karini AI provides no-code solutions for creating generative AI applications using RAG. These solutions include two main components: a data ingestion pipeline for building a knowledge base, and a system for knowledge retrieval and summarization. Together, these pipelines simplify the development process and make it easier to create powerful AI applications.
Data Ingestion Pipeline
Ingesting data from diverse sources is essential to perform Search Augmentation Generation (RAG). Karini AI’s data ingestion pipeline allows connection to multiple data sources such as Amazon S3, Amazon Redshift, Amazon Relational Database Service (RDS), websites, Confluence, etc., and handles structured and unstructured data. This source data is pre-processed, chunked, and converted into vector embeddings before being stored in a vector database for ingest. Karini AI’s platform provides flexibility by offering a variety of embedding models from its model hub, simplifying the creation of vector embeddings for advanced AI applications.
Below is a screenshot of Karini AI’s no-code data ingestion pipeline.
Karini AI’s Model Hub streamlines adding models by integrating with major underlying model providers such as Amazon Bedrock and self-managed service platforms.
Infrastructure challenges
As customers explore complex use cases and datasets grow in size and complexity, Karini AI efficiently scales the data ingestion process, enabling high concurrency for creating vector embeddings, with state-of-the-art embedding models like those featured on the MTEB leaderboard that are rapidly evolving and unavailable on managed platforms.
Before migrating to Amazon SageMaker, we deployed our models on self-managed Kubernetes (K8s) on EC2 instances. Kubernetes offered great flexibility to quickly deploy HuggingFace’s models, but it quickly required our engineering department to manage many aspects of scaling and deployment. With our existing setup, we faced the following challenges that we had to address to improve efficiency and performance:
- Catching up with the SOTA (State-Of-The-Art) model: Managing different deployment manifests for each model type (classifiers, embeddings, autocomplete, etc.) was time-consuming and error-prone, and we also had to maintain the logic that determined memory allocation for different model types.
- Managing dynamic concurrency was difficult: A big challenge with the Kubernetes-hosted model was achieving the highest dynamic concurrency levels. We aimed to maximize the performance of our endpoints to achieve our target transactions per second (TPS) while meeting strict latency requirements.
- Rising costsWhile Kubernetes (K8s) offers robust capabilities, it is expensive due to the dynamic nature of data ingestion pipelines, leading to underutilized instances and high costs.
Our search for an inference platform led us to Amazon SageMaker, a solution that efficiently manages models to increase concurrency, meet customer SLAs, and scale down services when not needed. The reliability of SageMaker’s performance gave us confidence in its capabilities.
Choosing Amazon SageMaker was a strategic decision for Karini AI, as it provided a balance between higher concurrency and low cost, providing a cost-effective solution for our needs. SageMaker’s ability to scale and maximize concurrency while ensuring sub-second latency enables us to serve a wide variety of generative AI use cases, making it a long-term investment for our platform.
Amazon SageMaker is a fully managed service that enables developers and data scientists to rapidly build, train, and deploy machine learning (ML) models. With SageMaker, you can deploy ML models to hosted endpoints and get real-time inference results. You can easily view performance metrics for your endpoints in Amazon CloudWatch, automatically scale your endpoints based on traffic, and update models in production without compromising availability.
Here is the data ingestion pipeline architecture for Karini AI using an Amazon SageMaker model endpoint.
Benefits of using SageMaker hosting
Amazon SageMaker has provided many direct and indirect benefits to our Gen AI ingestion pipeline.
- Reducing technical debt: As a managed service, Amazon SageMaker frees ML engineers from the burden of inference, allowing them to focus more on core platform features. This freedom from technical debt is a major benefit of using SageMaker and reaffirms its efficiency.
- Meet customer SLAs: Creating a knowledge base is a dynamic task that may require high concurrency and slight overhead at query time during vector embedding generation. Based on customer SLA and data volume, you can choose batch inference, real-time hosting with autoscaling, or serverless hosting. Amazon SageMaker also provides suitable instance type recommendations for your embedding model.
- Reduce infrastructure costsSageMaker is a pay-as-you-go service that allows us to create batch or real-time endpoints when demand is met and destroy them when the work is complete. This approach reduced our infrastructure costs by over 23% compared to the Kubernetes (K8s) platform.
- SageMaker JumpstartSageMaker Jumpstart provides access to state-of-the-art (SOTA) models and optimized inference containers, making it ideal for creating new models that customers can access.
- Amazon Bedrock CompatibilityKarini AI is integrated with Amazon Bedrock for LLM (Large Language Model) inference. Custom model import feature allows you to reuse model weights used in SageMaker model hosting on Amazon Bedrock, maintaining a joint code base and exchange services between Bedrock and SageMaker for your workload.
Conclusion
Karini AI has made significant improvements by migrating to Amazon SageMaker, achieving high performance and reducing model hosting costs. Custom third-party models can be deployed on SageMaker and made immediately available in Karini’s model hub for data ingestion pipelines. Depending on the model size and expected TPS, model hosting infrastructure configuration can be optimized as needed. By using Amazon SagaMaker for model inference, Karini AI was able to efficiently handle growing data complexity and meet concurrency needs while optimizing costs. Additionally, Amazon SageMaker makes it easy to integrate and exchange new models, allowing customers to continue to take advantage of the latest advancements in AI technologies without sacrificing performance or incurring unnecessary incremental costs.
Amazon SageMaker and Karini.ai provide a powerful platform for building, training, and deploying machine learning models at scale. By leveraging these tools, you can:
- Accelerate development:Build and train models faster using pre-built algorithms and frameworks.
- Increase accuracy: Leverage advanced algorithms and techniques to improve model performance.
- Easily scalable:Easily deploy models into production to handle growing workloads.
- Reduce costs:Optimize resource utilization and minimize operational overhead.
Don’t miss this opportunity to gain a competitive advantage.
About the Author
Deepali Raja Karini is the founder of Karini AI, with a mission to democratize generative AI across the enterprise. She blogs about generative AI and coaches clients to optimize their generative AI practices. In her spare time, she enjoys traveling, exploring new experiences, and keeping up with the latest technology trends. You can find her on LinkedIn.
Rabindra Gupta He is the Worldwide GTM Leader for SageMaker and is passionate about helping customers adopt SageMaker for their Machine Learning/GenAI workloads. Ravi loves learning new technologies and enjoys mentoring startups on machine learning practices. You can find him on Linkedin.