This post was co-written with Pradeep Prabhakaran from Cohere.
Search Augmented Generation (RAG) is a powerful technique that helps businesses integrate real-time data and develop generative artificial intelligence (AI) apps that use their own data to enable rich, interactive conversations.
RAGs enable these AI applications to leverage external authoritative sources of domain-specific knowledge, enriching the context of their language models when answering user queries. However, the reliability and accuracy of the response depends on finding the right source material. Therefore, honing the search process in RAGs is critical to increasing the reliability of the generated responses.
RAG systems are important tools for building search and retrieval systems, but they often fail to meet expectations due to suboptimal retrieval procedures. Enhancing them with reranking procedures can improve retrieval quality.
RAG is an approach that combines information retrieval techniques and natural language processing (NLP) to improve the performance of text generation or language modeling tasks. The method retrieves relevant information from large amounts of text data and uses it to augment the generation process. The key idea is to incorporate external knowledge or context into the model to improve the accuracy, variety, and relevance of the generated responses.
RAG Orchestration Workflow
RAG orchestration typically consists of two steps:
- search – RAG uses the generated search query to retrieve relevant documents from external data sources. When a search query is presented, the RAG-based application searches the data sources for relevant documents or passages.
- A down-to-earth generation – Using the retrieved documents or passages, the generative model creates knowledge-based answers that include in-line citations using the retrieved documents.
The following diagram illustrates the RAG workflow:
Document Search in RAG Orchestration
One technique for document retrieval in RAG orchestration is dense search, an information retrieval approach that aims to understand the meaning and intent behind a user query. Dense search finds documents that are closest to the user query in their embeddings, as shown in the following screenshot:
The goal of dense search is to map both user queries and documents (or sentences) into a dense vector space, where standard distance metrics such as cosine similarity or Euclidean distance can be used to calculate the similarity between query vectors and document vectors. Based on the calculated distance metric, documents that most closely match the meaning of the user query are returned to the user.
The quality of the final response to a search query depends heavily on the relevance of the retrieved documents. Dense search models are very efficient and can scale to large datasets, but due to the simplicity of their methodology they struggle with more complex data and questions. Document vectors contain the meaning of the text in a compressed representation (usually vectors with dimensions between 786 and 1536). As information is compressed into a single vector, information is often lost. When documents are retrieved during vector search, the most relevant information does not always appear at the top of the search.
Improve search accuracy with Cohere Rerank
To address this issue precisely, search engineers use two-stage search as a way to improve search quality. In this two-stage system, a first-stage model (an embedding model or search function) retrieves a set of candidate documents from a large dataset. Then, a second-stage model (a re-rank function) is used to re-rank the documents retrieved by the first-stage model.
Reranking models, such as Cohere Rerank, are a type of model that, given a query-document pair, output a similarity score. This score can be used to sort documents that are most relevant to a search query. Among reranking methodologies, the Cohere Rerank model stands out for its ability to significantly improve search accuracy. This model differs from traditional embedding models by using deep learning to directly evaluate the alignment between each document and the query. Cohere Rerank outputs relevance scores by processing queries and documents in parallel, resulting in a more nuanced document selection process.
In the following example, the application was presented with the query “When was the Transformers paper co-authored by Aidan Gomez published?”. Top-k with k = 6 returned the results shown in the image, where the retrieved result set contained the most accurate results, but at the bottom of the list. With k = 3, the most relevant documents are not included in the retrieved results.
Cohere Rerank aims to reassess and sort the relevance of retrieved documents based on additional criteria such as semantic content, user intent, and contextual relevance, and output a similarity score. This score is then used to sort documents based on the relevance of the query. The following image shows the sorting result using Rerank.
By applying Cohere Rerank after the first-stage search, RAG Orchestration can enjoy the benefits of both approaches. While the first-stage search helps retrieve relevant items based on proximity matches in the vector space, reranking helps optimize the search accordingly by ensuring that contextually relevant results are surfaced at the top. The following diagram illustrates this efficiency gain.
The latest version of Cohere Rerank, Rerank 3, is built specifically to power enterprise search and RAG systems. Rerank 3 offers cutting edge features for enterprise search, including:
- 4k context length significantly improves search quality for long documents.
- Ability to search multifaceted and semi-structured data (emails, invoices, JSON documents, code, tables, etc.)
- Covering over 100 languages
- Improved latency and lower total cost of ownership (TCO)
The endpoint takes a query and a list of documents and produces an ordered array with each document assigned a relevance score, allowing for a powerful semantic improvement to the search quality of keyword and vector search systems without the need for an overhaul or replacement.
Developers and enterprises can access Rerank through Cohere’s hosted API and Amazon SageMaker. This post walks you through the step-by-step process of using Cohere Rerank on Amazon SageMaker.
Solution overview
The solution follows these high-level steps:
- Subscribe to a model package
- Create an endpoint and run real-time inference
Prerequisites
To complete this tutorial, you need the following prerequisites:
- The cohere-aws notebook.
This is a reference notebook and cannot be run unless you make the modifications suggested in the notebook. You must open it from an Amazon SageMaker notebook instance or Amazon SageMaker Studio because it contains elements that render correctly in the Jupyter interface.
- AWS Identity and Access Management (IAM) roles and Amazon SageMaker full access A policy is attached. To successfully deploy this machine learning (ML) model, choose one of the following options:
- If your AWS account doesn’t already have a subscription to Cohere Rerank 3 Model – Multilingual, your IAM role must have the following three permissions, and it must also have permission to create an AWS Marketplace subscription in your AWS account:
aws-marketplace:ViewSubscriptions
aws-marketplace:Unsubscribe
aws-marketplace:Subscribe
- If your AWS account has a subscription to Cohere Rerank 3 Model – Multilingual, you can skip the step of subscribing to the model package.
- If your AWS account doesn’t already have a subscription to Cohere Rerank 3 Model – Multilingual, your IAM role must have the following three permissions, and it must also have permission to create an AWS Marketplace subscription in your AWS account:
Refrain from using full access in production environments; a security best practice is to adopt the principle of least privilege.
Implementing Rerank 3 on Amazon SageMaker
To use Cohere Rerank to improve RAG performance, follow the steps in the next section.
Subscribe to a model package
To subscribe to a model package, follow these steps:
- In AWS Marketplace, open the model package listing page Cohere Rerank 3 Model – Multilingual.
- Select Continue Subscription.
- On the “Subscribe to this software” page, review the End User License Agreement (EULA), pricing, and support terms, Accept the offer.
- Select (Continue to Settings), and then select your Region. The Product ARN is displayed, as shown in the following screenshot. This is the Amazon Resource Name (ARN) of your model package that you must specify when you create a deployable model using Boto3. Copy the ARN that corresponds to your Region, and enter it in the next cell.
The code snippets included in this post are taken from the aws-cohere notebook. If you have any issues with the code, please refer to the latest version of the notebook.
!pip install --upgrade cohere-aws
# if you upgrade the package, you need to restart the kernel
from cohere_aws import Client
import boto3
Above Configuration for AWS CloudFormation The page is as shown in the following screenshot: Product ArnNote the last part of the product ARN to use as the value for the variable. cohere_package
With the following code.
cohere_package = " cohere-rerank-multilingual-v3--13dba038aab73b11b3f0b17fbdb48ea0"
model_package_map = {
"us-east-1": f"arn:aws:sagemaker:us-east-1:865070037744:model-package/{cohere_package}",
"us-east-2": f"arn:aws:sagemaker:us-east-2:057799348421:model-package/{cohere_package}",
"us-west-1": f"arn:aws:sagemaker:us-west-1:382657785993:model-package/{cohere_package}",
"us-west-2": f"arn:aws:sagemaker:us-west-2:594846645681:model-package/{cohere_package}",
"ca-central-1": f"arn:aws:sagemaker:ca-central-1:470592106596:model-package/{cohere_package}",
"eu-central-1": f"arn:aws:sagemaker:eu-central-1:446921602837:model-package/{cohere_package}",
"eu-west-1": f"arn:aws:sagemaker:eu-west-1:985815980388:model-package/{cohere_package}",
"eu-west-2": f"arn:aws:sagemaker:eu-west-2:856760150666:model-package/{cohere_package}",
"eu-west-3": f"arn:aws:sagemaker:eu-west-3:843114510376:model-package/{cohere_package}",
"eu-north-1": f"arn:aws:sagemaker:eu-north-1:136758871317:model-package/{cohere_package}",
"ap-southeast-1": f"arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/{cohere_package}",
"ap-southeast-2": f"arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/{cohere_package}",
"ap-northeast-2": f"arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/{cohere_package}",
"ap-northeast-1": f"arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/{cohere_package}",
"ap-south-1": f"arn:aws:sagemaker:ap-south-1:077584701553:model-package/{cohere_package}",
"sa-east-1": f"arn:aws:sagemaker:sa-east-1:270155090741:model-package/{cohere_package}",
}
region = boto3.Session().region_name
if region not in model_package_map.keys():
raise Exception(f"Current boto3 session region {region} is not supported.")
model_package_arn = model_package_map(region)
Create an endpoint and run real-time inference
If you want to understand how real-time inference works with Amazon SageMaker, see the Amazon SageMaker Developer Guide.
Create an endpoint
To create the endpoint, use the following code:
co = Client(region_name=region)
co.create_endpoint(arn=model_package_arn, endpoint_name="cohere-rerank-multilingual-v3-0", instance_type="ml.g5.2xlarge", n_instances=1)
# If the endpoint is already created, you just need to connect to it
# co.connect_to_endpoint(endpoint_name="cohere-rerank-multilingual-v3-0”)
Once the endpoint is created, you can perform real-time inference.
Create the input payload
To create the input payload, use the following code:
documents = (
{"Title":"Contraseña incorrecta","Content":"Hola, llevo una hora intentando acceder a mi cuenta y sigue diciendo que mi contraseña es incorrecta. ¿Puede ayudarme, por favor?"},
{"Title":"Confirmation Email Missed","Content":"Hi, I recently purchased a product from your website but I never received a confirmation email. Can you please look into this for me?"},
{"Title":"أسئلة حول سياسة الإرجاع","Content":"مرحبًا، لدي سؤال حول سياسة إرجاع هذا المنتج. لقد اشتريته قبل بضعة أسابيع وهو معيب"},
{"Title":"Customer Support is Busy","Content":"Good morning, I have been trying to reach your customer support team for the past week but I keep getting a busy signal. Can you please help me?"},
{"Title":"Falschen Artikel erhalten","Content":"Hallo, ich habe eine Frage zu meiner letzten Bestellung. Ich habe den falschen Artikel erhalten und muss ihn zurückschicken."},
{"Title":"Customer Service is Unavailable","Content":"Hello, I have been trying to reach your customer support team for the past hour but I keep getting a busy signal. Can you please help me?"},
{"Title":"Return Policy for Defective Product","Content":"Hi, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective."},
{"Title":"收到错误物品","Content":"早上好,关于我最近的订单,我有一个问题。我收到了错误的商品,需要退货。"},
{"Title":"Return Defective Product","Content":"Hello, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective."}
)
Perform real-time inference
To perform real-time inference, use the following code.response = co.rerank(documents=documents, query='What emails have been about returning items?', rank_fields=("Title","Content"), top_n=5)
Visualize the output
To visualize the output, use the following code:
print(f'Documents: {response}')
The following screenshot shows the output response.
cleaning
To avoid recurring charges, clean up the resources you created in this tutorial by following these steps:
Delete a model
Now that you have successfully run a real-time inference, you no longer need the endpoint and can terminate it to avoid being charged.
co.delete_endpoint()
co.close()
Unsubscribe from the list (optional)
If you want to unsubscribe from a model package, follow these steps: Before canceling your subscription, make sure you do not have any deployable models created from the model package or using its algorithms. You can find this information by checking the container name associated with the model.
To unlist a product from AWS Marketplace:
- On the Software Subscriptions page, Machine Learning tab
- Find the list you want to cancel your subscription from, Cancel subscription
summary
RAG is a great technique for developing AI applications that integrate real-time data and use their own information to enable conversational conversations. RAG leverages external domain-specific knowledge sources to enhance AI responses, but its effectiveness depends on finding the right source material. This article focuses on using Cohere Rerank to improve the search efficiency and accuracy of RAG systems. RAG orchestration typically involves two steps: retrieving relevant documents and generating answers. Dense search is efficient for large datasets, but can struggle with complex data and questions due to information compression. Cohere Rerank uses deep learning to evaluate the consistency of documents and queries and outputs a relevance score that enables more nuanced document selection.
Customers can find Cohere Rerank 3 and Cohere Rerank 3 Nimble on Amazon Sagemaker Jumpstart.
About the Author
Chassis Liner Shashi is a Senior Partner Solutions Architect at Amazon Web Services (AWS) specializing in supporting Generative AI (GenAI) startups. With nearly six years of experience at AWS, Shashi has developed deep expertise across multiple domains including DevOps, Analytics, and Generative AI.
Pradeep Prabhakaran He is the Senior Manager of Solutions Architecture at Cohere. In his current role at Cohere, Pradeep serves as a trusted technical advisor to customers and partners, providing guidance and strategies to realize the full potential of Cohere’s cutting edge Generative AI platform.