AI chatbots and virtual assistants have become increasingly popular in recent years due to advances in large language models (LLMs). Trained on massive datasets, these models incorporate a memory component in their architectural design and are able to understand and grasp the context of text.
The most common use cases for chatbot assistants focus on several key areas: improving customer experience, increasing employee productivity and creativity, optimizing business processes, for example, customer support, troubleshooting, searching internal and external knowledge bases, etc.
Despite these capabilities, the main challenge of a chatbot is to generate high-quality, accurate responses. One way to solve this challenge is to use Search Augmentation Generation (RAG). RAG is a process of optimizing the output of LLM to refer to a trusted knowledge base outside the training data source before generating a response. Reranking attempts to improve search relevance by reordering the result set returned by retrievers with different models. In this article, we explain how the two techniques, RAG and reranking, can help you improve chatbot responses using a knowledge base in Amazon Bedrock.
Solution overview
RAG is a technique that combines the strengths of knowledge base retrieval and generative models for text generation. It first retrieves relevant responses from a database, then uses those responses as context to feed a generative model to generate the final output. There are many benefits to using the RAG approach for building chatbots. For example, retrieving responses from a database before generating them can provide more relevant and consistent responses, which improves the conversation flow. Also, compared to pure generative models, RAG scales better with more data and does not require fine-tuning the model as new data is added to the knowledge base. Additionally, the retrieval component allows the model to incorporate external knowledge by retrieving relevant background information from a database. This approach helps provide factual, detailed, and knowledgeable responses.
To find answers, RAG adopts an approach that uses vector search across documents. The advantages of using vector search are speed and scalability. Instead of scanning all documents to find answers, the RAG approach converts the text (knowledge base) into embeddings and stores these embeddings in a database. An embedding is a compressed version of a document and is represented by an array of numbers. After the embeddings are stored, vector search queries the vector database to find similarities based on the vectors associated with the documents. Typically, vector search returns the top-k most relevant documents based on the user’s question and returns the results. However, because the similarity algorithm in vector databases works on vectors and not documents, vector search does not necessarily return the most relevant information in the top-k results. If the most relevant context is not available in the LLM, it directly affects the accuracy of the response.
Re-ranking is a technique that can further improve the response by selecting the best option among multiple candidate responses. The following architecture shows how the re-ranking solution works.

Architectural diagram of integrating a reranking model with a Bedrock knowledge base
Let’s create a question answering solution. We’ll ingest the 1925 novel “The Great Gatsby” by American author F. Scott Fitzgerald. The book is available through Project Gutenberg. We’ll implement an end-to-end RAG workflow using an Amazon Bedrock knowledge base and ingest the embeddings into an Amazon OpenSearch Serverless vector search collection. We’ll then retrieve the answers using a standard RAG and a two-stage RAG with a reranking API. We’ll then compare the results of these two methods.
Code samples are available in this GitHub repository.
The following sections provide high-level steps.
- Prepare the dataset.
- Use Amazon Bedrock LLM to generate questions from documentation.
- Create a knowledge base that includes this book.
- Use the knowledge base to get answers
retrieve
API - Evaluating responses using RAGAS
- Run a two-step RAG using the knowledge base to get the answer again
retrieve
Use the API to apply reranking in context. - We use the RAGAS framework to evaluate the two-stage RAG response.
- We compare the results and performance of each RAG approach.
For efficiency, we have provided a sample code in the notebook to generate a set of questions and answers. These Q&A pairs will be used in the RAG evaluation process. We strongly recommend that you have a human verify the accuracy of each question and answer.
The following sections explain the main steps with code blocks.
Prerequisites
To clone the GitHub repository to your local machine, open a terminal window and run the following command:
Prepare the dataset
Download the book from the Project Gutenberg website. In this post, we’ll create 10 large documents from the book and upload them to Amazon Simple Storage Service (Amazon S3).
Create a Bedrock knowledge base
If you are new to Amazon Bedrock Knowledge Base, see Amazon Bedrock Knowledge Base now supports Amazon Aurora PostgreSQL and Cohere embedded models, which explains how Amazon Bedrock Knowledge Base manages the end-to-end RAG workflow.
In this step, you will create your knowledge base using a Boto3 client, converting your documents into embeddings (’embeddingModelArn’) using Amazon Titan Text Embedding v2, and specifying the S3 bucket you created earlier as the data source (dataSourceConfiguration).
Generate questions from documents
We use Amazon Bedrock’s Anthropic Claude to generate a list of 10 questions and their corresponding answers. The Q&A data will serve as the basis for the RAG evaluation based on the approach we will implement. We define the answers generated in this step as ground truth data. See the following code:
Use the Knowledge Base API to get answers
Use the generated questions to get and converse
API:
Assessing RAG responses using the RAGAS framework
Here, we evaluate the effectiveness of RAGs using a framework called RAGAS, which provides a set of metrics to evaluate different aspects. In this example, we evaluate the response based on the following aspects:
- Relevance of the answer – This metric focuses on assessing how relevant the generated answer is to the given prompt. Incomplete answers or answers that contain redundant information are assigned a low score. This metric is calculated using the question and answer and has a value ranging from 0 to 1, with higher scores indicating greater relevance.
- Similarity of answers – Evaluate the semantic similarity between the generated answer and the truth. This evaluation is based on the truth and the answer and has a value ranging from 0 to 1. A higher score means better alignment between the generated answer and the truth.
- Contextual Relevance – This metric measures the relevance of the captured context, calculated based on both the question and the context. Values range from 0 to 1, with higher values indicating higher relevance.
- Accuracy of answers – Evaluating the accuracy of an answer involves comparing the accuracy of the generated answer with the actual answer. This evaluation is done based on the actual answer and your answer and results in a score ranging from 0 to 1. A higher score indicates a better alignment and accuracy between the generated answer and the actual answer.
Summary report of standard RAG approach based on RAGAS evaluation:
answer_relevancy: 0.9006225160334027
answer_similarity: 0.7400904157096762
answer_correctness: 0.32703043056663855
context_relevancy: 0.024797687553157175
Two-stage RAG: Acquisition and Re-ranking
Now that the results are out, retrieve_and_generate
In the API, let us consider a two-stage search approach by extending the standard RAG approach and integrating it with a reranking model. In the RAG context, a reranking model is used after the initial set of contexts is obtained by the retriever. The reranking model takes the list of results and reranks each result based on the similarity of the context to the user query. In this example, we use a powerful reranking model called bge-reranker-large, which is available on the Hugging Face Hub and is free for commercial use. The following code extracts the results from the knowledge base retrieve
Using the API, you can get a handle to the context and rerank it using a reranking model deployed as an Amazon SageMaker endpoint. We provide sample code for deploying a reranking model in SageMaker in our GitHub repository. Below is a code snippet that shows the two-step retrieval process:
Assessing two-phase RAG responses using the RAGAS framework
We evaluate the answers generated by the two-stage search process. Below is a summary report based on the RAGAS evaluation.
answer_relevancy: 0.841581671275458
answer_similarity: 0.7961827348349313
answer_correctness: 0.43361356731293665
context_relevancy: 0.06049484724216884
Compare the results
Let’s compare the test results. As shown in the following image, the reranking API improves context relevance, answer accuracy, and answer similarity, which are important for improving the accuracy of the RAG process.

Evaluation metrics for RAG and two-stage search
Similarly, we also measured the RAG latency for both approaches. The results can be viewed in the following metrics and corresponding graphs:
Standard RAG latency: 76.59s
Two Stage Retrieval latency: 312.12s

Latency Metrics for RAG and Two-Stage Acquisition Processes
In summary, host the reranking model (tge-reranker-large) ml.m5.xlarge
Each instance introduces roughly 4x the latency compared to the standard RAG approach, and we recommend testing with different reranking model variants and instance types to get the best performance for your use case.
Conclusion
In this post, we demonstrated how to integrate a reranking model to implement a two-stage search process. We explored how integrating a reranking model into an Amazon Bedrock knowledge base can improve performance. Finally, we used the open source framework RAGAS to provide metrics for context relevance, answer relevance, answer similarity, and answer accuracy for both approaches.
Try this retrieval process now and share your feedback in the comments section.
About the Author
Way Teh He is a Machine Learning Solutions Architect at AWS. He is passionate about helping customers achieve their business goals using cutting edge machine learning solutions. Outside of work, he enjoys outdoor activities such as camping, fishing and hiking with his family.
Pallavi Nargund She is a Principal Solutions Architect at AWS. As a cloud technology advocate, she works with customers to understand their goals and challenges, and provide prescriptive guidance to achieve them with AWS services. She is passionate about women in technology and is a core member of Amazon’s Women in AI/ML. She has spoken at internal and external conferences including AWS re:Invent, AWS Summits, and webinars. Outside of work, she enjoys volunteering, gardening, biking, and hiking.
Lee Ching Wei He is a Machine Learning Specialist at Amazon Web Services. He completed his PhD in Operations Research after bankrupting his advisor’s research grant account and missing out on a promised Nobel Prize. He currently helps clients in the financial services and insurance industries build machine learning solutions on AWS. In his spare time, he enjoys reading and teaching.
Mani Kanuja She is a technical lead for generative AI specialists, author of the book “Applied Machine Learning and High Performance Computing on AWS” and a member of the Women in Manufacturing Education Foundation Board. She leads machine learning projects in a variety of areas including computer vision, natural language processing and generative AI. She has spoken at internal and external conferences including AWS re:Invent, Women in Manufacturing West, YouTube webinars and GHC 23. In her free time, she enjoys long runs along the beach.