Improving AI assistant response accuracy using Amazon Bedrock knowledge base and reranking models

AI chatbots and virtual assistants have become increasingly popular in recent years due to advances in large language models (LLMs). Trained on massive datasets, these models incorporate a memory component in their architectural design and are able to understand and grasp the context of text.

The most common use cases for chatbot assistants focus on several key areas: improving customer experience, increasing employee productivity and creativity, optimizing business processes, for example, customer support, troubleshooting, searching internal and external knowledge bases, etc.

Despite these capabilities, the main challenge of a chatbot is to generate high-quality, accurate responses. One way to solve this challenge is to use Search Augmentation Generation (RAG). RAG is a process of optimizing the output of LLM to refer to a trusted knowledge base outside the training data source before generating a response. Reranking attempts to improve search relevance by reordering the result set returned by retrievers with different models. In this article, we explain how the two techniques, RAG and reranking, can help you improve chatbot responses using a knowledge base in Amazon Bedrock.

Solution overview

RAG is a technique that combines the strengths of knowledge base retrieval and generative models for text generation. It first retrieves relevant responses from a database, then uses those responses as context to feed a generative model to generate the final output. There are many benefits to using the RAG approach for building chatbots. For example, retrieving responses from a database before generating them can provide more relevant and consistent responses, which improves the conversation flow. Also, compared to pure generative models, RAG scales better with more data and does not require fine-tuning the model as new data is added to the knowledge base. Additionally, the retrieval component allows the model to incorporate external knowledge by retrieving relevant background information from a database. This approach helps provide factual, detailed, and knowledgeable responses.

To find answers, RAG adopts an approach that uses vector search across documents. The advantages of using vector search are speed and scalability. Instead of scanning all documents to find answers, the RAG approach converts the text (knowledge base) into embeddings and stores these embeddings in a database. An embedding is a compressed version of a document and is represented by an array of numbers. After the embeddings are stored, vector search queries the vector database to find similarities based on the vectors associated with the documents. Typically, vector search returns the top-k most relevant documents based on the user’s question and returns the results. However, because the similarity algorithm in vector databases works on vectors and not documents, vector search does not necessarily return the most relevant information in the top-k results. If the most relevant context is not available in the LLM, it directly affects the accuracy of the response.

Re-ranking is a technique that can further improve the response by selecting the best option among multiple candidate responses. The following architecture shows how the re-ranking solution works.

Architectural diagram of integrating a reranking model with a Bedrock knowledge base

Let’s create a question answering solution. We’ll ingest the 1925 novel “The Great Gatsby” by American author F. Scott Fitzgerald. The book is available through Project Gutenberg. We’ll implement an end-to-end RAG workflow using an Amazon Bedrock knowledge base and ingest the embeddings into an Amazon OpenSearch Serverless vector search collection. We’ll then retrieve the answers using a standard RAG and a two-stage RAG with a reranking API. We’ll then compare the results of these two methods.

Code samples are available in this GitHub repository.

The following sections provide high-level steps.

Prepare the dataset.
Use Amazon Bedrock LLM to generate questions from documentation.
Create a knowledge base that includes this book.
Use the knowledge base to get answers retrieve API
Evaluating responses using RAGAS
Run a two-step RAG using the knowledge base to get the answer again retrieve Use the API to apply reranking in context.
We use the RAGAS framework to evaluate the two-stage RAG response.
We compare the results and performance of each RAG approach.

For efficiency, we have provided a sample code in the notebook to generate a set of questions and answers. These Q&A pairs will be used in the RAG evaluation process. We strongly recommend that you have a human verify the accuracy of each question and answer.

The following sections explain the main steps with code blocks.

Prerequisites

To clone the GitHub repository to your local machine, open a terminal window and run the following command:

git clone https://github.com/aws-samples/amazon-bedrock-samples
cd knowledge-bases/features-examples/03-advanced-concepts/reranking

Prepare the dataset

Download the book from the Project Gutenberg website. In this post, we’ll create 10 large documents from the book and upload them to Amazon Simple Storage Service (Amazon S3).

target_url = "https://www.gutenberg.org/ebooks/64317.txt.utf-8" # the great gatsby
data = urllib.request.urlopen(target_url)
my_texts = ()
for line in data:
my_texts.append(line.decode())

doc_size = 700 # size of the document to determine number of batches
batches = math.ceil(len(my_texts) / doc_size)

sagemaker_session = sagemaker.Session()
default_bucket = sagemaker_session.default_bucket()
s3_prefix = "bedrock/knowledgebase/datasource"

start = 0
s3 = boto3.client("s3")
for batch in range(batches):
    batch_text_arr = my_texts(start:start+doc_size)
    batch_text = "".join(batch_text_arr)
    s3.put_object(
        Body=batch_text,
        Bucket=default_bucket,
        Key=f"{s3_prefix}/{start}.txt"
    )
    start += doc_size

Create a Bedrock knowledge base

If you are new to Amazon Bedrock Knowledge Base, see Amazon Bedrock Knowledge Base now supports Amazon Aurora PostgreSQL and Cohere embedded models, which explains how Amazon Bedrock Knowledge Base manages the end-to-end RAG workflow.

In this step, you will create your knowledge base using a Boto3 client, converting your documents into embeddings (’embeddingModelArn’) using Amazon Titan Text Embedding v2, and specifying the S3 bucket you created earlier as the data source (dataSourceConfiguration).

bedrock_agent = boto3.client("bedrock-agent")
response = bedrock_agent.create_knowledge_base(
    name=knowledge_base_name,
    description='Knowledge Base for Bedrock',
    roleArn=role_arn,
    knowledgeBaseConfiguration={
        'type': 'VECTOR',
        'vectorKnowledgeBaseConfiguration': {
            'embeddingModelArn': embedding_model_arn
        }
    },
    storageConfiguration={
        'type': 'OPENSEARCH_SERVERLESS',
        'opensearchServerlessConfiguration': {
            'collectionArn': collection_arn,
            'vectorIndexName': index_name,
            'fieldMapping': {
                'vectorField':  "bedrock-knowledge-base-default-vector",
                'textField': 'AMAZON_BEDROCK_TEXT_CHUNK',
                'metadataField': 'AMAZON_BEDROCK_METADATA'
            }
        }
    }
)
knowledge_base_id = response('knowledgeBase')('knowledgeBaseId')
knowledge_base_name = response('knowledgeBase')('name')

response = bedrock_agent.create_data_source(
    knowledgeBaseId=knowledge_base_id,
    name=f"{knowledge_base_name}-ds",
    dataSourceConfiguration={
        'type': 'S3',
        's3Configuration': {
            'bucketArn': f"arn:aws:s3:::{bucket}",
            'inclusionPrefixes': (
                f"{s3_prefix}/",
            )
        }
    },
    vectorIngestionConfiguration={
        'chunkingConfiguration': {
            'chunkingStrategy': 'FIXED_SIZE',
            'fixedSizeChunkingConfiguration': {
                'maxTokens': 300,
                'overlapPercentage': 10
            }
        }
    }
)
data_source_id = response('dataSource')('dataSourceId')

response = bedrock_agent.start_ingestion_job(
    knowledgeBaseId=knowledge_base_id,
    dataSourceId=data_source_id,
)

Generate questions from documents

We use Amazon Bedrock’s Anthropic Claude to generate a list of 10 questions and their corresponding answers. The Q&A data will serve as the basis for the RAG evaluation based on the approach we will implement. We define the answers generated in this step as ground truth data. See the following code:

prompt_template = """The question should be diverse in nature \
across the document. The question should not contain options, not start with Q1/ Q2. \
Restrict the question to the context information provided.\

<document>
{{document}}
</document>

Think step by step and pay attention to the number of question to create.

Your response should follow the format as followed:

Question: question
Answer: answer

"""
system_prompt = """You are a professor. Your task is to setup 1 question for an upcoming \
quiz/examination based on the given document wrapped in <document></document> XML tag."""

prompt = prompt_template.replace("{{document}}", documents)
temperature = 0.9
top_k = 250
messages = ({"role": "user", "content": ({"text": prompt})})
# Base inference parameters to use.
inference_config = {"temperature": temperature, "maxTokens": 512, "topP": 1.0}
# Additional inference parameters to use.
additional_model_fields = {"top_k": top_k}

# Send the message.
response = bedrock_runtime.converse(
    modelId=model_id,
    messages=messages,
    system=({"text": system_prompt}),
    inferenceConfig=inference_config,
    additionalModelRequestFields=additional_model_fields
)
print(response('output')('message')('content')(0)('text'))
result = response('output')('message')('content')(0)('text')
q_pos = ((a.start(), a.end()) for a in list(re.finditer("Question:", result)))
a_pos = ((a.start(), a.end()) for a in list(re.finditer("Answer:", result)))

Use the Knowledge Base API to get answers

Use the generated questions to get and converse API:

contexts = ()
answers = ()

for question in questions:
    response = agent_runtime.retrieve(
        knowledgeBaseId=knowledge_base_id,
        retrievalQuery={
            'text': question
        },
        retrievalConfiguration={
            'vectorSearchConfiguration': {
                'numberOfResults': topk
            }
        }
    )
    
    retrieval_results = response('retrievalResults')
    local_contexts = ()
    for result in retrieval_results:
        local_contexts.append(result('content')('text'))
    contexts.append(local_contexts)
    combined_docs = "\n".join(local_contexts)
    prompt = llm_prompt_template.replace("{{documents}}", combined_docs)
    prompt = prompt.replace("{{query}}", question)
    temperature = 0.9
    top_k = 250
    messages = ({"role": "user", "content": ({"text": prompt})})
    # Base inference parameters to use.
    inference_config = {"temperature": temperature, "maxTokens": 512, "topP": 1.0}
    # Additional inference parameters to use.
    additional_model_fields = {"top_k": top_k}

    # Send the message.
    response = bedrock_runtime.converse(
        modelId=model_id,
        messages=messages,
        inferenceConfig=inference_config,
        additionalModelRequestFields=additional_model_fields
    )
    answers.append(response('output')('message')('content')(0)('text'))

Assessing RAG responses using the RAGAS framework

Here, we evaluate the effectiveness of RAGs using a framework called RAGAS, which provides a set of metrics to evaluate different aspects. In this example, we evaluate the response based on the following aspects:

Relevance of the answer – This metric focuses on assessing how relevant the generated answer is to the given prompt. Incomplete answers or answers that contain redundant information are assigned a low score. This metric is calculated using the question and answer and has a value ranging from 0 to 1, with higher scores indicating greater relevance.
Similarity of answers – Evaluate the semantic similarity between the generated answer and the truth. This evaluation is based on the truth and the answer and has a value ranging from 0 to 1. A higher score means better alignment between the generated answer and the truth.
Contextual Relevance – This metric measures the relevance of the captured context, calculated based on both the question and the context. Values range from 0 to 1, with higher values indicating higher relevance.
Accuracy of answers – Evaluating the accuracy of an answer involves comparing the accuracy of the generated answer with the actual answer. This evaluation is done based on the actual answer and your answer and results in a score ranging from 0 to 1. A higher score indicates a better alignment and accuracy between the generated answer and the actual answer.

Summary report of standard RAG approach based on RAGAS evaluation:

answer_relevancy: 0.9006225160334027

answer_similarity: 0.7400904157096762

answer_correctness: 0.32703043056663855

context_relevancy: 0.024797687553157175

Two-stage RAG: Acquisition and Re-ranking

Now that the results are out, retrieve_and_generate In the API, let us consider a two-stage search approach by extending the standard RAG approach and integrating it with a reranking model. In the RAG context, a reranking model is used after the initial set of contexts is obtained by the retriever. The reranking model takes the list of results and reranks each result based on the similarity of the context to the user query. In this example, we use a powerful reranking model called bge-reranker-large, which is available on the Hugging Face Hub and is free for commercial use. The following code extracts the results from the knowledge base retrieve Using the API, you can get a handle to the context and rerank it using a reranking model deployed as an Amazon SageMaker endpoint. We provide sample code for deploying a reranking model in SageMaker in our GitHub repository. Below is a code snippet that shows the two-step retrieval process:

def generate_two_stage_context_answers(bedrock_runtime, 
                                       agent_runtime, 
                                       model_id, 
                                       knowledge_base_id, 
                                       retrieval_topk, 
                                       reranking_model, 
                                       questions, 
                                       rerank_top_k=3):
    contexts = ()
    answers = ()
    predictor = Predictor(endpoint_name=reranking_model, serializer=JSONSerializer(), deserializer=JSONDeserializer())
    for question in questions:
        retrieval_results = two_stage_retrieval(agent_runtime, knowledge_base_id, question, retrieval_topk, predictor, rerank_top_k)
        local_contexts = ()
        documents = ()
        for result in retrieval_results:
            local_contexts.append(result)

        contexts.append(local_contexts)
        combined_docs = "\n".join(local_contexts)
        prompt = llm_prompt_template.replace("{{documents}}", combined_docs)
        prompt = prompt.replace("{{query}}", question)
        temperature = 0.9
        top_k = 250
        messages = ({"role": "user", "content": ({"text": prompt})})
        inference_config = {"temperature": temperature, "maxTokens": 512, "topP": 1.0}
        additional_model_fields = {"top_k": top_k}
        
        response = bedrock_runtime.converse(
            modelId=model_id,
            messages=messages,
            inferenceConfig=inference_config,
            additionalModelRequestFields=additional_model_fields
        )
        answers.append(response('output')('message')('content')(0)('text'))
    return contexts, answers

Assessing two-phase RAG responses using the RAGAS framework

We evaluate the answers generated by the two-stage search process. Below is a summary report based on the RAGAS evaluation.

answer_relevancy: 0.841581671275458

answer_similarity: 0.7961827348349313

answer_correctness: 0.43361356731293665

context_relevancy: 0.06049484724216884

Compare the results

Let’s compare the test results. As shown in the following image, the reranking API improves context relevance, answer accuracy, and answer similarity, which are important for improving the accuracy of the RAG process.

Evaluation metrics for RAG and two-stage search

Similarly, we also measured the RAG latency for both approaches. The results can be viewed in the following metrics and corresponding graphs:

Standard RAG latency: 76.59s

Two Stage Retrieval latency: 312.12s

Latency Metrics for RAG and Two-Stage Acquisition Processes

In summary, host the reranking model (tge-reranker-large) ml.m5.xlarge Each instance introduces roughly 4x the latency compared to the standard RAG approach, and we recommend testing with different reranking model variants and instance types to get the best performance for your use case.

Conclusion

In this post, we demonstrated how to integrate a reranking model to implement a two-stage search process. We explored how integrating a reranking model into an Amazon Bedrock knowledge base can improve performance. Finally, we used the open source framework RAGAS to provide metrics for context relevance, answer relevance, answer similarity, and answer accuracy for both approaches.

Try this retrieval process now and share your feedback in the comments section.

About the Author

Way Teh He is a Machine Learning Solutions Architect at AWS. He is passionate about helping customers achieve their business goals using cutting edge machine learning solutions. Outside of work, he enjoys outdoor activities such as camping, fishing and hiking with his family.

Pallavi Nargund She is a Principal Solutions Architect at AWS. As a cloud technology advocate, she works with customers to understand their goals and challenges, and provide prescriptive guidance to achieve them with AWS services. She is passionate about women in technology and is a core member of Amazon’s Women in AI/ML. She has spoken at internal and external conferences including AWS re:Invent, AWS Summits, and webinars. Outside of work, she enjoys volunteering, gardening, biking, and hiking.

Lee Ching Wei He is a Machine Learning Specialist at Amazon Web Services. He completed his PhD in Operations Research after bankrupting his advisor’s research grant account and missing out on a promised Nobel Prize. He currently helps clients in the financial services and insurance industries build machine learning solutions on AWS. In his spare time, he enjoys reading and teaching.

Mani Kanuja She is a technical lead for generative AI specialists, author of the book “Applied Machine Learning and High Performance Computing on AWS” and a member of the Women in Manufacturing Education Foundation Board. She leads machine learning projects in a variety of areas including computer vision, natural language processing and generative AI. She has spoken at internal and external conferences including AWS re:Invent, Women in Manufacturing West, YouTube webinars and GHC 23. In her free time, she enjoys long runs along the beach.

What's Hot

Genomics England uses Amazon SageMaker to predict cancer subtypes and patient survival from multi-modal data

Forensic psychology and evolutionary biology will tell you how to acquire traitor.

Hardy bacteria found living in microwave ovens

Maximize your file server data’s potential by using Amazon Q Business on Amazon FSx for Windows

LLM continuous self-instruct fine-tuning framework powered by a compound AI system on Amazon SageMaker

Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases

How Rocket Companies modernized their data science solution on AWS

Orchestrate an intelligent document processing workflow using tools in Amazon Bedrock

Maximize your file server data’s potential by using Amazon Q Business on Amazon FSx for Windows

LLM continuous self-instruct fine-tuning framework powered by a compound AI system on Amazon SageMaker

Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases

The remote continent of Antarctica may be polluted by plastic waste from far away

Peter Stormare Will Reprise Until Dawn Role for the Game’s Movie

Will there be another pandemic after covid-19? Are we prepared for it?

Most Popular

‘House of the Dragon’s Abubakar Salim breaks down Corlys and Alyn’s relationship

If aliens were harnessing solar energy, would we be able to detect it? NASA investigates.

Asus Chromebook CM14 review: A super cheap laptop

Our Picks

Apple Watch SE 3: What we know

The new Sonos app is so bad the company might bring back the old one

NYT Strand Hints and Answers August 4th

Subscribe to our newsletter

Subscribe to Updates

What's Hot

Improving AI assistant response accuracy using Amazon Bedrock knowledge base and reranking models

Solution overview

Prerequisites

Prepare the dataset

Create a Bedrock knowledge base

Generate questions from documents

Use the Knowledge Base API to get answers

Assessing RAG responses using the RAGAS framework

Two-stage RAG: Acquisition and Re-ranking

Assessing two-phase RAG responses using the RAGAS framework

Compare the results

Conclusion

About the Author

Related Posts

Subscribe to our newsletter

Subscribe to our newsletter