Evaluating prompts at scale using Amazon Bedrock prompt management and prompt flows

As generative artificial intelligence (AI) continues to revolutionize every industry, the importance of effective prompt optimization through prompt engineering techniques is key to efficiently balance output quality, response time, and cost. Prompt engineering is the selection of appropriate words, phrases, sentences, punctuation, and delimiters to create and optimize inputs to models to effectively use foundational models (FMs) or large language models (LLMs) for various applications. High-quality prompts maximize the chances of getting the right response from a generative AI model.

A fundamental part of the optimization process is evaluation, and evaluating a generative AI application involves multiple elements. In addition to the most common evaluation of FMs, evaluating prompts is a critical, yet often challenging, aspect of developing high-quality AI-powered solutions. Many organizations struggle to consistently create and effectively evaluate prompts across different applications, resulting in inconsistent performance and user experience, and undesirable responses from the model.

In this post, we show how you can use Amazon Bedrock to implement an automated prompt evaluation system to streamline your prompt development process and improve the overall quality of your AI-generated content by using Amazon Bedrock Prompt Management and Amazon Bedrock Prompt Flows to systematically evaluate prompts at scale for your generative AI application.

The importance of rapid evaluation

Before we dive into the technical implementation, let’s briefly discuss why prompt evaluation is important. The key aspects to consider when building and optimizing a prompt are usually:

quality assurance – Evaluating prompts ensures that your AI application produces consistently high-quality and relevant output for the model you select.
Performance optimization – Identifying and refining effective prompts can improve the overall performance of generative AI models in terms of reducing latency and ultimately increasing throughput.
Cost-effective – Better prompts can lead to more efficient use of AI resources and potentially reduce the costs associated with model inference. Good prompts allow the use of smaller, less expensive models, but poor prompts produce poor results.
User Experience – Improved prompts mean that AI-generated content will be more accurate, providing more personalized and helpful information, improving the end-user experience in your application.

Optimizing prompts for these aspects is an iterative process that requires evaluation to drive prompt adjustments. In other words, it is a way to understand how well a particular prompt-model combination works in getting the desired answer.

In this example, you would implement the following methods: Master of Laws as a JudgeIn LLMs, LLMs are used to evaluate prompts according to predefined criteria based on the answers generated by a particular model. While evaluating prompts and their answers for a particular LLM is an inherently subjective task, systematic prompt evaluation using LLMs as the criterion allows for the evaluation to be quantified using a numeric score evaluation metric. This allows for standardization and automation of the prompt lifecycle within an organization, which is one of the reasons why this method is one of the most common prompt evaluation approaches in the industry.

Let’s take a look at a sample solution for assessing prompts with LLM as an assessor in Amazon Bedrock. The complete code example can also be found in amazon-bedrock-samples.

Prerequisites

For this example, you need:

Configure the assessment prompt

To create an assessment prompt using Amazon Bedrock prompt management, follow these steps:

In the Amazon Bedrock console navigation pane, Rapid Management And choose Create a prompt.
Please enter name Example prompt prompt-evaluator and explanation “Prompt template for assessing rapid responses with LLM as examiner,” etc. Create.

So prompt Write your prompt assessment template in the field. For example, you can use a template like the one below, or adapt it according to your specific assessment requirements:

You're an evaluator for the prompts and answers provided by a generative AI model.
Consider the input prompt in the <input> tags, the output answer in the <output> tags, the prompt evaluation criteria in the <prompt_criteria> tags, and the answer evaluation criteria in the <answer_criteria> tags.

<input>
{{input}}
</input>

<output>
{{output}}
</output>

<prompt_criteria>
- The prompt should be clear, direct, and detailed.
- The question, task, or goal should be well explained and be grammatically correct.
- The prompt is better if containing examples.
- The prompt is better if specifies a role or sets a context.
- The prompt is better if provides details about the format and tone of the expected answer.
</prompt_criteria>

<answer_criteria>
- The answers should be correct, well structured, and technically complete.
- The answers should not have any hallucinations, made up content, or toxic content.
- The answer should be grammatically correct.
- The answer should be fully aligned with the question or instruction in the prompt.
</answer_criteria>

Evaluate the answer the generative AI model provided in the <output> with a score from 0 to 100 according to the <answer_criteria> provided; any hallucinations, even if small, should dramatically impact the evaluation score.
Also evaluate the prompt passed to that generative AI model provided in the <input> with a score from 0 to 100 according to the <prompt_criteria> provided.
Respond only with a JSON having:
- An 'answer-score' key with the score number you evaluated the answer with.
- A 'prompt-score' key with the score number you evaluated the prompt with.
- A 'justification' key with a justification for the two evaluations you provided to the answer and the prompt; make sure to explicitely include any errors or hallucinations in this part.
- An 'input' key with the content of the <input> tags.
- An 'output' key with the content of the <output> tags.
- A 'prompt-recommendations' key with recommendations for improving the prompt based on the evaluations performed.
Skip any preamble or any other text apart from the JSON in your answer.

under compositionyou’ll be prompted to select the model you want to use to run the assessment. In this example, we selected Anthropic Claude Sonnet. The quality of the assessment depends on the model you select in this step. When deciding, try to balance quality, response time, and cost.
Set up Inference parameters For the model, temperature Give it a factual rating and give it a 0 to avoid hallucinations.

You can test your assessment prompts using sample inputs and outputs. Test Variables and Test Window panel.

Now that you have a draft of your prompt, you can also create a version. Versions allow you to quickly switch between different configurations of your prompt and update your application with the version that best suits your use case. To create a version, Create a Version At the top.

The following screenshot shows: Prompt Builder page.

Set up the evaluation flow

Next, you need to build an evaluation flow using the Amazon Bedrock Prompt Flow. In this example, we will use the Prompt Node. Check the Prompt Flow Node Types documentation for more information on supported node types. To build an evaluation flow, follow these steps:

Amazon Bedrock Console Prompt Flowchoose Create a prompt flow.
Please enter name Like prompt-eval-flowPlease enter explanation “Prompt flow for LLMs to evaluate prompts as examiners,” etc. Use an existing service role Select a role from the dropdown. Create.
This means: Prompt Flow BuilderDrag the two prompt Add a node to the canvas and configure it according to the following parameters:
- Flow Input
  - output:
    - name: documenttype: string
- Call (prompt)
  - Node Name: Invoke
  - Defined on node
  - Select a model: The preferred model to evaluate at the prompt
  - message: {{input}}
  - Inference structure: As you like
  - input:
    - name: inputtype: string, expression: $.data
  - output:
    - name: modelCompletiontype: string
- Evaluation (prompt)
  - Node Name: Evaluate
  - Use prompts from Prompt Management
  - prompt: prompt-evaluator
  - Version: Version 1 (or your preferred version)
  - Select Model: The preferred model for evaluating the prompt.
  - Inference configuration: As configured in the prompt
  - input:
    - name: inputtype: string, expression: $.data
    - name: outputtype: string, expression: $.data
  - output
    - name: modelCompletiontype: string
- Flow Output
  - Node Name: End
  - input:
    - name: documenttype: string, expression: $.data
To connect the nodes, drag the connection dots as shown in the following image.

To test the prompt evaluation flow, Test Prompt Flow Panel. You pass in input, such as a question like “What is your one-paragraph description of cloud computing?”, and it returns a JSON with the evaluation results, similar to the following example: In the code example notebook amazon-bedrock-samples, we’ve also included information about the model used for the invocation and evaluation in the resulting JSON.

{
	"answer-score": 95,
	"prompt-score": 90,
	"justification": "The answer provides a clear and technically accurate explanation of cloud computing in a single paragraph. It covers key aspects such as scalability, shared resources, pay-per-use model, and accessibility. The answer is well-structured, grammatically correct, and aligns with the prompt. No hallucinations or toxic content were detected. The prompt is clear, direct, and explains the task well. However, it could be improved by providing more details on the expected format, tone, or length of the answer.",
	"input": "What is cloud computing in a single paragraph?",
	"output": "Cloud computing is a model for delivering information technology services where resources are retrieved from the internet through web-based tools. It is a highly scalable model in which a consumer can access a shared pool of configurable computing resources, such as applications, servers, storage, and services, with minimal management effort and often with minimal interaction with the provider of the service. Cloud computing services are typically provided on a pay-per-use basis, and can be accessed by users from any location with an internet connection. Cloud computing has become increasingly popular in recent years due to its flexibility, cost-effectiveness, and ability to enable rapid innovation and deployment of new applications and services.",
	"prompt-recommendations": "To improve the prompt, consider adding details such as the expected length of the answer (e.g., 'in a single paragraph of approximately 100-150 words'), the desired tone (e.g., 'in a professional and informative tone'), and any specific aspects that should be covered (e.g., 'including examples of cloud computing services or providers').",
	"modelInvoke": "amazon.titan-text-premier-v1:0",
	"modelEval": "anthropic.claude-3-sonnet-20240229-v1:0"
}

As the example shows, we asked FMs to rate the prompt and the answers they generated from that prompt with separate scores. We asked them to provide justification for their scores and recommendations for further improving the prompt. All this information is valuable to prompt engineers, as it helps guide optimization experiments and make informed decisions during the prompt’s lifecycle.

Rapid evaluation and implementation at scale

So far, we have shown how to evaluate a single prompt. Mid-to-large organizations often deal with dozens, hundreds, or even thousands of prompt variations across multiple applications, creating a great opportunity for automation at scale. For this, you can run your flow on a complete dataset of prompts stored in a file, as shown in the sample notebook.

Alternatively, you can take advantage of other node types in Amazon Bedrock prompt flows to read and store Amazon Simple Storage Service (Amazon S3) files and implement an iterator and collector-based flow. The following diagram illustrates this type of flow. Once you have established a file-based mechanism for running prompt evaluation flows on large datasets, you can also connect it to your preferred continuous integration and continuous development (CI/CD) tools to automate the entire process. These details are beyond the scope of this article.

Best Practices and Recommendations

Based on our evaluation process, here are some best practices for improving prompts:

Iterative Improvement – Use evaluation feedback to continually improve your prompts. Prompt optimization is ultimately an iterative process.
Context matters – Ensure that your prompts provide enough context for your AI model to generate an accurate response. Depending on the complexity of the task or question that the prompt answers, you might need to use different prompt engineering techniques. You can review the prompt engineering guidelines in the Amazon Bedrock documentation and other resources on the topic provided by your model provider.
Specificity matters – Be as specific as possible with your prompts and evaluation criteria. Specificity will guide the model to the desired output.
Testing edge cases – Evaluate prompts with different inputs to validate robustness. You can also run multiple evaluations on the same prompt to compare and test for consistency of output, which may be important depending on your use case.

Conclusion and next steps

Using LLM with Amazon Bedrock Prompt Management and Amazon Bedrock Prompt Flows as judges, you can implement a systematic approach to evaluating and optimizing prompts, which not only improves the quality and consistency of your AI-generated content, but also streamlines the development process, reduces costs, and potentially improves the user experience.

We encourage you to explore these features in more depth and tailor the evaluation process to your specific use case. As you continue to refine your prompts, you will unlock the full potential of generative AI in your applications. To get started, check out the full code sample used in this post. We look forward to seeing how you use these tools to enhance your AI-powered solutions.

To learn more about Amazon Bedrock and its capabilities, see the Amazon Bedrock documentation.

About the Author

Antonio Rodriguez He is a Senior Generative AI Specialist Solutions Architect at Amazon Web Services, helping companies of all sizes solve challenges, drive innovation, and create new business opportunities using Amazon Bedrock. Outside of work, he loves spending time with his family and playing sports with friends.

What's Hot

How Superman Helped Launch the Hubble Space Telescope

Tested Review: Podcast explores how sports draws boundaries between men and women

Enhance your media search experience using Amazon Q Business and Amazon Transcribe

Improve your bike safety with Amazon Rekognition

Earthly Meditation – My Travel and Geology Blog: Tierra del Fuego

Use language embeddings for zero-shot classification and semantic search with Amazon Bedrock

Build a dynamic, role-based AI agent using Amazon Bedrock inline agents

From concept to reality: Navigating the Journey of RAG from proof of concept to production

Improve your bike safety with Amazon Rekognition

Use language embeddings for zero-shot classification and semantic search with Amazon Bedrock

Build a dynamic, role-based AI agent using Amazon Bedrock inline agents

Do the 2024 Nobel Prizes signal that AI is the future of science?

Lakes are losing winter ice at an alarming rate

Best charging station deal: Get a Qi-enabled 3-in-1 wireless charging station for just $27

Most Popular

How climate change causes earthquakes

Building the Future of Construction Analysis: CONXAI’s AI Reasoning on Amazon EKS

Shocked by The Acolyte cancellation? Brace yourself for more bad news

Our Picks

Dating app Raya isn’t as exclusive as you think

Why polio has resurfaced in Gaza

Do you want to play the UK immediately with Lynx? Sadly, it’s not a hell chance cat

Subscribe to our newsletter

Subscribe to Updates

What's Hot

Evaluating prompts at scale using Amazon Bedrock prompt management and prompt flows

The importance of rapid evaluation

Prerequisites

Configure the assessment prompt

Set up the evaluation flow

Rapid evaluation and implementation at scale

Best Practices and Recommendations

Conclusion and next steps

About the Author

Related Posts

Subscribe to our newsletter

Subscribe to our newsletter