Amazon SageMaker Ground Truth enables the creation of high-quality, large-scale training datasets that are essential for fine-tuning across a wide range of applications such as large-scale language models (LLMs) and generative AI. SageMaker Ground Truth dramatically reduces the cost and time required to label data by integrating human annotators and machine learning. Whether you annotate images, video, or text, SageMaker Ground Truth allows you to build accurate datasets while maintaining human oversight and feedback at scale. This human-involved approach is important for tailoring the underlying model to human preferences and enhancing its ability to perform tasks tailored to specific requirements.
To support a variety of labeling needs, SageMaker Ground Truth provides built-in workflows for common tasks such as image classification, object detection, and semantic segmentation. Additionally, you have the flexibility to create custom workflows, allowing you to design your own UI templates for specialized data labeling tasks to suit your unique requirements.
Previously, you needed to specify two AWS Lambda functions to set up a custom labeling job. One is a pre-annotation function that is run on each dataset object before it is sent to the worker, and the other is a post-annotation function that is run on each dataset’s annotations. Create objects and integrate multiple worker annotations as needed. These functions provide valuable customization capabilities, but also add complexity for users who do not require additional data manipulation. In these cases, you would need to create a function that simply returns the input unchanged, which would increase development effort and potentially introduce errors when integrating the Lambda function with UI templates and input manifest files. .
Today, we’re excited to announce that you no longer need to provide pre-annotation and post-annotation Lambda functions when creating custom SageMaker Ground Truth labeling jobs. These functions are now optional in both the SageMaker console and CreateLabelingJob API. This means you can create custom labeling workflows more efficiently when no additional data processing is required.
This post shows you how to use SageMaker Ground Truth to set up a custom labeling job without using Lambda functions. We’ll walk you through setting up a workflow with a multimodal content evaluation template, explain how the workflow works without Lambda functions, and highlight the benefits of this new feature.
Solution overview
Omitting the Lambda function in your custom labeling job simplifies your workflow.
- No pre-annotation feature – Data from the input manifest file is inserted directly into the UI template. You can reference data object fields in a template without mapping them using a Lambda function.
- No post-annotation feature – Each worker’s annotations are stored as separate JSON files directly in the specified Amazon Simple Storage Service (Amazon S3) bucket, and the annotations are stored under the worker response key. If you do not use post-annotation Lambda functions, the output manifest file references these worker response files instead of including all annotations directly within the manifest.
The following section describes how to use multimodal content evaluation templates to set up custom labeling jobs without using Lambda functions. This allows you to evaluate the image descriptions produced by your model. Annotators can review images, prompts, and model responses and rate responses based on criteria such as accuracy, relevance, and clarity. This provides important human feedback for model fine-tuning and LLM evaluation using reinforcement learning from human feedback (RLHF).
Prepare the input manifest file
To set up a labeling job, start by preparing an input manifest file that the template will use. The input manifest is a JSON Lines file, where each line represents a dataset item to be labeled. Each line contains: source
Embedded data field or source-ref
A field for a reference to data stored in Amazon S3. These fields are used to provide data objects for annotators to label. For more information about the structure of the input manifest file, see Input Manifest File.
For the specific task of evaluating image descriptions produced by a model, we structure the input manifest to include the following fields:
- “sauce” – Prompts provided to the model
- “image” – S3 URI of the image associated with the prompt
- “Model response” – Description of the image generated by the model
Including these fields allows you to present both the prompt and related data directly to the annotator within the UI template. This approach eliminates the need to pre-annotate Lambda functions because all the necessary information is easily accessible in the manifest file.
The following code is an example of what the lines in the input manifest might look like.
Insert prompts in UI templates
In your UI template, you can insert prompts using: {{ task.input.source }}
to display the image. <img>
tagged src="https://aws.amazon.com/blogs/machine-learning/accelerate-custom-labeling-workflows-in-amazon-sagemaker-ground-truth-without-using-aws-lambda/{{ task.input.image" grant_read_access }}"
(The grant_read_access Liquid filter provides workers with access to S3 objects.) View the model response as follows: {{ task.input.modelResponse }}
. Annotators can then evaluate the model’s responses based on predefined criteria such as accuracy, relevance, and clarity using tools such as sliders and text input fields for additional comments. The complete UI template for this task can be found in the GitHub repository.
Create a labeling job in the SageMaker console
To configure a labeling job using the AWS Management Console, follow these steps.
- Found in the SageMaker console at: ground truth In the navigation pane, select Label pasting work.
- choose Creating a labeling job.
- Specify the input manifest location and output path.
- choice custom as a type of task.
- choose Next.
- Enter a title and description for the task.
- under templateupload the UI template.
Annotation Lambda functions now have the following optional settings: Additional configuration.
- choose preview Display the UI template for review.
- choose create Create a labeling job.
Create a labeling job using the CreateLabelingJob API
Create custom labeling jobs programmatically using AWS SDKs to CreateLabelingJob
API. After you upload your input manifest file to your S3 bucket and set up your work team, you can define your labeling job in code and omit parameters for your Lambda function if you don’t need them. The following example shows how to do this using Python and Boto3.
In the API, pre-annotated Lambda functions are PreHumanTaskLambdaArn
parameters within HumanTaskConfig
structure. The Lambda function after annotation is AnnotationConsolidationLambdaArn
parameters within AnnotationConsolidationConfig
structure. With recent updates, both PreHumanTaskLambdaArn
and AnnotationConsolidationConfig
is now optional. This means that if your labeling workflow does not require additional data pre- or post-processing, you can omit it.
The following code is an example of how to create a labeling job without specifying a Lambda function.
When an annotator submits a rating, the response is saved directly to the specified S3 bucket. The output manifest file contains the original data fields and worker-response-ref
This points to the worker response file in S3. This worker response file contains all annotations for that data object. If multiple annotators worked on the same data object, each individual annotator will answers
key, an array of responses. Each response includes annotator input and metadata such as acceptance time, submission time, and worker ID.
This means that all annotations for a given data object are collected in one place, and you can later process or analyze the annotations according to your specific requirements without the need for a post-annotation Lambda function. You have access to all raw annotations and can perform any necessary integrations or aggregations as part of your post-processing workflow.
Advantages of labeling jobs without using Lambda functions
Creating custom labeling jobs without using Lambda functions has the following benefits:
- Simplified setup – Create custom labeling jobs faster by skipping the creation and configuration of unnecessary Lambda functions.
- time saving – Reducing the number of components in your labeling workflow saves development and debugging time.
- Reduced complexity – Fewer moving parts means less chance of configuration errors or integration issues.
- cost reduction – Not using Lambda functions reduces costs associated with deploying and invoking these resources.
- flexibility – If your project requires preprocessing or annotation integration, you can use Lambda functions to perform the preprocessing and annotation integration. This update provides simplification for simple tasks and flexibility for more complex requirements.
This feature is currently available in all AWS Regions that support SageMaker Ground Truth. In the future, we plan to focus on built-in task types that don’t require annotated Lambda functions to simplify the overall SageMaker Ground Truth experience.
conclusion
The introduction of SageMaker Ground Truth custom labeling job workflows without Lambda functions greatly simplifies the data labeling process. By making Lambda functions optional, we’ve made it easier and faster to set up custom labeling jobs, reducing potential errors and saving you valuable time.
This update removes unnecessary steps for users who don’t require specialized data processing while maintaining the flexibility of custom workflows. Whether you’re performing a simple labeling task or complex multi-step annotation, SageMaker Ground Truth now provides a more streamlined path to high-quality labeled data. It has become.
We encourage you to try this new feature and see how it enhances your data labeling workflow. Check out these resources to get started:
About the author
Sundar Raghavan He is an AI/ML Specialist Solutions Architect at AWS, helping customers leverage SageMaker and Bedrock to build scalable, cost-effective pipelines for computer vision applications, natural language processing, and generative AI. . In his free time, Sander loves exploring new places, trying local eateries, and enjoying the great outdoors.
Alan Ismail is a software engineer at AWS based in New York City. He focuses on building and maintaining scalable AI/ML products such as Amazon SageMaker Ground Truth and Amazon Bedrock Model Evaluation. Outside of work, Alan is learning how to play pickleball, with mixed results.
Yinan Ran is a software engineer at AWS GroundTruth. He worked on GroundTruth, MechanicalTurk, and Bedrock infrastructure, as well as customer-facing projects for GroundTruth Plus. I also focused on product security, remediating risks and creating security tests. In his spare time, he is an audiophile and especially loves practicing Bach’s keyboard compositions.
george king is a summer 2024 intern at Amazon AI. He studied computer science and mathematics at the University of Washington, where he is currently a sophomore-junior. George loves the outdoors, playing games (chess and all types of card games), and exploring Seattle, where he has lived his entire life.