Large-scale language models (LLMs) have come a long way from being able to read only text to now being able to read and understand graphs, diagrams, tables, and images. In this post, you learn how to use Amazon Bedrock’s LLM to not only extract text but also to understand information in images.
Amazon Bedrock is a fully managed service that lets you choose high-performance foundational models (FM) from leading AI companies, including AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon, through a single API. It also provides a broad set of features for building generative AI applications with security, privacy, and responsible AI.
Solution overview
This post shows you how to use models in Amazon Bedrock to retrieve information from images, tables, and scanned documents. For example:
- Performing object classification and object detection tasks
- Reading and querying graphs
- Read flowcharts and architecture diagrams (such as AWS architecture diagrams) and convert them to text
We also discuss specific use cases where Amazon Bedrock can help you generate information from images.
- Identify actions in images
- Name your product and extract metadata to generate taglines and descriptions
- Create a real estate listing for your property
- Generate recipes from food images
Perform object classification and object detection tasks using LLM
As of this writing, the best algorithms for object classification and detection are based on convolutional neural networks. However, LLMs such as Amazon Bedrock’s Anthropic’s Claude 3 Sonnet can also perform these tasks using zero-shot prompts. Zero-shot prompting refers to a prompting technique in which a model is given a task without providing specific examples or training for the specific task. The following example shows how to classify images based on the class you pass in the prompt. The image below is a sample Australian Medicare card.
To classify images, use the following prompt in Anthropic’s Claude 3 Sonnet.
This is one of the documents uploaded by customers seeking loans. Images can belong to one of four classes: ID, bank statement, application, or utility bill. Please identify the class in this document and tell us just the class name. No other information required at this time
You will receive the following response from LLM.
Based on the image provided, the class of this document appears to be an ID card or identification card. The image shows what appears to be a Medicare card or similar government-issued ID card with a name and identification number displayed.
The class was correctly identified. However, responding in JSON makes it easier to process the response. Add the following line at the end of the prompt:
Provides the response in json format with the key as ‘class’ and the value as the document’s class.
I get a response like this:
{
"class": "ID"
}
You can now read JSON responses using a library of your choice, including the Python JSON library.
The following example uses Meta’s Llama 3.2 90B Vision model to identify objects in an image. The next image is the gearbox.
Identify the gearbox from the image using the following prompts:
Check out the entire mechanical part in this image. Specify only the name of the machine part in JSON format, with the key as “name” and the value as the name of the machine part.
The following screenshot shows the prompt and the answer the model gives.
Reading and querying graphs
The following example shows how to read a graph and interpolate and extrapolate values within the graph. The following graph shows the population growth of several countries (source).
Pass this image to Anthropic’s Claude 3 Sonnet with the following prompt:
From this graph, create a table of the population of India where the first column is the year and the second column is the population in millions. Create rows for each 5-year period from 1950 to 2025.
Since we ask you to create a row every 5 years, the model must interpolate the values. I also ask that the table be extended to 2025, but since the data is only available until 2021, the model needs to extrapolate the values. I get a response like this:
Below is a table of India’s population every five years from 1950 to 2025. The first column represents the year and the second column represents the population in millions.
| Year | Population (in millions) |
|------|--------------------------|
| 1950 | 357.6 |
| 1955 | 390.5 |
| 1960 | 427.8 |
| 1965 | 468.0 |
| 1970 | 510.9 |
| 1975 | 556.9 |
| 1980 | 605.7 |
| 1985 | 658.0 |
| 1990 | 713.8 |
| 1995 | 773.0 |
| 2000 | 835.6 |
| 2005 | 901.7 |
| 2010 | 971.5 |
| 2015 | 1045.0 |
| 2020 | 1122.0 |
| 2025 | 1202.4 |
Extrapolation may not be accurate, and more traditional machine learning (ML) approaches can be useful in this case.
You can also ask the model to combine that knowledge with graph knowledge. The next prompt asks for the 2021 population of the countries in this chart and their capitals.
Can I create a table for 2021 with country names in the first column, population in millions in the second column, and the country’s capital in the third column?
The following screenshot shows the response.
LLM filled out the table based on the graph and their own knowledge of each country’s capital city.
You can also try something more difficult. The next image is a weather map of Japan.
You can ask your LLM specific questions, such as:
This is a rainfall map of Australia. Is it raining in Sydney now?
There is no need to tell LLM where Sydney is or that the image is for rainfall.
I get a response like this:
According to the displayed precipitation map, no significant rainfall appears to occur in the Sydney region, located on the east coast of the Australian continent. Although the map uses contours and colors to represent rainfall levels, the area around Sydney does not have any shaded areas or contours to show the amount of rainfall in this particular time frame shown on the map . Therefore, the available data suggests that it is unlikely to be raining in Sydney during the times represented by this precipitation map.
There are some interesting observations from this use case.
- Anthropic’s Claude 3 Sonnet Can Read Maps
- Model can read overlays on maps
- Phrases like “areas around Sydney” indicate that the model doesn’t need to work with exact information, but can use approximations just like humans.
Read flowcharts and architecture diagrams
You can read AWS architecture diagrams using the Meta Llama 3.2 90B Vision model. The following is an example architecture diagram for modernizing applications with microservices using Amazon Elastic Kubernetes Service (Amazon EKS).
Use the following prompt to read this diagram.
The steps in this diagram are numbered from 1 to 11. The number is indicated by a blue square. Could you please give me a diagram using the numbers 1 to 11 and what happens at each of those steps?
The following screenshot shows the response obtained from LLM (truncated for brevity).
Additionally, you can use this diagram to ask follow-up questions.
Why is a network load balancer required in this architecture?
The following screenshot shows the response from the model.
As you can see, LLM acts as an advisor for questions related to this architecture.
However, the use of generative AI is not limited to software engineering only. You can also read engineering, architecture, and healthcare diagrams and images.
This example uses a process diagram taken from Wikipedia.
To see the purpose of this process diagram and explain the process, you can use the following prompts:
Could you please name the process shown in the example? It also uses numbered steps to walk you through the process from left to right.
The following screenshot shows the response.
LLM successfully figured out that this diagram is of the Haber process producing ammonia. We’ll also walk you through the process.
Identify actions in images
You can identify and classify actions taking place in images. The model’s ability to accurately identify actions is further enhanced by its ability to analyze contextual information such as surrounding objects, the environment, and the location of individuals or entities within the image. By combining these visual cues and contextual elements, Anthropic’s Claude 3 Sonnet allows you to make informed decisions about the nature of the action being performed, giving you a comprehensive view of the scene depicted in the image. Provide understanding.
Below is an example where you can not only categorize a player’s actions, but also compare those actions with professional players and provide feedback to the player.
The model provides an image of a tennis player as shown below. This image was generated using Amazon Bedrock’s Stability AI (SDXL 1.0) model.
The following screenshot shows the prompt and model response.
Name your product and extract metadata to generate taglines and descriptions
In marketing and product development, coming up with the perfect product name or creative promotional content can be difficult. With Anthropic’s Claude 3 Sonnet image-to-text feature, you can upload an image of your product and our model will generate a unique product name and create a tagline tailored to your target audience.
In this example, we provide the model with the following image of sneakers (the image was generated using Amazon Bedrock’s Stability AI (SDXL 1.0) model).
The following screenshot shows the prompt.
The following screenshot shows the model’s response.
In the retail and e-commerce domain, Anthropic’s Claude 3 Sonnet can also be used to extract detailed product information from images for inventory management.
For example, use the prompt shown in the following screenshot.
The following screenshot shows the model’s response.
Create a real estate listing for your property
You can upload images of your property’s floor plan and photos of the interior and exterior of your home to get a description to use in your real estate listing. This helps real estate agents increase their creativity and productivity when promoting properties. Architects can also use this mechanism to explain floor plans to customers.
Provide the model with the following example floor plan.
The following screenshot shows the prompt.
The following screenshot shows the response.
Generate recipes from food images
You can also create recipes based on food photos using Anthropic’s Claude 3 Sonnet. However, out of the box, the model can only identify dishes that are included in the dataset used to train the model. Factors such as ingredient substitutions, cooking techniques, and culinary cultural differences can pose significant challenges.
For example, provide the model with an image of the following cake to extract the recipe: This image was generated using Amazon Bedrock’s Stability AI model (SDXL 1.0).
The following screenshot shows the prompt.
The model identified the dish as Black Forest Cake and created a detailed recipe. This recipe may not produce a cake exactly as shown, but it will be close to a Black Forest cake.
conclusion
FMs such as Anthropic’s Claude 3 Sonnet and the Meta Llama 3.2 90B Vision model available on Amazon Bedrock have demonstrated great capabilities in image processing. These FMs enable a variety of powerful capabilities, including image classification, optical character recognition (OCR), and the ability to interpret complex visuals such as graphs and architectural blueprints. These innovations provide novel solutions to difficult problems, from searching scanned document archives to generating image-inspired textual content to converting visual information into structured data. Masu.
To start using these features for your specific needs, we recommend that you consider Amazon Bedrock’s Chat Playground feature. This feature allows you to manipulate images and extract information from them.
About the author
Mitchell Shah He is a Principal AI/ML Solutions Architect at Amazon Web Services. He helps commercial and public sector customers achieve business outcomes using AI/ML. He currently helps customers build chat bots and search functionality using LLM Agent and RAG.
Santosh Kulkarni I am a Senior Solutions Architect at Amazon Web Services specializing in AI/ML. He is passionate about generative AI and helping customers unlock business potential and drive actionable outcomes through machine learning at scale. Outside of work, I enjoy reading and traveling.