Read graphs, diagrams, tables, and scanned pages using Amazon Bedrock’s multimodal prompts

Large-scale language models (LLMs) have come a long way from being able to read only text to now being able to read and understand graphs, diagrams, tables, and images. In this post, you learn how to use Amazon Bedrock’s LLM to not only extract text but also to understand information in images.

Amazon Bedrock is a fully managed service that lets you choose high-performance foundational models (FM) from leading AI companies, including AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon, through a single API. It also provides a broad set of features for building generative AI applications with security, privacy, and responsible AI.

Solution overview

This post shows you how to use models in Amazon Bedrock to retrieve information from images, tables, and scanned documents. For example:

Performing object classification and object detection tasks
Reading and querying graphs
Read flowcharts and architecture diagrams (such as AWS architecture diagrams) and convert them to text

We also discuss specific use cases where Amazon Bedrock can help you generate information from images.

Identify actions in images
Name your product and extract metadata to generate taglines and descriptions
Create a real estate listing for your property
Generate recipes from food images

Perform object classification and object detection tasks using LLM

As of this writing, the best algorithms for object classification and detection are based on convolutional neural networks. However, LLMs such as Amazon Bedrock’s Anthropic’s Claude 3 Sonnet can also perform these tasks using zero-shot prompts. Zero-shot prompting refers to a prompting technique in which a model is given a task without providing specific examples or training for the specific task. The following example shows how to classify images based on the class you pass in the prompt. The image below is a sample Australian Medicare card.

To classify images, use the following prompt in Anthropic’s Claude 3 Sonnet.

This is one of the documents uploaded by customers seeking loans. Images can belong to one of four classes: ID, bank statement, application, or utility bill. Please identify the class in this document and tell us just the class name. No other information required at this time

You will receive the following response from LLM.

Based on the image provided, the class of this document appears to be an ID card or identification card. The image shows what appears to be a Medicare card or similar government-issued ID card with a name and identification number displayed.

The class was correctly identified. However, responding in JSON makes it easier to process the response. Add the following line at the end of the prompt:

Provides the response in json format with the key as ‘class’ and the value as the document’s class.

I get a response like this:

{
"class": "ID"
}

You can now read JSON responses using a library of your choice, including the Python JSON library.

The following example uses Meta’s Llama 3.2 90B Vision model to identify objects in an image. The next image is the gearbox.

Identify the gearbox from the image using the following prompts:

Check out the entire mechanical part in this image. Specify only the name of the machine part in JSON format, with the key as “name” and the value as the name of the machine part.

The following screenshot shows the prompt and the answer the model gives.

Reading and querying graphs

The following example shows how to read a graph and interpolate and extrapolate values within the graph. The following graph shows the population growth of several countries (source).

Pass this image to Anthropic’s Claude 3 Sonnet with the following prompt:

From this graph, create a table of the population of India where the first column is the year and the second column is the population in millions. Create rows for each 5-year period from 1950 to 2025.

Since we ask you to create a row every 5 years, the model must interpolate the values. I also ask that the table be extended to 2025, but since the data is only available until 2021, the model needs to extrapolate the values. I get a response like this:

Below is a table of India’s population every five years from 1950 to 2025. The first column represents the year and the second column represents the population in millions.

| Year | Population (in millions) |

|------|--------------------------|

| 1950 | 357.6 |

| 1955 | 390.5 |

| 1960 | 427.8 |

| 1965 | 468.0 |

| 1970 | 510.9 |

| 1975 | 556.9 |

| 1980 | 605.7 |

| 1985 | 658.0 |

| 1990 | 713.8 |

| 1995 | 773.0 |

| 2000 | 835.6 |

| 2005 | 901.7 |

| 2010 | 971.5 |

| 2015 | 1045.0 |

| 2020 | 1122.0 |

| 2025 | 1202.4 |

Extrapolation may not be accurate, and more traditional machine learning (ML) approaches can be useful in this case.

You can also ask the model to combine that knowledge with graph knowledge. The next prompt asks for the 2021 population of the countries in this chart and their capitals.

Can I create a table for 2021 with country names in the first column, population in millions in the second column, and the country’s capital in the third column?

The following screenshot shows the response.

LLM filled out the table based on the graph and their own knowledge of each country’s capital city.

You can also try something more difficult. The next image is a weather map of Japan.

You can ask your LLM specific questions, such as:

This is a rainfall map of Australia. Is it raining in Sydney now?

There is no need to tell LLM where Sydney is or that the image is for rainfall.

I get a response like this:

According to the displayed precipitation map, no significant rainfall appears to occur in the Sydney region, located on the east coast of the Australian continent. Although the map uses contours and colors to represent rainfall levels, the area around Sydney does not have any shaded areas or contours to show the amount of rainfall in this particular time frame shown on the map . Therefore, the available data suggests that it is unlikely to be raining in Sydney during the times represented by this precipitation map.

There are some interesting observations from this use case.

Anthropic’s Claude 3 Sonnet Can Read Maps
Model can read overlays on maps
Phrases like “areas around Sydney” indicate that the model doesn’t need to work with exact information, but can use approximations just like humans.

Read flowcharts and architecture diagrams

You can read AWS architecture diagrams using the Meta Llama 3.2 90B Vision model. The following is an example architecture diagram for modernizing applications with microservices using Amazon Elastic Kubernetes Service (Amazon EKS).

Use the following prompt to read this diagram.

The steps in this diagram are numbered from 1 to 11. The number is indicated by a blue square. Could you please give me a diagram using the numbers 1 to 11 and what happens at each of those steps?

The following screenshot shows the response obtained from LLM (truncated for brevity).

Additionally, you can use this diagram to ask follow-up questions.

Why is a network load balancer required in this architecture?

The following screenshot shows the response from the model.

As you can see, LLM acts as an advisor for questions related to this architecture.

However, the use of generative AI is not limited to software engineering only. You can also read engineering, architecture, and healthcare diagrams and images.

This example uses a process diagram taken from Wikipedia.

To see the purpose of this process diagram and explain the process, you can use the following prompts:

Could you please name the process shown in the example? It also uses numbered steps to walk you through the process from left to right.

The following screenshot shows the response.

LLM successfully figured out that this diagram is of the Haber process producing ammonia. We’ll also walk you through the process.

Identify actions in images

You can identify and classify actions taking place in images. The model’s ability to accurately identify actions is further enhanced by its ability to analyze contextual information such as surrounding objects, the environment, and the location of individuals or entities within the image. By combining these visual cues and contextual elements, Anthropic’s Claude 3 Sonnet allows you to make informed decisions about the nature of the action being performed, giving you a comprehensive view of the scene depicted in the image. Provide understanding.

Below is an example where you can not only categorize a player’s actions, but also compare those actions with professional players and provide feedback to the player.

The model provides an image of a tennis player as shown below. This image was generated using Amazon Bedrock’s Stability AI (SDXL 1.0) model.

The following screenshot shows the prompt and model response.

Name your product and extract metadata to generate taglines and descriptions

In marketing and product development, coming up with the perfect product name or creative promotional content can be difficult. With Anthropic’s Claude 3 Sonnet image-to-text feature, you can upload an image of your product and our model will generate a unique product name and create a tagline tailored to your target audience.

In this example, we provide the model with the following image of sneakers (the image was generated using Amazon Bedrock’s Stability AI (SDXL 1.0) model).

The following screenshot shows the prompt.

The following screenshot shows the model’s response.

In the retail and e-commerce domain, Anthropic’s Claude 3 Sonnet can also be used to extract detailed product information from images for inventory management.

For example, use the prompt shown in the following screenshot.

The following screenshot shows the model’s response.

Create a real estate listing for your property

You can upload images of your property’s floor plan and photos of the interior and exterior of your home to get a description to use in your real estate listing. This helps real estate agents increase their creativity and productivity when promoting properties. Architects can also use this mechanism to explain floor plans to customers.

Provide the model with the following example floor plan.

The following screenshot shows the prompt.

The following screenshot shows the response.

Generate recipes from food images

You can also create recipes based on food photos using Anthropic’s Claude 3 Sonnet. However, out of the box, the model can only identify dishes that are included in the dataset used to train the model. Factors such as ingredient substitutions, cooking techniques, and culinary cultural differences can pose significant challenges.

For example, provide the model with an image of the following cake to extract the recipe: This image was generated using Amazon Bedrock’s Stability AI model (SDXL 1.0).

The following screenshot shows the prompt.

The model identified the dish as Black Forest Cake and created a detailed recipe. This recipe may not produce a cake exactly as shown, but it will be close to a Black Forest cake.

conclusion

FMs such as Anthropic’s Claude 3 Sonnet and the Meta Llama 3.2 90B Vision model available on Amazon Bedrock have demonstrated great capabilities in image processing. These FMs enable a variety of powerful capabilities, including image classification, optical character recognition (OCR), and the ability to interpret complex visuals such as graphs and architectural blueprints. These innovations provide novel solutions to difficult problems, from searching scanned document archives to generating image-inspired textual content to converting visual information into structured data. Masu.

To start using these features for your specific needs, we recommend that you consider Amazon Bedrock’s Chat Playground feature. This feature allows you to manipulate images and extract information from them.

About the author

Mitchell Shah He is a Principal AI/ML Solutions Architect at Amazon Web Services. He helps commercial and public sector customers achieve business outcomes using AI/ML. He currently helps customers build chat bots and search functionality using LLM Agent and RAG.

Santosh Kulkarni I am a Senior Solutions Architect at Amazon Web Services specializing in AI/ML. He is passionate about generative AI and helping customers unlock business potential and drive actionable outcomes through machine learning at scale. Outside of work, I enjoy reading and traveling.

What's Hot

What’s new on Disney+ and Hulu in August 2024 (and what’s leaving)?

Best games from the Nintendo Direct

5 differences between House of the Dragon and Fire & Blood

Deploy DeepSeek-R1 distilled Llama models with Amazon Bedrock Custom Model Import

Strange geo engineering ideas that have the potential for significant radioactive drops

The design of these interstellar spaceships is very unrealistic

The plan for Trump to bring an “stuck” astronaut to the house is a NASA headache.

Which food is the most super fast? The new system ranks them

Deploy DeepSeek-R1 distilled Llama models with Amazon Bedrock Custom Model Import

Deploy DeepSeek-R1 Distilled Llama models in Amazon Bedrock

Generative AI operating models in enterprise organizations with Amazon Bedrock

AI reaches silver medal level at this year’s Mathematics Olympiad

COVID-19 summer wave could be big, prompting FDA to release new vaccine sooner

NYT Mini Crossword Answers for August 15

Most Popular

Questions from a century ago still reveal answers in basic mathematics

US Open 2024 Live Stream: Watch US Open Tennis for Free in the UK

Answers to the NYT Mini Crossword for August 3

Our Picks

Today’s Wordle: August 18 Answers and Hints

Buy a Japanese knife set for $110

Alien: Romulus review: The summer of fan service continues

Subscribe to our newsletter

Subscribe to Updates

What's Hot

Read graphs, diagrams, tables, and scanned pages using Amazon Bedrock’s multimodal prompts

Solution overview

Perform object classification and object detection tasks using LLM

Reading and querying graphs

Read flowcharts and architecture diagrams

Identify actions in images

Name your product and extract metadata to generate taglines and descriptions

Create a real estate listing for your property

Generate recipes from food images

conclusion

About the author

Related Posts

Subscribe to our newsletter

Subscribe to our newsletter