Implementing web crawling for Amazon Bedrock knowledge bases

Amazon Bedrock is a fully managed service that offers a choice of high-performance foundational models (FMs) from leading artificial intelligence (AI) companies, including AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon, through a single API and also provides a wide range of capabilities for building generative AI applications with security, privacy, and responsible AI.

Amazon Bedrock allows you to try and evaluate top FMs for a variety of use cases. You can build agents that use enterprise systems and data sources to perform tasks, using techniques such as Retrieval Augmented Generation (RAG) to privately customize them with your enterprise data. Amazon Bedrock knowledge bases allow you to aggregate data sources into repositories of information. Knowledge bases make it easy to build applications that leverage RAG.

Accessing up-to-date and comprehensive information from various websites is essential for accurate and relevant data for many AI applications. Customers using Amazon Bedrock knowledge bases want to extend their ability to crawl and index public websites. Integrating a web crawler into their knowledge base allows them to efficiently collect and leverage this web data. In this article, we explain how to achieve this seamlessly.

Web crawler for knowledge bases

The Web Crawler Data Source in Knowledge Base allows you to create generative AI web applications for end users based on crawled website data using the AWS Management Console or APIs. The default crawling behavior of the Web Connector starts by taking the provided seed URL and traversing all child links with the same or deeper URL path within the same top primary domain (TPD).

The current considerations are that the URL does not require authentication, it cannot be a host IP address, and its scheme must start with one of the following: http:// or https://Additionally, the Web Connector fetches non-HTML files such as PDFs, text files, markdown files, and CSVs referenced in crawled pages regardless of URL unless explicitly excluded.If multiple seed URLs are specified, the Web Connector crawls URLs that match the TPD and path of any of the seed URLs.You can specify up to 10 source URLs that the knowledge base uses as a starting point for the crawl.

However, the Web Connector does not traverse pages across different domains by default, but the default behavior is to retrieve supported non-HTML files, which keeps the crawl process within specified boundaries and keeps it focused and relevant to the data source of interest.

Understanding synchronization scope

When configuring a knowledge base with web crawling capabilities, you can choose between different synchronization types to control which web pages are included. The following table shows examples of paths that are crawled given a source URL for different synchronization scopes:https://example.com Used for descriptive purposes.

Synchronization Scope Type	Source URL	Example of a crawled domain path	explanation
Default	`https://example.com/products`	`https://example.com/products` `https://example.com/products/product1` `https://example.com/products/product` `https://example.com/products/discounts`	Same host and same initial path as the source URL
Host only	`https://example.com/sellers`	`https://example.com/` `https://example.com/products` `https://example.com/sellers` `https://example.com/delivery`	Same host as the source URL
sub domain	`https://example.com`	`https://blog.example.com` `https://blog.example.com/posts/post1` `https://discovery.example.com` `https://transport.example.com`	Subdomain of the source URL’s primary domain

You can control the maximum crawl rate by setting the maximum crawl rate throttle. A higher value will reduce the sync time. However, the crawl job will always run at the rate of 100% for the domain. robots.txt If the file is present, it will respect the standard robots.txt directives such as ‘Allow’, ‘Disallow’, crawl rate, etc.

You can further narrow the scope of URLs to crawl by using inclusion and exclusion filters. These filters are regular expression (regex) patterns that are applied to each URL. If a URL matches an exclusion filter, it is ignored. Conversely, if you have inclusion filters set, the crawler will only process URLs that match at least one of these filters in scope. For example, .pdfYou can use regular expressions ^.*\.pdf$Use a regular expression to include only URLs that contain the word “products”. .*products.*.

Solution overview

The following sections provide instructions for creating and testing a knowledge base using a web crawler, show how to create a knowledge base using a specific embedding model and an Amazon OpenSearch Service vector collection as a vector database, and explain how to monitor the web crawler.

Prerequisites

Please ensure that you have permission to crawl the URLs you use, comply with Amazon’s Terms of Use, and ensure that bot detection is turned off for those URLs. The Knowledge Base web crawler uses the user-agent. bedrockbot When crawling web pages.

Creating a knowledge base with a web crawler

To implement a web crawler for your knowledge base, follow these steps:

In the Amazon Bedrock console navigation pane, Knowledge Base.
choose Create a knowledge base.
upper Provide knowledge base details On the page, configure the following:
1. Specify a name for the knowledge base.
2. In IAM permissions Section, Selection Create and use a new service role.
3. In choose Information source Section, Selection Web crawler As a data source.
4. choose Next.
upper Configure the Data Source On the page, configure the following:
1. under Source URLinput https://www.aboutamazon.com/news/amazon-offices.
2. for Sync Rangeselect Host only.
3. for Include Patterninput ^https?://www.aboutamazon.com/news/amazon-offices/.*$.
4. For an exclude pattern, .*plants.* (We don’t want posts with the word “plants” in the URL).
5. for Content chunking and parsing,choice Default.
6. choose Next.
upper Select an embedded model and configure the vector store On the page, configure the following:
1. In Embedding Model Section, Selection Titan Text Embeddings v2.
2. for Vector Dimensioninput 1024.
3. for Vector Databasechoose Quickly create a new vector store.
4. choose Next.
Check the details and make your selection Create a knowledge base.

In the above explanation, Include Pattern and Host only The synchronous scope is used to demonstrate the use of include patterns for web crawling. The same result can be achieved with the default synchronous scope, as explained in the previous section of this post.

you Quickly create a vector store When you create a knowledge base, you can select the option to create an Amazon OpenSearch Serverless vector search collection. This option sets up a public vector search collection and vector index with the required fields and required configuration. Additionally, the Amazon Bedrock knowledge base manages the end-to-end ingestion and query workflow for you.

Test your knowledge base

Let’s look at the steps to test your knowledge base using a web crawler as a data source.

In the Amazon Bedrock console, navigate to the knowledge base you created.
under Information sourceselect the data source name, and then click SynchronizationDepending on the size of your data, syncing can take anywhere from a few minutes to a few hours.

Once the sync job is complete, you will see Testing Knowledge Basechoose Select your model Please select the model you want.
Enter any of the following prompts and see the response from the model:
1. How can I tour Amazon’s Seattle offices?
2. Give us some information about Amazon’s HQ2.
3. What is Amazon’s New York office like?

The citation is returned within the response reference webpage, as shown in the following screenshot. x-amz-bedrock-kb-source-uri A webpage link that will help you verify the accuracy of your response.

Creating a knowledge base using the AWS SDK

The following code uses the AWS SDK for Python (Boto3) to create a knowledge base in Amazon Bedrock using a given embedding model and an OpenSearch Service vector collection as the vector database.

import boto3

client = boto3.client('bedrock-agent')

response = client.create_knowledge_base(
    name="workshop-aoss-knowledge-base",
    roleArn='your-role-arn',
    knowledgeBaseConfiguration={
        'type': 'VECTOR',
        'vectorKnowledgeBaseConfiguration': {
            'embeddingModelArn': 'arn:aws:bedrock:your-region::foundation-model/amazon.titan-embed-text-v2:0'
        }
    },
    storageConfiguration={
        'type': 'OPENSEARCH_SERVERLESS',
        'opensearchServerlessConfiguration': {
            'collectionArn': 'your-opensearch-collection-arn',
            'vectorIndexName': 'blog_index',
            'fieldMapping': {
                'vectorField': 'documentid',
                'textField': 'data',
                'metadataField': 'metadata'
            }
        }
    }
)

The following Python code uses Boto3 to create a web crawler data source for an Amazon Bedrock knowledge base by specifying the URL seed, crawl limits, and inclusion and exclusion filters.

import boto3

client = boto3.client('bedrock-agent', region_name="us-east-1")

knowledge_base_id = 'knowledge-base-id'

response = client.create_data_source(
    knowledgeBaseId=knowledge_base_id,
    name="example",
    description='test description',
    dataSourceConfiguration={
        'type': 'WEB',
        'webConfiguration': {
            'sourceConfiguration': {
                'urlConfiguration': {
                    'seedUrls': (
                        {'url': 'https://example.com/'}
                    )
                }
            },
            'crawlerConfiguration': {
                'crawlerLimits': {
                    'rateLimit': 300
                },
                'inclusionFilters': (
                    '.*products.*'
                ),
                'exclusionFilters': (
                    '.*\.pdf$'
                ),
                'scope': 'HOST_ONLY'
            }
        }
    }
)

Monitoring

You can track the status of an ongoing web crawl with Amazon CloudWatch logs, which report which URLs were accessed and whether they were successfully retrieved, skipped, or failed. The following screenshot shows the CloudWatch logs for a crawl job.

cleaning

To clean up resources, follow these steps:

Delete the knowledge base.
1. On the Amazon Bedrock console, Knowledge Base under Orchestration In the navigation pane.
2. Select the knowledge base you created.
3. Note the AWS Identity and Access Management (IAM) service role name in the knowledge base summary.
4. In Vector Database SectionNote the OpenSearch Serverless collection ARN.
5. choose eraseEnter, delete To confirm.
Delete the vector database.
1. In the OpenSearch Service console, collection under Serverless In the navigation pane.
2. Enter the collection ARN you saved in the search bar.
3. Select a collection erase.
4. input confirm At the confirmation prompt, erase.
Delete the IAM service role.
1. In the IAM console, role In the navigation pane.
2. Search for the role name you noted earlier.
3. Select a role and click erase.
4. In the confirmation prompt, enter the role name to delete the role.

Conclusion

In this post, we introduced that Amazon Bedrock Knowledge Base now supports web data sources and enables indexing of public web pages. This feature enables you to efficiently crawl and index websites so that your knowledge base contains diverse and relevant information from the web. By leveraging Amazon Bedrock infrastructure, you can use up-to-date and comprehensive data to improve the accuracy and effectiveness of your generative AI applications.

For pricing information, see Amazon Bedrock Pricing. To get started with Amazon Bedrock knowledge bases, see Creating a Knowledge Base. For more technical content, see Crawling Web Pages in an Amazon Bedrock Knowledge Base. To learn how the Builder community is using Amazon Bedrock in their solutions, visit the community.aws website.

About the Author

Hardik Vasa Hardik is a Sr. Solutions Architect at AWS. He focuses on Generative AI and Serverless technologies and helps customers make the most of AWS services. Hardik enjoys sharing his knowledge at various conferences and workshops. In his spare time, he enjoys learning about new technologies, playing video games and spending time with his family.

Malini Chatterjee She is a Senior Solutions Architect at AWS. She advises AWS customers on their workloads across various AWS technologies. She has broad expertise in data analytics and machine learning. Prior to joining AWS, she designed data solutions in the finance industry. She is a passionate semi-classical dancer and performs at community events. She loves traveling and spending time with her family.

What's Hot

7 Best MagSafe Power Banks for iPhone (2024): High Capacity, Slim, Kickstand

NYT “Connections” clues and answers for July 29: Clues to solving “Connections” #414.

Reader reactions to the July/August 2024 issue

Improve your bike safety with Amazon Rekognition

Earthly Meditation – My Travel and Geology Blog: Tierra del Fuego

Use language embeddings for zero-shot classification and semantic search with Amazon Bedrock

Build a dynamic, role-based AI agent using Amazon Bedrock inline agents

From concept to reality: Navigating the Journey of RAG from proof of concept to production

Improve your bike safety with Amazon Rekognition

Use language embeddings for zero-shot classification and semantic search with Amazon Bedrock

Build a dynamic, role-based AI agent using Amazon Bedrock inline agents

Unleashing Stability AI’s most advanced text-to-image models for media, marketing and advertising: Revolutionizing creative workflows

Ikea Employees Are Getting New AI Drone Coworkers in the U.S.

AI-generated child sexual abuse material is proliferating on the dark web. Big tech companies

Most Popular

The Call review: A transcendent new exhibition where music AI harmonizes with your voice

Humanity will continue to live in an era of incredible food waste

Why frenemies, or love-hate relationships, are bad for your health

Our Picks

How can you live a happier life? Focus on what’s been there all along

Gh0st RAT Trojan targets Chinese Windows users via fake Chrome site

Robot tuna reveals how folding fins help fish move faster

Subscribe to our newsletter

Subscribe to Updates

What's Hot

Implementing web crawling for Amazon Bedrock knowledge bases

Web crawler for knowledge bases

Understanding synchronization scope

Solution overview

Prerequisites

Creating a knowledge base with a web crawler

Test your knowledge base

Creating a knowledge base using the AWS SDK

Monitoring

cleaning

Conclusion

About the Author

Related Posts

Subscribe to our newsletter

Subscribe to our newsletter