Skip to content

Commit

Permalink
Merge pull request #2483 from rajavaid77/rajavaid-feature-eventbridge…
Browse files Browse the repository at this point in the history
…-bedrock-s3-aoss

New serverless pattern - EventBridge-Bedrock-S3-AOSS
  • Loading branch information
julianwood authored Nov 15, 2024
2 parents 012d9d9 + ff0dde8 commit a131944
Show file tree
Hide file tree
Showing 31 changed files with 1,117 additions and 0 deletions.
10 changes: 10 additions & 0 deletions eventbridge-bedrock-s3-aoss/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
*.swp
package-lock.json
__pycache__
.pytest_cache
.venv
*.egg-info

# CDK asset staging directory
.cdk.staging
cdk.out
196 changes: 196 additions & 0 deletions eventbridge-bedrock-s3-aoss/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# Automatically sync your data with your Amazon Bedrock knowledge base using Amazon EventBridge Scheduler
![architecture](architecture/architecture.png)

This pattern demonstrates an approach to automatically sync datasource associated with [Amazon Bedrock Knowledge Bases](https://aws.amazon.com/bedrock/knowledge-bases/). Knowledge Bases help you take advantage of [Retrieval Augmented Generation](https://aws.amazon.com/what-is/retrieval-augmented-generation/) (RAG), a popular technique that involves drawing information from a data store to augment the responses generated by Large Language Models (LLMs). When you set up a knowledge base with your data sources, your application can query the knowledge base to return information to answer the query either with direct quotations from sources or with natural responses generated from the query results.

After you create your knowledge base, you ingest your data source/sources into your knowledge base so that they're indexed and are able to be queried. Additionally each time you add, modify, or remove files from your data source, you must sync the data source so that it is re-indexed to the knowledge base. Syncing is incremental, so Bedrock only processes added, modified, or deleted documents since the last sync.

At the time of writing, knowledge bases doesn't have a native feature to periodically sync the datasource associated with a Knowledge Base. So customers who need to refresh their datasources periodically to ensure their knowledge base is up-to-date have to rely on bespoke solution. This pattern shows one way of implementing the solution, using [Amazon EventBridge Scheduler](https://docs.aws.amazon.com/scheduler/latest/UserGuide/what-is-scheduler.html).

EventBridge Scheduler simplifies scheduling tasks by providing a centralized, serverless service that reliably executes schedules and invokes targets across various AWS services. In this particular pattern, we configure an EventBridge schedule that runs periodically (using a schedule expression). As part of the EventBridge schedule creation, we configure a target. A target is an API operation that EventBridge Scheduler invokes on your behalf whenever the schedule runs. In our case the target API would be the [`StartIngestionJob`](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_StartIngestionJob.html) operation on the Bedrock Agents API.

Learn more about this pattern at Serverless Land Patterns: https://serverlessland.com/patterns/eventbridge-bedrock-s3-aoss

> [!Important]
>This application uses various AWS services and there are costs associated with these services after the Free Tier usage - please see the [AWS Pricing page](https://aws.amazon.com/pricing/) for details. You are responsible for any AWS costs incurred. No warranty is implied in this example.
## Requirements

* [Create an AWS account](https://portal.aws.amazon.com/gp/aws/developer/registration/index.html) if you do not already have one and log in. The IAM user that you use must have sufficient permissions to make necessary AWS service calls and manage AWS resources.
* [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) installed and configured
* [Git Installed](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
* [Node and NPM](https://nodejs.org/en/download/) installed
* [AWS Cloud Development Kit](https://docs.aws.amazon.com/cdk/latest/guide/cli.html) (AWS CDK) installed

> [!Important]
> This pattern uses Knowledge Bases and the Titan Text Embeddings V2 model. See [Supported regions and models for Amazon Bedrock knowledge bases](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-supported.html) to select a region where knowledge bases is supported
## Enable Model Access in Bedrock console
Knowledge bases use a foundation model to embed your data sources in a vector store. Before creating a knowledge base and selecting an embeddings model for the Knowledge Base, You must request access to the model. If you try to use the model (with the API or console) before you have requested access to it, you receive an error message. For more information, see [Model access](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html).

1. In the AWS console, select the region from which you want to access Bedrock.

![Region Selection](images/region-selection.png)

2. Find **Amazon Bedrock** by searching in the AWS console.

![Bedrock Search](images/bedrock-search.png)

3. Expand the side menu.

![Bedrock Expand Menu](images/bedrock-menu-expand.png)

4. From the side menu, select **Model access**.

![Model Access](images/model-access-link.png)

5. Depending on your view, Select the **Enable specific models** button or the **Modify Model Access** button

![Model Access View](images/model-access-view.png)


6. Use the checkboxes to select the models you wish to enable. Review the applicable EULAs as needed. Click **Next** to go to the Review screen and then **Submit** to enable the required models in your account. For this pattern, by default, we would only need Titan Text Embeddings V2 / model id: _amazon.titan-embed-text-v2:0_.

## Deployment Instructions

1. Create a new directory, navigate to that directory in a terminal and clone the GitHub repository:
```
git clone https://github.com/aws-samples/serverless-patterns
```
2. Change directory to the pattern directory:
```
cd serverless-patterns/eventbridge-bedrock-s3-aoss
```
3. Create virtual environment for Python
```
python3 -m venv .venv
```
4. Activate the virtualenv like this:
```
source .venv/bin/activate
```
5. Install the Python required dependencies:
```
pip install -r requirements.txt
```
6. Install dependencies to be used in Lambda Layer
```
pip install --target layers/python -r layers/requirements.txt
```
7. Run the command below to bootstrap your account. CDK needs it to deploy
```
cdk bootstrap
```
8. see the list of the IDs of the stacks in the AWS CDK application:
```
cdk list
```
9. Review the CloudFormation template CDK generates for the included stacks using the following AWS CDK CLI command:
> [!NOTE]
> Substitute the stack_id with one from the list in output from the `cdk list` command
```
cdk synth <stack_id>
```
10. From the command line, use AWS CDK to deploy the AWS resources.
```
cdk deploy --all
```
Enter `y` if prompted `Do you wish to deploy these changes (y/n)?`
> [!NOTE]
> You can optionally change the `collection_name`, `index_name`, `knowledge_base_name`, `kb_s3_datasource_name`
parameters in the `cdk.context.json`. The parameters are used to name the OpenSearch Serverless collection, index, the knowledge base and the associated S3 data source, respectively.
## How it works
Upon deployment, the CDK stack will create a Bedrock Knowledge Base configured with S3 Bucket as data source and an OpenSearch Serverless collection to store vector data. A data source repository contains files or content with information that can be retrieved when your knowledge base is queried. The stack also include an EventBridge scheduler that is configured to run every 5 mins and invoke the `StartIngestionJob` operation on Bedrock Agents API. Bedrock supports a monitoring system to help you understand the execution of any data ingestion jobs. The Stack would create the neccessary CloudWatch log groups and CloudWatch delivery. You can gain visibility into the ingestion of your knowledge base resources with this logging system. Additionally, Bedrock is integrated with AWS CloudTrail, a service that provides a record of actions taken by a user, role, or an AWS service in Bedrock. CloudTrail captures all API calls for Bedrock as events.
## Testing
### Verify Event Scheduler is ENABLED
The EventScheduler should be enabled by default when the stack creation is complete. You can verify this by running the below command. The expected output of the command is the text `ENABLED`. This means that the scheduler is enabled and is ready to run at the next schedule time.
```
aws scheduler get-schedule --name BedrockKBDataSourceSyncSchedule --group BedrockKBSyncScheduleGroup --query 'State' --output text
```
### Upload Document(s) to S3 Bucket
Upload a sample pdf document to S3 bucket that is configured as the KB Datasource. You can provide your own or use one of the pdfs provided in ```examples``` folder. You can find the bucketname in the Outputs section of the CDK command output of the BedrockKBStack
> [!NOTE]
> Substitute the value from `BedrockKBStack.bucketname` found in the Outputs section of the `cdk deploy` command output of the `BedrockKBStack`
```
aws s3 cp examples/2022-Shareholder-Letter.pdf s3://<BedrockKBStack.bucketname>
```
> [!Important]
> Wait for for the next scheduled run before running the below commands. By default, this stack configures a scheduler to run every 5 minutes. You can find the scheduler rate by running the below command. The expected output is `rate(5 minutes)`
```
aws scheduler get-schedule --name BedrockKBDataSourceSyncSchedule --group BedrockKBSyncScheduleGroup --query 'ScheduleExpression' --output text
```
### View CloudTrail log for StartIngestionJob
1. In the CloudTrail console, click on Event history. Event history provides a viewable, searchable, downloadable, and immutable record of the past 90 days of management events.
![CloudTrail Event History](images/cloudtrail-eventhistory.png)
2. Filter using the Event Name as StartIngestionJob as well as by date and time (for example, Last 20 minutes)
![StartIngestionJob Event](images/startingestionjob-event.png)
3. In the Event Record, notice that the `sessionContext.sessionIssuer.userName` mentions `EventBridgeSchedulerRole` which is the role that was created by the CDK stack, and assigned to the EventBridge Schedule. Also the `userAgent` indicates `AmazonEventBridgeScheduler` as the agent through which the request was made.
### Tail the CloudWatch Logs to look for Sync Events
The CDK creates resources to enable logging for a knowledge base using the CloudWatch contructs.
See [Knowledge bases logging](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-bases-logging.html) for more information.
The following command tails the CloudWatch log to view KnowledgeBase events as they are logged.
> [!NOTE]
> Substitute the `BedrockKBStack.knowledgebaseid` found in the CDK Output section of the `cdk deploy` command output of the `BedrockKBStack`
```
aws logs tail --follow --since 20m BedrockKnowledgeBase-`<BedrockKBStack.knowledgebaseid>`
```
The command should output cloudwatch log entries, for the various stages of the ingestion process (such as INGESTION_JOB_STARTED, CRAWLING_COMPLETED, EMBEDDING_STARTED and so on). The final log statement for a given ingestion job id should be the entry to indicate the COMPLETED status of the job as in the screenshot below. The log entry also outputs the resource stats include the number of documents ingested to the Knowledge Base.
Sample Output
![cloudwatch-log](images/cloudwatch-log.png)
### View Ingestion Job timestamp and status
You can also use the following command to check the status of ingestion job(s). The command outputs the most recent ingestion job.
> [!NOTE]
> Substitute the BedrockKBStack.knowledgebaseid and BedrockKBStack.datasourceid found in the Output section of the `cdk deploy` command output of the `BedrockKBStack`
```
aws bedrock-agent list-ingestion-jobs --knowledge-base-id <BedrockKBStack.knowledgebaseid> --data-source-id <BedrockKBStack.datasourceid> --query 'reverse(sort_by(ingestionJobSummaries,&startedAt))[:1].{startedAt:startedAt, updatedAt:updatedAt,ingestionJobId:ingestionJobId,status:status}'
```
Sample Output
![list-ingestion-jobs-output](images/list-ingestion-jobs-output.png)
## Cleanup
1. Run below script in the `eventbridge-bedrock-s3-aoss` directory to delete AWS resources created by this sample stack.
```bash
cdk destroy --all
```
## Extra Resources
* [Bedrock Api Reference](https://docs.aws.amazon.com/bedrock/latest/APIReference/welcome.html)
* [Sync to ingest your data sources into the knowledge base](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-ingest.html)
* [What is Amazon EventBridge Scheduler?](https://docs.aws.amazon.com/scheduler/latest/UserGuide/what-is-scheduler.html)
* [Using universal targets with EventBridge Scheduler](https://docs.aws.amazon.com/scheduler/latest/UserGuide/managing-targets-universal.html)
----
Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved.
SPDX-License-Identifier: MIT-0
32 changes: 32 additions & 0 deletions eventbridge-bedrock-s3-aoss/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
#!/usr/bin/env python3
import os
import aws_cdk as cdk
from stacks.bedrock_knowledgebase_stack import BedrockKnowledgebaseStack
from stacks.opensearch_serverless_stack import OpenSearchServerlessStack
from stacks.ingestion_job_resources_stack import IngestionJobResourcesStack
from stacks.bedrock_service_role_stack import BedrockServiceRoleStack


app = cdk.App()

bedrock_sr_ap_stack = BedrockServiceRoleStack(app,
"BedrockServiceRoleStack",
)

opensearch_serverless_stack = OpenSearchServerlessStack(app, "AOSSStack",
bedrock_kb_service_role_arn = bedrock_sr_ap_stack.bedrock_kb_service_role_arn
)

bedrock_kb_stack = BedrockKnowledgebaseStack(app,
"BedrockKBStack",
cfn_aoss_collection_arn = opensearch_serverless_stack.cfn_aoss_collection_arn,
index_name = opensearch_serverless_stack.index_name,
bedrock_kb_service_role_arn = bedrock_sr_ap_stack.bedrock_kb_service_role_arn
)
ingestion_job_resources_stack = IngestionJobResourcesStack(app,
"SchedulerStack",
knowledge_base_id=bedrock_kb_stack.knowledge_base_id,
data_source_id=bedrock_kb_stack.knowledgebase_datasource_id
)

app.synth()
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
16 changes: 16 additions & 0 deletions eventbridge-bedrock-s3-aoss/cdk.context.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"opensearch_serverless_params": {
"collection_name": "bedrock-kb",
"index_name": "bedrock-kb-index"
},
"bedrock_knowledgebase_params": {
"knowledge_base_name": "rag-knowledge-base",
"kb_s3_datasource_name":"kb-s3-datasource",
"embedding_model_id": "amazon.titan-embed-text-v2:0",
"vector_index_metadata_field":"text-metadata",
"vector_index_text_field":"text",
"vector_index_vector_field":"vector",
"kb_cw_log_group_name_prefix":"BedrockKnowledgeBase",
"bedrock_kb_log_delivery_source":"bedrock_kb_log_delivery_source"
}
}
52 changes: 52 additions & 0 deletions eventbridge-bedrock-s3-aoss/cdk.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
{
"app": "python3 app.py",
"watch": {
"include": [
"**"
],
"exclude": [
"README.md",
"cdk*.json",
"requirements*.txt",
"source.bat",
"**/__init__.py",
"python/__pycache__",
"tests"
]
},
"context": {
"@aws-cdk/aws-lambda:recognizeLayerVersion": true,
"@aws-cdk/core:checkSecretUsage": true,
"@aws-cdk/core:target-partitions": [
"aws",
"aws-cn"
],
"@aws-cdk-containers/ecs-service-extensions:enableDefaultLogDriver": true,
"@aws-cdk/aws-ec2:uniqueImdsv2TemplateName": true,
"@aws-cdk/aws-ecs:arnFormatIncludesClusterName": true,
"@aws-cdk/aws-iam:minimizePolicies": true,
"@aws-cdk/core:validateSnapshotRemovalPolicy": true,
"@aws-cdk/aws-codepipeline:crossAccountKeyAliasStackSafeResourceName": true,
"@aws-cdk/aws-s3:createDefaultLoggingPolicy": true,
"@aws-cdk/aws-sns-subscriptions:restrictSqsDescryption": true,
"@aws-cdk/aws-apigateway:disableCloudWatchRole": true,
"@aws-cdk/core:enablePartitionLiterals": true,
"@aws-cdk/aws-events:eventsTargetQueueSameAccount": true,
"@aws-cdk/aws-iam:standardizedServicePrincipals": true,
"@aws-cdk/aws-ecs:disableExplicitDeploymentControllerForCircuitBreaker": true,
"@aws-cdk/aws-iam:importedRoleStackSafeDefaultPolicyName": true,
"@aws-cdk/aws-s3:serverAccessLogsUseBucketPolicy": true,
"@aws-cdk/aws-route53-patters:useCertificate": true,
"@aws-cdk/customresources:installLatestAwsSdkDefault": false,
"@aws-cdk/aws-rds:databaseProxyUniqueResourceName": true,
"@aws-cdk/aws-codedeploy:removeAlarmsFromDeploymentGroup": true,
"@aws-cdk/aws-apigateway:authorizerChangeDeploymentLogicalId": true,
"@aws-cdk/aws-ec2:launchTemplateDefaultUserData": true,
"@aws-cdk/aws-secretsmanager:useAttachedSecretResourcePolicyForSecretTargetAttachments": true,
"@aws-cdk/aws-redshift:columnId": true,
"@aws-cdk/aws-stepfunctions-tasks:enableEmrServicePolicyV2": true,
"@aws-cdk/aws-ec2:restrictDefaultSecurityGroup": true,
"@aws-cdk/aws-apigateway:requestValidatorUniqueId": true,
"@aws-cdk/aws-kms:aliasNameRef": true
}
}
Loading

0 comments on commit a131944

Please sign in to comment.