Skip to content

Latest commit

 

History

History
103 lines (91 loc) · 5.12 KB

File metadata and controls

103 lines (91 loc) · 5.12 KB

Document indexing and searching in AWS

Contextual overview

A bank would like to increase the number of online transactions customers can review from 6 months to 5 years. In addition, the online bank statements must support textual searches for all fields in the statement.

Architecture diagram

Screenshot 2023-09-01 at 08 05 36

Project objectives

1. To upload files from on-premises applications and store them in the cloud, we will use S3.
2. To index bank transactions, we will deploy an OpenSearch domain.
3. To read data from S3 and load it into OpenSearch for full-text search capabilities, we will use a Glue ETL job.
4. To explore the data and create queries, we will use the built-in dashboards in OpenSearch.

Reproducibility guidelines

Required setup 1. Download the "glue_to_opensearch_job.py" file locally.
2. Create an ingestion bucket in S3, make sure it contains the "elasticsearch-hadoop-7.8.0.jar" file.
3. In the S3 bucket, create an "input/" folder and make sure it contains the "transactions.csv.gz" file.
4. Create an IAM role for AWS Glue named "NewGlueServiceRole" with permissions to access S3 for any sources, targets, scripts and temporary directories.
5. Make sure you create a t3.micro EC2 instance called "SearchInstance".
Deloy and configure an Amazon OpenSearch domain 1. Navigate to the OpenSearch console and click on "create domain", using the following configurations:
- Domain name: bank-transactions.
- Domain creation method: standard create.
- Templates: dev/test.
- Deployment options: domain without standby.
- Availability zones: 1 AZ.
- Engine options / version: 7.10.
- Data nodes / instance type: m5.large.search.
- Number of nodes: 1.
- Network: public access.
- Master user: create master user.
- Master username: project-user.
- Master password: ProjectUserD777!
- Access policy: only use fine-grained access control.
- Click create at the bottom of the page to finish this step.
Creating an ETL job using AWS Glue studio 1. Open S3 and click the "elasticsearch-hadoop-7.8.0.jar" checkbox, then click "Copy S3 URI" above it to a local file on your computer.
2. Click the "input/" folder and "Copy S3 URI" at the top right of the page.
3. Navigate to AWS Glue Studio and create an ETL job with the following configurations:
- Create job: Spark script editor.
- Options: upload and edit an existing script.
- File upload: click and upload the .py file in this repository.
- Click create and name the script "bank-transactions-ingestion-job".
4. Switch to the "Job Details" tab and select the following:
- IAM role: NewGlueServiceRole.
- Glue version: 2.0.
- Job bookmark: disable.
- Number of retries: 0.
- Under Advanced Properties, libraries / dependent JARs path paste the first S3 URI you copied and click save.
Configure an ETL script to ingest Amazon S3 data into Amazon OpenSearch 1. Navigate to the OpenSearch Service to verify the domain is now available.
2. Click on the domain, copy and paste the domain endpoint locally.
3. Go back to your Glue job, under Job Details / Job Parameters click add new parameter:
- Key: --es_endpoint.
- Value: URL of the endpoint you copied.
4. Add another parameter: - Key: --es_user.
- Value: project-user.
5. Add another parameter: - Key: --es_pass.
- Value: ProjectUserD777!
6. Add another parameter:
- Key: --input_bucket.
- Value: the S3 URI you copied for the ingestion bucket.
7. Click save and run.
8. Refresh the run details page to check if your run was completed successfully.
Use the OpenSearch dashboards to call the Search API to query data from the OpenSearch domain 1. Navigate to the EC2 console and click on instances.
2. Copy the Public IPv4 address of the instance you created earlier and paste it into a new browser tab.
3. Type the word "credit" for example and click search to see the results.
4. Navigate to the OpenSearch console and click on the domain you created.
5. Click on the Kibana URL and enter the credentials you created earlier.
6. Select "explore on my own".
7. Select private tenant and click confirm.
8. Click "Interact with the Elasticsearch API"
9. Review the provided query example and click play, this query searches all indexes in your cluster
10. After GET type: /main-index/_search
11. Test the query written in the file "test-query.rtf" listed on this repository
12. Change the query with new keywords related to bank transactions in order to see new results