Semantic search service over wikipedia articles using AWS infrastructure

Description

This work shows how to create a semantic search engine over a set of wikipedia pages and deploy it as a service in AWS.

Infrastructure as Code (IaS) is used through AWS CDK.

Here are the descriptions of some directories and files in the repo:

src/wiki.py gets random pages from wikipedia, enriches them with metadata and uploads them to an s3 bucket. This can be also run using
```
python src/scripts.py upload-random-pages -n <NUMBER_OF_RANDOM_PAGES_TO_UPLOAD>
```
lambda_indexer/ defines a lambda function that is attached to create events in the S3 bucket. It sends the page content to the embedding service and references the document and its embedding in the ElasticSearch cluster
universal-sentence-encoder defines the docker image that is pushed to ECR and then deployed in ECS. It provides a service that given a text returns its embedding. It can be used as follows
```
 curl -XPOST -d '{"instances": ["text 1 to query", "text 2 to query"]}' http://<EMBEDDER_IP>:8501/v1/models/USE_3:predict | jq
```
src/es is related to the ElasticSearch cluster where the pages are referenced. It contains modules to create the index, index a document, and search the index using knn similarity
src/api a sanic server listening on port 8000 with an endpoint /search which takes a text as a query, sends it to the embedding service, takes the returned embedding and sends a query to the ElasticSearch index with that embedding. Finally it returns the results. To use it, run src/api/search_server then execute the following command which requests 3 most similar Wikipedia pages to the text "beautiful painting":
```
curl -XGET -d '{"query": "beautiful painting"}' localhost:8000/search?n=3 | jq
```
This server is also deployed as an AWS service which can be reached through
```
curl -XGET -d '{"query": "beautiful painting"}' <API_SERVICE_IP>:8000/search\?n=3 | jq
```

Initial setup

The setup is not at production grade. However, reading the makefile is not very complicated.

Start by creating the virtual environment with

make env-create
source .venv/bin/activate

How to deploy stacks

bootstrap environment
```
make cdk-bootstrap-environment
```

create docker registries

make docker-create-ecr-embedder
make docker-create-ecr-api

deploy Elasticsearch stack
```
make deploy-es
```
in src/config/config.ini update es_url with Elasticsearch endpoint. This can be obtained with make echo-elastic-search-endpoint
create the index in the cluster
```
make create-es-index
```

build and push embedder image

make docker-embedder-image && make docker-embedder-push

deploy embedding stack
```
make deploy-embedder
```

check the embedding service

Embedder IP can be obtained with make echo-embedder-ip

curl -XPOST -d '{"instances": ["toto", "tata"]}' http://<EMBEDDER_IP>:8501/v1/models/USE_3:predict | jq

in src/config/config.ini update public_ip with embedding service public IP
- this is a workaround to pass the embedding service container to the indexer lambda and the API service. A robust solution would be to assign a load balancer with an elastic IP to the embedder serivce, but this would increase the cost and the purpose of this repo is only to showcase the semantic search solution.
package indexing lambda
```
make lambda-indexer-package
```
deploy WikiReferencing stack (S3 bucket + Lambda function + S3 Notification)
```
make deploy-referencing
```
check the referencing stack by sending a batch of Wikipedia pages, you should find json files added to the s3 bucket and the corresponding pages indexed in Elasticsearch index semwiki
```
python src/scripts.py upload-random-pages -n 5
```

list documents in Elasticsearch index

Elasticsearch endpoint can be obtained with make echo-elastic-search-endpoint

curl -XGET -u 'semwiki:SemWiki21!' -H 'Content-Type: application/json' \
  -d '{"_source": "title", "query": {"match_all": {}}}' https://<ES_ENDPOINT>/semwiki/_search | jq

build and push API image

make docker-api-image && make docker-api-push

deploy API service
```
make deploy-api
```

test API service

Search API IP can be obtained with make echo-api-ip

curl -XGET -d '{"query": "entertainment"}' http://<API_SERVICE_IP>:8000/search\?n\=3 | \
    jq '.[] | {"title": .title, "url": .url}'

Disclaimers

The Elasticsearch service, the embedding service get a new IP everytime their containers are re-instantiated. This makes it difficult to reach them. We should use a load balancer with a fixed public IP to overcome this problem, but since the objective of this project is only to show the main idea of how to implement a semantic search engine, we do not want to have unnecessary costs related to these additional resources.

Consequently, for now, when the embedder is unreachable (from the indexing lambda or the search service), their respective code and docker image have to be updated with the new embedder IP. This happens when the embedder service is killed and re-created.

ElasticSearch cURL requests (saved here to be used for debugging)

GET /_cat/indices?v=true&s=index&pretty

GET /_cat/indices/semwiki?v=true&s=index&pretty

GET /_cat/indices/semwiki?format=json

DELETE /semwiki

GET /semwiki

GET /semwiki/_settings

GET /semwiki/_mappings

GET /semwiki/_stats

GET /semwiki/_doc/17793022

DELETE /semwiki/_doc/00000000?routing=shard-1&pretty

GET /semwiki/_doc/53747466?pretty

GET /semwiki/_search
{
  "size": 10,
  "_source": [
    "url",
    "uri",
    "title"
  ],
  "query": {
    "function_score": {
      "functions": [
        {
          "random_score": {
            "seed": "1518707649"
          }
        }
      ]
    }
  }
}

GET /semwiki/_search
{
  "size": 5,
  "query": {
    "function_score": {
      "query": {
        "match_all": {}
      },
      "functions": [
        {
          "random_score": {}
        }
      ]
    }
  }
}

GET /semwiki/_search
{
  "_source": [
    "title"
  ],
  "query": {
    "match_all": {}
  }
}

GET /semwiki/_search
{
  "size": 10,
  "stored_fields": [
    "_id"
  ],
  "_source": [
    "title",
    "url"
  ],
  "query": {
    "match_all": {}
  }
}

POST /semwiki/_delete_by_query
{
  "query": {
    "match_all": {}
  }
}

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
elasticsearch		elasticsearch
images		images
infrastructure		infrastructure
lambda_indexer		lambda_indexer
src		src
tests		tests
universal-sentence-encoder		universal-sentence-encoder
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.pylintrc		.pylintrc
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.in		requirements.in
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic search service over wikipedia articles using AWS infrastructure

Description

Initial setup

How to deploy stacks

Disclaimers

ElasticSearch cURL requests (saved here to be used for debugging)

About

Releases

Packages

Languages

sofglide/semwiki

Folders and files

Latest commit

History

Repository files navigation

Semantic search service over wikipedia articles using AWS infrastructure

Description

Initial setup

How to deploy stacks

Disclaimers

ElasticSearch cURL requests (saved here to be used for debugging)

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages