This work shows how to create a semantic search engine over a set of wikipedia pages and deploy it as a service in AWS.
Infrastructure as Code (IaS) is used through AWS CDK.
Here are the descriptions of some directories and files in the repo:
src/wiki.py
gets random pages from wikipedia, enriches them with metadata and uploads them to an s3 bucket. This can be also run usingpython src/scripts.py upload-random-pages -n <NUMBER_OF_RANDOM_PAGES_TO_UPLOAD>
lambda_indexer/
defines a lambda function that is attached to create events in the S3 bucket. It sends the page content to the embedding service and references the document and its embedding in the ElasticSearch clusteruniversal-sentence-encoder
defines the docker image that is pushed to ECR and then deployed in ECS. It provides a service that given a text returns its embedding. It can be used as followscurl -XPOST -d '{"instances": ["text 1 to query", "text 2 to query"]}' http://<EMBEDDER_IP>:8501/v1/models/USE_3:predict | jq
src/es
is related to the ElasticSearch cluster where the pages are referenced. It contains modules to create the index, index a document, and search the index using knn similaritysrc/api
a sanic server listening on port 8000 with an endpoint/search
which takes a text as a query, sends it to the embedding service, takes the returned embedding and sends a query to the ElasticSearch index with that embedding. Finally it returns the results. To use it, runsrc/api/search_server
then execute the following command which requests 3 most similar Wikipedia pages to the text "beautiful painting":This server is also deployed as an AWS service which can be reached throughcurl -XGET -d '{"query": "beautiful painting"}' localhost:8000/search?n=3 | jq
curl -XGET -d '{"query": "beautiful painting"}' <API_SERVICE_IP>:8000/search\?n=3 | jq
The setup is not at production grade. However, reading the makefile is not very complicated.
Start by creating the virtual environment with
make env-create
source .venv/bin/activate
- bootstrap environment
make cdk-bootstrap-environment
- create docker registries
make docker-create-ecr-embedder make docker-create-ecr-api
- deploy Elasticsearch stack
make deploy-es
- in
src/config/config.ini
updatees_url
with Elasticsearch endpoint. This can be obtained withmake echo-elastic-search-endpoint
- create the index in the cluster
make create-es-index
- build and push embedder image
make docker-embedder-image && make docker-embedder-push
- deploy embedding stack
make deploy-embedder
- check the embedding service
- Embedder IP can be obtained with
make echo-embedder-ip
curl -XPOST -d '{"instances": ["toto", "tata"]}' http://<EMBEDDER_IP>:8501/v1/models/USE_3:predict | jq
- Embedder IP can be obtained with
- in
src/config/config.ini
updatepublic_ip
with embedding service public IP- this is a workaround to pass the embedding service container to the indexer lambda and the API service. A robust solution would be to assign a load balancer with an elastic IP to the embedder serivce, but this would increase the cost and the purpose of this repo is only to showcase the semantic search solution.
- package indexing lambda
make lambda-indexer-package
- deploy WikiReferencing stack (S3 bucket + Lambda function + S3 Notification)
make deploy-referencing
- check the referencing stack by sending a batch of Wikipedia pages, you should find json
files added to the s3 bucket and the corresponding pages indexed in Elasticsearch index
semwiki
python src/scripts.py upload-random-pages -n 5
- list documents in Elasticsearch index
- Elasticsearch endpoint can be obtained with
make echo-elastic-search-endpoint
curl -XGET -u 'semwiki:SemWiki21!' -H 'Content-Type: application/json' \ -d '{"_source": "title", "query": {"match_all": {}}}' https://<ES_ENDPOINT>/semwiki/_search | jq
- Elasticsearch endpoint can be obtained with
- build and push API image
make docker-api-image && make docker-api-push
- deploy API service
make deploy-api
- test API service
- Search API IP can be obtained with
make echo-api-ip
curl -XGET -d '{"query": "entertainment"}' http://<API_SERVICE_IP>:8000/search\?n\=3 | \ jq '.[] | {"title": .title, "url": .url}'
- Search API IP can be obtained with
The Elasticsearch service, the embedding service get a new IP everytime their containers are re-instantiated. This makes it difficult to reach them. We should use a load balancer with a fixed public IP to overcome this problem, but since the objective of this project is only to show the main idea of how to implement a semantic search engine, we do not want to have unnecessary costs related to these additional resources.
Consequently, for now, when the embedder is unreachable (from the indexing lambda or the search service), their respective code and docker image have to be updated with the new embedder IP. This happens when the embedder service is killed and re-created.
GET /_cat/indices?v=true&s=index&pretty
GET /_cat/indices/semwiki?v=true&s=index&pretty
GET /_cat/indices/semwiki?format=json
DELETE /semwiki
GET /semwiki
GET /semwiki/_settings
GET /semwiki/_mappings
GET /semwiki/_stats
GET /semwiki/_doc/17793022
DELETE /semwiki/_doc/00000000?routing=shard-1&pretty
GET /semwiki/_doc/53747466?pretty
GET /semwiki/_search
{
"size": 10,
"_source": [
"url",
"uri",
"title"
],
"query": {
"function_score": {
"functions": [
{
"random_score": {
"seed": "1518707649"
}
}
]
}
}
}
GET /semwiki/_search
{
"size": 5,
"query": {
"function_score": {
"query": {
"match_all": {}
},
"functions": [
{
"random_score": {}
}
]
}
}
}
GET /semwiki/_search
{
"_source": [
"title"
],
"query": {
"match_all": {}
}
}
GET /semwiki/_search
{
"size": 10,
"stored_fields": [
"_id"
],
"_source": [
"title",
"url"
],
"query": {
"match_all": {}
}
}
POST /semwiki/_delete_by_query
{
"query": {
"match_all": {}
}
}