Search for similar documents using Elasticsearch and BERT.
This assumes Japanese sentences.
- docker-compose
$ docker-compose up --build
- Download and set up the model file
Download from here.
Rename download files like this.
$ ls bertserver/model
bert_config.json bert_model.ckpt.meta wiki-ja.model
bert_model.ckpt.data-00000-of-00001 graph.pbtxt wiki-ja.vocab
bert_model.ckpt.index vocab.txt
-
Go to JupyterLab(
http://0.0.0.0:8888/lab
) and open terminal -
Create Elasticsearch index
$ python create_index.py --index_file index.json --index_name vector_search
- Create Elasticsearch documents
$ python create_documents.py --data contents.csv --save contents.json --index_name vector_search
- Index Elasticsearch documents
$ python index_document.py --data contents.json
- Open
main.ipynb
and run
csv and japanese are expected.
content |
---|
私は仕事中によく居眠りをしてしまいます。眠気を覚ます方法を教えて下さい。 |
content
can be multiple sentences.
It is split into one sentence during preprocessing