Currently, we support the following tokenizers:
BERT
: default uncased BERT tokenizer.TOCKEN
: a Unicode tokenizer pre-trained on wiki-103-raw withmin_freq=10
.UNICODE
: a Unicode tokenizer that will be trained on your data.
BERT
and TOCKEN
are pre-trained tokenizers. You can use them directly by calling the tokenize
function.
SET bm25_catalog.tokenizer = 'BERT'; -- or 'TOCKEN'
SELECT tokenize('A quick brown fox jumps over the lazy dog.');
-- {2058:1, 2474:1, 2829:1, 3899:1, 4248:1, 4419:1, 5376:1, 5831:1}
UNICODE
will be trained on your data during the document tokenization. You can use this function with/without the trigger:
- with trigger (convenient but slower)
CREATE TABLE corpus (id TEXT, text TEXT, embedding bm25vector);
CREATE TRIGGER test_trigger AFTER INSERT ON corpus FOR each row execute FUNCTION unicode_tokenizer_trigger('text', 'embedding', 'test_token');
-- insert text to the table
CREATE INDEX corpus_embedding_bm25 ON corpus USING bm25 (embedding bm25_ops);
SELECT id, text, embedding <&> bm25_query_unicode_tokenize('documents_embedding_bm25', 'PostgreSQL', 'test_token') AS rank
FROM corpus
ORDER BY rank
LIMIT 10;
- without trigger (faster but need to call the
document_unicode_tokenize
function manually)
CREATE TABLE corpus (id TEXT, text TEXT);
-- insert text to the table
CREATE TABLE bm25_catalog.test_token (id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, token TEXT UNIQUE);
ALTER TABLE corpus ADD COLUMN embedding bm25vector;
UPDATE corpus SET embedding = document_unicode_tokenize(text, 'test_token');
CREATE INDEX corpus_embedding_bm25 ON corpus USING bm25 (embedding bm25_ops);
SELECT id, text, embedding <&> bm25_query_unicode_tokenize('documents_embedding_bm25', 'PostgreSQL', 'test_token') AS rank
FROM corpus
ORDER BY rank
LIMIT 10;
To create another tokenizer that is pre-trained on your data, you can follow the steps below: