Tokenizer

Currently, we support the following tokenizers:

BERT: default uncased BERT tokenizer.
TOCKEN: a Unicode tokenizer pre-trained on wiki-103-raw with min_freq=10.
UNICODE: a Unicode tokenizer that will be trained on your data.

Usage

Pre-trained Tokenizer

BERT and TOCKEN are pre-trained tokenizers. You can use them directly by calling the tokenize function.

SET bm25_catalog.tokenizer = 'BERT';  -- or 'TOCKEN'
SELECT tokenize('A quick brown fox jumps over the lazy dog.');
-- {2058:1, 2474:1, 2829:1, 3899:1, 4248:1, 4419:1, 5376:1, 5831:1}

Train on Your Data

UNICODE will be trained on your data during the document tokenization. You can use this function with/without the trigger:

with trigger (convenient but slower)

CREATE TABLE corpus (id TEXT, text TEXT, embedding bm25vector);
CREATE TRIGGER test_trigger AFTER INSERT ON corpus FOR each row execute FUNCTION unicode_tokenizer_trigger('text', 'embedding', 'test_token');
-- insert text to the table
CREATE INDEX corpus_embedding_bm25 ON corpus USING bm25 (embedding bm25_ops);
SELECT id, text, embedding <&> bm25_query_unicode_tokenize('documents_embedding_bm25', 'PostgreSQL', 'test_token') AS rank
    FROM corpus
    ORDER BY rank
    LIMIT 10;

without trigger (faster but need to call the document_unicode_tokenize function manually)

CREATE TABLE corpus (id TEXT, text TEXT);
-- insert text to the table
CREATE TABLE bm25_catalog.test_token (id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, token TEXT UNIQUE);
ALTER TABLE corpus ADD COLUMN embedding bm25vector;
UPDATE corpus SET embedding = document_unicode_tokenize(text, 'test_token');
CREATE INDEX corpus_embedding_bm25 ON corpus USING bm25 (embedding bm25_ops);
SELECT id, text, embedding <&> bm25_query_unicode_tokenize('documents_embedding_bm25', 'PostgreSQL', 'test_token') AS rank
    FROM corpus
    ORDER BY rank
    LIMIT 10;

Contribution

To create another tokenizer that is pre-trained on your data, you can follow the steps below:

impl Tokenizer trait for your tokenizer.
(optional) pre-trained data can be stored under the tokenizer directory.
Add your tokenizer to the GUC TOKENIZER_NAME match branch in the token.rs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizer.md

tokenizer.md

Tokenizer

Usage

Pre-trained Tokenizer

Train on Your Data

Contribution

Files

tokenizer.md

Latest commit

History

tokenizer.md

File metadata and controls

Tokenizer

Usage

Pre-trained Tokenizer

Train on Your Data

Contribution