Skip to content

Latest commit

 

History

History
59 lines (44 loc) · 2.25 KB

tokenizer.md

File metadata and controls

59 lines (44 loc) · 2.25 KB

Tokenizer

Currently, we support the following tokenizers:

  • BERT: default uncased BERT tokenizer.
  • TOCKEN: a Unicode tokenizer pre-trained on wiki-103-raw with min_freq=10.
  • UNICODE: a Unicode tokenizer that will be trained on your data.

Usage

Pre-trained Tokenizer

BERT and TOCKEN are pre-trained tokenizers. You can use them directly by calling the tokenize function.

SET bm25_catalog.tokenizer = 'BERT';  -- or 'TOCKEN'
SELECT tokenize('A quick brown fox jumps over the lazy dog.');
-- {2058:1, 2474:1, 2829:1, 3899:1, 4248:1, 4419:1, 5376:1, 5831:1}

Train on Your Data

UNICODE will be trained on your data during the document tokenization. You can use this function with/without the trigger:

  • with trigger (convenient but slower)
CREATE TABLE corpus (id TEXT, text TEXT, embedding bm25vector);
CREATE TRIGGER test_trigger AFTER INSERT ON corpus FOR each row execute FUNCTION unicode_tokenizer_trigger('text', 'embedding', 'test_token');
-- insert text to the table
CREATE INDEX corpus_embedding_bm25 ON corpus USING bm25 (embedding bm25_ops);
SELECT id, text, embedding <&> bm25_query_unicode_tokenize('documents_embedding_bm25', 'PostgreSQL', 'test_token') AS rank
    FROM corpus
    ORDER BY rank
    LIMIT 10;
  • without trigger (faster but need to call the document_unicode_tokenize function manually)
CREATE TABLE corpus (id TEXT, text TEXT);
-- insert text to the table
CREATE TABLE bm25_catalog.test_token (id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, token TEXT UNIQUE);
ALTER TABLE corpus ADD COLUMN embedding bm25vector;
UPDATE corpus SET embedding = document_unicode_tokenize(text, 'test_token');
CREATE INDEX corpus_embedding_bm25 ON corpus USING bm25 (embedding bm25_ops);
SELECT id, text, embedding <&> bm25_query_unicode_tokenize('documents_embedding_bm25', 'PostgreSQL', 'test_token') AS rank
    FROM corpus
    ORDER BY rank
    LIMIT 10;

Contribution

To create another tokenizer that is pre-trained on your data, you can follow the steps below:

  1. impl Tokenizer trait for your tokenizer.
  2. (optional) pre-trained data can be stored under the tokenizer directory.
  3. Add your tokenizer to the GUC TOKENIZER_NAME match branch in the token.rs.