Skip to content

v0.3.29

Compare
Choose a tag to compare
@georg-jung georg-jung released this 18 Sep 00:28
· 122 commits to master since this release

Breaking Changes

  • PreTokenizer is now internal instead of public, as it should have been before too.
  • The publicly visible API now uses string instead of ReadOnlySpan/Memory<char>
    • This enables better unicode normalization handling without having to create a string based on all inputs first

New Features

  • Automatically test correctness of tokenization against Huggingface tokenizers using unit tests
  • Added support for multi-threaded tokenization
    • On a 8-core notebook CPU multithreaded tokenization is 3x faster than singlethreaded tokenization
    • In a GitHub actions runner it is about 2x faster
  • Inputs are unicode normalized prior to tokenization if (and only if) required