Skip to content

Multingual pretrained Sentencepiece tokenizers, obtained through wikiloader.

Notifications You must be signed in to change notification settings

possible-worlds-research/pretrained-tokenizers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Wikipedia-trained tokenizers

This repository contains SentencePiece tokenizers trained over Wikipedia snapshots using the WikiLoader package. At the moment, this repository is maintained for our friends at the PeARS project, who are developing a multilingual, decentralised search engine. But you can of course use the models for whichever purposes you need. We will keep adding languages.

The vocabulary size is held constant across languages, at 8000 or 16000 wordpieces. Models are trained over the first 5M words of the snapshot.

For each language, you will need:

  • a .vocab file containing the list of the 8000/16000 wordpieces used by the model
  • a .model file containing the actual SentencePiece model

Those are stored in the respective vocabs/ and models/ directory, under the relevant language code.

In addition, English, French, German and Malayalam have nearest neighbours files corresponding to the 16000 wordpieces models, stored in the nns folder. Those have been generated by a FastText model trained on 100M wordpieces with the wikiloader package (40M for Malayalam, due to the overall size of the corresponding Wikipedia snapshot). NB: these files do not contain nearest neighbours for each wordpiece in the vocabulary, as they ignore pieces under a certain frequency threshold.

About

Multingual pretrained Sentencepiece tokenizers, obtained through wikiloader.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published