Wikipedia-trained tokenizers

This repository contains SentencePiece tokenizers trained over Wikipedia snapshots using the WikiLoader package. At the moment, this repository is maintained for our friends at the PeARS project, who are developing a multilingual, decentralised search engine. But you can of course use the models for whichever purposes you need. We will keep adding languages.

The vocabulary size is held constant across languages, at 8000 or 16000 wordpieces. Models are trained over the first 5M words of the snapshot.

For each language, you will need:

a .vocab file containing the list of the 8000/16000 wordpieces used by the model
a .model file containing the actual SentencePiece model

Those are stored in the respective vocabs/ and models/ directory, under the relevant language code.

In addition, English, French, German and Malayalam have nearest neighbours files corresponding to the 16000 wordpieces models, stored in the nns folder. Those have been generated by a FastText model trained on 100M wordpieces with the wikiloader package (40M for Malayalam, due to the overall size of the corresponding Wikipedia snapshot). NB: these files do not contain nearest neighbours for each wordpiece in the vocabulary, as they ignore pieces under a certain frequency threshold.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
models		models
nns		nns
vocabs		vocabs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia-trained tokenizers

About

Releases

Packages

possible-worlds-research/pretrained-tokenizers

Folders and files

Latest commit

History

Repository files navigation

Wikipedia-trained tokenizers

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages