Support more languages #18

SSamDav · 2023-08-17T10:59:43Z

SSamDav
Aug 17, 2023

Do you plan to support more languages?

Also it seems some of the optimizations where made with the western languages in mind. Do you plan to explore how this tokenizer works with non western languages?

alasdairforsythe · 2023-09-07T05:17:11Z

alasdairforsythe
Sep 7, 2023
Maintainer

TokenMonster is not language dependent. It will work with any language that uses a space as a word boundary (non-standard spaces are also fine). It'll also work with languages that don't use space as word boundaries, but you'll need to use unfiltered optimization mode in those cases.

0 replies

Sovenok-Hacker · 2023-10-22T10:04:12Z

Sovenok-Hacker
Oct 22, 2023

I trained a model for English-Russian, it can be downloaded from IPFS: https://ipfs.io/ipfs/QmPhxHrNyogBnzxY5onAnkvvgg78RP26R1XXrv3Ka6Qc9J?filename=russian.vocab

2 replies

NotSpooky Nov 15, 2023

Could you please give me a small introduction on how to train it for another dataset?

Sovenok-Hacker Dec 16, 2023

Here is the guide how to do it: https://github.com/alasdairforsythe/tokenmonster/tree/main/training
You need 100MB+ text in one .txt file and around 36 hours on a modern pc (I trained it on Intel Core i3-10100 for ~24 hours)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support more languages #18

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Support more languages #18

SSamDav Aug 17, 2023

Replies: 2 comments · 2 replies

alasdairforsythe Sep 7, 2023 Maintainer

Sovenok-Hacker Oct 22, 2023

NotSpooky Nov 15, 2023

Sovenok-Hacker Dec 16, 2023

SSamDav
Aug 17, 2023

Replies: 2 comments 2 replies

alasdairforsythe
Sep 7, 2023
Maintainer

Sovenok-Hacker
Oct 22, 2023