Skip to content

gao1948083886/VocabAdaptation_LLM

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vocabulary Adaptation MPT and BLOOM model

Tokenizer-embed Pipeline

  1. To train Indic Tokenizer and get the final tokenizer follow tokenizer_setup directory
  2. To evaluate the resulting tokenizer follow tokenizer_evaluation directory
  3. To get embedding using wechsel follow Wechsel_Setup
  4. To initialize the word embedding layer of model follow InitializationWordEmbed

Result

  1. Please find result on https://docs.google.com/spreadsheets/d/1npkCffkNyztbPZokK9vis19zvzzT07l-uWnN06aiOeQ/edit#gid=868636088
  2. Please find Meeting Notes/To-Do list/observation/.. on - https://docs.google.com/document/d/1dOegfXg8v5NBYXlCZgLDnkLBjP1YD_6K47kHh_5ojd0/edit

File specification

  1. seed_data_test_split.py contains code to split seed dataset for train(90%) and test(10%)
  2. merge_training_seed.py -> code to merge the training data
  3. tokenizer_specification.py -> code to find how two tokenizer are related, such as intersecting token, or avg tokenization length per sentence
  4. combine_tokenizer.py -> contains code to combine two tokenizer (The one used for extended version)
  5. train_tokenizer.py -> train tokenizer from scratch
  6. MPT_inference.py and IndicMPT_inference.py -> code to calculate the perplexity score of just inferncing(no training)
  7. MPT_train.py and IndicMPT_train.py -> contains code to train LoRA adapetr and Word Embedding layer of model

About

扩展词汇表的代码

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.4%
  • Shell 2.6%