Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how can I replace tokenizer TinyChart #94

Open
LilDevsy0117 opened this issue Jul 8, 2024 · 2 comments
Open

how can I replace tokenizer TinyChart #94

LilDevsy0117 opened this issue Jul 8, 2024 · 2 comments

Comments

@LilDevsy0117
Copy link

I want to change the tokenizer so that it can be applied to Korean

I would appreciate it if you could change LLM_PATH and additionally let me know which part of the code should be modified.

@LilDevsy0117 LilDevsy0117 changed the title how can I replace tokenizer how can I replace tokenizer TinyChart Jul 8, 2024
@zhangliang-04
Copy link
Collaborator

Hi @LilDevsy0117,
Our model is based on TinyLlava, which uses Phi-2 as the LLM. It uses Byte-level BPE and can be naturally applied to other languages including Korean after fine-tuning your data. However, we cannot guarantee its performance, which is related to the Korean support of the Phi-2.
If you want to use another LLM, try to write a class like llava_phi that support your LLM to receive image features from ViT. Note that you may pre-train the projector and LLM on the general image-text datasets to get better performance just like TinyLlava does.

@LilDevsy0117
Copy link
Author

Thanks @zhangliang-04
I am trying as below.

train.sh
LLM_PATH=tabtoyou/KoLLaVA-v1.5-Synatra-7b # replaced
VIT_PATH=mPLUG/TinyChart-3B-768-siglip

and
write llava_synatra.py

modified train.py

config = LlavaConfig.from_pretrained(model_args.model_name_or_path)
86 model = LlavaLlamaForCausalLM.from_pretrained(
87 model_args.model_name_or_path,
88 config = config,
89 cache_dir=training_args.cache_dir,
90 **bnb_model_from_pretrained_args,
91 attn_implementation=None,
92 torch_dtype=compute_dtype
93 )

Tokenizer, init_tokenizer = TokenizerSelect('synatra')()
118 tokenizer = Tokenizer.from_pretrained(
119 model_args.model_name_or_path,
120 cache_dir=training_args.cache_dir,
121 model_max_length=training_args.model_max_length,
122 padding_side="right",
123 use_fast=True,
124 )
125 tokenizer = init_tokenizer(tokenizer)

However, I encountered the following error.

You are using a model of type llava to instantiate a model of type tiny_chart_synatra. This is not supported for all configurations of models and can yield errors.
WARNING:tinychart.model.multimodal_encoder.siglip_encoder:You are using a model of type clip to instantiate a model of type siglip_vision_model. This is not supported for all configurations of models and can yield errors.

WARNING: tokenization mismatch: 203 vs. 210. (ignored)
number of rounds: 1
rounds: ["A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: \nGenerate underlying data table of the chart. ASSISTANT: TITLE | 기술 및 기능부문에서 장비
규격 유지관리의 평가방법은 무엇인가 \n | 일반부문 및 안개요(20점) | 장비 기능 요구사항 \n 0 | 40 | 18 \n 1 | 31 | 2 \n 2 | 8 | 4 \n 3 | 20 | 9 \n 4 | 14 | 34 \n 5 | 20 | 41 \n 6 | 20 | 7"]
conversation: ["A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: \nGenerate underlying data table of the chart. ASSISTANT: TITLE | 기술 및 기능부문에
서 장비 규격 유지관리의 평가방법은 무엇인가 \n | 일반부문 및 안개요(20점) | 장비 기능 요구사항 \n 0 | 40 | 18 \n 1 | 31 | 2 \n 2 | 8 | 4 \n 3 | 20 | 9 \n 4 | 14 | 34 \n 5 | 20 | 41 \n 6 | 20 | 7<|endoftext|>"]
tensor([ -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, 320,
1153, 1180, 342, 28705, 29164, 239, 139, 163, 28705, 31262,
28705, 29164, 30364, 29775, 29710, 29148, 29305, 28705, 29747, 29859,
28705, 31982, 31110, 28705, 30127, 29161, 30224, 29288, 29187, 28705,
31523, 29135, 30240, 30979, 29538, 28705, 30449, 239, 154, 138,
29324, 29135, 28705, 13, 28705, 342, 28705, 29415, 30192, 29775,
29710, 28705, 31262, 28705, 30325, 29893, 29517, 28732, 28750, 28734,
30589, 28731, 342, 28705, 29747, 29859, 28705, 29164, 30364, 28705,
29517, 29779, 29315, 30968, 28705, 13, 28705, 28734, 342, 28705,
28781, 28734, 342, 28705, 28740, 28783, 28705, 13, 28705, 28740,
342, 28705, 28770, 28740, 342, 28705, 28750, 28705, 13, 28705,
28750, 342, 28705, 28783, 342, 28705, 28781, 28705, 13, 28705,
28770, 342, 28705, 28750, 28734, 342, 28705, 28774, 28705, 13,
28705, 28781, 342, 28705, 28740, 28781, 342, 28705, 28770, 28781,
28705, 13, 28705, 28782, 342, 28705, 28750, 28734, 342, 28705,
28781, 28740, 28705, 13, 28705, 28784, 342, 28705, 28750, 28734,
342, 28705, 28787, 28789, 28766, 416, 1009, 772, 28766, 28767])
tensor([[ 1, 330, 10706, 1444, 264, 13903, 2188, 304, 396, 18278,
10895, 13892, 28723, 415, 13892, 5212, 10865, 28725, 10537, 28725,
304, 27057, 11194, 298, 272, 2188, 28742, 28713, 4224, 28723,
2223, 725, 28747, 28705, -200, 28705, 13, 23342, 14164, 1178,
2401, 302, 272, 10968, 28723, 8602, 8048, 12738, 28747, 320,
1153, 1180, 342, 28705, 29164, 239, 139, 163, 28705, 31262,
28705, 29164, 30364, 29775, 29710, 29148, 29305, 28705, 29747, 29859,
28705, 31982, 31110, 28705, 30127, 29161, 30224, 29288, 29187, 28705,
31523, 29135, 30240, 30979, 29538, 28705, 30449, 239, 154, 138,
29324, 29135, 28705, 13, 28705, 342, 28705, 29415, 30192, 29775,
29710, 28705, 31262, 28705, 30325, 29893, 29517, 28732, 28750, 28734,
30589, 28731, 342, 28705, 29747, 29859, 28705, 29164, 30364, 28705,
29517, 29779, 29315, 30968, 28705, 13, 28705, 28734, 342, 28705,
28781, 28734, 342, 28705, 28740, 28783, 28705, 13, 28705, 28740,
342, 28705, 28770, 28740, 342, 28705, 28750, 28705, 13, 28705,
28750, 342, 28705, 28783, 342, 28705, 28781, 28705, 13, 28705,
28770, 342, 28705, 28750, 28734, 342, 28705, 28774, 28705, 13,
28705, 28781, 342, 28705, 28740, 28781, 342, 28705, 28770, 28781,
28705, 13, 28705, 28782, 342, 28705, 28750, 28734, 342, 28705,
28781, 28740, 28705, 13, 28705, 28784, 342, 28705, 28750, 28734,
342, 28705, 28787, 28789, 28766, 416, 1009, 772, 28766, 28767]])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants