-
Notifications
You must be signed in to change notification settings - Fork 26.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When I used galore, the learning rate was set to 8e-6, but the training rate was 0.001 #31707
Comments
Hey @Minami-su, on which version of transformers are you? It reminds me of an older issue #30082 very similar to this which should have been fixed by #30085 (>= v4.40.0). Still pretty sure that it is more of a display issue. |
4.42.3.However, I found that lr and grad in print had problems. In fact, there were changes. |
If you refer to changes, do you mean the actual display of lr/grad changed? I might look into it when I have time. Galore currently uses a lot of dummies to display things which might cause an issue here again (just my first intuition). |
The lr shown is not changing,but the actual training lr is changing when I set lr to 1e-5 and 1e-2
|
@Minami-su small update on my side. I could reproduce the issue with a somewhat shrinked variant of yours: import os
import torch
import transformers
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, logging
logging.set_verbosity(logging.DEBUG)
os.environ["WANDB_DISABLED"] = "true"
# model/data params
base_model: str = "gpt2" # the only required argument
data_path: str = "yahma/alpaca-cleaned"
output_dir: str = "./lora-alpaca"
# training hyperparams
batch_size: int = 32
num_epochs: int = 3
learning_rate: float = 3e-4
cutoff_len: int = 256
val_set_size: int = 2000
# llm hyperparams
train_on_inputs: bool = True # if False, masks out inputs in loss
add_eos_token: bool = False
group_by_length: bool = False # faster, but produces an odd training loss curve
# wandb params
wandb_project: str = ""
wandb_run_name: str = ""
wandb_watch: str = "" # options: false | gradients | all
wandb_log_model: str = "" # options: false | true
resume_from_checkpoint: str = None # either training checkpoint or final adapter
prompt_template_name: str = "alpaca2" # The prompt template to use, will default to alpaca.
if int(os.environ.get("LOCAL_RANK", 0)) == 0:
print(
f"Training Alpaca-LoRA model with params:\n"
f"base_model: {base_model}\n"
f"data_path: {data_path}\n"
f"output_dir: {output_dir}\n"
f"batch_size: {batch_size}\n"
f"num_epochs: {num_epochs}\n"
f"learning_rate: {learning_rate}\n"
f"cutoff_len: {cutoff_len}\n"
f"val_set_size: {val_set_size}\n"
f"train_on_inputs: {train_on_inputs}\n"
f"add_eos_token: {add_eos_token}\n"
f"group_by_length: {group_by_length}\n"
f"wandb_project: {wandb_project}\n"
f"wandb_run_name: {wandb_run_name}\n"
f"wandb_watch: {wandb_watch}\n"
f"wandb_log_model: {wandb_log_model}\n"
f"resume_from_checkpoint: {resume_from_checkpoint or False}\n"
f"prompt template: {prompt_template_name}\n"
)
assert (
base_model
), "Please specify a --base_model, e.g. --base_model='huggyllama/llama-7b'"
# Check if parameter passed or if set within environ
use_wandb = len(wandb_project) > 0 or (
"WANDB_PROJECT" in os.environ and len(os.environ["WANDB_PROJECT"]) > 0
)
# Only overwrite environ if wandb param passed
if len(wandb_project) > 0:
os.environ["WANDB_PROJECT"] = wandb_project
if len(wandb_watch) > 0:
os.environ["WANDB_WATCH"] = wandb_watch
if len(wandb_log_model) > 0:
os.environ["WANDB_LOG_MODEL"] = wandb_log_model
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
# needed for models like gpt2
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
if base_model.find("qwen") != -1 or base_model.find("Qwen") != -1:
tokenizer.add_special_tokens({"bos_token": "<|im_start|>"})
tokenizer.add_special_tokens({"eos_token": "<|im_end|>"})
tokenizer.add_special_tokens({"pad_token": "<|endoftext|>"})
tokenizer.padding_side = "left" # Allow batched inference
def tokenize(prompt, add_eos_token=True):
# there's probably a way to do this with the tokenizer settings
# but again, gotta move fast
result = tokenizer(
prompt,
truncation=True,
max_length=cutoff_len,
padding=False,
return_tensors=None,
)
if (
result["input_ids"][-1] != tokenizer.eos_token_id
and len(result["input_ids"]) < cutoff_len
and add_eos_token
):
result["input_ids"].append(tokenizer.eos_token_id)
result["attention_mask"].append(1)
result["labels"] = result["input_ids"].copy()
return result
def generate_and_tokenize_prompt(data_point):
full_prompt =data_point["instruction"] + data_point["input"] + data_point["output"]
tokenized_full_prompt = tokenize(full_prompt)
return tokenized_full_prompt
print(tokenizer.pad_token_id)
print(tokenizer.pad_token)
print(tokenizer.bos_token_id)
print(tokenizer.bos_token)
print(tokenizer.eos_token_id)
print(tokenizer.eos_token)
if data_path.endswith(".json") or data_path.endswith(".jsonl"):
data = load_dataset("json", data_files=data_path)
else:
data = load_dataset(data_path)
if val_set_size > 0:
train_val = data["train"].train_test_split(
test_size=val_set_size, shuffle=True, seed=42
)
train_data = (
train_val["train"].shuffle().map(generate_and_tokenize_prompt)
)
val_data = (
train_val["test"].shuffle().map(generate_and_tokenize_prompt)
)
else:
train_data = data["train"].shuffle().map(generate_and_tokenize_prompt)
val_data = None
model = AutoModelForCausalLM.from_pretrained(base_model,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
)
trainer = transformers.Trainer(
model=model,
train_dataset=train_data,
eval_dataset=val_data,
args=transformers.TrainingArguments(
per_device_train_batch_size=batch_size,
gradient_accumulation_steps=1,
warmup_steps=0,
num_train_epochs=num_epochs,
learning_rate=learning_rate,
bf16=True,
lr_scheduler_type="cosine",
optim = "galore_adamw_8bit_layerwise",
optim_target_modules=[
'q_proj', 'k_proj', 'down_proj', 'up_proj',
'gate_proj', 'v_proj', 'o_proj', 'lm_head'
],
optim_args="rank=1024, update_proj_gap=500, scale=0.25",
eval_strategy="steps" if val_set_size > 0 else "no",
save_strategy="steps",
logging_strategy="steps",
logging_steps=10,
eval_steps=100 if val_set_size > 0 else None,
save_steps=200,
output_dir=output_dir,
save_total_limit=2,
load_best_model_at_end=True if val_set_size > 0 else False,
group_by_length=group_by_length,
report_to="wandb" if use_wandb else None,
run_name=wandb_run_name if use_wandb else None,
),
data_collator=transformers.DataCollatorForSeq2Seq(
tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
),
)
trainer.train()
model.save_pretrained(output_dir) It is a display issue due to the (cosine) lr scheduler. Working on a fix that I'll submit in a PR. If you're interested why this happens: In short, galore work param-wise on each individually and to conform to this without interrupting it, dummy schedulers and optims are used as a global overhead. This is so that they don't interfer with the param-wise updates. In this case, the scheduler was the problem as it did not follow the scheduling as well as the param-wise learning rates were discarded in the process. To be clear tho, it's entirely a display issue. |
@vasqu Thank you for your explanation,I figure out. |
@Minami-su PR is up, and no problem! Small edit: You should also see the changes in the lr when using warmup steps. |
result:
The text was updated successfully, but these errors were encountered: