Hallucinations with 8bit Whisper PEFT model - solved with full / half precision #477

sanchit-gandhi · 2023-05-19T18:21:40Z

sanchit-gandhi
May 19, 2023
Maintainer

Observed several instances where fine-tuning Whisper with PEFT and then running inference in 8bit precision gives ~5x slower inference speeds vs full precision, and increases Whisper’s propensity to hallucinate considerably

Table for inference speed with batch-size=1:

I'll include code snippets below, and update these in time to use a fine-tuned PEFT checkpoint with audio sample (currently these are both private):

Code to load PEFT model in 8bit then pass to pipeline:

import torch
from transformers import (
    AutomaticSpeechRecognitionPipeline,
    WhisperForConditionalGeneration,
    WhisperTokenizer,
    WhisperProcessor,
)
from peft import PeftModel, PeftConfig

peft_model_id = "path/to/peft/checkpoint"
sample = "path/to/audio/file"

language = "english"
task = "transcribe"

peft_config = PeftConfig.from_pretrained(peft_model_id)
model = WhisperForConditionalGeneration.from_pretrained(
    peft_config.base_model_name_or_path, load_in_8bit=True, device_map="auto"
)

model = PeftModel.from_pretrained(model, peft_model_id)
processor = WhisperProcessor.from_pretrained(peft_config.base_model_name_or_path, language=language, task=task)

pipe = AutomaticSpeechRecognitionPipeline(model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, batch_size=8)

def transcribe(audio, return_timestamps=False):
    with torch.cuda.amp.autocast():
        text = pipe(audio, chunk_length_s=30, return_timestamps=return_timestamps, generate_kwargs={"language": language, "task": task})["text"]
    return text

transcript = transcribe(sample)
print(transcript)

Loading the model weights and PEFT weights in fp32/fp16 for inference drastically helps with inference time (faster than fp32), and retains the WER boost we get by fine-tuning with PEFT. There are almost no hallucinations when we run inference in full or half precision.

Code to load PEFT model in fp16 then pass to pipeline:

import torch
from transformers import (
    AutomaticSpeechRecognitionPipeline,
    WhisperForConditionalGeneration,
    WhisperTokenizer,
    WhisperProcessor,
)
from peft import PeftModel, PeftConfig

peft_model_id = "path/to/peft/checkpoint"
sample = "path/to/audio/file"

language = "english"
task = "transcribe"

peft_config = PeftConfig.from_pretrained(peft_model_id)
model = WhisperForConditionalGeneration.from_pretrained(
    peft_config.base_model_name_or_path,
)
model = PeftModel.from_pretrained(model, peft_model_id)
model.to("cuda").half()

processor = WhisperProcessor.from_pretrained(peft_config.base_model_name_or_path, language=language, task=task)

pipe = AutomaticSpeechRecognitionPipeline(model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, batch_size=8, torch_dtype=torch.float16, device="cuda:0")

def transcribe(audio, return_timestamps=False):
    text = pipe(audio, chunk_length_s=30, return_timestamps=return_timestamps, generate_kwargs={"language": language, "task": task})["text"]
    return text

transcript = transcribe(sample)
print(transcript)

Takeaway: PEFT is great for stable, low-resource training in 8-bit. We can then leverage the fine-tuned checkpoints for fast inference in full or half precision and negate possible hallucinations

jr4c · 2023-06-04T23:54:02Z

jr4c
Jun 4, 2023

Hey @sanchit-gandhi I have noticed that Inference Speed using QLoRA (4bits) is relatively slow too. My question is: if we have trained a model using the Combination of PEFT + (LoRA or QLoRA) shouldn't we have to load the model with the same Bits on Inference, or given that the adapters learn in 32fp/16fp we can use the inference in half-precision with no problem? and that means that we could train a full precision model + PEFT (using Accelerate for multiple GPUs) and used it with different types at inference time.

1 reply

sanchit-gandhi Jun 5, 2023
Maintainer Author

Half-precision works for inference!

given that the adapters learn in 32fp/16fp we can use the inference in half-precision

I haven't explored it too deeply, but I believe it works for exactly that reason.

train a full precision model + PEFT (using Accelerate for multiple GPUs) and used it with different types at inference time

Yes I think that should work too given these results

jr4c · 2023-07-06T23:00:31Z

jr4c
Jul 6, 2023

Hey @sanchit-gandhi I wondering if this is resolved if when finish training with 8Bit or 4Bit at the end the adapter is merged back to the model. (at leat QLoRA)
for example

  peft_config = PeftConfig.from_pretrained(output_dir)
    model = AutoModelForCausalLM.from_pretrained(
        peft_config.base_model_name_or_path,
        return_dict=True,
        torch_dtype=torch.float16,
        low_cpu_mem_usage=True,
    )
    model = PeftModel.from_pretrained(model, output_dir)
    model.eval()
    # Merge LoRA and base model and save
    merged_model = model.merge_and_unload()
    merged_model.save_pretrained("/opt/ml/model/")

    # save tokenizer for easy inference
    tokenizer = AutoTokenizer.from_pretrained(args.model_id)
    tokenizer.save_pretrained("/opt/ml/model/")

2 replies

sanchit-gandhi Jul 13, 2023
Maintainer Author

Hey @jr4c - I've not had experience with QLoRA so can't comment, but I know @Vaibhavs10 has run some experiments so might be able to comment better. (also cc @pacman100 in case you know whether this helps here)

kli017 Jan 19, 2024

@sanchit-gandhi Hi, thanks for providing the tuturial. I was finetuning the whisper-large-v2 with uyghur and peft. The weird thing is my traing loss is decreasing well but the validation loss goes lile a 'v'. It decrease from 1.55 to 1.3 and then goes to 1.97. Because whisper dose not have tokenizer for 'ug' so I use uzbek instead. Do you think it work?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hallucinations with 8bit Whisper PEFT model - solved with full / half precision #477

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Hallucinations with 8bit Whisper PEFT model - solved with full / half precision #477

sanchit-gandhi May 19, 2023 Maintainer

Replies: 2 comments · 3 replies

jr4c Jun 4, 2023

sanchit-gandhi Jun 5, 2023 Maintainer Author

jr4c Jul 6, 2023

sanchit-gandhi Jul 13, 2023 Maintainer Author

kli017 Jan 19, 2024

sanchit-gandhi
May 19, 2023
Maintainer

Replies: 2 comments 3 replies

jr4c
Jun 4, 2023

sanchit-gandhi Jun 5, 2023
Maintainer Author

jr4c
Jul 6, 2023

sanchit-gandhi Jul 13, 2023
Maintainer Author