Skip to content

Latest commit

 

History

History
48 lines (35 loc) · 1.98 KB

hugging_face_format.md

File metadata and controls

48 lines (35 loc) · 1.98 KB

Hugging Face transformers Format

Load in Low Precision

You may apply INT4 optimizations to any Hugging Face Transformers models as follows:

# load Hugging Face Transformers model with INT4 optimizations
from ipex_llm.transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)

After loading the Hugging Face Transformers model, you may easily run the optimized model as follows:

# run the optimized model
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_path)
input_ids = tokenizer.encode(input_str, ...)
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids)

Tip

See the complete CPU examples here and GPU examples here.

Note

You may apply more low bit optimizations (including INT8, INT5 and INT4) as follows:

model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int5")

See the CPU example here and GPU example here.

Save & Load

After the model is optimized using INT4 (or INT8/INT5), you may save and load the optimized model as follows:

model.save_low_bit(model_path)

new_model = AutoModelForCausalLM.load_low_bit(model_path)

Tip

See the complete CPU examples here and GPU examples here.