Sharding a model checkpoint for deepspeed usage #39

CoderPat · 2022-12-05T15:12:30Z

Hey!
I'm using a custom version of this repo to run BLOOM-175B with DeepSpeed and it works great, thank you for this!
I was thinking of exploring using large models (such as OPT-175B) and was wondering what is the process for creating a pre-sharded, int8 deepspeed checkpoint for it, similar to https://huggingface.co/microsoft/bloom-deepspeed-inference-int8
Is there any documentation available or example scripts for this?

mayank31398 · 2022-12-05T17:35:51Z

I am unsure about OPT's compatibility with deepspeed.
But if it works, you can simply pass save_mp_checkpoint_path parameter to init_inference method.
This will create a pre-sharded fp16 version (assuming it works :) )

For generating int8 weights (pre-sharded), look at https://github.com/microsoft/DeepSpeedExamples/blob/master/model_compression/gpt2/bash_script/run_zero_quant.sh

This scripts generates a quantized version of gpt2, but it is QAT and requires training.
I haven't personally tried this though.

mayank31398 · 2022-12-05T17:39:22Z

Also watch out for #37

mayank31398 · 2022-12-05T17:40:50Z

If you don't have memory constraints (number of GPUs), I will encourage you to use fp16 since it is faster.
int8/int4 will be much faster once DeepSpeed starts supporting their kernels.

CoderPat mentioned this issue Dec 19, 2022

Porting OPT-175B to this framework neulab/lti-llm-deployment#10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharding a model checkpoint for deepspeed usage #39

Sharding a model checkpoint for deepspeed usage #39

CoderPat commented Dec 5, 2022

mayank31398 commented Dec 5, 2022

mayank31398 commented Dec 5, 2022

mayank31398 commented Dec 5, 2022

Sharding a model checkpoint for deepspeed usage #39

Sharding a model checkpoint for deepspeed usage #39

Comments

CoderPat commented Dec 5, 2022

mayank31398 commented Dec 5, 2022

mayank31398 commented Dec 5, 2022

mayank31398 commented Dec 5, 2022