-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ENH Ephemeral GPU offload support for DoRA (#1857)
Adds the concept of ephemeral GPU offloading, i.e. where data in compute intense operations is copied onto the GPU before the operation is performed, after which the result is put back on CPU memory. This PR adds support in the DoRA initialization code, but the approach can be applied in a number of places: when the size of the data compared to the time to perform the operation on CPU memory is heavily time dominant, using ephemeral transfers has a fairly small VRAM overhead (depending on the size of the model/adapter) with orders of magnitude speed-up in certain operations. For example, a Llama3-8B DoRA adapter with r=64 would put an overhead of 2 x (64 x 4096 x 2 + 4096 x 4096) bytes (assuming fp16), i.e. 33 MB or so. A Llama3-70B adapter with r=32 would have 2 x (32 x 8192 x 2 + 8192 x 8192) bytes =130 MB. By making use of ephemeral GPU offloading, more efficient juggling of data between GPU and CPU may become possible, i.e. where instead of always loading as much as we can onto the GPU and then endure the CPU slowness for whatever happens to not fit in there, we intentionally leave a (modest) chunk of VRAM for optimizations like these, and the end result is a much (MUCH) faster experience.
- Loading branch information
Showing
13 changed files
with
324 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
# Copyright 2024-present the HuggingFace Inc. team. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
""" | ||
Example script demonstrating the time difference loading a model with a DoRA using ephemeral GPU offloading vs doing it purely on the CPU. | ||
Example outputs: | ||
$ python load_with_dora.py | ||
--- Loading model --- | ||
Loading checkpoint shards: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 4/4 [00:04<00:00, 1.03s/it] | ||
--- Loading PeftModel --- | ||
--- Done --- | ||
Model loading time: 4.83s | ||
PeftModel loading time: 28.14s | ||
Use ephemeral GPU offloading: False | ||
(Note: if this was the first time you ran the script, or if your cache was cleared, the times shown above are invalid, due to the time taken to download the model and DoRA files. Just re-run the script in this case.) | ||
$ python load_with_dora.py --ephemeral_gpu_offload | ||
--- Loading model --- | ||
Loading checkpoint shards: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 4/4 [00:03<00:00, 1.11it/s] | ||
--- Loading PeftModel --- | ||
--- Done --- | ||
Model loading time: 4.28s | ||
PeftModel loading time: 16.59s | ||
Use ephemeral GPU offloading: True | ||
(Note: if this was the first time you ran the script, or if your cache was cleared, the times shown above are invalid, due to the time taken to download the model and DoRA files. Just re-run the script in this case.) | ||
""" | ||
|
||
import argparse | ||
import time | ||
|
||
from huggingface_hub import snapshot_download | ||
from transformers import AutoModelForCausalLM | ||
|
||
from peft import PeftModel | ||
|
||
|
||
def main(): | ||
parser = argparse.ArgumentParser(description="Load a model with DoRA using ephemeral GPU offloading") | ||
parser.add_argument("--model", type=str, default="NousResearch/Hermes-2-Pro-Mistral-7B", help="Model to load") | ||
parser.add_argument( | ||
"--dora", | ||
type=str, | ||
default="peft-internal-testing/DoRA-Hermes-2-Pro-Mistral-7B", | ||
help="DoRA to use", | ||
) | ||
parser.add_argument("--ephemeral_gpu_offload", action="store_true", help="Use ephemeral GPU offloading") | ||
parser.add_argument( | ||
"--merge_model_path", type="str", help="Merge the model with the DoRA model and save to the given path" | ||
) | ||
args = parser.parse_args() | ||
|
||
peft_model_kwargs = { | ||
"ephemeral_gpu_offload": args.ephemeral_gpu_offload, | ||
"max_memory": {"cpu": "256GiB"}, | ||
"device_map": {"": "cpu"}, | ||
} | ||
|
||
# Predownload | ||
try: | ||
snapshot_download(repo_id=args.model) | ||
except Exception as e: | ||
print(f"Failed to download model: {e}") | ||
# We continue anyway as this might be e.g. a local directory or something | ||
try: | ||
snapshot_download(repo_id=args.dora) | ||
except Exception as e: | ||
print(f"Failed to download DoRA: {e}") | ||
# We continue anyway as this might be e.g. a local directory or something | ||
|
||
start = time.perf_counter() | ||
print("--- Loading model ---") | ||
model = AutoModelForCausalLM.from_pretrained(args.model) | ||
model_time = time.perf_counter() - start | ||
print("--- Loading PeftModel ---") | ||
peft_model = PeftModel.from_pretrained(model, args.dora, **peft_model_kwargs) | ||
print("--- Done ---") | ||
peft_model_time = time.perf_counter() - start | ||
|
||
print(f"Model loading time: {model_time:.2f}s") | ||
print(f"PeftModel loading time: {peft_model_time:.2f}s") | ||
print(f"Use ephemeral GPU offloading: {args.ephemeral_gpu_offload}") | ||
|
||
if args.merge_model_path is not None: | ||
merged_model = peft_model.merge_and_unload(progressbar=True) | ||
merged_model.save_pretrained(args.merge_model_path) | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.