-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support node smart offload, reduce peak VRAM/RAM usage #3709
base: master
Are you sure you want to change the base?
feat: support node smart offload, reduce peak VRAM/RAM usage #3709
Conversation
Signed-off-by: storyicon <storyicon@foxmail.com>
Do we need that allocatevram node , if so where can we get it ? Tried this with a workflow (always same seeds, same configs) that generates & hiresfix & face restore & hand restore and in the end didn't do much noticeble difference , normally over 400 sec so even a 20-30 sec difference would be good to get , but only like 5 secs max and that could be within normal. |
AllocateVRAM is a test node, and there is no need to merge it into the default node list, which is why I have provided its implementation code in the PR description. This PR aims to reduce peak VRAM/RAM usage, thereby alleviating OOM issues. It is not intended to accelerate workflow execution, so the lack of significant changes in execution time that you observed is expected. @patientx |
I do want this as a node is it in the node library? |
@guill is there a valuable idea here? |
I definitely think there's value in being able to dump cached outputs in the middle of graph execution (for users on low-spec machines). That behavior would have to be re-implemented under the forward execution changes though (PR #2666). Because the forward execution PR allows the dynamic creation of graph edges during execution (which is what allows loops to work), it's currently impossible to know with certainty when the output of a node is "done" being used so that we can dump it. There are two ways we could add this behavior post execution-inversion:
|
can we just disable output dumping if there is any loop behavior? aka enable eager unloading as long as the graph is a real static DAG? |
Currently, there's no way to know ahead of time which nodes may expand and which won't -- and I'm not totally sure that we would want to disable this feature based on that anyway. If people are enabling this feature, it's likely because they simply don't have the RAM/VRAM to handle keeping the models in memory. If the alternative is "execution fails with an out of memory error", loading the model from disk multiple times may be preferable anyway. |
seems to be a good addition |
For the purpose of decoupling, or to avoid reloading the model every time a node is executed, node developers tend to separate the model loading as an individual node, so that the execution speed can benefit from the node cache. Although ComfyUI implements the internal model management method
model_management.load_models_gpu
, it is unrealistic to expect all custom nodes to adopt this approach given the variety of model architectures and developers.In the current implementation, the outputs of all nodes are always referenced during the workflow execution. This prevents some larger models or tensors from being effectively garbage-collected, resulting in CUDA memory overflow.
Let's take the following workflow as an example:
Before the execution of node
24
, the models loaded by🔎Yoloworld Model Loader
and🔎ESAM Model Loader
should have been released, and this portion of GPU memory could have been returned to the subsequent memory-intensiveKSampler
, instead of causing a CUDA out-of-memory error in later steps.In the example above,
AllocateVRAM
is used to simulate the GPU memory allocation scenario, and it is a simple custom node implementation.It simulates subsequent, more complex workflows.
This PR aims to automatically release unreferenced node outputs, helping to reduce peak VRAM/RAM usage during execution and mitigate out-of-memory issues.
The following modes are supported:
When setting
--node-smart-offload-level
to2
in the launch arguments, the example workflow above runs well on an A10 GPU with 22GB of memory.This implementation has good compatibility and I believe many workflows would benefit from this.