feat: support node smart offload, reduce peak VRAM/RAM usage #3709

storyicon · 2024-06-13T11:30:29Z

For the purpose of decoupling, or to avoid reloading the model every time a node is executed, node developers tend to separate the model loading as an individual node, so that the execution speed can benefit from the node cache. Although ComfyUI implements the internal model management method model_management.load_models_gpu, it is unrealistic to expect all custom nodes to adopt this approach given the variety of model architectures and developers.

In the current implementation, the outputs of all nodes are always referenced during the workflow execution. This prevents some larger models or tensors from being effectively garbage-collected, resulting in CUDA memory overflow.

Let's take the following workflow as an example:

Before the execution of node 24, the models loaded by 🔎Yoloworld Model Loader and 🔎ESAM Model Loader should have been released, and this portion of GPU memory could have been returned to the subsequent memory-intensive KSampler, instead of causing a CUDA out-of-memory error in later steps.

In the example above, AllocateVRAM is used to simulate the GPU memory allocation scenario, and it is a simple custom node implementation.

class AnyType(str):
    def __ne__(self, __value: object) -> bool:
        return False
any_type = AnyType("*")

class AllocateVRAM:
    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": { 
                "anything": (any_type,),
                "size": ("FLOAT", {"min": 0, "max": 1024, "step": 0.01, "default": 1}),
            }
        }
    RETURN_TYPES = (any_type, "TENSOR")
    RETURN_NAMES = ("anything", "tensor")
    FUNCTION = "main"
    CATEGORY = "util"
    def main(self, anything, size):
        num_elements = 1_073_741_824 // 4
        tensor = torch.randn(int(float(num_elements) * size), device='cuda')
        return (anything, tensor)

It simulates subsequent, more complex workflows.

This PR aims to automatically release unreferenced node outputs, helping to reduce peak VRAM/RAM usage during execution and mitigate out-of-memory issues.
The following modes are supported:

0: Means disabling this feature;
1: Only release outputs that were never referenced;
2: Release all currently unreferenced outputs.

When setting --node-smart-offload-level to 2 in the launch arguments, the example workflow above runs well on an A10 GPU with 22GB of memory.

This implementation has good compatibility and I believe many workflows would benefit from this.

Signed-off-by: storyicon <storyicon@foxmail.com>

patientx · 2024-06-14T17:53:25Z

For the purpose of decoupling, or to avoid reloading the model every time a node is executed, node developers tend to separate the model loading as an individual node, so that the execution speed can benefit from the node cache. Although ComfyUI implements the internal model management method model_management.load_models_gpu, it is unrealistic to expect all custom nodes to adopt this approach given the variety of model architectures and developers.

In the current implementation, the outputs of all nodes are always referenced during the workflow execution. This prevents some larger models or tensors from being effectively garbage-collected, resulting in CUDA memory overflow.

Let's take the following workflow as an example:

Before the execution of node 24, the models loaded by 🔎Yoloworld Model Loader and 🔎ESAM Model Loader should have been released, and this portion of GPU memory could have been returned to the subsequent memory-intensive KSampler, instead of causing a CUDA out-of-memory error in later steps.

In the example above, AllocateVRAM is used to simulate the GPU memory allocation scenario, and it is a simple custom node implementation.
class AnyType(str):
    def __ne__(self, __value: object) -> bool:
        return False
any_type = AnyType("*")

class AllocateVRAM:
    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": { 
                "anything": (any_type,),
                "size": ("FLOAT", {"min": 0, "max": 1024, "step": 0.01, "default": 1}),
            }
        }
    RETURN_TYPES = (any_type, "TENSOR")
    RETURN_NAMES = ("anything", "tensor")
    FUNCTION = "main"
    CATEGORY = "util"
    def main(self, anything, size):
        num_elements = 1_073_741_824 // 4
        tensor = torch.randn(int(float(num_elements) * size), device='cuda')
        return (anything, tensor)
It simulates subsequent, more complex workflows.

This PR aims to automatically release unreferenced node outputs, helping to reduce peak VRAM/RAM usage during execution and mitigate out-of-memory issues. The following modes are supported:

0: Means disabling this feature;

1: Only release outputs that were never referenced;

2: Release all currently unreferenced outputs.

When setting --node-smart-offload-level to 2 in the launch arguments, the example workflow above runs well on an A10 GPU with 22GB of memory.

This implementation has good compatibility and I believe many workflows would benefit from this.

Do we need that allocatevram node , if so where can we get it ?

Tried this with a workflow (always same seeds, same configs) that generates & hiresfix & face restore & hand restore and in the end didn't do much noticeble difference , normally over 400 sec so even a 20-30 sec difference would be good to get , but only like 5 secs max and that could be within normal.

storyicon · 2024-06-17T03:40:49Z

AllocateVRAM is a test node, and there is no need to merge it into the default node list, which is why I have provided its implementation code in the PR description. This PR aims to reduce peak VRAM/RAM usage, thereby alleviating OOM issues. It is not intended to accelerate workflow execution, so the lack of significant changes in execution time that you observed is expected. @patientx

BrechtCorbeel · 2024-07-16T10:24:08Z

I do want this as a node is it in the node library?

doctorpangloss · 2024-07-21T21:52:12Z

@guill is there a valuable idea here?

guill · 2024-07-21T22:42:03Z

I definitely think there's value in being able to dump cached outputs in the middle of graph execution (for users on low-spec machines). That behavior would have to be re-implemented under the forward execution changes though (PR #2666).

Because the forward execution PR allows the dynamic creation of graph edges during execution (which is what allows loops to work), it's currently impossible to know with certainty when the output of a node is "done" being used so that we can dump it. There are two ways we could add this behavior post execution-inversion:

In this edge case (where we have already dumped a node's output and then a new edge is added from that node's output), re-execute the node. This would increase execution time, but keep RAM/VRAM usage low.
Restrict what edges a node expansion is allowed to add to the graph. Specifically, we could say that new output connections are only allowed to be added to nodes that already connect to the node performing the expansion. This would work fine for existing use-cases (loops, components, and lazy evaluation), but might restrict things in the future.

doctorpangloss · 2024-07-21T23:09:49Z

can we just disable output dumping if there is any loop behavior? aka enable eager unloading as long as the graph is a real static DAG?

guill · 2024-07-21T23:45:17Z

Currently, there's no way to know ahead of time which nodes may expand and which won't -- and I'm not totally sure that we would want to disable this feature based on that anyway. If people are enabling this feature, it's likely because they simply don't have the RAM/VRAM to handle keeping the models in memory. If the alternative is "execution fails with an out of memory error", loading the model from disk multiple times may be preferable anyway.

ethanfel · 2024-09-26T17:29:56Z

seems to be a good addition

feat: support node smart offload, reduce peak VRAM/RAM usage

57ff206

Signed-off-by: storyicon <storyicon@foxmail.com>

storyicon requested a review from comfyanonymous as a code owner June 13, 2024 11:30

mcmonkey4eva added the Feature A new feature to add to ComfyUI. label Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support node smart offload, reduce peak VRAM/RAM usage #3709

feat: support node smart offload, reduce peak VRAM/RAM usage #3709

storyicon commented Jun 13, 2024 •

edited

Loading

patientx commented Jun 14, 2024

storyicon commented Jun 17, 2024

BrechtCorbeel commented Jul 16, 2024

doctorpangloss commented Jul 21, 2024

guill commented Jul 21, 2024

doctorpangloss commented Jul 21, 2024

guill commented Jul 21, 2024

ethanfel commented Sep 26, 2024

feat: support node smart offload, reduce peak VRAM/RAM usage #3709

Are you sure you want to change the base?

feat: support node smart offload, reduce peak VRAM/RAM usage #3709

Conversation

storyicon commented Jun 13, 2024 • edited Loading

patientx commented Jun 14, 2024

storyicon commented Jun 17, 2024

BrechtCorbeel commented Jul 16, 2024

doctorpangloss commented Jul 21, 2024

guill commented Jul 21, 2024

doctorpangloss commented Jul 21, 2024

guill commented Jul 21, 2024

ethanfel commented Sep 26, 2024

storyicon commented Jun 13, 2024 •

edited

Loading