Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support node smart offload, reduce peak VRAM/RAM usage #3709

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

storyicon
Copy link

@storyicon storyicon commented Jun 13, 2024

For the purpose of decoupling, or to avoid reloading the model every time a node is executed, node developers tend to separate the model loading as an individual node, so that the execution speed can benefit from the node cache. Although ComfyUI implements the internal model management method model_management.load_models_gpu, it is unrealistic to expect all custom nodes to adopt this approach given the variety of model architectures and developers.

In the current implementation, the outputs of all nodes are always referenced during the workflow execution. This prevents some larger models or tensors from being effectively garbage-collected, resulting in CUDA memory overflow.

Let's take the following workflow as an example:

image

Before the execution of node 24, the models loaded by 🔎Yoloworld Model Loader and 🔎ESAM Model Loader should have been released, and this portion of GPU memory could have been returned to the subsequent memory-intensive KSampler, instead of causing a CUDA out-of-memory error in later steps.

20240613191248

In the example above, AllocateVRAM is used to simulate the GPU memory allocation scenario, and it is a simple custom node implementation.

class AnyType(str):
    def __ne__(self, __value: object) -> bool:
        return False
any_type = AnyType("*")

class AllocateVRAM:
    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": { 
                "anything": (any_type,),
                "size": ("FLOAT", {"min": 0, "max": 1024, "step": 0.01, "default": 1}),
            }
        }
    RETURN_TYPES = (any_type, "TENSOR")
    RETURN_NAMES = ("anything", "tensor")
    FUNCTION = "main"
    CATEGORY = "util"
    def main(self, anything, size):
        num_elements = 1_073_741_824 // 4
        tensor = torch.randn(int(float(num_elements) * size), device='cuda')
        return (anything, tensor)

It simulates subsequent, more complex workflows.

This PR aims to automatically release unreferenced node outputs, helping to reduce peak VRAM/RAM usage during execution and mitigate out-of-memory issues.
The following modes are supported:

  • 0: Means disabling this feature;
  • 1: Only release outputs that were never referenced;
  • 2: Release all currently unreferenced outputs.

When setting --node-smart-offload-level to 2 in the launch arguments, the example workflow above runs well on an A10 GPU with 22GB of memory.

This implementation has good compatibility and I believe many workflows would benefit from this.

Signed-off-by: storyicon <storyicon@foxmail.com>
@patientx
Copy link

For the purpose of decoupling, or to avoid reloading the model every time a node is executed, node developers tend to separate the model loading as an individual node, so that the execution speed can benefit from the node cache. Although ComfyUI implements the internal model management method model_management.load_models_gpu, it is unrealistic to expect all custom nodes to adopt this approach given the variety of model architectures and developers.

In the current implementation, the outputs of all nodes are always referenced during the workflow execution. This prevents some larger models or tensors from being effectively garbage-collected, resulting in CUDA memory overflow.

Let's take the following workflow as an example:

image

Before the execution of node 24, the models loaded by 🔎Yoloworld Model Loader and 🔎ESAM Model Loader should have been released, and this portion of GPU memory could have been returned to the subsequent memory-intensive KSampler, instead of causing a CUDA out-of-memory error in later steps.

20240613191248

In the example above, AllocateVRAM is used to simulate the GPU memory allocation scenario, and it is a simple custom node implementation.

class AnyType(str):
    def __ne__(self, __value: object) -> bool:
        return False
any_type = AnyType("*")

class AllocateVRAM:
    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": { 
                "anything": (any_type,),
                "size": ("FLOAT", {"min": 0, "max": 1024, "step": 0.01, "default": 1}),
            }
        }
    RETURN_TYPES = (any_type, "TENSOR")
    RETURN_NAMES = ("anything", "tensor")
    FUNCTION = "main"
    CATEGORY = "util"
    def main(self, anything, size):
        num_elements = 1_073_741_824 // 4
        tensor = torch.randn(int(float(num_elements) * size), device='cuda')
        return (anything, tensor)

It simulates subsequent, more complex workflows.

This PR aims to automatically release unreferenced node outputs, helping to reduce peak VRAM/RAM usage during execution and mitigate out-of-memory issues. The following modes are supported:

  • 0: Means disabling this feature;
  • 1: Only release outputs that were never referenced;
  • 2: Release all currently unreferenced outputs.

When setting --node-smart-offload-level to 2 in the launch arguments, the example workflow above runs well on an A10 GPU with 22GB of memory.

This implementation has good compatibility and I believe many workflows would benefit from this.

Do we need that allocatevram node , if so where can we get it ?

Tried this with a workflow (always same seeds, same configs) that generates & hiresfix & face restore & hand restore and in the end didn't do much noticeble difference , normally over 400 sec so even a 20-30 sec difference would be good to get , but only like 5 secs max and that could be within normal.

@storyicon
Copy link
Author

AllocateVRAM is a test node, and there is no need to merge it into the default node list, which is why I have provided its implementation code in the PR description. This PR aims to reduce peak VRAM/RAM usage, thereby alleviating OOM issues. It is not intended to accelerate workflow execution, so the lack of significant changes in execution time that you observed is expected. @patientx

@mcmonkey4eva mcmonkey4eva added the Feature A new feature to add to ComfyUI. label Jun 28, 2024
@BrechtCorbeel
Copy link

I do want this as a node is it in the node library?

@doctorpangloss
Copy link

@guill is there a valuable idea here?

@guill
Copy link
Contributor

guill commented Jul 21, 2024

I definitely think there's value in being able to dump cached outputs in the middle of graph execution (for users on low-spec machines). That behavior would have to be re-implemented under the forward execution changes though (PR #2666).

Because the forward execution PR allows the dynamic creation of graph edges during execution (which is what allows loops to work), it's currently impossible to know with certainty when the output of a node is "done" being used so that we can dump it. There are two ways we could add this behavior post execution-inversion:

  1. In this edge case (where we have already dumped a node's output and then a new edge is added from that node's output), re-execute the node. This would increase execution time, but keep RAM/VRAM usage low.
  2. Restrict what edges a node expansion is allowed to add to the graph. Specifically, we could say that new output connections are only allowed to be added to nodes that already connect to the node performing the expansion. This would work fine for existing use-cases (loops, components, and lazy evaluation), but might restrict things in the future.

@doctorpangloss
Copy link

can we just disable output dumping if there is any loop behavior? aka enable eager unloading as long as the graph is a real static DAG?

@guill
Copy link
Contributor

guill commented Jul 21, 2024

Currently, there's no way to know ahead of time which nodes may expand and which won't -- and I'm not totally sure that we would want to disable this feature based on that anyway. If people are enabling this feature, it's likely because they simply don't have the RAM/VRAM to handle keeping the models in memory. If the alternative is "execution fails with an out of memory error", loading the model from disk multiple times may be preferable anyway.

@ethanfel
Copy link

seems to be a good addition

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature A new feature to add to ComfyUI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants