FLUX full fine-tuning VRAM & speed #890

primecai · 2024-08-27T16:31:22Z

primecai
Aug 27, 2024

Hi all,

I'm trying to use FLUX full fine-tuning, and managed to get it to work with ~20sec/iter, using batch size 1 and resolution 512x512 on a single 80G A100. This takes about 71GB VRAM. I'd really like to get it to work on higher resolution and high batch size (1 is honestly a bit of a stretch to get good results, I guess), so any advice will be very appreciated. Here is my accelerate config:

- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: bf16
        - use_cpu: False
        - debug: False
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu',           'zero3_init_flag': False, 'zero_stage': 2}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

bghira · 2024-08-27T16:59:21Z

bghira
Aug 27, 2024
Maintainer

i was seeing about 7-8 seconds per step with 3x A100-80G in DeepSpeed ZeRO 2 and just around 58,000M VRAM in use on each GPU, when training on mixed 512px and 1024px.

7 replies

primecai Sep 1, 2024
Author

Following up on this, were you using quantization? And is this with batch size 1?

zhangvia Sep 19, 2024

what did you mean nothing special? i using deepspeed stage 2, bf16 mixed_precision, adafactor, 1024px*1024px, batchsize 1,but still get oom on 4 x A100 80GB @bghira

bghira Sep 19, 2024
Maintainer

no guys i don't think quanto works with deepspeed.

primecai Sep 19, 2024
Author

what did you mean nothing special? i using deepspeed stage 2, bf16 mixed_precision, adafactor, 1024px*1024px, batchsize 1,but still get oom on 4 x A100 80GB @bghira

@zhangvia I managed to get it to work with a reasonable batch size (~30 per A100 80GB GPU on 512x512). Here are some suggestions:

Do not specify --multi_gpu, this will overwrite deepspeed and make accelerate do the normal training without any offloading stuff.
Be sure to use multiple GPUs, by design deepspeed works better with multiple GPUs.
Activate CPU unloading, i.e. put cpu for both options.

btw @bghira what batch size and lr do you use? I'm trying to do some hyperparameter search, any suggestion will be very helpful.

bghira Sep 19, 2024
Maintainer

i don't really do full finetuning for Flux, just experimented to ensure it worked

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FLUX full fine-tuning VRAM & speed #890

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

FLUX full fine-tuning VRAM & speed #890

primecai Aug 27, 2024

Replies: 1 comment · 7 replies

bghira Aug 27, 2024 Maintainer

primecai Sep 1, 2024 Author

zhangvia Sep 19, 2024

bghira Sep 19, 2024 Maintainer

primecai Sep 19, 2024 Author

bghira Sep 19, 2024 Maintainer

primecai
Aug 27, 2024

Replies: 1 comment 7 replies

bghira
Aug 27, 2024
Maintainer

primecai Sep 1, 2024
Author

bghira Sep 19, 2024
Maintainer

primecai Sep 19, 2024
Author

bghira Sep 19, 2024
Maintainer