In this example, we'll be training a Flux.1 LoRA model using the SimpleTuner toolkit.
When you're training every component of the model, a rank-16 LoRA ends up using a bit more than 40GB of VRAM for training.
You'll need at minimum, a single A40 GPU, or, ideally, multiple A6000s. Luckily, these are readily available through providers such as TensorDock for extremely low rates (<$2/hr).
Unlike other models, AMD and Apple GPUs do not work for training Flux.
Make sure that you have python installed. You can check this by running:
python --version
Clone the SimpleTuner repository and set up the python venv:
git clone --branch=main https://github.com/bghira/SimpleTuner.git
cd SimpleTuner
python -m venv .venv
source .venv/bin/activate
pip install -U poetry pip
Note: We're currently relying on the main
branch here, but after the next release, we'll use the release
branch instead.
Depending on your system, you will run one of 3 commands:
# MacOS
poetry install --no-root -C install/apple
# Linux
poetry install --no-root
# Linux with ROCM
poetry install --no-root -C install/rocm
For LoRA support in Diffusers, the current main branch does not yet have this merged in.
To obtain the correct build, run the following commands:
pip uninstall diffusers
pip install git+https://github.com/huggingface/diffusers@lora-support-flux
To run SimpleTuner, you will need to set up a configuration file, the dataset and model directories, and a dataloader configuration file.
Copy config/config.env.example
to config/config.env
:
cp config/config.env.example config/config.env
There, you will need to modify the following variables:
MODEL_TYPE
- Set this tolora
.FLUX
- Set this totrue
.MODEL_NAME
- Set this toblack-forest-labs/FLUX.1-dev
.- Note that you will probably need to log in to Huggingface and be granted access to download this model. We will go over logging in to Huggingface later in this tutorial.
OUTPUT_DIR
- Set this to the directory where you want to store your outputs and datasets. It's recommended to use a full path here.VALIDATION_RESOLUTION
- As Flux is a 1024px model, you can set this to1024x1024
.- Additionally, Flux was fine-tuned on multi-aspect buckets, and other resolutions may be specified using commas to separate them:
1024x1024,1280x768,2048x2048
- Additionally, Flux was fine-tuned on multi-aspect buckets, and other resolutions may be specified using commas to separate them:
VALIDATION_GUIDANCE
- Use whatever you are used to selecting at inference time for Flux.TRAINER_EXTRA_ARGS
- Here, you can place--lora_rank=4
if you wish to substantially reduce the size of the LoRA being trained. This can help with VRAM use.
Tested on Apple and NVIDIA systems, Hugging Face Optimum-Quanto can be used to reduce the precision and VRAM requirements, training Flux on just 20GB.
Inside your SimpleTuner venv:
pip install optimum-quanto
# choices: int8-quanto, int4-quanto, int2-quanto, fp8-quanto
# int8-quanto was tested with a single subject dreambooth LoRA.
# fp8-quanto does not work on Apple systems. you must use int levels.
# int2-quanto is pretty extreme and gets the whole rank-1 LoRA down to about 13.9GB VRAM.
# may the gods have mercy on your soul, should you push things Too Far.
export TRAINER_EXTRA_ARGS="--base_model_precision=int8-quanto"
# Maybe you want the text encoders to remain full precision so your text embeds are cake.
# We unload the text encoders before training, so, that's not an issue during training time - only during pre-caching.
# Alternatively, you can go ham on quantisation here and run them in int4 or int8 mode, because no one can stop you.
export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --text_encoder_1_precision=no_change --text_encoder_2_precision=no_change"
# When you're quantising the model, we're not in pure bf16 anymore.
# Since adamw_bf16 will never work with this setup, select another optimiser.
# I know the spelling is different than everywhere else, but we're in too deep to fix it now.
export OPTIMIZER="adafactor" # or maybe prodigy
It's crucial to have a substantial dataset to train your model on. There are limitations on the dataset size, and you will need to ensure that your dataset is large enough to train your model effectively. Note that the bare minimum dataset size is TRAIN_BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS
as well as more than VAE_BATCH_SIZE
. The dataset will not be useable if it is too small.
Depending on the dataset you have, you will need to set up your dataset directory and dataloader configuration file differently. In this example, we will be using pseudo-camera-10k as the dataset.
In your OUTPUT_DIR
directory, create a multidatabackend.json:
[
{
"id": "pseudo-camera-10k-flux",
"type": "local",
"crop": true,
"crop_aspect": "square",
"crop_style": "center",
"resolution": 1.0,
"minimum_image_size": 0.5,
"maximum_image_size": 1.0,
"target_downsample_size": 1.0,
"resolution_type": "area",
"cache_dir_vae": "cache/vae/flux/pseudo-camera-10k",
"instance_data_dir": "datasets/pseudo-camera-10k",
"disabled": false,
"skip_file_discovery": "",
"caption_strategy": "filename",
"metadata_backend": "json"
},
{
"id": "text-embeds",
"type": "local",
"dataset_type": "text_embeds",
"default": true,
"cache_dir": "cache/text/flux/pseudo-camera-10k",
"disabled": false,
"write_batch_size": 128
}
]
Then, navigate to the OUTPUT_DIR
directory and create a datasets
directory:
apt -y install git-lfs
mkdir -p datasets
pushd datasets
git clone https://huggingface.co/datasets/ptx0/pseudo-camera-10k
popd
This will download about 10k photograph samples to your datasets/pseudo-camera-10k
directory, which will be automatically created for you.
You'll want to login to WandB and HF Hub before beginning training, especially if you're using PUSH_TO_HUB=true
and --report_to=wandb
.
If you're going to be pushing items to a Git LFS repository manually, you should also run git config --global credential.helper store
Run the following commands:
wandb login
and
huggingface-cli login
Follow the instructions to log in to both services.
From the SimpleTuner directory, one simply has to run:
bash train.sh
This will begin the text embed and VAE output caching to disk.
For more information, see the dataloader and tutorial documents.
Note: It's unclear whether training on multi-aspect buckets works correctly for Flux at the moment. It's recommended to use crop_style=random
and crop_aspect=square
.
- A model as large as 12B has empirically performed better with lower learning rates.
- LoRA at 1e-4 might totally roast the thing. LoRA at 1e-7 does nearly nothing.
- Minimum 8bit quantisation is required for a 24G card to train this model - but 32G (V100) cards suffer a more tragic fate.
- Without quantising the model, a rank-1 LoRA sits at just over 32GB of mem use, in a way that prevents a 32G V100 from actually working
- Adafactor works, reducing VRAM to ~24G or further with sub-1024x1024 training
- Quantising the model isn't a bad thing
- It allows you to push higher batch sizes and possibly obtain a better result
- It unlocks the non-bf16 optimisers for use, such as Prodigy, Adafactor, Dadaptation, AdamW, and AdamW8Bit
- As usual, fp8 quantisation runs more slowly than int8 and might have a worse result due to the use of
e4m3fn
in Quanto- fp16 training similarly is bad for Flux; this model wants the range of bf16
e5m2
level precision is better at fp8 but haven't looked into how to enable it yet. Sorry, H100 owners. We weep for you.
- Larger rank models might be undesirable on a 12B model due to the general training dynamics of large models.
- Try a smaller network first (rank-1, rank-4) and work your way up - they'll train faster, and might do everything you need.
- When you do these things (among others), some square grid artifacts may begin appearing in the samples:
- Overtrain with low quality data
- Use too high of a learning rate
- Select a bad optimiser
- Overtraining (in general), a low-capacity network with too many images
- Undertraining (also), a high-capacity network with too few images
- Using weird aspect ratios or training data sizes
- Training for too long on square crops probably won't damage this model. Go nuts, it's great and reliable.