Skip to content

Commit

Permalink
Merge pull request #323 from bghira/main
Browse files Browse the repository at this point in the history
apple mps: various bugfixes for LoRA training, SDXL, SD 2.x | sd2x: various bugfixes for EMA, validations noise scheduler config | add --disable_multiprocessing for possible performance improvements on certain systems | metadata: abstract logic into pluggable backends | metadata: support for parquet backend, pull data directly from Pandas dataframes | vaecache: improve and fix logic for scan_for_errors=true | aspect bucketing: make it more robust for extremely diverse datasets
  • Loading branch information
bghira authored Mar 22, 2024
2 parents 396ff92 + 8d4de2e commit 5f3dda7
Show file tree
Hide file tree
Showing 29 changed files with 5,462 additions and 1,372 deletions.
25 changes: 24 additions & 1 deletion INSTALL.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,31 @@
git clone https://github.com/bghira/SimpleTuner --branch release
python -m venv .venv
pip3 install -U poetry pip
```

### MacOS (Apple Silicon)

The experience of training a model may be disappointing on Apple hardware due to the lack of memory-efficient attention - things require more VRAM here.

You will require a minimum of 24G of total memory for an SDXL LoRA at a batch size of 1.

To install the Apple-specific requirements:

```bash
poetry install --no-root -C install/apple
```

### Linux

The first command you'll run will install most of the dependencies:

```bash
poetry install --no-root
```

You will need to install some Linux-specific dependencies (Ubuntu is used here):
#### Optional, possibly not required steps

You will possibly need to install some Linux-specific dependencies (Ubuntu is used here):

> ⚠️ This command can break certain container deployments. If it does, you'll have to redeploy the container.
Expand All @@ -36,6 +57,8 @@ If the egg install for Xformers does not work, try including `xformers` on the f
pip3 install --pre xformers torch torchvision torchaudio torchtriton --extra-index-url https://download.pytorch.org/whl/nightly/cu118 --force
```

### All platforms

2. For SD2.1, copy `sd21-env.sh.example` to `env.sh` - be sure to fill out the details. Try to change as little as possible.

For SDXL, copy `sdxl-env.sh.example` to `sdxl-env.sh` and then fill in the details.
Expand Down
14 changes: 11 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,13 @@
**SimpleTuner** is a repository dedicated to a set of experimental scripts designed for training optimization. The project is geared towards simplicity, with a focus on making the code easy to read and understand. This codebase serves as a shared academic exercise, and contributions to its improvement are welcome.

* Multi-GPU training is supported and encouraged
* Multi-GPU training
* Aspect bucketing "just works"; fill a folder of images and let it rip
* Multiple datasets can be used in a single training session, each with a different base resolution.
* VRAM-saving techniques, such as pre-computing VAE and text encoder outputs
* Full featured fine-tuning support for SDXL and SD 2.x
* LoRA training support for SDXL and SD 2.x
* Full featured fine-tuning support
* Bias training (BitFit)
* LoRA training support

## Table of Contents

Expand Down Expand Up @@ -60,6 +61,13 @@ Stable Diffusion 2.1 is known for difficulty during fine-tuning, but this doesn'

EMA (exponential moving average) weights are a memory-heavy affair, but provide fantastic results at the end of training. Without it, training can still be done, but more care must be taken not to drastically change the model leading to "catastrophic forgetting".

### GPU vendors

* NVIDIA - pretty much anything 3090 and up is a safe bet. YMMV.
* AMD - No one has reported anything, we don't know.
* Apple - LoRA and full u-net tuning are tested to work on an M3 Max with 128G memory, taking about **12G** of "Wired" memory and **4G** of system memory for SDXL.
* You likely need a 24G or greater machine for machine learning with M-series hardware due to the lack of memory-efficient attention.

### SDXL, 1024px

* A100-80G (EMA, large batches, LoRA @ insane batch sizes)
Expand Down
63 changes: 32 additions & 31 deletions helpers/arguments.py
Original file line number Diff line number Diff line change
Expand Up @@ -298,6 +298,26 @@ def parse_args(input_args=None):
" This mostly applies to S3, but some shared server filesystems may benefit as well, eg. Ceph. Default: 64."
),
)
parser.add_argument(
"--disable_multiprocessing",
default=False,
action="store_true",
help=(
"If set, will use threads instead of processes during metadata caching operations."
" This is set implicitly for Apple systems, as Darwin behaves oddly with multiprocessing."
" For some systems, multiprocessing may be slower than threading, so this option is provided."
),
)
parser.add_argument(
"--aspect_bucket_worker_count",
type=int,
default=12,
help=(
"The number of workers to use for aspect bucketing. This is a CPU-bound task, so the number of workers"
" should be set to the number of CPU threads available. If you use an I/O bound backend, an even higher"
" value may make sense. Default: 12."
),
)
parser.add_argument(
"--cache_dir",
type=str,
Expand Down Expand Up @@ -421,37 +441,6 @@ def parse_args(input_args=None):
" would result in a 4 megapixel image being resized to 2 megapixel before cropping to 1 megapixel."
),
)
parser.add_argument(
"--crop",
default=False,
type=bool,
help=(
"Whether to crop the input images to the resolution. If not set, the images will be downsampled"
" instead. When cropping is enabled, the images will not be resized before cropping. If training SDXL,"
" the VAE cache and aspect bucket cache will need to be (re)built so they include crop coordinates."
),
)
parser.add_argument(
"--crop_style",
default="random",
choices=["center", "centre", "corner", "random"],
help=(
"When --crop is provided, a crop style may be defined that designates which part of an image to crop to."
" The old behaviour was to crop to the lower right corner, but this isn't always ideal for training."
" The default is 'random', which will locate a random segment of the image matching the given resolution."
),
)
parser.add_argument(
"--crop_aspect",
default="square",
choices=["square", "preserve"],
help=(
"When --crop is supplied, the default behaviour is to crop to square images, which greatly simplifies aspect bucketing."
" However, --crop_aspect may be set to 'preserve', which will crop based on the --resolution_type value."
" If --resolution_type=area, the crop will be equal to the target pixel area. If --resolution_type=pixel,"
" the crop will have the smaller edge equal to the value of --resolution."
),
)
parser.add_argument(
"--train_text_encoder",
action="store_true",
Expand Down Expand Up @@ -974,6 +963,18 @@ def parse_args(input_args=None):
" This can be helpful when fine-tuning Stable Diffusion 2.1 on a new style."
),
)
parser.add_argument(
"--freeze_unet_strategy",
type=str,
choices=["none", "bitfit"],
default="none",
help=(
"When freezing the UNet, we can use the 'none' or 'bitfit' strategy."
" The 'bitfit' strategy will freeze all weights, and leave bias thawed."
" The default strategy is to leave the full u-net thawed."
" Freezing the weights can improve convergence for finetuning."
),
)
parser.add_argument(
"--print_filenames",
action="store_true",
Expand Down
Loading

0 comments on commit 5f3dda7

Please sign in to comment.