Merge pull request #323 from bghira/main

apple mps: various bugfixes for LoRA training, SDXL, SD 2.x | sd2x: various bugfixes for EMA, validations noise scheduler config | add --disable_multiprocessing for possible performance improvements on certain systems | metadata: abstract logic into pluggable backends | metadata: support for parquet backend, pull data directly from Pandas dataframes | vaecache: improve and fix logic for scan_for_errors=true | aspect bucketing: make it more robust for extremely diverse datasets
bghira · Mar 22, 2024 · 5f3dda7 · 5f3dda7
2 parents 396ff92 + 8d4de2e
commit 5f3dda7
Show file tree

Hide file tree

Showing 29 changed files with 5,462 additions and 1,372 deletions.
diff --git a/INSTALL.md b/INSTALL.md
@@ -6,10 +6,31 @@
 git clone https://github.com/bghira/SimpleTuner --branch release
 python -m venv .venv
 pip3 install -U poetry pip
+```
+
+### MacOS (Apple Silicon)
+
+The experience of training a model may be disappointing on Apple hardware due to the lack of memory-efficient attention - things require more VRAM here.
+
+You will require a minimum of 24G of total memory for an SDXL LoRA at a batch size of 1.
+
+To install the Apple-specific requirements:
+
+```bash
+poetry install --no-root -C install/apple
+```
+
+### Linux
+
+The first command you'll run will install most of the dependencies:
+
+```bash
 poetry install --no-root
 ```
 
-You will need to install some Linux-specific dependencies (Ubuntu is used here):
+#### Optional, possibly not required steps
+
+You will possibly need to install some Linux-specific dependencies (Ubuntu is used here):
 
 > ⚠️ This command can break certain container deployments. If it does, you'll have to redeploy the container.
 
@@ -36,6 +57,8 @@ If the egg install for Xformers does not work, try including `xformers` on the f
 pip3 install --pre xformers torch torchvision torchaudio torchtriton --extra-index-url https://download.pytorch.org/whl/nightly/cu118 --force
 ```
 
+### All platforms
+
 2. For SD2.1, copy `sd21-env.sh.example` to `env.sh` - be sure to fill out the details. Try to change as little as possible.
 
 For SDXL, copy `sdxl-env.sh.example` to `sdxl-env.sh` and then fill in the details.

diff --git a/README.md b/README.md
@@ -4,12 +4,13 @@
 
 **SimpleTuner** is a repository dedicated to a set of experimental scripts designed for training optimization. The project is geared towards simplicity, with a focus on making the code easy to read and understand. This codebase serves as a shared academic exercise, and contributions to its improvement are welcome.
 
-* Multi-GPU training is supported and encouraged
+* Multi-GPU training
 * Aspect bucketing "just works"; fill a folder of images and let it rip
 * Multiple datasets can be used in a single training session, each with a different base resolution.
 * VRAM-saving techniques, such as pre-computing VAE and text encoder outputs
-* Full featured fine-tuning support for SDXL and SD 2.x
-* LoRA training support for SDXL and SD 2.x
+* Full featured fine-tuning support
+  * Bias training (BitFit)
+* LoRA training support
 
 ## Table of Contents
 
@@ -60,6 +61,13 @@ Stable Diffusion 2.1 is known for difficulty during fine-tuning, but this doesn'
 
 EMA (exponential moving average) weights are a memory-heavy affair, but provide fantastic results at the end of training. Without it, training can still be done, but more care must be taken not to drastically change the model leading to "catastrophic forgetting".
 
+### GPU vendors
+
+* NVIDIA - pretty much anything 3090 and up is a safe bet. YMMV.
+* AMD - No one has reported anything, we don't know.
+* Apple - LoRA and full u-net tuning are tested to work on an M3 Max with 128G memory, taking about **12G** of "Wired" memory and **4G** of system memory for SDXL.
+  * You likely need a 24G or greater machine for machine learning with M-series hardware due to the lack of memory-efficient attention.
+
 ### SDXL, 1024px
 
 * A100-80G (EMA, large batches, LoRA @ insane batch sizes)

diff --git a/helpers/arguments.py b/helpers/arguments.py
@@ -298,6 +298,26 @@ def parse_args(input_args=None):
             " This mostly applies to S3, but some shared server filesystems may benefit as well, eg. Ceph. Default: 64."
         ),
     )
+    parser.add_argument(
+        "--disable_multiprocessing",
+        default=False,
+        action="store_true",
+        help=(
+            "If set, will use threads instead of processes during metadata caching operations."
+            " This is set implicitly for Apple systems, as Darwin behaves oddly with multiprocessing."
+            " For some systems, multiprocessing may be slower than threading, so this option is provided."
+        ),
+    )
+    parser.add_argument(
+        "--aspect_bucket_worker_count",
+        type=int,
+        default=12,
+        help=(
+            "The number of workers to use for aspect bucketing. This is a CPU-bound task, so the number of workers"
+            " should be set to the number of CPU threads available. If you use an I/O bound backend, an even higher"
+            " value may make sense. Default: 12."
+        ),
+    )
     parser.add_argument(
         "--cache_dir",
         type=str,
@@ -421,37 +441,6 @@ def parse_args(input_args=None):
             " would result in a 4 megapixel image being resized to 2 megapixel before cropping to 1 megapixel."
         ),
     )
-    parser.add_argument(
-        "--crop",
-        default=False,
-        type=bool,
-        help=(
-            "Whether to crop the input images to the resolution. If not set, the images will be downsampled"
-            " instead. When cropping is enabled, the images will not be resized before cropping. If training SDXL,"
-            " the VAE cache and aspect bucket cache will need to be (re)built so they include crop coordinates."
-        ),
-    )
-    parser.add_argument(
-        "--crop_style",
-        default="random",
-        choices=["center", "centre", "corner", "random"],
-        help=(
-            "When --crop is provided, a crop style may be defined that designates which part of an image to crop to."
-            " The old behaviour was to crop to the lower right corner, but this isn't always ideal for training."
-            " The default is 'random', which will locate a random segment of the image matching the given resolution."
-        ),
-    )
-    parser.add_argument(
-        "--crop_aspect",
-        default="square",
-        choices=["square", "preserve"],
-        help=(
-            "When --crop is supplied, the default behaviour is to crop to square images, which greatly simplifies aspect bucketing."
-            " However, --crop_aspect may be set to 'preserve', which will crop based on the --resolution_type value."
-            " If --resolution_type=area, the crop will be equal to the target pixel area. If --resolution_type=pixel,"
-            " the crop will have the smaller edge equal to the value of --resolution."
-        ),
-    )
     parser.add_argument(
         "--train_text_encoder",
         action="store_true",
@@ -974,6 +963,18 @@ def parse_args(input_args=None):
             " This can be helpful when fine-tuning Stable Diffusion 2.1 on a new style."
         ),
     )
+    parser.add_argument(
+        "--freeze_unet_strategy",
+        type=str,
+        choices=["none", "bitfit"],
+        default="none",
+        help=(
+            "When freezing the UNet, we can use the 'none' or 'bitfit' strategy."
+            " The 'bitfit' strategy will freeze all weights, and leave bias thawed."
+            " The default strategy is to leave the full u-net thawed."
+            " Freezing the weights can improve convergence for finetuning."
+        ),
+    )
     parser.add_argument(
         "--print_filenames",
         action="store_true",