How to setup parquet strategy dataloader configs with embeds? #886

shu-boom · 2024-08-27T10:13:44Z

shu-boom
Aug 27, 2024

I am trying to run the photo-concept-bucket example to train SD3 with parquet files. I have cloned the repository in my datasets folder and the size of the parquet file is coming out to be 150 MB. The tree looks like: /Users/boom/SimpleTuner/datasets/photo-concept-bucket/

System: M2 Pro 16 GB MacOS

The config is set to the following:

export MODEL_TYPE='lora'
export MODEL_NAME="stabilityai/stable-diffusion-3-medium-diffusers"
export STABLE_DIFFUSION_3=true
export PIXART_SIGMA=false
export STABLE_DIFFUSION_LEGACY=false
export KOLORS=false
export FLUX=false
export FLUX_GUIDANCE_VALUE=1.0
export FLUX_LORA_TARGET=all 
export CONTROLNET=false
export USE_DORA=false
export RESUME_CHECKPOINT="latest"
export CHECKPOINTING_STEPS=150
export CHECKPOINTING_LIMIT=2
export LEARNING_RATE=8e-7 #@param {type:"number"}
export DEBUG_EXTRA_ARGS=""
export TRACKER_PROJECT_NAME="${MODEL_TYPE}-training"
export TRACKER_RUN_NAME="simpletuner-sdxl"
export MAX_NUM_STEPS=30000
export DATALOADER_CONFIG="/Users/boom/SimpleTuner/multidatabackend.json"
export OUTPUT_DIR="/Users/boom/SimpleTuner"
export PUSH_TO_HUB="false"
export PUSH_CHECKPOINTS="true"
export HUB_MODEL_NAME=$TRACKER_PROJECT_NAME
export RESOLUTION=1
export RESOLUTION_TYPE="area"
export MINIMUM_RESOLUTION=$RESOLUTION
export VALIDATION_PROMPT="ethnographic photography of teddy bear at a picnic"
export VALIDATION_GUIDANCE=3.0
export VALIDATION_GUIDANCE_RESCALE=0.0
export VALIDATION_GUIDANCE_REAL=1.0
export VALIDATION_NO_CFG_UNTIL_TIMESTEP=2
export VALIDATION_STEPS=100
export VALIDATION_NUM_INFERENCE_STEPS=30
export VALIDATION_NEGATIVE_PROMPT="blurry, cropped, ugly"
export VALIDATION_SEED=42
export VALIDATION_RESOLUTION=$RESOLUTION
export TRAIN_BATCH_SIZE=10
export VAE_BATCH_SIZE=4
export LR_SCHEDULE="polynomial"
export LR_WARMUP_STEPS=1000
export CAPTION_DROPOUT_PROBABILITY=0.1
export METADATA_UPDATE_INTERVAL=65
export MAX_WORKERS=32
export READ_BATCH_SIZE=25
export WRITE_BATCH_SIZE=64
export IMAGE_PROCESSING_BATCH_SIZE=32
export AWS_MAX_POOL_CONNECTIONS=128
export TORCH_NUM_THREADS=8
export DELETE_ERRORED_IMAGES=0
export DELETE_SMALL_IMAGES=0
export TRAINING_SCHEDULER_TIMESTEP_SPACING="trailing"
export INFERENCE_SCHEDULER_TIMESTEP_SPACING="trailing"
export MIN_SNR_GAMMA=5
export USE_XFORMERS=false
export USE_GRADIENT_CHECKPOINTING=true
export ALLOW_TF32=true
export OPTIMIZER="adamw_bf16"
export USE_EMA=false
export EMA_DECAY=0.999
export TRAINER_EXTRA_ARGS=""
export TRAINING_SEED=42
export MIXED_PRECISION=no             
export PURE_BF16=true
export TRAINING_NUM_PROCESSES=1
export TRAINING_NUM_MACHINES=1
export ACCELERATE_EXTRA_ARGS=""                    
export TRAINING_DYNAMO_BACKEND='no'               
export TOKENIZERS_PARALLELISM=false

Following the guide: https://github.com/bghira/SimpleTuner/blob/main/documentation/DATALOADER.md#parquet-caption-strategy--json-lines-datasets

Dataloader config looks like:

  {
    "id": "photo-concept-bucket",
    "type": "local",
    "instance_data_dir": "/Users/boom/SimpleTuner/datasets/photo-concept-bucket/",
    "caption_strategy": "parquet",
    "metadata_backend": "parquet",
    "parquet": {
      "path": "/Users/boom/SimpleTuner/datasets/photo-concept-bucket/photo-concept-bucket.parquet",
      "filename_column": "id",
      "caption_column": "cogvlm_caption",
      "fallback_caption_column": "tags",
      "width_column": "width",
      "height_column": "height",
      "identifier_includes_extension": false
    },
    "resolution": 1.0,
    "minimum_image_size": 0.75,
    "maximum_image_size": 2.0,
    "target_downsample_size": 1.5,
    "prepend_instance_prompt": false,
    "instance_prompt": null,
    "only_instance_prompt": false,
    "disable": false,
    "cache_dir_vae": "/Users/boom/SimpleTuner",
    "probability": 1.0,
    "skip_file_discovery": "",
    "preserve_data_backend_cache": false,
    "vae_cache_clear_each_epoch": true,
    "repeats": 1,
    "crop": true,
    "crop_aspect": "closest",
    "crop_style": "random",
    "crop_aspect_buckets": [1.0, 0.75, 1.23],
    "resolution_type": "area"
    }

ON running bash train.sh I am getting:

2024-08-27 11:51:24,932 [ERROR] (main) 'str' object has no attribute 'get', traceback: Traceback (most recent call last):
File "/Users/boom/SimpleTuner/train.py", line 449, in main
configure_multi_databackend(
File "/Users/boom/SimpleTuner/helpers/data_backend/factory.py", line 348, in configure_multi_databackend
dataset_type = backend.get("dataset_type", None)
AttributeError: 'str' object has no attribute 'get'

I tried looking at existing discussion and found this question: #415

which has specified dataloader config in an array. I am not sure what should I put in the text-embeds entry since the guide available above doesn't specify anything for the parquet strategy in the example, but I tried to wrap the above config in an array and it gives:

2024-08-27 11:57:57,681 [ERROR] (main) Your dataloader config must contain at least one image dataset AND at least one text_embed dataset. See this link for more information about dataset_type:
https://github.com/bghira/SimpleTuner/blob/main/documentation/DATALOADER.md#configuration-options, traceback: Traceback (most recent call last):
File "/Users/boom/SimpleTuner/train.py", line 449, in main
configure_multi_databackend(
File "/Users/boom/SimpleTuner/helpers/data_backend/factory.py", line 442, in configure_multi_databackend
raise ValueError(
ValueError: Your dataloader config must contain at least one image dataset AND at least one text_embed dataset. See this link for more information about dataset_type: https://github.com/bghira/SimpleTuner/blob/main/documentation/DATALOADER.md#configuration-options

So I tried to provide a text_embed following the existing discussion example and the config looks like:

  [{
    "id": "photo-concept-bucket",
    "type": "local",
    "instance_data_dir": "/Users/boom/SimpleTuner/datasets/photo-concept-bucket/",
    "caption_strategy": "parquet",
    "metadata_backend": "parquet",
    "parquet": {
      "path": "/Users/boom/SimpleTuner/datasets/photo-concept-bucket/photo-concept-bucket.parquet",
      "filename_column": "id",
      "caption_column": "cogvlm_caption",
      "fallback_caption_column": "tags",
      "width_column": "width",
      "height_column": "height",
      "identifier_includes_extension": false
    },
    "resolution": 1.0,
    "minimum_image_size": 0.75,
    "maximum_image_size": 2.0,
    "target_downsample_size": 1.5,
    "prepend_instance_prompt": false,
    "instance_prompt": null,
    "only_instance_prompt": false,
    "disable": false,
    "cache_dir_vae": "/Users/boom/SimpleTuner",
    "probability": 1.0,
    "skip_file_discovery": "",
    "preserve_data_backend_cache": false,
    "vae_cache_clear_each_epoch": true,
    "repeats": 1,
    "crop": true,
    "crop_aspect": "closest",
    "crop_style": "random",
    "crop_aspect_buckets": [1.0, 0.75, 1.23],
    "resolution_type": "area"
    },
    {
      "id": "alt-embed-cache",
      "dataset_type": "text_embeds",
      "default": false,
      "type": "local",
      "cache_dir": "/Users/boom/SimpleTuner"
    }
  ]

Note: I do not know what to put in cache_dir so I just put my OUTPUT_DIR? Would be great to know how to configure text_embeds and what are their purpose for better understanding but running the above config gives me:

(Rank: 0) | Bucket | Image Count (per-GPU)

2024-08-27 12:09:56,318 [ERROR] (main) No images were discovered by the bucket manager in the dataset: photo-concept-bucket., traceback: Traceback (most recent call last):
File "/Users/boom/SimpleTuner/train.py", line 449, in main
configure_multi_databackend(
File "/Users/boom/SimpleTuner/helpers/data_backend/factory.py", line 812, in configure_multi_databackend
raise Exception(
Exception: No images were discovered by the bucket manager in the dataset: photo-concept-bucket.**

Some extra logs generated by above execution:

2024-08-27 12:09:43,977 [INFO] (DataBackendFactory) Configuring text embed backend: alt-embed-cache
2024-08-27 12:09:43,995 [INFO] (TextEmbeddingCache) (Rank: 0) (id=alt-embed-cache) Listing all text embed cache entries
2024-08-27 12:09:46,263 [WARNING] (DataBackendFactory) No default text embed was defined, using alt-embed-cache as the default. See this page for information about the default text embed backend: https://github.com/bghira/SimpleTuner/blob/main/documentation/DATALOADER.md#configuration-options
2024-08-27 12:09:46,263 [INFO] (DataBackendFactory) Completed loading text embed services.
2024-08-27 12:09:46,263 [INFO] (DataBackendFactory) Configuring data backend: photo-concept-bucket
2024-08-27 12:09:46,264 [INFO] (DataBackendFactory) (id=photo-concept-bucket) Loading bucket manager.
2024-08-27 12:09:56,297 [INFO] (DataBackendFactory) (id=photo-concept-bucket) Refreshing aspect buckets on main process.
2024-08-27 12:09:56,298 [INFO] (ParquetMetadataBackend) Discovering new files...
2024-08-27 12:09:56,315 [WARNING] (DataBackendFactory) Key disable_validation not found in the current backend config, using the existing value 'False'.
2024-08-27 12:09:56,315 [INFO] (DataBackendFactory) Configured backend: {'id': 'photo-concept-bucket', 'config': {'vae_cache_clear_each_epoch': True, 'probability': 1.0, 'repeats': 1, 'crop': True, 'crop_aspect': 'closest', 'crop_aspect_buckets': [1.0, 0.75, 1.23], 'crop_style': 'random', 'disable_validation': False, 'resolution': 1.0, 'resolution_type': 'area', 'parquet': {'path': '/Users/boom/SimpleTuner/datasets/photo-concept-bucket/photo-concept-bucket.parquet', 'filename_column': 'id', 'caption_column': 'cogvlm_caption', 'fallback_caption_column': 'tags', 'width_column': 'width', 'height_column': 'height', 'identifier_includes_extension': False}, 'caption_strategy': 'parquet', 'instance_data_dir': '/Users/boom/SimpleTuner/datasets/photo-concept-bucket/', 'maximum_image_size': 2.0, 'target_downsample_size': 1.5, 'config_version': 1}, 'dataset_type': 'image', 'data_backend': <helpers.data_backend.local.LocalDataBackend object at 0x13f2fc670>, 'instance_data_dir': '/Users/boom/SimpleTuner/datasets/photo-concept-bucket', 'metadata_backend': <helpers.metadata.backends.parquet.ParquetMetadataBackend object at 0x13f2fc640>}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to setup parquet strategy dataloader configs with embeds? #886

{{title}}

Replies: 0 comments

Select a reply

How to setup parquet strategy dataloader configs with embeds? #886

shu-boom Aug 27, 2024

(Rank: 0) | Bucket | Image Count (per-GPU)

Replies: 0 comments

shu-boom
Aug 27, 2024