Skip to content

Latest commit

 

History

History
257 lines (181 loc) · 8.06 KB

File metadata and controls

257 lines (181 loc) · 8.06 KB

Stable Diffusion 3

In this example, we'll be training a Stable Diffusion 3 model using the SimpleTuner toolkit and will be using the lora model type.

Prerequisites

Make sure that you have python installed; SimpleTuner does well with 3.10 or 3.11. Python 3.12 should not be used.

You can check this by running:

python --version

If you don't have python 3.11 installed on Ubuntu, you can try the following:

apt -y install python3.11 python3.11-venv

Container image dependencies

For Vast, RunPod, and TensorDock (among others), the following will work on a CUDA 12.2-12.4 image:

apt -y install nvidia-cuda-toolkit libgl1-mesa-glx

If libgl1-mesa-glx is not found, you might need to use libgl1-mesa-dri instead. Your mileage may vary.

Installation

Clone the SimpleTuner repository and set up the python venv:

git clone --branch=release https://github.com/bghira/SimpleTuner.git

cd SimpleTuner

python -m venv .venv

source .venv/bin/activate

pip install -U poetry pip

Depending on your system, you will run one of 3 commands:

# MacOS
poetry install -C install/apple

# Linux
poetry install

# Linux with ROCM
poetry install -C install/rocm

AMD ROCm follow-up steps

The following must be executed for an AMD MI300X to be useable:

apt install amd-smi-lib
pushd /opt/rocm/share/amd_smi
python3 -m pip install --upgrade pip
python3 -m pip install .
popd

Removing DeepSpeed & Bits n Bytes

These two dependencies cause numerous issues for container hosts such as RunPod and Vast.

To remove them after poetry has installed them, run the following command in the same terminal:

pip uninstall -y deepspeed bitsandbytes

Setting up the environment

To run SimpleTuner, you will need to set up a configuration file, the dataset and model directories, and a dataloader configuration file.

Configuration file

An experimental script, configure.py, may allow you to entirely skip this section through an interactive step-by-step configuration. It contains some safety features that help avoid common pitfalls.

Note: This doesn't configure your dataloader. You will still have to do that manually, later.

To run it:

python configure.py

If you prefer to manually configure:

Copy config/config.json.example to config/config.json:

cp config/config.json.example config/config.json

There, you will need to modify the following variables:

{
  "model_type": "lora",
  "model_family": "sd3",
  "pretrained_model_name_or_path": "stabilityai/stable-diffusion-3-medium-diffusers",
  "output_dir": "/home/user/outputs/models",
  "validation_resolution": "1024x1024,1280x768",
  "validation_guidance": 3.0,
  "validation_prompt": "your main test prompt here",
  "user_prompt_library": "config/user_prompt_library.json"
}
  • pretrained_model_name_or_path - Set this to stabilityai/stable-diffusion-3-medium-diffusers. Note that you will need to log in to Huggingface and be granted access to download this model. We will go over logging in to Huggingface later in this tutorial.
  • MODEL_TYPE - Set this to lora.
  • MODEL_FAMILY - Set this to sd3.
  • OUTPUT_DIR - Set this to the directory where you want to store your checkpoints and validation images. It's recommended to use a full path here.
  • VALIDATION_RESOLUTION - As SD3 is a 1024px model, you can set this to 1024x1024.
    • Additionally, SD3 was fine-tuned on multi-aspect buckets, and other resolutions may be specified using commas to separate them: 1024x1024,1280x768
  • VALIDATION_GUIDANCE - SD3 benefits from a very-low value. Set this to 3.0.

There are a few more if using a Mac M-series machine:

  • mixed_precision should be set to no.

Quantised model training

Tested on Apple and NVIDIA systems, Hugging Face Optimum-Quanto can be used to reduce the precision and VRAM requirements well below the requirements of base SDXL training.

Inside your SimpleTuner venv:

pip install optimum-quanto

⚠️ If using a JSON config file, be sure to use this format in config.json instead of config.env:

{
  "base_model_precision": "int8-quanto",
  "text_encoder_1_precision": "no_change",
  "text_encoder_2_precision": "no_change",
  "text_encoder_3_precision": "no_change",
  "optimizer": "adamw_bf16"
}

For config.env users (deprecated):

# choices: int8-quanto, int4-quanto, int2-quanto, fp8-quanto
# int8-quanto was tested with a single subject dreambooth LoRA.
# fp8-quanto does not work on Apple systems. you must use int levels.
# int2-quanto is pretty extreme and gets the whole rank-1 LoRA down to about 13.9GB VRAM.
# may the gods have mercy on your soul, should you push things Too Far.
export TRAINER_EXTRA_ARGS="--base_model_precision=int8-quanto"

# Maybe you want the text encoders to remain full precision so your text embeds are cake.
# We unload the text encoders before training, so, that's not an issue during training time - only during pre-caching.
# Alternatively, you can go ham on quantisation here and run them in int4 or int8 mode, because no one can stop you.
export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --text_encoder_1_precision=no_change --text_encoder_2_precision=no_change"

# When you're quantising the model, --base_model_default_dtype is set to bf16 by default. This setup requires adamw_bf16, but saves the most memory.
# adamw_bf16 only supports bf16 training, but any other optimiser will support both bf16 or fp32 training precision.
export OPTIMIZER="adamw_bf16"

Dataset considerations

It's crucial to have a substantial dataset to train your model on. There are limitations on the dataset size, and you will need to ensure that your dataset is large enough to train your model effectively. Note that the bare minimum dataset size is TRAIN_BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS as well as more than VAE_BATCH_SIZE. The dataset will not be useable if it is too small.

Depending on the dataset you have, you will need to set up your dataset directory and dataloader configuration file differently. In this example, we will be using pseudo-camera-10k as the dataset.

In your /home/user/simpletuner/config directory, create a multidatabackend.json:

[
  {
    "id": "pseudo-camera-10k-sd3",
    "type": "local",
    "crop": true,
    "crop_aspect": "square",
    "crop_style": "center",
    "resolution": 1.0,
    "minimum_image_size": 0,
    "maximum_image_size": 1.0,
    "target_downsample_size": 1.0,
    "resolution_type": "area",
    "cache_dir_vae": "cache/vae/sd3/pseudo-camera-10k",
    "instance_data_dir": "/home/user/simpletuner/datasets/pseudo-camera-10k",
    "disabled": false,
    "skip_file_discovery": "",
    "caption_strategy": "filename",
    "metadata_backend": "discovery"
  },
  {
    "id": "text-embeds",
    "type": "local",
    "dataset_type": "text_embeds",
    "default": true,
    "cache_dir": "cache/text/sd3/pseudo-camera-10k",
    "disabled": false,
    "write_batch_size": 128
  }
]

Then, create a datasets directory:

mkdir -p datasets
pushd datasets
    huggingface-cli download --repo-type=dataset bghira/pseudo-camera-10k --local-dir=pseudo-camera-10k
popd

This will download about 10k photograph samples to your datasets/pseudo-camera-10k directory, which will be automatically created for you.

Login to WandB and Huggingface Hub

You'll want to login to WandB and HF Hub before beginning training, especially if you're using push_to_hub: true and --report_to=wandb.

If you're going to be pushing items to a Git LFS repository manually, you should also run git config --global credential.helper store

Run the following commands:

wandb login

and

huggingface-cli login

Follow the instructions to log in to both services.

Executing the training run

From the SimpleTuner directory, one simply has to run:

bash train.sh

This will begin the text embed and VAE output caching to disk.

For more information, see the dataloader and tutorial documents.