Skip to content

Offload

Vladimir Mandic edited this page Dec 10, 2024 · 3 revisions

Offload

Offload is a method of moving model or parts of the model between the GPU memory (VRAM) and system memory (RAM) in order to reduce the memory footprint of the model and allow it to run on GPUs with lower VRAM.

Automatic offload

Tip

Automatic offload is set by the Settings -> Diffusers -> Model offload mode

Balanced

Balanced offload works differently than all other offloading methods as it performs offloading only when the VRAM usage exceeds the user-specified threshold.

  • Recommended for compatible high VRAM GPUs
  • Faster but requires compatible platform and sufficient VRAM
  • Balanced offload moves parts of the model depending on the user-specified threshold
    allowing to control how much VRAM is to be used
  • Default memory threshold is 75% of the available GPU memory
    Configure threshold in Settings -> Diffusers -> Max GPU memory for balanced offload mode in GB

Warning

Not compatible with Optimum.Quanto qint quantization

Sequential

Works on layer-by-layer basis of each model component that is marked as offload-compatible

  • Recommended for low VRAM GPUs
  • Much slower but allows to run large models such as FLUX even on GPUs with 6GB VRAM

Warning

Not compatible with Quanto qint or BitsAndBytes nf4 quantization

Note

Use of --lowvram automatically triggers use of sequenential offload

Model

Works on model component level by offloading components that are marked as offload-compatible
For example, VAE, text-encoder, etc.

  • Recommended for medium when balanced offload is not compatible
  • Higher compatibility than either balanced and sequential, but lesser savings

Limitations: N/A

Note

Use of --medvram automatically triggers use of model offload

Manual Offload

In addition to above mentioned automatic offload method, SD.Next includes manual offload methods which are less granular and are only supported for specific models.

  • Move base model to CPU when using refiner
  • Move base model to CPU when using VAE
  • Move refiner model to CPU when not in use

Performance Notes

  • Tested using SDXL with 2 large LoRA models
  • Sequential offload is default for GPUs with 4GB or less
  • Balanced offload is default for GPUs with more than 4GB
    Balanced offload is slower than no offload, but allows using large models such as SD35 and FLUX.1 out-of-the-box
  • Balanced offload set to default values
  • LoRA overhead is measured in sec for first and subsequent iterations
  • LoRA mode=backup can use up to 2x system memory
    Using backup can be prohibitive on large models such as SD35 or FLUX.1
Offload mode LoRA type LoRA mode LoRA overhead End-to-end it/s Note
none none N/A N/A 6.7 fastest inference
balanced none N/A N/A 4.5 default without LoRA
sequential none N/A N/A 0.6 lowvram
none native backup 1.8 / 0.0 6.0
balanced native backup 1.3 / 0.0 2.8
sequential native backup 5.8 / 0.0 0.5
none native fuse 1.3 / 1.3 4.8
balanced native fuse 2.8 / 2.5 3.1 default with LoRA
sequential native fuse 8.8 / 7.7 0.4
none diffusers default 2.9 / 2.9 3.8
balanced diffusers default 2.2 / 2.2 2.1
sequential diffusers default 4.6 / 4.6 0.3
none diffusers fuse 5.7 / 5.7 2.0
balanced diffusers fuse N/A did not complete
sequential diffusers fuse N/A did not complete
Clone this wiki locally