distillation.qmd

# Progressive Distillation for Fast Sampling of Diffusion Models

Diffusion models have recently shown great promise for generative modeling, outperforming GANs on perceptual quality and autoregressive models at density estimation.
A remaining downside is their slow sampling time: generating high quality samples takes many hundreds or thousands of model evaluations.
Here we make two contributions to help eliminate this downside: 
- First, we present new parameterizations of diffusion models that provide increased stability when using few sampling steps.
- Second, we present a method to distill a trained deterministic diffusion sampler, using many steps, into a new diffusion model that takes half as many sampling steps.
  We then keep progressively applying this distillation procedure to our model, halving the number of required sampling steps each time.
  On standard image generation benchmarks like CIFAR-10, ImageNet, and LSUN, we start out with state-of-the-art samplers taking as many as 8192 steps, and are able to distill down to models taking as few as 4 steps without losing much perceptual quality; achieving, for example, a FID of 3.0 on CIFAR-10 in 4 steps.
- Finally, we show that the full progressive distillation procedure does not take more time than it takes to train the original model, thus representing an efficient solution for generative modeling using diffusion at both train and test time.

---

We have presented progressive distillation, a method to drastically reduce the number of sampling steps required for high quality generation of images, and potentially other data, using diffusion models with deterministic samplers like DDIM (Song et al., 2020).
By making these models cheaper to run at test time, we hope to increase their usefulness for practical applications, for which running time and computational requirements often represent important constraints.

In the current work we limited ourselves to setups where the student model has the same architecture and number of parameters as the teacher model: in future work we hope to relax this constraint and explore settings where the student model is smaller, potentially enabling further gains in test time computational requirements.
In addition, we hope to move past the generation of images and also explore progressive distillation of diffusion models for different data modalities such as e.g. audio (Chen et al., 2021).

In addition to the proposed distillation procedure, some of our progress was realized through different parameterizations of the diffusion model and its training loss.
We expect to see more progress in this direction as the community further explores this model class.