tags | |
---|---|
|
Diffusion in ML is a way of generating data of arbitrary distribution. Even though the concept is relatively simple, the math behind it makes it hard to grasp. There are several papers on the subject:
- Deep Unsupervised Learning using Nonequilibrium Thermodynamics by Sohl-Dikstein et al. (2015) -- applies an idea from physics to ML in order to learn arbitrary distribution using MLP
- Denoising Diffusion Probabilistic Models (DDPM) by Ho et al. (2020) -- applied Diffusion model to images and got competitive results
From a high level, diffusion models attempt to capture 'noise' in a sample (e.g. image), such that if we subtract the predicted noise from the input, we should move closer to the distribution in the training data.
To train diffusion models, there are two processes:
- forward: adds noise to the input image
- reverse: removes noise from the input image -- this is what we want to model
To reproduce the reverse process during inference, we want to parametrize it and design a model that would predict those parameters.
Thanks to noice math, the reverse process can be explicitly defined in terms of its parameters, which gives us the true values the model should predict. With these in hand we can simply apply MSE on those predicted parameters and thats it.
Moreover, we can sample from the modelled distribution by applying the reverse process on a random noise.
Denoising Diffusion Probabilistic Model (DDPM) is the 'basic' diffusion variant. It follows the High level intutition described above, which omits the following details:
- How does the forward process look like. What sort of noise are we adding?
- How does the true reverse process look like?
- What is the goal of the reverse process? What should the model parametrize and predict?
- How does the loss look like?
- How to put it all together & simplify?
- How is the system trained? How do we generate new sample?
The forward process is Markovian process
To define how will the process noise the image after
TODO: cleanup the math so it fits, from here on down
$$ \begin{aligned} \mathbf{x}t &= \sqrt{\alpha_t}\mathbf{x}{t-1} + \sqrt{1-\alpha_t}\mathbf{e}t \ &= \sqrt{\alpha_t}(\sqrt{\alpha{t-1}}\mathbf{x}{t-2} + \sqrt{1-\alpha{t-1}}\mathbf{e}{t-1}) + \sqrt{1-\alpha_t}\mathbf{e}t \ &= \sqrt{\alpha_t\alpha{t-1}}\mathbf{x}{t-2} + \sqrt{\alpha_t(1-\alpha_{t-1}) + (1-\alpha_t)}\mathbf{e}{t-1} \ &= \sqrt{\alpha_t\alpha{t-1}}\mathbf{x}{t-2} + \sqrt{1-\alpha_t\alpha{t-1}}\mathbf{e}{t-1} \ &= \sqrt{\alpha_t\alpha{t-1}\alpha_{t-2}}\mathbf{x}{t-3} + \sqrt{1-\alpha_t\alpha{t-1}\alpha_{t-2}}\mathbf{e}_{t-2} \ &= \cdots \ &= \sqrt{\bar{\alpha_t}}\mathbf{x}_0 + \sqrt{1-\bar{\alpha_t}}\mathbf{e}_0 \end{aligned} $$
This means that after
If
The above also justifies the definition of
If
$$ p_\theta(\mathbf{x}{t-1}|\mathbf{x}t) = \mathcal{N}( \mathbf{x}{t-1}; \mu\theta(\mathbf{x}_t,t), \sigma^2_t\mathbf{I} ) $$
So essentially we need to find the correct values for
$$ \begin{aligned} q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}0) &= q(\mathbf{x}t|\mathbf{x}{t-1}, \mathbf{x}0)\frac{q(\mathbf{x}{t-1}|\mathbf{x}0)}{q(\mathbf{x}t|\mathbf{x}0)} \ &\propto \exp\left(-\frac{1}{2}\left(\frac{(\mathbf{x}t-\sqrt{\alpha_t}\mathbf{x}{t-1})^2}{\beta_t} + \frac{(\mathbf{x}{t-1}-\sqrt{\bar{\alpha}{t-1}}\mathbf{x}0)^2}{1-\bar{\alpha}{t-1}} - \frac{(\mathbf{x}t-\sqrt{\bar{\alpha}t}\mathbf{x}0)^2}{1-\bar{\alpha}t}\right)\right) \ &= \exp\left(-\frac{1}{2}\left(\frac{\mathbf{x}t^2-2\sqrt{\alpha_t}\mathbf{x}t\mathbf{x}{t-1}+\alpha_t\mathbf{x}{t-1}^2}{\beta_t} + \frac{\mathbf{x}{t-1}^2-2\sqrt{\bar{\alpha}{t-1}}\mathbf{x}{t-1}\mathbf{x}0+\bar{\alpha}{t-1}\mathbf{x}0^2}{1-\bar{\alpha}{t-1}} + \cdots\right)\right) \ &= \exp\left(-\frac{1}{2}\left(\left(\frac{\alpha_t}{\beta_t} + \frac{1}{1-\bar{\alpha}{t-1}}\right)\mathbf{x}{t-1}^2 - 2\left(\frac{\sqrt{\alpha_t}}{\beta_t}\mathbf{x}t + \frac{\sqrt{\bar{\alpha}{t-1}}}{1-\bar{\alpha}{t-1}}\mathbf{x}0\right)\mathbf{x}{t-1} + \cdots\right)\right) \end{aligned} $$
From this we derive $ q(\mathbf{x}_{t-1}|\mathbf{x}t, \mathbf{x}0) = \mathcal{N}(x{t-1}; \mathbf{\mu\theta}(x_t, x_0), \tilde{\beta_t}\mathbf{I}$, where:
$$ \begin{aligned} \tilde{\beta}t &= 1/\left(\frac{\alpha_t}{\beta_t} + \frac{1}{1-\bar{\alpha}{t-1}}\right) = 1/\left(\frac{\alpha_t(1-\bar{\alpha}{t-1})+\beta_t}{\beta_t(1-\bar{\alpha}{t-1})}\right) = 1/\left(\frac{\alpha_t+\beta_t-\bar{\alpha}t}{\beta_t(1-\bar{\alpha}{t-1})}\right) = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t, \ \tilde{\mu}_t(\mathbf{x}_t, \mathbf{x}0) &= \left(\frac{\sqrt{\alpha_t}}{\beta_t}\mathbf{x}t + \frac{\sqrt{\bar{\alpha}{t-1}}}{1-\bar{\alpha}{t-1}}\mathbf{x}0\right)\frac{1-\bar{\alpha}{t-1}}{1-\bar{\alpha}t}\beta_t = \frac{\sqrt{\bar{\alpha}{t-1}}\beta_t}{1-\bar{\alpha}_t}\mathbf{x}0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t. \end{aligned} $$
As with all ML models, the goal is to maximize the likelihood of the training data. We simplify using Jensen's inequality.
$$ \begin{aligned} -\mathbb{E}{q(\mathbf{x}0)}[\log p\theta(\mathbf{x}0)] &= -\mathbb{E}{q(\mathbf{x}0)}[\log\mathbb{E}{p\theta(\mathbf{x}{1:T})}[p\theta(\mathbf{x}0|\mathbf{x}{1:T})]] \ &= -\mathbb{E}{q(\mathbf{x}0)}\left[\log\mathbb{E}{q(\mathbf{x}{1:T}|\mathbf{x}0)}\left[\frac{p\theta(\mathbf{x}{0:T})}{q(\mathbf{x}{1:T}|\mathbf{x}0)}\right]\right] \ &\leq -\mathbb{E}{q(\mathbf{x}{0:T})}\left[\log\frac{p\theta(\mathbf{x}{0:T})}{q(\mathbf{x}{1:T}|\mathbf{x}0)}\right] = \mathbb{E}{q(\mathbf{x}{0:T})}\left[\log\frac{q(\mathbf{x}{1:T}|\mathbf{x}0)}{p\theta(\mathbf{x}{0:T})}\right] \ &= \mathbb{E}{q(\mathbf{x}{0:T})}\left[-\log p\theta(\mathbf{x}T) + \sum{t=2}^T\log\frac{q(\mathbf{x}t|\mathbf{x}{t-1})}{p_\theta(\mathbf{x}_{t-1}|\mathbf{x}t)} + \log\frac{q(\mathbf{x}1|\mathbf{x}0)}{p\theta(\mathbf{x}0|\mathbf{x}1)}\right] \ &= \mathbb{E}{q(\mathbf{x}{0:T})}\left[-\log p\theta(\mathbf{x}T) + \sum{t=2}^T\log\left(\frac{q(\mathbf{x}{t-1}|\mathbf{x}t,\mathbf{x}0)}{p\theta(\mathbf{x}{t-1}|\mathbf{x}_t)}\cdot\frac{q(\mathbf{x}_t|\mathbf{x}0)}{q(\mathbf{x}{t-1}|\mathbf{x}0)}\right) + \log\frac{q(\mathbf{x}1|\mathbf{x}0)}{p\theta(\mathbf{x}0|\mathbf{x}1)}\right] \ &= \mathbb{E}{q(\mathbf{x}{0:T})}\left[-\log p\theta(\mathbf{x}T) + \sum{t=2}^T\log\frac{q(\mathbf{x}{t-1}|\mathbf{x}t,\mathbf{x}0)}{p\theta(\mathbf{x}{t-1}|\mathbf{x}_t)} + \log\frac{q(\mathbf{x}_T|\mathbf{x}0)}{q(\mathbf{x}1|\mathbf{x}0)} + \log\frac{q(\mathbf{x}1|\mathbf{x}0)}{p\theta(\mathbf{x}0|\mathbf{x}1)}\right] \ &= \mathbb{E}{q(\mathbf{x}{0:T})}\left[\log\frac{q(\mathbf{x}T|\mathbf{x}0)}{p\theta(\mathbf{x}T)} + \sum{t=2}^T\log\frac{q(\mathbf{x}{t-1}|\mathbf{x}t,\mathbf{x}0)}{p\theta(\mathbf{x}{t-1}|\mathbf{x}t)} - \log p\theta(\mathbf{x}0|\mathbf{x}1)\right] \ &= \mathbb{E}{q(\mathbf{x}{0:T})}\left[\underbrace{D{KL}(q(\mathbf{x}T|\mathbf{x}0)|p\theta(\mathbf{x}T))}{L_T} + \sum{t=2}^T\underbrace{D{KL}(q(\mathbf{x}{t-1}|\mathbf{x}t, \mathbf{x}0)|p\theta(\mathbf{x}{t-1}|\mathbf{x}t))}{L_t} - \underbrace{\log p\theta(\mathbf{x}_0|\mathbf{x}1)}{L_0}\right] \end{aligned} $$
From these terms:
-
$L_T$ is constant w.r.t. to$\theta$ , since$q$ is fixed, and$x_T \sim \mathcal{N}(0, \mathbf{I})$ -
$L_t = D_{KL}(q(\mathbf{x}_{t-1}|\mathbf{x}t, \mathbf{x}0)|p\theta(\mathbf{x}{t-1}|\mathbf{x}_t))$ can be expressed analytically as:
-
$L_0 = - \log p_\theta(x_0 | x_1)$ -- we ignore for now (TODO)
TODO: express the loss for noise-producing model instead of
TODO: rewrite from slides