-
Notifications
You must be signed in to change notification settings - Fork 0
/
_DDIM.qmd
138 lines (128 loc) · 7.71 KB
/
_DDIM.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
## DDIM {#sec-DDIM}
One of the major drawbacks of DDPM is the lengthy time required for data generation, especially when compared to other generative AI methods.
In response to this issue, an improved version of DDPM, known as Denoising Diffusion Implicit Models (DDIM), was introduced by @song2022denoising.
The primary innovation of DDIM is its ability to significantly accelerate the data generation process.
By refining the underlying diffusion mechanism, DDIM reduces the number of required diffusion steps without sacrificing the quality of the generated data.
This breakthrough makes DDIM a more practical and efficient alternative for generative AI tasks, offering faster performance while maintaining high-quality outputs.
Now we introduce the DDIM.
The main reason why we can decompose $L$ in @eq-decom-L in DDPM is that we have the following production of two densities:
$$
\begin{aligned}
p_{\theta}(x_{0:T}) &= p_{\theta}(x_T) \cdot \prod_{t=2}^T p_{\theta}(x_{t-1}\vert x_t) \cdot p_{\theta}(x_0\vert x_1) , \cr
q(x_{1:T}\vert x_0) &= q(x_T\vert x_0) \cdot \prod_{t=2}^{T} q(x_{t-1} \vert x_{t}, x_0).
\end{aligned}
$$ {#eq-q_prod}
DDIM consider a new forward process $\bigl( \lbrace X_0,X_1,\cdots,X_T\rbrace, \mathbf{Q}_{\sigma} \bigr),$ where $\mathbf{Q}_{\sigma}$ is some probability measure indexed by $\sigma\in [0,\infty)^T$.
The forward process is not a Markov chain but has the same conditional density of $X_{t}$ given $X_0=x_0$ for each $t$ as DDPM.
Inspired by @eq-q_prod, DDIM directly defines the joint density
$$
\begin{aligned}
{\color{red}{q_{\sigma}(x_{0:T})}} := q_{\sigma}(x_T\vert x_0) \cdot \prod_{t=2}^T q_{\sigma}(x_{t-1}\vert x_t, x_0) \cdot q(x_0),
\end{aligned}
$$
where $q_{\sigma}(x_T\vert x_0):=\mathcal{N}(\sqrt{\overline{\alpha}_T}x_0,(1-\overline{\alpha}_T)\mathbf{I})$ and
$$
\begin{aligned}
{\color{red}{q_{\sigma} (x_{t-1}\vert x_t,x_0)}}
:= \mathcal{N}\biggl( \sqrt{\overline{\alpha}_{t-1}}x_0 + \sqrt{1-\overline{\alpha}_{t-1} - \sigma_t^2} \cdot \frac{x_t-\sqrt{\overline{\alpha}_t}x_0}{\sqrt{1-\overline{\alpha}_t}} , \sigma_t^2 \mathbf{I} \biggr), \quad t=2,\cdots, T.
\end{aligned}
$$
Note that $q_{\sigma}(x_{0:T})$ is a density since it is a product of densities.
This seems a little weird that the joint density of $q_{\sigma}(x_{0:T})$ is determined by some conditional density.
In fact, $\bigl(\lbrace X_0,X_1,\cdots,X_T \rbrace,\mathbf{Q}_{\sigma}\bigr)$ is a process satisfying the following conditions:
1. Under $\mathbf{Q}_{\sigma},$ $X_0$ has the density $q(x_0).$
2. Conditioned on $X_0=x_0,$ the process $\Bigl( \lbrace X_T,X_{T-1},\cdots, X_2,X_1\rbrace\Big\vert_{X_0=x_0}, \mathbf{Q}_{\sigma} \Bigr)$ is a Markov chain with
- the initial density $q_{\sigma}(x_T\vert x_0)= \mathcal{N}(\sqrt{\overline{\alpha}_T}x_0,(1-\overline{\alpha}_T)\mathbf{I})$ and
- the transition density
$$
\begin{aligned}
q_{\sigma} (x_{t-1}\vert x_t,x_0) = \mathcal{N}\biggl( \sqrt{\overline{\alpha}_{t-1}}x_0 + \sqrt{1-\overline{\alpha}_{t-1} - \sigma_t^2} \cdot \frac{x_t-\sqrt{\overline{\alpha}_t}x_0}{\sqrt{1-\overline{\alpha}_t}} , \sigma_t^2 \mathbf{I} \biggr), \quad t=2,\cdots, T.
\end{aligned}
$$
Note that if we write $q_{\sigma}(x_{t-1}\vert x_t,x_0)=\mathcal{N} \bigl(f(x_t,x_0,t), \sigma_t^2 \mathbf{I}\bigr),$
then the process
$$
\begin{aligned}
\Bigl( \lbrace X_T,X_{T-1},\cdots, X_2,X_1\rbrace\Big\vert_{X_0=x_0}, \mathbf{Q}_{\sigma} \Bigr)
\end{aligned}
$$
can be write as, conditioned on $X_0=x_0,$
$$
\begin{aligned}
X_{t-1} = f(X_t,x_0,t) + \sigma_t \xi_t, \quad t=T,\cdots, 2,
\end{aligned}
$$
where $X_T,\xi_{T-1},\xi_{T-2},\cdots, \xi_{1}$ are independent under $\mathbf{Q}_{\sigma}.$
For each $\sigma\in [0,\infty)^T,$
we can show that for this joint density $q_{\sigma}(x_{0:T}),$
$$
\begin{aligned}
q_{\sigma}(x_0) &= q(x_0), \cr
q_{\sigma} (x_t \vert x_0) &= \mathcal{N} \bigl(\sqrt{\overline{\alpha}_t}x_0, (1-\overline{\alpha}_t)\mathbf{I}\bigr) = q(x_t\vert x_0) , \quad t=1,\cdots,T.
\end{aligned}
$$
DDIM consider the backward process $\bigl( \lbrace X_T,X_{T-1},\cdots,X_1,X_0 \rbrace, \mathbf{P}_{\theta} \bigr)$ as a Markov chain with the initial distribution $p_{\theta}({x_T})=\mathcal{N}(\mathbf{0},\mathbf{I})$ and the transition density
$$
\begin{aligned}
p_{\theta}(x_0\vert x_1) &= \mathcal{N}( {\color{blue}{\widehat{x}_0(x_1,1)}}, \sigma_1^2 \mathbf{I} ) , \cr
p_{\theta}(x_{t-1}\vert x_t) &= q_{\sigma}(x_{t-1}\vert x_t,{\color{blue}{\widehat{x}_0}}) \cr
&= \mathcal{N}\biggl( \sqrt{\overline{\alpha}_{t-1}}{\color{blue}{\widehat{x}_0}} + \sqrt{1-\overline{\alpha}_{t-1} - \sigma_t^2} \cdot \frac{x_t-\sqrt{\overline{\alpha}_t}{\color{blue}{\widehat{x}_0}}}{\sqrt{1-\overline{\alpha}_t}} , \sigma_t^2 \mathbf{I} \biggr), \quad t=2,\cdots, T,
\end{aligned}
$$
where ${\color{blue}{\widehat{x}_0}}=\widehat{x}_0(x_t,t)$ satisfies
$$
\begin{aligned}
x_t = \sqrt{\overline{\alpha}_t}\cdot \widehat{x}_0 (x_t,t) + \sqrt{1-\overline{\alpha}_t} \cdot \mathtt{Net}_{\theta}(x_t,t), \quad x\in \mathbb R^n,\, t\in \mathbb N.
\end{aligned}
$$
By the constructions of $q_{\sigma},p_{\theta},$ we still have the decomposition
$$
\begin{aligned}
&\mathbb E_{X_{0:T}\sim q_{\sigma}(x_{0:T})} \Bigl[ -\log \frac{p_{\theta}(X_{0:T})}{q_{\sigma}(X_{1:T}\vert X_0)} \Bigr] \cr
&= \underbrace{\mathbb E_{X_0\sim q_{\sigma}(x_0)} \biggl[ D_{\mathtt{KL}} \Bigl( \underline{q_{\sigma}(x_T \vert x_0)} \big\Vert \underline{p(x_T)} \Bigr) \Big\vert_{x_0=X_0} \biggr]}_{L_T} \cr
& \qquad + \sum_{t=2}^T
\underbrace{\mathbb E_{X_0,X_t\sim q_{\sigma}(x_0,x_{t})} \biggl[
D_{\mathtt{KL}} \Bigl(
{\underline{\color{red}{q_{\sigma}(x_{t-1} \vert x_t,x_0)}}}
\big\Vert
\underline{\color{blue}{p_{\theta}(x_{t-1}\vert x_t)} }
\Bigr)\Big \vert_{x_0,x_t=X_0,X_t}
\biggr]}_{L_{t-1}} \cr
& \qquad \qquad + \underbrace{\mathbb E_{X_0,X_1\sim q_{\sigma}(x_0,x_1)} \biggl[
-\log {\color{blue}{p_{\theta}(x_0 \vert x_1)}} \Big\vert_{x_0,x_1=X_0,X_1}
\biggr]}_{L_0}.
\end{aligned}
$$
There are two special values for $\sigma.$
- The first one is
$$
\begin{aligned}
\sigma_t = \sqrt{(1-\overline{\alpha}_{t-1})/(1-\overline{\alpha}_t)} \sqrt{ 1-\alpha_t }, \quad t = 1,\cdots, T.
\end{aligned}
$$
Under this $\sigma,$
the process $\bigl(\lbrace X_0,X_1,\cdots,X_T \rbrace,\mathbf{Q}_{\sigma}\bigr)$ becomes a Markov chain and the DDIM becomes the original DDPM.
- The second one is $\sigma_t = 0$ for $t=1,2,\cdots, T.$
In this case, the backward process $\bigl(\lbrace X_T,X_{T-1},\cdots,X_0 \rbrace,\mathbf{P}_{\theta}\bigr)$ becomes deterministic when we condition on $X_T = x_T.$
This greatly speeds up the sampling of diffusion models.
In this case,
$$
\begin{aligned}
q_{\sigma}(x_{t-1}\vert x_t,x_0)
&= \mathcal{N} \biggl( \sqrt{\overline{\alpha}_{t-1}}x_0 + \sqrt{1-\overline{\alpha}_{t-1} } \cdot \frac{x_t-\sqrt{\overline{\alpha}_t}x_0}{\sqrt{1-\overline{\alpha}_t}} \,\, , \,\, 0 \biggr) \cr
&= \mathcal{N} \biggl( \Bigl( \sqrt{\overline{\alpha}_{t-1}} - \frac{\sqrt{\overline{\alpha}_t}}{\sqrt{1-\overline{\alpha}_t}} \Bigr) x_0 \,\, + \,\, \frac{\sqrt{1-\overline{\alpha}_{t-1}}}{\sqrt{1-\overline{\alpha}_t}} x_t
\,\, , \,\, 0 \biggr) \cr
p_{\theta}(x_{t-1}\vert x_t)
&= \mathcal{N} \biggl(
\frac{1}{\sqrt{\alpha_t}} x_t + \Bigl( \sqrt{1-\overline{\alpha}_{t-1}}-\frac{\sqrt{1-\overline{\alpha}_t}}{\sqrt{\alpha_t}} \Bigr) \cdot \mathtt{Net}_{\theta}(x_t,t)
\,\, , \,\, 0 \biggr).
\end{aligned}
$$
<!--
we may write
$$
\begin{aligned}
X_{t-1} = \sqrt{\overline{\alpha}_{t-1}} \widehat{X}_0 (X_t,t) + \sqrt{1-\overline{\alpha}_{t-1}} \cdot \mathtt{Net}_{\theta}(X_t,t), \quad t=T,T-1,\cdots, 1.
\end{aligned}
$$
-->