Problem: State and prove Tweedie’s formula.
Solution: Tweedie’s formula asserts that if \(p(\mathbf x|\boldsymbol{\mu},\sigma)=\frac{1}{\det(\sqrt{2\pi}\sigma)}e^{-(\mathbf x-\boldsymbol{\mu})^T\sigma^{-2}(\mathbf x-\boldsymbol{\mu})/2}\) is normally distributed, then without needing to know anything about the prior \(p(\boldsymbol{\mu}|\sigma)\) on the mean random vector \(\boldsymbol{\mu}\), one has the following Bayesian point estimate for it:
\[\langle\boldsymbol{\mu}|\mathbf x,\sigma\rangle=\mathbf x+\sigma^2\frac{\partial\ln p(\mathbf x|\sigma)}{\partial\mathbf x}\]
At first glance, one might think that \(\langle\boldsymbol{\mu}|\mathbf x\rangle\approx\mathbf x\), but Tweedie’s formula provides the empirical Bayes correction \(\sigma^2\frac{\partial\ln p(\mathbf x|\sigma)}{\partial\mathbf x}\) to the naive estimate. The proof amounts to a brute force computation of the score vector field:
\[\frac{\partial\ln p(\mathbf x|\sigma)}{\partial\mathbf x}=\frac{1}{p(\mathbf x|\sigma)}\frac{\partial p(\mathbf x|\sigma)}{\partial\mathbf x}=\frac{1}{p(\mathbf x|\sigma)}\int d\boldsymbol{\mu}p(\boldsymbol{\mu}|\sigma)\frac{\partial p(\mathbf x|\boldsymbol{\mu},\sigma)}{\partial\mathbf x}\]
\[=-\frac{\sigma^{-2}}{p(\mathbf x|\sigma)}\int d\boldsymbol{\mu}p(\boldsymbol{\mu}|\sigma)(\mathbf x-\boldsymbol{\mu})p(\mathbf x|\boldsymbol{\mu},\sigma)\]
\[=-\frac{\sigma^{-2}}{p(\mathbf x|\sigma)}\left(\mathbf xp(\mathbf x|\sigma)-\int d\boldsymbol{\mu}\boldsymbol{\mu}p(\mathbf x|\sigma)p(\boldsymbol{\mu}|\mathbf x,\sigma)\right)\]
\[=\sigma^{-2}(\langle\boldsymbol{\mu}|\mathbf x,\sigma\rangle-\mathbf x)\]
Problem: Show that the set of Gaussians is closed under both convolution and multiplication.
Solution: The convolution of two normalized Gaussians is also a normalized Gaussian representing the probability distribution of the sum of the two independent normal random variables (i.e. it is a kind of orthonormal bilinear transformation):
\[p(\mathbf x|\boldsymbol{\mu}_1,\sigma_1)*p(\mathbf x|\boldsymbol{\mu}_2,\sigma_2)=p(\mathbf x|\boldsymbol{\mu}_{1*2},\sigma_{1*2})\]
where \(\boldsymbol{\mu}_{1*2}=\boldsymbol{\mu}_1+\boldsymbol{\mu}_2\) and \(\sigma^2_{1*2}=\sigma^2_1+\sigma^2_2\) (thus, addition of independent normal random variables is isomorphic to vector addition in \((\boldsymbol{\mu},\sigma^2)\)-space).
The product of two normalized Gaussians is an (unnormalized) Gaussian (but whose normalization constant is itself has a Gaussian form):
\[p(\mathbf x|\boldsymbol{\mu}_1,\sigma_1)p(\mathbf x|\boldsymbol{\mu}_2,\sigma_2)=p(\boldsymbol{\mu}_1|\boldsymbol{\mu}_2,\sigma_{1*2})p(\mathbf x|\boldsymbol{\mu}_{12},\sigma_{12})\]
where \(\sigma^{-2}_{12}\boldsymbol{\mu}_{12}=\sigma^{-2}_1\boldsymbol{\mu}_1+\sigma^{-2}_2\boldsymbol{\mu}_2\) and \(\sigma^{-2}_{12}=\sigma^{-2}_1+\sigma^{-2}_2\).
(aside: the properties above are of course tied by the fact that Gaussians are (roughly speaking) eigenfunctions of the Fourier transform:
\[\int d\mathbf xe^{-i\mathbf k\cdot\mathbf x}p(\mathbf x|\boldsymbol{\mu},\sigma)=e^{-i\boldsymbol{\mu}\cdot\mathbf k-\mathbf k^T\sigma^{2}\mathbf k/2}\]
which intertwines convolution and multiplication).
Problem: What is the fundamental problem that diffusion models solve?
Solution: Given some prior data distribution \(p(\mathbf x_0)\) (e.g. the distribution of images or audio), and given some latent \(\mathbf x_T\) which does not lie in the support of \(p\) in the sense that \(p(\mathbf x_T)\approx 0\), a diffusion model is any (possibly non-deterministic) algorithm for transporting \(\mathbf x_T\) onto the submanifold of latents \(\mathbf x_0\) for which \(p(\mathbf x_0)>0\) is supported. Roughly speaking, a diffusion model is any parametric ansatz for the score vector field \(\partial\ln p(\mathbf x_0)/\partial\mathbf x_0\) of the data distribution, so one can get from \(\mathbf x_T\mapsto\mathbf x_0\) by taking steps along the score vector field. Diffusion models are commonly used as generative models in the sense that the latent \(\mathbf x_0\) represents a novel sample from \(p(\mathbf x_0)\) though other non-generative applications (e.g. reconstruction) do exist.
Problem: Explain the problem that denoising diffusion probabilistic models (DDPMs) solve, and contrast how they are used during inference vs. how they are trained.
Solution: DDPMs are generative diffusion models. At inference time, one begins by sampling some noisy latent \(\mathbf x_T\) where \(p(\mathbf x_T)\approx 0\) and iteratively \(\mathbf x_T,…,\mathbf x_0\) denoising \(\mathbf x_T\) towards some submanifold latent \(\mathbf x_0\) where \(p(\mathbf x_0)>0\), thereby generating the sample \(\mathbf x_0\). More explicitly, the sequence \(\mathbf x_T,…,\mathbf x_0\) will turn out to be a discrete-time Markov chain, and the operation of “denoising” \(\mathbf x_t\mapsto\mathbf x_{t-1}\) will really just amount to sampling from the Markov chain’s transition kernel \(p(\mathbf x_{t-1}|\mathbf x_t)=\int d\mathbf x_0p(\mathbf x_0|\mathbf x_t)p(\mathbf x_{t-1}|\mathbf x_0,\mathbf x_t)\). As it is, this is intractable, but focus on that second term in the integrand:
\[p(\mathbf x_{t-1}|\mathbf x_0,\mathbf x_t)\propto p(\mathbf x_t|\mathbf x_{t-1},\mathbf x_0)p(\mathbf x_{t-1}|\mathbf x_0)\]
Both of these are essentially transition kernels for the forward Markov chain \(\mathbf x_0,…,\mathbf x_T\). At this point, the DDPM paper decides to use a Gaussian forward transition kernel:
\[p(\mathbf x_t|\mathbf x_{t-1})\sim e^{-|\mathbf x_t-\sqrt{1-\sigma_t^2}\mathbf x_{t-1}|^2/2\sigma_t^2}\]
where \(0<\sigma^2_1<\sigma^2_2<…<\sigma^2_T\ll 1\) are just \(T\) fixed hyperparameters (called a variance schedule, where the horizon \(T\) is also a hyperparameter). Equivalently, using the reparameterization trick, adding Gaussian noise looks like:
\[\mathbf x_t=\sqrt{1-\sigma_t^2}\mathbf x_{t-1}+\sigma_t\boldsymbol{\varepsilon}\]
where \(\boldsymbol{\varepsilon}\) is drawn from an isotropic standard normal \(p(\boldsymbol{\varepsilon})\sim e^{-|\boldsymbol{\varepsilon}|^2/2}\). The \(\sqrt{1-\sigma_t^2}\) factor is used to scale down the previous state \(\mathbf x_{t-1}\) in order to prevent the variance (i.e. diagonal entries of the covariance matrix) of \(\mathbf x_t\) from exploding (and indeed, would be exactly variance-preserving iff the data distribution covariance \(\sigma^2_{\mathbf x_0}=1\)); cf. Brownian motion of a particle attached to the origin by a spring.
Rather than iteratively stepping from \(\mathbf x_0\) to \(\mathbf x_t\) in \(O(t)\) samples, one can jump in \(O(1)\) time from \(\mathbf x_0\mapsto\mathbf x_t\) via just a single \(\boldsymbol{\varepsilon}\)-sample:
\[\mathbf x_t=\sqrt{(1-\sigma_1^2)…(1-\sigma_t^2)}\mathbf x_0+\sqrt{1-(1-\sigma^2_1)…(1-\sigma^2_t)}\boldsymbol{\varepsilon}\]
Or equivalently:
\[p(\mathbf x_t|\mathbf x_0)\sim e^{-|\mathbf x_t-\sqrt{(1-\sigma_1^2)…(1-\sigma_t^2)}\mathbf x_0|^2/2(1-(1-\sigma^2_1)…(1-\sigma_t^2))}\]
Thus, the “oracle posterior” \(p(\mathbf x_{t-1}|\mathbf x_0,\mathbf x_t)\sim e^{-|\mathbf x_{t-1}-\boldsymbol{\mu}_{t-1}|^2/2\tilde{\sigma}^2_{t-1}}\) is an isotropic Gaussian with:
\[\tilde{\sigma}^{-2}_{t-1}=\frac{1-\sigma_t^2}{\sigma_t^2}+\frac{1}{1-(1-\sigma_1^2)…(1-\sigma^2_{t-1})}\Rightarrow \tilde{\sigma}^2_{t-1}=\frac{1-(1-\sigma_1^2)…(1-\sigma_{t-1}^2)}{1-(1-\sigma^2_1)…(1-\sigma^2_t)}\sigma^2_t\approx\sigma^2_t\]
\[\tilde{\sigma}^{-2}_{t-1}\boldsymbol{\mu}_{t-1}=\frac{1-\sigma_t^2}{\sigma_t^2}\frac{\mathbf x_t}{\sqrt{1-\sigma_t^2}}+\frac{\sqrt{(1-\sigma_1^2)…(1-\sigma^2_{t-1})}}{1-(1-\sigma^2_1)…(1-\sigma^2_{t-1})}\mathbf x_0\]
\[=\frac{\sqrt{1-\sigma_t^2}}{\sigma_t^2}\mathbf x_t+\frac{\sqrt{(1-\sigma_1^2)…(1-\sigma^2_{t-1})}}{1-(1-\sigma^2_1)…(1-\sigma^2_{t-1})}\frac{1}{\sqrt{1-\sigma^2_t}}\left(\mathbf x_t-\sqrt{1-(1-\sigma^2_1)…(1-\sigma^2_t)}\boldsymbol{\varepsilon}\right)\]
\[=\frac{1}{\tilde{\sigma}^2_{t-1}\sqrt{1-\sigma^2_t}}\mathbf x_t-\frac{\sigma_t^2}{\tilde{\sigma}^2_{t-1}\sqrt{(1-\sigma_t^2)(1-(1-\sigma_1^2)…(1-\sigma_t^2))}}\boldsymbol{\varepsilon}\]
Hence:
\[\boldsymbol{\mu}_{t-1}=\frac{1}{\sqrt{1-\sigma^2_t}}\left(\mathbf x_t-\frac{\sigma_t^2}{\sqrt{1-(1-\sigma_1^2)…(1-\sigma_t^2)}}\boldsymbol{\varepsilon}\right)\]
Although by design the forward Markov chain has Gaussian transition kernels \(p(\mathbf x_t|\mathbf x_{t-1})\sim e^{-|\mathbf x_t-\sqrt{1-\sigma_t^2}\mathbf x_{t-1}|^2/2\sigma_t^2}\), the reverse transition kernel \(p(\mathbf x_{t-1}|\mathbf x_t)\) is in general non-Gaussian. However, a theorem of Anderson guarantees that it is in fact Gaussian at the level of stochastic differential equations, and so it is not a bad approximation to simply sample \(\mathbf x_{t-1}\) from the Gaussian \(p(\mathbf x_{t-1}|\mathbf x_t,\mathbf x_0)\) as a proxy for sampling from the true, non-Gaussian \(p(\mathbf x_{t-1}|\mathbf x_t)\). Reparameterized, this means that inference looks like unannealed Langevin dynamics (ULD):
\[\mathbf x_{t-1}=\boldsymbol{\mu}_{t-1}+\tilde{\sigma}_{t-1}\tilde{\boldsymbol{\varepsilon}}\]
\[=\frac{1}{\sqrt{1-\sigma^2_t}}\left(\mathbf x_t-\frac{\sigma_t^2}{\sqrt{1-(1-\sigma_1^2)…(1-\sigma_t^2)}}\boldsymbol{\varepsilon}\right)+\sqrt{\frac{1-(1-\sigma_1^2)…(1-\sigma_{t-1}^2)}{1-(1-\sigma^2_1)…(1-\sigma^2_t)}}\sigma_t\tilde{\boldsymbol{\varepsilon}}\]
The only problem here is that the Gaussian noise \(\boldsymbol{\varepsilon}\) that was injected to get from \(\mathbf x_0\mapsto\mathbf x_t\) is unknown.
This is where training comes in. Training consists of \(3\) distinct sampling steps and assembling the samples:
- Sample \(\mathbf x_0\)
- Sample \(t\in\{1,…,T\}\)
- Sample \(\boldsymbol{\varepsilon}\)
- Hence, compute \(\mathbf x_t=\sqrt{(1-\sigma_1^2)…(1-\sigma_t^2)}\mathbf x_0+\sqrt{1-(1-\sigma^2_1)…(1-\sigma^2_t)}\boldsymbol{\varepsilon}\)
Architecturally, the diffusion model itself is then a ResNet-like noise predictor \(\hat{\boldsymbol{\varepsilon}}(\mathbf x_t,t|\boldsymbol{\theta})\) (e.g. a U-Net or a transformer, etc.) that seeks to predict the sampled Gaussian noise \(\boldsymbol{\varepsilon}\) that was added to \(\mathbf x_0\) to obtain \(\mathbf x_t\). This is enforced by the MSE loss function \(L(\hat{\boldsymbol{\varepsilon}},\boldsymbol{\varepsilon})=|\hat{\boldsymbol{\varepsilon}}-\boldsymbol{\varepsilon}|^2/2\) with corresponding cost function over the training set:
\[C_{\text{tr}}(\boldsymbol{\theta})=\frac{1}{N_{\text{tr}}}\sum_{i=1}^{N_{\text{tr}}}L(\hat{\boldsymbol{\varepsilon}}(\mathbf x_{t_i},t_i|\boldsymbol{\theta}),\boldsymbol{\varepsilon}_i)\]
The upshot is that the inference loop Markov chain dynamics are governed (for \(t=T,T-1,…,2\)) by:
\[\mathbf x_{t-1}=\frac{1}{\sqrt{1-\sigma^2_t}}\left(\mathbf x_t-\frac{\sigma_t^2}{\sqrt{1-(1-\sigma_1^2)…(1-\sigma_t^2)}}\hat{\boldsymbol{\varepsilon}}(\mathbf x_t,t|\boldsymbol{\theta})\right)+\sqrt{\frac{1-(1-\sigma_1^2)…(1-\sigma_{t-1}^2)}{1-(1-\sigma^2_1)…(1-\sigma^2_t)}}\sigma_t\tilde{\boldsymbol{\varepsilon}}\]
and for \(t=1\), the final generated sample \(\mathbf x_0=\frac{1}{\sqrt{1-\sigma^2_1}}\left(\mathbf x_1-\sigma_1\hat{\boldsymbol{\varepsilon}}(\mathbf x_1,1|\boldsymbol{\theta})\right)\) should not have any noise added to it.
Problem: In DDPMs, explain how estimating the noise \(\hat{\boldsymbol{\varepsilon}}(\mathbf x_t,t|\boldsymbol{\theta})\) is synonymous with estimating the score vector field of the data distribution \(\partial\ln p(\mathbf x_t)/\partial\mathbf x_t\).
Solution: Because \(\mathbf x_t\) is normally distributed about \(\sqrt{(1-\sigma^2_1)…(1-\sigma^2_t)}\mathbf x_0\) with covariance \(1-(1-\sigma_1^2)…(1-\sigma^2_t)\), Tweedie’s formula asserts that:
\[\langle\sqrt{(1-\sigma^2_1)…(1-\sigma^2_t)}\mathbf x_0|\mathbf x_t\rangle=\mathbf x_t+(1-(1-\sigma_1^2)…(1-\sigma^2_t))\frac{\partial\ln p(\mathbf x_t)}{\partial\mathbf x_t}\]
Comparing this with \(\mathbf x_t=\sqrt{(1-\sigma^2_1)…(1-\sigma^2_t)}\mathbf x_0+\sqrt{1-(1-\sigma^2_1)…(1-\sigma^2_t)}\hat{\boldsymbol{\varepsilon}}(\mathbf x_t,t|\boldsymbol{\theta})\), one concludes that:
\[\frac{\partial\ln p(\mathbf x_t)}{\partial\mathbf x_t}=-\frac{\hat{\boldsymbol{\varepsilon}}(\mathbf x_t,t|\boldsymbol{\theta})}{\sqrt{1-(1-\sigma^2_1)…(1-\sigma^2_t)}}\]
Problem: Explain how the DDPM architecture can be reformulated in terms of a stochastic differential equation, and (following work of Song et al.) deduce its distributionally equivalent probability flow ODE.
Solution: Recall the DDPM forward Markov chain transition kernel (rewritten using \(i\in\{0,…,T\}\) instead of \(t\) because in a moment \(t\in [0,T]\) will be reserved for a continuous variable).
\[\mathbf x_i=\sqrt{1-\sigma_i^2}\mathbf x_{i-1}+\sigma_i\boldsymbol{\varepsilon}\]
Taking the continuous time limit of this discrete difference equation amounts to letting \(T\to\infty\) while \(\sigma^2_i\to 0\) in such a way that their product \(T\sigma^2_i:=\sigma^2(t)\) is fixed (cf. the dipole limit in electrostatics which takes \(q\to\infty\) and \(\Delta\mathbf x\to\mathbf 0\) such that \(\boldsymbol{\pi}:=q\Delta\mathbf x\) is fixed). Hence, writing \(dt:=1/T\):
\[\mathbf x_i=\sqrt{1-\sigma^2(t)dt}\mathbf x_{i-1}+\sigma(t)\sqrt{dt}\boldsymbol{\varepsilon}\]
So after binomial expanding, isolating \(d\mathbf x:=\mathbf x_i-\mathbf x_{i-1}\), and recognizing the Wiener process \(d\mathbf w:=\sqrt{dt}\boldsymbol{\varepsilon}\):
\[d\mathbf x=-\frac{\sigma^2(t)}{2}\mathbf xdt+\sigma(t)d\mathbf w\]
This particular SDE is also an instance of an Ornstein-Uhlenbeck process, but is special in that it is also variance-preserving. The corresponding Fokker-Planck probability current density is \(-\frac{\sigma^2(t)}{2}\left(p(\mathbf x,t)\mathbf x+\frac{\partial p}{\partial\mathbf x}\right):=p(\mathbf x,t)\mathbf v_{\text{eff}}(\mathbf x,t)\) which is distributionally equivalent to the ODE \(d\mathbf x=\mathbf v_{\text{eff}}dt\) with effective velocity field:
\[\mathbf v_{\text{eff}}(\mathbf x,t):=-\frac{\sigma^2(t)}{2}\left(\mathbf x+\frac{\partial \ln p}{\partial\mathbf x}\right)\]
Hence, the SDE admits a corresponding probability flow ODE \(d\mathbf x/dt=\mathbf v_{\text{eff}}(\mathbf x,t)\) which can be integrated during inference using standard numerical algorithms (e.g. Euler, Runge-Kutta); recall the score vector field \(\partial\ln p/\partial\mathbf x\) is what the diffusion model estimates via \(\hat{\boldsymbol{\epsilon}}(\mathbf x,t|\boldsymbol{\theta})\).
Problem: Hence, in light of the above discussion, explain and motivate flow matching models, and describe the simplest instance of it (i.e. rectified flow).
Solution: Roughly speaking, diffusion models are trained to stochastically map signal to noise. At inference, one is therefore obliged to denoise along these stochastic trajectories in order to map from noise back to signal. But this is a bit like shooting oneself in the foot, unnecessarily making one’s life difficult with no ostensible gain. Flow matching models are (loosely speaking) what one gets after cutting through the fluff of diffusion models with Occam’s razor, replacing SDEs with ODEs (indeed, this is w.l.o.g. thanks to the probability flow ODE construction).
Intuitively, the simplest possible map \(\mathbf x_0\mapsto\mathbf x_1\) one could engineer is a simple linear interpolation for \(t\in [0,1]\):
\[\mathbf x_t=t\mathbf x_1+(1-t)\mathbf x_0\]
(aside: in the flow matching literature, the convention about whether \(\mathbf x_0\) represents signal vs. noise is sometimes reversed compared with the diffusion literature, but here it is being assumed that \(\mathbf x_0\) represents signal as in diffusion models). The relevant velocity field to be learned is thus \(d\mathbf x_t/dt=\mathbf x_1-\mathbf x_0\). A flow-matching model thus no longer seeks to estimate a score vector field, but rather a velocity vector field \(\hat{\mathbf v}(\mathbf x,t|\boldsymbol{\theta})\) via an MSE loss function \(L(\hat{\mathbf v},\mathbf v):=|\hat{\mathbf v}-\mathbf v|^2/2\) and training cost function:
\[C_{\text{tr}}(\boldsymbol{\theta})=\frac{1}{N_{\text{tr}}}\sum_{i=1}^{N_{\text{tr}}}L(\hat{\mathbf v}(\mathbf x_{t_i},t_i|\boldsymbol{\theta}),\mathbf x_1-\mathbf x_0)\]
A key advantage of learning such a simple rectified flow (also called optimal transport) is that inference is very easy, and can be achieved with very few steps since one just has to step along a straight line.





