Magnetism

Problem: Define the \(2\) words in the phrase “ideal paramagnet“. Show that a classical ideal paramagnet of \(N\) spins each with the same fixed magnetic dipole moment \(\mu:=|\boldsymbol{\mu}|\) placed in a uniform external magnetic field \(B:=|\mathbf B|\) will develop a uniform non-zero induced average magnetization \(\langle M\rangle=n\mu L(\beta\mu B)\) along the direction of \(\mathbf B\) where the Langevin function \(L(x):=\coth(x)-1/x\).

Solution: Ideal means the \(N\) spins are non-interacting with respect to each other (hence there’s no need to specify the precise geometric configuration of the \(N\) spins such as whether they’re on a lattice \(\Lambda\), etc.). Paramagnet imposes conditions on the first \(2\) Taylor expansion coefficients of \(M(B)\) about \(B=0\):

  1. Non-magnetic/absence of spontaneous magnetization \(\langle M(B=0)\rangle=0\).
  2. Positive-definite zero-field magnetic susceptibility \(\chi_{\mu}/\mu_0:=\partial \langle M\rangle/\partial (B=0)>0\).

Classically, for a single classical spin, the state space is \((\theta,\phi)\in S^2\) and the \(\phi\)-independent Hamiltonian is \(H(\theta)=-\boldsymbol{\mu}\cdot\mathbf B=-\mu B\cos\theta\), so the single-spin partition function is:

\[Z_1=\int_{-1}^1d\cos\theta\int_0^{2\pi}d\phi e^{-\beta H(\theta)}=4\pi\text{sinhc}\beta\mu B\]

Thus, \(F_1=-k_BT\ln Z_1\) and \(\langle\mu_x\rangle=\langle\mu_y\rangle=0\) whereas:

\[\langle\mu_z\rangle=\int_{-1}^1d\cos\theta\int_0^{2\pi}d\phi\frac{e^{\beta\mu B\cos\theta}}{Z_1}\mu\cos\theta=\frac{\partial \ln Z_1}{\partial\beta B}=-\frac{\partial F}{\partial B}=\mu L(\beta\mu B)\]

and hence the result follows from \(\langle M\rangle=n\langle\mu_z\rangle\) where \(n:=N/V\).

Problem: Show that in the high-temperature limit where \(x:=\beta\mu B\ll 1\), the Taylor expansion of the Langevin function about \(x=0\) is \(L(x)=x/3+O_{x\to 0}(x^3)\). Hence, derive Curie’s high-temperature ideal paramagnet law \(\chi_{\mu}=C/T\) for the zero-field magnetic susceptibility and state the value of the Curie constant \(C>0\).

Solution: One has:

\[L(x)=\coth x-\frac{1}{x}=\frac{\cosh x}{\sinh x}-\frac{1}{x}\approx\frac{1+x^2/2+…}{x+x^3/6}-\frac{1}{x}\]

\[=\frac{1}{x}\left(\frac{1+x^2/2}{1+x^2/6}-1\right)\approx\frac{1}{x}\left((1+x^2/2)(1-x^2/6)-1\right)=x/3+O_{x\to 0}(x^3)\]

Thus, it is straightforward to compute the Curie constant \(C=\mu_0 n\mu^2/3k_B\) for a classical ideal paramagnet.

Problem: Repeat the above analysis but for a quantum ideal paramagnet in which all \(N\) spins have the same fixed total angular momentum quantum number \(j\in\{0,1/2,1,…\}\).

Solution: Now, for a single quantum spin with state vector in the manifold \(\mathbf C^{2j+1}\), the spectrum of its Hamiltonian is given by the weak-field Zeeman splitting \(E_{|j,m_j\rangle}=g_jm_j\mu_BB\) for Landé \(g\)-factor \(g_j=1+\frac{j(j+1)+s(s+1)-\ell(\ell+1)}{2j(j+1)}\) and the canonical partition function is a Dirichlet kernel:

\[Z_1=\sum_{m_j=-j}^je^{-\beta E_{|j,m_j\rangle}}=\frac{\sinh(j+1/2)g_j\beta\mu_BB}{\sinh(g_j\beta\mu_BB/2)}\]

Repeating the same steps as above, this time one finds:

\[\langle M\rangle=ng_jj\mu_B B_j(\beta g_jj\mu_BB)\]

where the Brillouin function is defined by:

\[B_j(x):=\left(1+\frac{1}{2j}\right)\coth\left(1+\frac{1}{2j}\right)x-\frac{1}{2j}\coth\frac{x}{2j}\]

In particular, as \(j\to\infty\), \(2j+1\to\infty\) and one recovers the classical continuous angle \(\theta\in [0,\pi]\) and \(\lim_{j\to\infty}B_j(x)=L(x)\). This is consistent with the Taylor expansion \(B_j(x)=\frac{j+1}{3j}x+O_{x\to 0}(x^3)\) which leads to the quantum Curie constant \(C=\mu_0ng_j^2j(j+1)\mu_B^2/3k_B\). Instead of looking at \(j\to\infty\), one can also take the quantum limit \(j=s=1/2\) and \(\ell=0\), in which case \(g_j=2\) and:

\[B_{1/2}(x)=2\coth 2x-\coth x=\tanh x\]

so one recovers the familiar \(2\)-level system average magnetization \(\langle M\rangle=n\mu_B\tanh\beta\mu_BB\) with Curie constant \(C=\mu_0n\mu_B^2/k_B\).

Problem: Explain why classical statistical mechanics (when applied consistently!) predicts \(\langle\mathbf M\rangle=\mathbf 0\) irrespective of what \(\mathbf B\) is (this is called the Bohr-van Leeuwen theorem). Since the Langevin derivation used a classical stat mech approach yet was able to predict a nontrivial \(\mathbf M(\mathbf B)\), explain why it doesn’t violate the BvL theorem.

Solution: The BvL theorem follows mathematically from the fact that the canonical \(N\)-particle partition function:

\[Z(\beta,\mathbf B)=\frac{1}{h^{3N}N!}\int d^3\mathbf x_1…d^3\mathbf x_Nd^3\mathbf p_1…d^3\mathbf p_Ne^{-\beta H}\]

with \(H=\sum_{i=1}^N\frac{|\mathbf p_i-q_i\mathbf A(\mathbf x_i)|^2}{2m_i}+V(\mathbf x_1,…,\mathbf x_N)\) can instead (via change of variables) be integrated over the kinetic momenta \(m_i\mathbf v_i:=\mathbf p_i-q_i\mathbf A(\mathbf x_i)\) rather than the canonical momenta \(\mathbf p_i\) without incurring a Jacobian penalty \(\partial (m_1\mathbf v_1,…,m_N\mathbf v_N)/\partial(\mathbf p_1,…,\mathbf p_N)=1\), so \(Z(\beta,\mathbf B)=Z(\beta,\mathbf B=\mathbf 0)\) is \(\mathbf B\)-independent and hence \(F=-k_BT\ln Z\) is also \(\mathbf B\)-independent, leading to the BvL theorem \(\langle\mathbf M\rangle=-V^{-1}\partial F/\partial\mathbf B=\mathbf 0\).

By assuming that one could speak of a “fixed magnetic dipole moment \(\mu\)” for all the atoms, Langevin was implicitly quantizing the system since in hindsight \(\mu=g_jj\mu_B\) and \(j\) is quantized and fixed (indeed, making the replacement \(\mu\mapsto g_jj\mu_B\) in the Langevin magnetization maps in the \(j\to\infty\) limit directly onto the Brillouin magnetization). As a result, it is perhaps wise to regard the Langevin result for \(\langle M\rangle\) as a semi-classical formula rather than strictly belonging to classical physics (otherwise it would violate the BvL theorem!).

Posted in Blog | Leave a comment

Physics-Informed Neural Networks

Problem: Train a physics-informed neural network (PINN) on both the van der Pol oscillator and the drift-free Fokker-Planck diffusion equation.

Solution:

report
Posted in Blog | Leave a comment

Diffusion & Flow-Matching Models

Problem: State and prove Tweedie’s formula.

Solution: Tweedie’s formula asserts that if \(p(\mathbf x|\boldsymbol{\mu},\sigma)=\frac{1}{\det(\sqrt{2\pi}\sigma)}e^{-(\mathbf x-\boldsymbol{\mu})^T\sigma^{-2}(\mathbf x-\boldsymbol{\mu})/2}\) is normally distributed, then without needing to know anything about the prior \(p(\boldsymbol{\mu}|\sigma)\) on the mean random vector \(\boldsymbol{\mu}\), one has the following Bayesian point estimate for it:

\[\langle\boldsymbol{\mu}|\mathbf x,\sigma\rangle=\mathbf x+\sigma^2\frac{\partial\ln p(\mathbf x|\sigma)}{\partial\mathbf x}\]

At first glance, one might think that \(\langle\boldsymbol{\mu}|\mathbf x,\sigma\rangle\approx\mathbf x\), but Tweedie’s formula provides the empirical Bayes correction \(\sigma^2\frac{\partial\ln p(\mathbf x|\sigma)}{\partial\mathbf x}\) to the naive estimate. The proof amounts to a brute force computation of the score vector field:

\[\frac{\partial\ln p(\mathbf x|\sigma)}{\partial\mathbf x}=\frac{1}{p(\mathbf x|\sigma)}\frac{\partial p(\mathbf x|\sigma)}{\partial\mathbf x}=\frac{1}{p(\mathbf x|\sigma)}\int d\boldsymbol{\mu}p(\boldsymbol{\mu}|\sigma)\frac{\partial p(\mathbf x|\boldsymbol{\mu},\sigma)}{\partial\mathbf x}\]

\[=-\frac{\sigma^{-2}}{p(\mathbf x|\sigma)}\int d\boldsymbol{\mu}p(\boldsymbol{\mu}|\sigma)(\mathbf x-\boldsymbol{\mu})p(\mathbf x|\boldsymbol{\mu},\sigma)\]

\[=-\frac{\sigma^{-2}}{p(\mathbf x|\sigma)}\left(\mathbf xp(\mathbf x|\sigma)-\int d\boldsymbol{\mu}\boldsymbol{\mu}p(\mathbf x|\sigma)p(\boldsymbol{\mu}|\mathbf x,\sigma)\right)\]

\[=\sigma^{-2}(\langle\boldsymbol{\mu}|\mathbf x,\sigma\rangle-\mathbf x)\]

Problem: Show that the set of Gaussians is closed under both convolution and multiplication.

Solution: The convolution of two normalized Gaussians is also a normalized Gaussian representing the probability distribution of the sum of the two independent normal random variables (i.e. it is a kind of orthonormal bilinear transformation):

\[p(\mathbf x|\boldsymbol{\mu}_1,\sigma_1)*p(\mathbf x|\boldsymbol{\mu}_2,\sigma_2)=p(\mathbf x|\boldsymbol{\mu}_{1*2},\sigma_{1*2})\]

where \(\boldsymbol{\mu}_{1*2}=\boldsymbol{\mu}_1+\boldsymbol{\mu}_2\) and \(\sigma^2_{1*2}=\sigma^2_1+\sigma^2_2\) (thus, addition of independent normal random variables is isomorphic to vector addition in \((\boldsymbol{\mu},\sigma^2)\)-space).

The product of two normalized Gaussians is an (unnormalized) Gaussian (but whose normalization constant is itself has a Gaussian form):

\[p(\mathbf x|\boldsymbol{\mu}_1,\sigma_1)p(\mathbf x|\boldsymbol{\mu}_2,\sigma_2)=p(\boldsymbol{\mu}_1|\boldsymbol{\mu}_2,\sigma_{1*2})p(\mathbf x|\boldsymbol{\mu}_{12},\sigma_{12})\]

where \(\sigma^{-2}_{12}\boldsymbol{\mu}_{12}=\sigma^{-2}_1\boldsymbol{\mu}_1+\sigma^{-2}_2\boldsymbol{\mu}_2\) and \(\sigma^{-2}_{12}=\sigma^{-2}_1+\sigma^{-2}_2\).

(aside: the properties above are of course tied by the fact that Gaussians are (roughly speaking) eigenfunctions of the Fourier transform:

\[\int d\mathbf xe^{-i\mathbf k\cdot\mathbf x}p(\mathbf x|\boldsymbol{\mu},\sigma)=e^{-i\boldsymbol{\mu}\cdot\mathbf k-\mathbf k^T\sigma^{2}\mathbf k/2}\]

which intertwines convolution and multiplication).

Problem: What is the fundamental problem that diffusion models solve?

Solution: Given some prior data distribution \(p(\mathbf x_0)\) (e.g. the distribution of images or audio), and given some latent \(\mathbf x_T\) which does not lie in the support of \(p\) in the sense that \(p(\mathbf x_T)\approx 0\), a diffusion model is any (possibly non-deterministic) algorithm for transporting \(\mathbf x_T\) onto the submanifold of latents \(\mathbf x_0\) for which \(p(\mathbf x_0)>0\) is supported. Roughly speaking, a diffusion model is any parametric ansatz for the score vector field \(\partial\ln p(\mathbf x_0)/\partial\mathbf x_0\) of the data distribution, so one can get from \(\mathbf x_T\mapsto\mathbf x_0\) by taking steps along the score vector field. Diffusion models are commonly used as generative models in the sense that the latent \(\mathbf x_0\) represents a novel sample from \(p(\mathbf x_0)\) though other non-generative applications (e.g. reconstruction) do exist.

Problem: Explain the problem that denoising diffusion probabilistic models (DDPMs) solve, and contrast how they are used during inference vs. how they are trained.

Solution: DDPMs are generative diffusion models. At inference time, one begins by sampling some noisy latent \(\mathbf x_T\) where \(p(\mathbf x_T)\approx 0\) and iteratively \(\mathbf x_T,…,\mathbf x_0\) denoising \(\mathbf x_T\) towards some submanifold latent \(\mathbf x_0\) where \(p(\mathbf x_0)>0\), thereby generating the sample \(\mathbf x_0\). More explicitly, the sequence \(\mathbf x_T,…,\mathbf x_0\) will turn out to be a discrete-time Markov chain, and the operation of “denoising” \(\mathbf x_t\mapsto\mathbf x_{t-1}\) will really just amount to sampling from the Markov chain’s transition kernel \(p(\mathbf x_{t-1}|\mathbf x_t)=\int d\mathbf x_0p(\mathbf x_0|\mathbf x_t)p(\mathbf x_{t-1}|\mathbf x_0,\mathbf x_t)\). As it is, this is intractable, but focus on that second term in the integrand:

\[p(\mathbf x_{t-1}|\mathbf x_0,\mathbf x_t)\propto p(\mathbf x_t|\mathbf x_{t-1},\mathbf x_0)p(\mathbf x_{t-1}|\mathbf x_0)\]

Both of these are essentially transition kernels for the forward Markov chain \(\mathbf x_0,…,\mathbf x_T\). At this point, the DDPM paper decides to use a Gaussian forward transition kernel:

\[p(\mathbf x_t|\mathbf x_{t-1})\sim e^{-|\mathbf x_t-\sqrt{1-\sigma_t^2}\mathbf x_{t-1}|^2/2\sigma_t^2}\]

where \(0<\sigma^2_1<\sigma^2_2<…<\sigma^2_T\ll 1\) are just \(T\) fixed hyperparameters (called a variance schedule, where the horizon \(T\) is also a hyperparameter). Equivalently, using the reparameterization trick, adding Gaussian noise looks like:

\[\mathbf x_t=\sqrt{1-\sigma_t^2}\mathbf x_{t-1}+\sigma_t\boldsymbol{\varepsilon}\]

where \(\boldsymbol{\varepsilon}\) is drawn from an isotropic standard normal \(p(\boldsymbol{\varepsilon})\sim e^{-|\boldsymbol{\varepsilon}|^2/2}\). The \(\sqrt{1-\sigma_t^2}\) factor is used to scale down the previous state \(\mathbf x_{t-1}\) in order to prevent the variance (i.e. diagonal entries of the covariance matrix) of \(\mathbf x_t\) from exploding (and indeed, would be exactly variance-preserving iff the data distribution covariance \(\sigma^2_{\mathbf x_0}=1\)); cf. Brownian motion of a particle attached to the origin by a spring.

Rather than iteratively stepping from \(\mathbf x_0\) to \(\mathbf x_t\) in \(O(t)\) samples, one can jump in \(O(1)\) time from \(\mathbf x_0\mapsto\mathbf x_t\) via just a single \(\boldsymbol{\varepsilon}\)-sample:

\[\mathbf x_t=\sqrt{(1-\sigma_1^2)…(1-\sigma_t^2)}\mathbf x_0+\sqrt{1-(1-\sigma^2_1)…(1-\sigma^2_t)}\boldsymbol{\varepsilon}\]

Or equivalently:

\[p(\mathbf x_t|\mathbf x_0)\sim e^{-|\mathbf x_t-\sqrt{(1-\sigma_1^2)…(1-\sigma_t^2)}\mathbf x_0|^2/2(1-(1-\sigma^2_1)…(1-\sigma_t^2))}\]

Thus, the “oracle posterior” \(p(\mathbf x_{t-1}|\mathbf x_0,\mathbf x_t)\sim e^{-|\mathbf x_{t-1}-\boldsymbol{\mu}_{t-1}|^2/2\tilde{\sigma}^2_{t-1}}\) is an isotropic Gaussian with:

\[\tilde{\sigma}^{-2}_{t-1}=\frac{1-\sigma_t^2}{\sigma_t^2}+\frac{1}{1-(1-\sigma_1^2)…(1-\sigma^2_{t-1})}\Rightarrow \tilde{\sigma}^2_{t-1}=\frac{1-(1-\sigma_1^2)…(1-\sigma_{t-1}^2)}{1-(1-\sigma^2_1)…(1-\sigma^2_t)}\sigma^2_t\approx\sigma^2_t\]

\[\tilde{\sigma}^{-2}_{t-1}\boldsymbol{\mu}_{t-1}=\frac{1-\sigma_t^2}{\sigma_t^2}\frac{\mathbf x_t}{\sqrt{1-\sigma_t^2}}+\frac{\sqrt{(1-\sigma_1^2)…(1-\sigma^2_{t-1})}}{1-(1-\sigma^2_1)…(1-\sigma^2_{t-1})}\mathbf x_0\]

\[=\frac{\sqrt{1-\sigma_t^2}}{\sigma_t^2}\mathbf x_t+\frac{\sqrt{(1-\sigma_1^2)…(1-\sigma^2_{t-1})}}{1-(1-\sigma^2_1)…(1-\sigma^2_{t-1})}\frac{1}{\sqrt{1-\sigma^2_t}}\left(\mathbf x_t-\sqrt{1-(1-\sigma^2_1)…(1-\sigma^2_t)}\boldsymbol{\varepsilon}\right)\]

\[=\frac{1}{\tilde{\sigma}^2_{t-1}\sqrt{1-\sigma^2_t}}\mathbf x_t-\frac{\sigma_t^2}{\tilde{\sigma}^2_{t-1}\sqrt{(1-\sigma_t^2)(1-(1-\sigma_1^2)…(1-\sigma_t^2))}}\boldsymbol{\varepsilon}\]

Hence:

\[\boldsymbol{\mu}_{t-1}=\frac{1}{\sqrt{1-\sigma^2_t}}\left(\mathbf x_t-\frac{\sigma_t^2}{\sqrt{1-(1-\sigma_1^2)…(1-\sigma_t^2)}}\boldsymbol{\varepsilon}\right)\]

Although by design the forward Markov chain has Gaussian transition kernels \(p(\mathbf x_t|\mathbf x_{t-1})\sim e^{-|\mathbf x_t-\sqrt{1-\sigma_t^2}\mathbf x_{t-1}|^2/2\sigma_t^2}\), the reverse transition kernel \(p(\mathbf x_{t-1}|\mathbf x_t)\) is in general non-Gaussian. However, a theorem of Anderson guarantees that it is in fact Gaussian at the level of stochastic differential equations, and so it is not a bad approximation to simply sample \(\mathbf x_{t-1}\) from the Gaussian \(p(\mathbf x_{t-1}|\mathbf x_t,\mathbf x_0)\) as a proxy for sampling from the true, non-Gaussian \(p(\mathbf x_{t-1}|\mathbf x_t)\). Reparameterized, this means that inference looks like unannealed Langevin dynamics (ULD):

\[\mathbf x_{t-1}=\boldsymbol{\mu}_{t-1}+\tilde{\sigma}_{t-1}\tilde{\boldsymbol{\varepsilon}}\]

\[=\frac{1}{\sqrt{1-\sigma^2_t}}\left(\mathbf x_t-\frac{\sigma_t^2}{\sqrt{1-(1-\sigma_1^2)…(1-\sigma_t^2)}}\boldsymbol{\varepsilon}\right)+\sqrt{\frac{1-(1-\sigma_1^2)…(1-\sigma_{t-1}^2)}{1-(1-\sigma^2_1)…(1-\sigma^2_t)}}\sigma_t\tilde{\boldsymbol{\varepsilon}}\]

The only problem here is that the Gaussian noise \(\boldsymbol{\varepsilon}\) that was injected to get from \(\mathbf x_0\mapsto\mathbf x_t\) is unknown.

This is where training comes in. Training consists of \(3\) distinct sampling steps and assembling the samples:

  1. Sample \(\mathbf x_0\)
  2. Sample \(t\in\{1,…,T\}\)
  3. Sample \(\boldsymbol{\varepsilon}\)
  4. Hence, compute \(\mathbf x_t=\sqrt{(1-\sigma_1^2)…(1-\sigma_t^2)}\mathbf x_0+\sqrt{1-(1-\sigma^2_1)…(1-\sigma^2_t)}\boldsymbol{\varepsilon}\)

Architecturally, the diffusion model itself is then a ResNet-like noise predictor \(\hat{\boldsymbol{\varepsilon}}(\mathbf x_t,t|\boldsymbol{\theta})\) (e.g. a U-Net or a transformer, etc.) that seeks to predict the sampled Gaussian noise \(\boldsymbol{\varepsilon}\) that was added to \(\mathbf x_0\) to obtain \(\mathbf x_t\). This is enforced by the MSE loss function \(L(\hat{\boldsymbol{\varepsilon}},\boldsymbol{\varepsilon})=|\hat{\boldsymbol{\varepsilon}}-\boldsymbol{\varepsilon}|^2/2\) with corresponding cost function over the training set:

\[C_{\text{tr}}(\boldsymbol{\theta})=\frac{1}{N_{\text{tr}}}\sum_{i=1}^{N_{\text{tr}}}L(\hat{\boldsymbol{\varepsilon}}(\mathbf x_{t_i},t_i|\boldsymbol{\theta}),\boldsymbol{\varepsilon}_i)\]

The upshot is that the inference loop Markov chain dynamics are governed (for \(t=T,T-1,…,2\)) by:

\[\mathbf x_{t-1}=\frac{1}{\sqrt{1-\sigma^2_t}}\left(\mathbf x_t-\frac{\sigma_t^2}{\sqrt{1-(1-\sigma_1^2)…(1-\sigma_t^2)}}\hat{\boldsymbol{\varepsilon}}(\mathbf x_t,t|\boldsymbol{\theta})\right)+\sqrt{\frac{1-(1-\sigma_1^2)…(1-\sigma_{t-1}^2)}{1-(1-\sigma^2_1)…(1-\sigma^2_t)}}\sigma_t\tilde{\boldsymbol{\varepsilon}}\]

and for \(t=1\), the final generated sample \(\mathbf x_0=\frac{1}{\sqrt{1-\sigma^2_1}}\left(\mathbf x_1-\sigma_1\hat{\boldsymbol{\varepsilon}}(\mathbf x_1,1|\boldsymbol{\theta})\right)\) should not have any noise added to it.

Problem: In DDPMs, explain how estimating the noise \(\hat{\boldsymbol{\varepsilon}}(\mathbf x_t,t|\boldsymbol{\theta})\) is synonymous with estimating the score vector field of the data distribution \(\partial\ln p(\mathbf x_t)/\partial\mathbf x_t\).

Solution: Because \(\mathbf x_t\) is normally distributed about \(\sqrt{(1-\sigma^2_1)…(1-\sigma^2_t)}\mathbf x_0\) with covariance \(1-(1-\sigma_1^2)…(1-\sigma^2_t)\), Tweedie’s formula asserts that:

\[\langle\sqrt{(1-\sigma^2_1)…(1-\sigma^2_t)}\mathbf x_0|\mathbf x_t\rangle=\mathbf x_t+(1-(1-\sigma_1^2)…(1-\sigma^2_t))\frac{\partial\ln p(\mathbf x_t)}{\partial\mathbf x_t}\]

Comparing this with \(\mathbf x_t=\sqrt{(1-\sigma^2_1)…(1-\sigma^2_t)}\mathbf x_0+\sqrt{1-(1-\sigma^2_1)…(1-\sigma^2_t)}\hat{\boldsymbol{\varepsilon}}(\mathbf x_t,t|\boldsymbol{\theta})\), one concludes that:

\[\frac{\partial\ln p(\mathbf x_t)}{\partial\mathbf x_t}=-\frac{\hat{\boldsymbol{\varepsilon}}(\mathbf x_t,t|\boldsymbol{\theta})}{\sqrt{1-(1-\sigma^2_1)…(1-\sigma^2_t)}}\]

Problem: Explain how the DDPM architecture can be reformulated in terms of a stochastic differential equation, and (following work of Song et al.) deduce its distributionally equivalent probability flow ODE.

Solution: Recall the DDPM forward Markov chain transition kernel (rewritten using \(i\in\{0,…,T\}\) instead of \(t\) because in a moment \(t\in [0,T]\) will be reserved for a continuous variable).

\[\mathbf x_i=\sqrt{1-\sigma_i^2}\mathbf x_{i-1}+\sigma_i\boldsymbol{\varepsilon}\]

Taking the continuous time limit of this discrete difference equation amounts to letting \(T\to\infty\) while \(\sigma^2_i\to 0\) in such a way that their product \(T\sigma^2_i:=\sigma^2(t)\) is fixed (cf. the dipole limit in electrostatics which takes \(q\to\infty\) and \(\Delta\mathbf x\to\mathbf 0\) such that \(\boldsymbol{\pi}:=q\Delta\mathbf x\) is fixed). Hence, writing \(dt:=1/T\):

\[\mathbf x_i=\sqrt{1-\sigma^2(t)dt}\mathbf x_{i-1}+\sigma(t)\sqrt{dt}\boldsymbol{\varepsilon}\]

So after binomial expanding, isolating \(d\mathbf x:=\mathbf x_i-\mathbf x_{i-1}\), and recognizing the Wiener process \(d\mathbf w:=\sqrt{dt}\boldsymbol{\varepsilon}\):

\[d\mathbf x=-\frac{\sigma^2(t)}{2}\mathbf xdt+\sigma(t)d\mathbf w\]

This particular SDE is also an instance of an Ornstein-Uhlenbeck process, but is special in that it is also variance-preserving. The corresponding Fokker-Planck probability current density is \(-\frac{\sigma^2(t)}{2}\left(p(\mathbf x,t)\mathbf x+\frac{\partial p}{\partial\mathbf x}\right):=p(\mathbf x,t)\mathbf v_{\text{eff}}(\mathbf x,t)\) which is distributionally equivalent to the ODE \(d\mathbf x=\mathbf v_{\text{eff}}dt\) with effective velocity field:

\[\mathbf v_{\text{eff}}(\mathbf x,t):=-\frac{\sigma^2(t)}{2}\left(\mathbf x+\frac{\partial \ln p}{\partial\mathbf x}\right)\]

Hence, the SDE admits a corresponding probability flow ODE \(d\mathbf x/dt=\mathbf v_{\text{eff}}(\mathbf x,t)\) which can be integrated during inference using standard numerical algorithms (e.g. Euler, Runge-Kutta); recall the score vector field \(\partial\ln p/\partial\mathbf x\) is what the diffusion model estimates via \(\hat{\boldsymbol{\epsilon}}(\mathbf x,t|\boldsymbol{\theta})\).

Problem: Hence, in light of the above discussion, explain and motivate flow matching models, and describe the simplest instance of it (i.e. rectified flow).

Solution: Roughly speaking, diffusion models are trained to stochastically map signal to noise. At inference, one is therefore obliged to denoise along these stochastic trajectories in order to map from noise back to signal. But this is a bit like shooting oneself in the foot, unnecessarily making one’s life difficult with no ostensible gain. Flow matching models are (loosely speaking) what one gets after cutting through the fluff of diffusion models with Occam’s razor, replacing SDEs with ODEs (indeed, this is w.l.o.g. thanks to the probability flow ODE construction).

Intuitively, the simplest possible map \(\mathbf x_0\mapsto\mathbf x_1\) one could engineer is a simple linear interpolation for \(t\in [0,1]\):

\[\mathbf x_t=t\mathbf x_1+(1-t)\mathbf x_0\]

(aside: in the flow matching literature, the convention about whether \(\mathbf x_0\) represents signal vs. noise is sometimes reversed compared with the diffusion literature, but here it is being assumed that \(\mathbf x_0\) represents signal as in diffusion models). The relevant velocity field to be learned is thus \(d\mathbf x_t/dt=\mathbf x_1-\mathbf x_0\). A flow-matching model thus no longer seeks to estimate a score vector field, but rather a velocity vector field \(\hat{\mathbf v}(\mathbf x,t|\boldsymbol{\theta})\) via an MSE loss function \(L(\hat{\mathbf v},\mathbf v):=|\hat{\mathbf v}-\mathbf v|^2/2\) and training cost function:

\[C_{\text{tr}}(\boldsymbol{\theta})=\frac{1}{N_{\text{tr}}}\sum_{i=1}^{N_{\text{tr}}}L(\hat{\mathbf v}(\mathbf x_{t_i},t_i|\boldsymbol{\theta}),\mathbf x_1-\mathbf x_0)\]

A key advantage of learning such a simple rectified flow (also called optimal transport) is that inference is very easy, and can be achieved with very few steps since one just has to step along a straight line.

Posted in Blog | Leave a comment

Information Geometry

Problem: Let \(\boldsymbol{\Theta}\) be a smooth statistical manifold, and let \(D:\boldsymbol{\Theta}^2\to [0,\infty)\) be a smooth function. What does it mean for \((\boldsymbol{\Theta},D)\) to be a “divergence manifold“?

Solution: The notion of a divergence manifold relaxes the axioms of a metric space, specifically still demanding \(D(\boldsymbol{\theta} || \boldsymbol{\theta}’)\geq 0\) and \(D(\boldsymbol{\theta} || \boldsymbol{\theta}’)=0\Leftrightarrow\boldsymbol{\theta}’=\boldsymbol{\theta}\) for all \(\boldsymbol{\theta},\boldsymbol{\theta}’\in\boldsymbol{\Theta}\) but no longer enforcing symmetry \(D(\boldsymbol{\theta} || \boldsymbol{\theta}’)=D(\boldsymbol{\theta}’ || \boldsymbol{\theta})\) or the triangle inequality \(D(\boldsymbol{\theta} || \boldsymbol{\theta}^{\prime\prime})\leq D(\boldsymbol{\theta} || \boldsymbol{\theta}’) + D(\boldsymbol{\theta}’ || \boldsymbol{\theta}^{\prime\prime})\).

Problem: Explain how any divergence manifold \((\boldsymbol{\Theta},D)\) enjoys the free gift of being automatically equipped with a canonical Riemannian metric tensor field \(g_D(\boldsymbol{\theta}):T_{\boldsymbol{\theta}}(\boldsymbol{\Theta})^2\to\mathbf R\) induced by the divergence function \(D\).

Solution: The so-called Fisher information metric \(g_D(\boldsymbol{\theta})=(g_D)_{ij}(\boldsymbol{\theta})d\theta_i\otimes d\theta_j\) on the statistical manifold \(\boldsymbol{\Theta}\) induced by the divergence \(D\) is basically just its Hessian:

\[(g_D)_{ij}(\boldsymbol{\theta}):=\left(\frac{\partial^2 D(\boldsymbol{\theta}||\boldsymbol{\theta}’)}{\partial\theta’_i\partial\theta’_j}\right)_{\boldsymbol{\theta}’=\boldsymbol{\theta}}\]

Intuitively, one can think of the divergence \(D(\boldsymbol{\theta}||\boldsymbol{\theta}’)\) of \(\boldsymbol{\theta}’\in\boldsymbol{\Theta}\) from \(\boldsymbol{\theta}\in\boldsymbol{\Theta}\) as holding the “ground truth” distribution \(\boldsymbol{\theta}\) fixed while sniffing around for a proxy distribution \(\boldsymbol{\theta}’\) to approximate \(\boldsymbol{\theta}\). By the axioms of a divergence manifold, the global minimum value \(D(\boldsymbol{\theta}||\boldsymbol{\theta}’)=0\) is attained at \(\boldsymbol{\theta}’=\boldsymbol{\theta}\), so Taylor expanding about this global minimum (\((\partial D(\boldsymbol{\theta}||\boldsymbol{\theta}’)/\partial\boldsymbol{\theta}’)_{\boldsymbol{\theta}’=\boldsymbol{\theta}}=\mathbf 0\)), one has the local quadratic form:

\[D(\boldsymbol{\theta}||\boldsymbol{\theta}+d\boldsymbol{\theta})\approx\frac{1}{2}(g_D)_{ij}(\boldsymbol{\theta})d\theta_i d\theta_j\]

Problem: Let \(f:(0,\infty)\to\mathbf R\) be a nonlinear convex function with a zero at \(f(1)=0\). Define the family of \(f\)-divergences \(D_f:\Theta^2\to [0,\infty)\), prove that they do indeed satisfy the axioms of a divergence manifold, and exhibit some examples of functions \(f\) and the corresponding \(f\)-divergence \(D_f\).

Solution: One has:

\[D_f(\boldsymbol{\theta}||\boldsymbol{\theta}’):=\int d\mathbf x p(\mathbf x|\boldsymbol{\theta}’)f\left(\frac{p(\mathbf x|\boldsymbol{\theta})}{p(\mathbf x|\boldsymbol{\theta}’)}\right)\]

By rewriting this as an expectation \(D_f(\boldsymbol{\theta}||\boldsymbol{\theta}’)=\langle f\left(p(\mathbf x|\boldsymbol{\theta})/p(\mathbf x|\boldsymbol{\theta}’)\right)\rangle_{\mathbf x\sim p(\mathbf x|\boldsymbol{\theta}’)}\), applying Jensen’s inequality, using \(\int d\mathbf x p(\mathbf x|\boldsymbol{\theta})=1\) and finally using \(f(1)=0\), one establishes non-negativity \(D_f(\boldsymbol{\theta}||\boldsymbol{\theta}’)\geq 0\). The same \(f(1)=0\) condition also ensures \(\boldsymbol{\theta}=\boldsymbol{\theta}’\Rightarrow D_f(\boldsymbol{\theta}||\boldsymbol{\theta}’)=0\), and the converse is proven by recalling that Jensen’s inequality becomes an equality iff \(f\) is linear (forbidden by hypothesis) or its argument \(p(\mathbf x|\boldsymbol{\theta})/p(\mathbf x|\boldsymbol{\theta}’)=\text{constant}\). But since the expectation of this constant in \(\mathbf x\sim p(\mathbf x|\boldsymbol{\theta}’)\) was \(1\), the constant itself must be \(1\), Q.E.D.

  • \(f(x):=x\ln x\) generates the asymmetric Kullback-Leibler (KL) divergence \[D_{\text{KL}}(\boldsymbol{\theta}||\boldsymbol{\theta}’):=\int d\mathbf x p(\mathbf x|\boldsymbol{\theta})\ln\frac{p(\mathbf x|\boldsymbol{\theta})}{p(\mathbf x|\boldsymbol{\theta}’)}\]
  • \(f(x):=-\ln x\) generates the dual KL divergence \(D_{\text{KL}}(\boldsymbol{\theta}’||\boldsymbol{\theta})\) (in general, \(D_{xf(1/x)}(\boldsymbol{\theta}||\boldsymbol{\theta}’)=D_{f(x)}(\boldsymbol{\theta}’||\boldsymbol{\theta})\))
  • \(f(x):=|x-1|/2\) generates the total variation divergence \[d_{\text{TV}}(\boldsymbol{\theta}||\boldsymbol{\theta}’):=\frac{1}{2}\int d\mathbf x|p(\mathbf x|\boldsymbol{\theta})-p(\mathbf x|\boldsymbol{\theta}’)|\] (indeed, TVD is actually symmetric and obeys the triangle inequality so is a metric in the metric space sense, yet does not admit a Fisher information metric due to the non-differentiability of the absolute value).
  • \(f(x):=(\sqrt{x}-1)^2/2\) generates the symmetric squared Hellinger divergence \[d^2_H(\boldsymbol{\theta}||\boldsymbol{\theta}’):=\frac{1}{2}\int d\mathbf x\left(\sqrt{p(\mathbf x|\boldsymbol{\theta})}-\sqrt{p(\mathbf x|\boldsymbol{\theta}’)}\right)^2=1-\int d\mathbf x\sqrt{p(\mathbf x|\boldsymbol{\theta})p(\mathbf x|\boldsymbol{\theta}’)}\]
  • \(f(x):=(x-1)^2\) generates the Pearson \(\chi^2\) divergence \[D_{\chi^2}(\boldsymbol{\theta}||\boldsymbol{\theta}’):=\int d\mathbf x\frac{(p(\mathbf x|\boldsymbol{\theta})-p(\mathbf x|\boldsymbol{\theta}’))^2}{p(\mathbf x|\boldsymbol{\theta}’)}\] and \(f(x)=(x-1)^2/x\) generates the dual divergence (called the Neyman \(\chi^2\) divergence) in which one replaces \(p(\mathbf x|\boldsymbol{\theta}’)\mapsto p(\mathbf x|\boldsymbol{\theta})\) in the denominator.
  • \(f(x):=\frac{1}{2}\left(x\ln x-(x+1)\ln\frac{x+1}{2}\right)\) generates the symmetric Jensen-Shannon divergence \[d^2_{\text{JS}}(\boldsymbol{\theta}||\boldsymbol{\theta}’):=\frac{D_{\text{KL}}(p_{\boldsymbol{\theta}}||\frac{p_{\boldsymbol{\theta}}+p_{\boldsymbol{\theta}’}}{2})+D_{\text{KL}}(p_{\boldsymbol{\theta}’}||\frac{p_{\boldsymbol{\theta}}+p_{\boldsymbol{\theta}’}}{2})}{2}\]

Problem: What is a fundamental limitation in the family of \(f\)-divergences \(D_f\)? How do integral probability metrics such as the Wasserstein (a.k.a. earth-mover’s) distance address this shortcoming?

Solution:

\[W_{n}(p,p’)=\text{min}_{p(\mathbf x,\mathbf x’)|\int d\mathbf xp(\mathbf x,\mathbf x’)=p(\mathbf x’)\text{ and }\int d\mathbf x’p(\mathbf x,\mathbf x’)=p'(\mathbf x)}\left(\int dp(\mathbf x,\mathbf x’)|\mathbf x-\mathbf x’|^n\right)^{1/n}\]

where the joint probability is \(dp(\mathbf x,\mathbf x’)=d\mathbf xd\mathbf x’p(\mathbf x,\mathbf x’)\).

(include a concrete computation in \(1\) dimension).

Problem: (something about Bregman divergence, comment on how KL div is the only simultaneous f and Bregman divergence)

Solution:

Posted in Blog | Leave a comment

Autoencoders

VAE
Posted in Blog | Leave a comment

Graph Neural Networks

Problem: Give a broad sketch of the current state of the field of research in graph neural networks.

Solution:

Problem: Okay, so now explain what a graph neural network (GNN) actually is.

Solution: A GNN is basically any neural network whose input is any (undirected/directed/mixed/multi) graph (e.g. molecules, social networks, citation networks, etc.). In order to be sensible, the GNN output (whatever it is, e.g. a binary classifier for molecule toxicity) has to genuinely be an intrinsic function of the graph structure alone, and in particular not depend on any arbitrary choice of “ordering” with which one might index the graph vertices and edges (i.e. the output must be either permutation equivariant or permutation invariant depending on its nature, cf. tensors vs. tensor components in some basis).

Problem: Explain the subclass of GNNs known as message-passing neural networks (MPNNs).

Solution: An MPNN, being a certain category of GNNs, starts its life by taking as input some graph \((V,E)\). More precisely, this looks like some feature vector \(\mathbf x_v\) (e.g. mass, charge, atomic number for atoms) for each vertex \(v\in V\) and possibly also a feature vector \(\mathbf x_e\) (e.g. bond length, bond energy, etc.) for each edge \(e\in E\). The idea is that, in a manner similar to a head of self-attention, each vertex \(v\in V\) wants to update its current state \(\mathbf x_v\) into some new state \(\mathbf x’_v\) by soaking in context from its neighbours, and in an analogous manner each edge \(e\in E\) also wants to update its current state \(\mathbf x_e\mapsto\mathbf x’_e\) based on its “neighbours” (thus it’s not quite the same as self-attention in which a token doesn’t just look at its nearest neighbour tokens, but at all the tokens in the context). In the general MPNN framework, this can be roughly broken down into \(3\) conceptual steps:

  1. Message phase: from a sender perspective, each vertex \(v\in V\) “broadcasts” a “personalized” message vector \(\mathbf m_{vv’}\) along the edge \((v,v’)\in E\) connecting it to a neighbouring vertex \(v’\in V\). This message vector \(\mathbf m_{vv’}\) is any (learnable) function of its current state \(\mathbf x_v\), the current state of the receiving neighbour vertex \(\mathbf x_{v’}\), and the current edge feature vector \(\mathbf x_{vv’}\) connecting them.
  2. Aggregation phase: simultaneously, from a receiver perspective, each vertex \(v\in V\) receives the broadcasted signals from its neighbouring vertices. From this perspective, it then takes all the received message vectors and synthesizes them into a single “message summary” vector \(\mathbf m_v\) which in practice is any permutation invariant function of the message vectors \(\mathbf m_{vv’}\) it received from neighbouring vertices \(v’\in V\) (e.g. their average).
  3. Update phase: Finally, the vertex \(v\in V\) updates its own current state \(\mathbf x_v\) to some new state \(\mathbf x’_v\) using another (learnable) function of its current state \(\mathbf x_v\) and the message summary \(\mathbf m_v\).

This \(3\)-step process represents a single forward pass through \(1\) message-passing layer; several composed together define an MPNN.

Problem: Now that the general framework of MPNN architectures has been defined, walk through the following specific examples of MPNN architectures:

  • Graph convolutional networks (GCNs)
  • Graph attention networks (GATs)

Solution:

Posted in Blog | Leave a comment

Renormalization Group

Problem: Consider a Landau-Ginzburg statistical field theory involving a single real scalar field \(\phi(\mathbf x)\) for \(\mathbf x\in\mathbf R^d\) governed by the canonically normalized free energy density:

\[\mathcal F(\phi,\partial\phi/\partial\mathbf x,…)=\frac{1}{2}\biggr|\frac{\partial\phi}{\partial\mathbf x}\biggr|^2+\frac{\phi^2}{2\xi^2}+…\]

Explain what the \(+…\) means, explain which terms have temperature \(T\)-dependence, and explain for which such terms does that \(T\)-dependence actually matter?

Solution: The \(+…\) includes any terms (each with their own coupling constants) compatible with the golden trinity of constraints: locality, analyticity, and symmetry (e.g. a quartic \(g\phi^4\) coupling). The part of the free energy density \(\mathcal F\) before the \(+…\) should be compared to the Lagrangian density \(\mathcal L\) of Klein-Gordon field theory:

\[\mathcal L=\frac{1}{2c^2}\left(\frac{\partial\phi}{\partial t}\right)^2-\frac{1}{2}\biggr|\frac{\partial\phi}{\partial\mathbf x}\biggr|^2-\frac{\phi^2}{2\bar{\lambda}^2}\]

with \(\bar{\lambda}=\hbar/mc\) the reduced Compton wavelength playing a role analogous to the correlation length \(\xi\sim 1/\sqrt{|T-T_c|}\); indeed, this \(T\)-dependence in \(\xi=\xi(T)\) is (usually) the only \(T\)-dependence that matters, even though generically all the other coupling constants will also have \(T\)-dependence.

Problem: Define the (non-standard) notion of “theory space”.

Solution: Roughly speaking, “theory space” is the space of all Landau-Ginzburg statistical field theories \((\mathcal F,k^*)\) (notice it is defined not only by the free energy density \(\mathcal F\) but also the baggage of the UV cutoff \(k^*\); it is an effective field theory). The \(\mathcal F\) part can equivalently be parameterized as a countably \(\infty\)-tuple \((\xi,g,…)\) of the LAS-permitted coupling constants in \(\mathcal F\).

Problem: In broad strokes, describe the sequence of \(3\) steps that comprise a \(\mathbf k\)-space \(\zeta\)-renormalization semigroup transformation from one effective Landau-Ginzburg statistical field theory \((\mathcal F,k^*)\mapsto (\mathcal F_{\zeta},k^*)\) to another with the same UV cutoff \(k^*\) but a new Wilsonian effective free energy \(\mathcal F_{\zeta}\).

Solution: For \(\zeta\in [1,\infty)\), the corresponding \(\mathbf k\)-space \(\zeta\)-RG transformation of \((\mathcal F,k^*)\) is given by the \(3\)-step recipe:

  1. Coarse-graining \(k^*\mapsto k^*/\zeta\) (blocking in real space/integrating out shells in momentum space)
  2. Rescaling \(\mathbf k’:=\zeta\mathbf k\) to recover the original UV cutoff \(k^*/\zeta\mapsto k^*\) (this leads to a reciprocal “zooming out” of space \(\mathbf x\mapsto\mathbf x/\zeta\)).
  3. Rescale fields \(\phi’:=\zeta^{\Delta}\phi\) to make \(\mathcal F_{\zeta}\) canonically normalized with respect to \(\mathcal F\).

Problem: Consider an effective Landau-Ginzburg statistical field theory of a single real scalar field \(\phi(\mathbf x)\in\mathbf R\) with \(\mathbf x\in\mathbf R^d\) whose Fourier transform \(\phi_{\mathbf k}=\int d^d\mathbf x e^{-i\mathbf k\cdot\mathbf x}\phi(\mathbf x)\) is supported only on a ball of radius \(k^*\) (the theory’s UV cutoff). The free energy density corresponds to a free (no pun intended) field:

\[\mathcal F(\phi,\partial\phi/\partial\mathbf x)=\frac{1}{2}\biggr|\frac{\partial\phi}{\partial\mathbf x}\biggr|^2+\frac{\phi^2}{2\xi^2}\]

Perform a \(\mathbf k\)-space \(\zeta\)-renormalization of this theory to find the corresponding Wilsonian effective free energy density \(\mathcal F_{\zeta}\).

Solution: Work with the free energy \(F=\int d^d\mathbf x\mathcal F\) itself instead of just its density \(\mathcal F\):

\[F[\phi]=\frac{1}{2}\int_{|\mathbf k|<k^*}\frac{d^d\mathbf k}{(2\pi)^d}\left(|\mathbf k|^2+\frac{1}{\xi^2}\right)|\phi_{\mathbf k}|^2\]

  1. Partition the support \(|\mathbf k|<k^*\) of \(\phi_{\mathbf k}\) into \(|\mathbf k|<k^*/\zeta\) and \(k^*/\zeta<|\mathbf k|<k^*\) and based on this \(\zeta\), piecewise decompose \(\phi_{\mathbf k}=\phi^{<}_{\mathbf k}+\phi^{>}_{\mathbf k}\). Then one has an instance of the freshman’s dream (thanks to the disjoint supports of \(\phi^{<}_{\mathbf k}\) and \(\phi^{>}_{\mathbf k}\)):

\[|\phi_{\mathbf k}|^2=|\phi^{<}_{\mathbf k}+\phi^{>}_{\mathbf k}|^2=|\phi^{<}_{\mathbf k}|^2+|\phi^{>}_{\mathbf k}|^2\]

So:

\[F[\phi]=\frac{1}{2}\int_{|\mathbf k|<k^*/\zeta}\frac{d^d\mathbf k}{(2\pi)^d}\left(|\mathbf k|^2+\frac{1}{\xi^2}\right)|\phi^{<}_{\mathbf k}|^2+\frac{1}{2}\int_{k^*/\zeta<|\mathbf k|<k^*}\frac{d^d\mathbf k}{(2\pi)^d}\left(|\mathbf k|^2+\frac{1}{\xi^2}\right)|\phi^{>}_{\mathbf k}|^2\]

\[=F[\phi^<_{\zeta}]+F[\phi^>_{\zeta}]\]

In this case the partition function factorizes:

\[Z=\int\mathcal D\phi e^{-\beta F[\phi]}=\int\mathcal D\phi^{<}_{\zeta} e^{-\beta F[\phi^{<}_{\zeta}]}\int\mathcal D\phi^{>}_{\zeta}e^{-\beta F[\phi^{>}_{\zeta}]}=Z^{>}_{\zeta}\int\mathcal D\phi^{<}_{\zeta} e^{-\beta F[\phi^{<}_{\zeta}]}\]

where the measures are \(\mathcal D\phi^{<}_{\zeta}=\prod_{|\mathbf k|<k^*/\zeta}d\phi^{<}_{\mathbf k}\) and \(\mathcal D\phi^{>}_{\zeta}=\prod_{k^*/\zeta<|\mathbf k|<k^*}d\phi^{>}_{\mathbf k}\). The constant \(Z^>_{\zeta}\) doesn’t affect the physics, being absorbed as a constant shift into the Wilsonian effective free energy \(F_{\zeta}[\phi^{<}_{\zeta}]=F[\phi^{<}_{\zeta}]-k_BT\ln Z^{>}_{\zeta}\).

2. Rescaling \(\mathbf k’:=\zeta\mathbf k\), the Wilsonian effective free energy becomes:

\[F_{\zeta}[\phi^{<}_{\zeta}]=\frac{1}{2}\int_{|\mathbf k’|<k^*}\frac{d^d\mathbf k’}{(2\pi)^d}\zeta^{-d}\left(\zeta^{-2}|\mathbf k’|^2+\frac{1}{\xi^2}\right)|\phi^{<}_{\mathbf k’/\zeta}|^2\]

3. Rescaling \(\phi^{<\prime}_{\mathbf k’}:=\zeta^{\Delta}\phi^{<}_{\mathbf k’/\zeta}\), the Wilsonian effective free energy becomes:

\[F_{\zeta}[\phi^{<}_{\zeta}]=\frac{1}{2}\int_{|\mathbf k’|<k^*}\frac{d^d\mathbf k’}{(2\pi)^d}\zeta^{-d}\left(\zeta^{-2}|\mathbf k’|^2+\frac{1}{\xi^2}\right)\zeta^{-2\Delta}|\phi^{<\prime}_{\mathbf k’}|^2\]

so in order to canonically normalize the gradient term, one requires \(\Delta=-(d+2)/2\). This leads to the desired Wilsonian effective free energy density:

\[\mathcal F_{\zeta}(\phi^{<\prime}_{\zeta},\partial\phi^{<\prime}_{\zeta}/\partial\mathbf x)=\frac{1}{2}\biggr|\frac{\partial\phi^{<\prime}_{\zeta}}{\partial\mathbf x}\biggr|^2+\zeta^2\frac{(\phi^{<\prime}_{\zeta})^2}{2\xi^2}\]

Thus, by construction the gradient coupling is marginal (i.e. \(\zeta\)-independent) while the quadratic coupling is relevant because \(\zeta^2\to\infty\) as \(\zeta\to\infty\).

Posted in Blog | Leave a comment

Convolutional Neural Networks

CNNs_Part_1
Posted in Blog | Leave a comment

Hamilton’s Optics-Mechanics Analogy

Problem: Deduce the Hamilton-Jacobi equation of classical mechanics.

Solution: Instead of viewing the action \(S=S[\mathbf x(t)]\) as a functional of the particle’s trajectory \(\mathbf x(t)\), it can be viewed more simply as a scalar field \(S(\mathbf x,t)\) in which the initial point in spacetime \((t_0,\mathbf x_0)\) is fixed and one simply takes the on-shell trajectory from \((t_0,\mathbf x_0)\) to \((t,\mathbf x)\). Then the total differential \(dS=\mathbf p\cdot d\mathbf x\) (follows from the usual Noetherian calculation) so in particular:

\[\mathbf p=\frac{\partial S}{\partial\mathbf x}\]

Intuitively, this is saying that the particle moves in a direction (the direction of the momentum \(\mathbf p\)) orthogonal to the contour surfaces of the action field \(S\), i.e. such isosurfaces can be viewed as “wavefronts”. Then the total time derivative is:

\[\dot S=L\]

But \(\frac{\partial S}{\partial t}+\frac{\partial S}{\partial\mathbf x}\cdot\dot{\mathbf x}=\frac{\partial S}{\partial t}+\mathbf p\cdot\dot{\mathbf x}\). Thus, isolating for \(H=\mathbf p\cdot\dot{\mathbf x}-L\) yields the Hamilton-Jacobi nonlinear \(1^{\text{st}}\)-order PDE for \(S(\mathbf x,t)\):

\[-\frac{\partial S}{\partial t}=H\left(\mathbf x,\frac{\partial S}{\partial\mathbf x},t\right)\]

Problem: When \(\partial H/\partial t=0\), the Hamiltonian is conserved with energy \(H=E\), so this motivates the additive separation of variables, \(S(\mathbf x,t):=S_0(\mathbf x)-Et\) for some constant \(E\). What does the Hamilton-Jacobi equation simplify to in this case? For a single non-relativistic particle of mass \(m\) moving in a potential \(V(\mathbf x)\), what does this look like? What about in \(1\) dimension?

Solution: \[H\left(\mathbf x,\frac{\partial S_0}{\partial\mathbf x}\right)=E\]

which for \(H(\mathbf x,\mathbf p)=|\mathbf p|^2/2m+V(\mathbf x)\) looks like:

\[\frac{1}{2m}\biggr|\frac{\partial S_0}{\partial\mathbf x}\biggr|^2+V(\mathbf x)=E\]

and in \(1\) dimension is integrable to the explicit solution:

\[S_0(x)=\pm\int ^xdx’\sqrt{2m(E-V(x’))}\]

In particular, the usual trajectory \(x(t)\) can be obtained by treating \(S_o=S_0(x,t;E)\) as a family of solutions parameterized by the energy \(E\); this works because \(S_0\) can be a viewed as a particular generating function of a canonical transformation \((\mathbf x,\mathbf p,H)\mapsto (\mathbf x’,\mathbf p’,H’)\) in which the “boosted” Hamiltonian vanishes \(H’=0\).

\[\frac{\partial S_0}{\partial E}=-t_0\Rightarrow t-t_0=\pm\sqrt{\frac{m}{2}}\int_{x_0}^{x(t)}\frac{dx’}{\sqrt{E-V(x’)}}\]

Problem: Above, the static field \(S_0(\mathbf x)\) was introduced to simplify the Hamilton-Jacobi equation when the energy \(E\) was conserved. However, if one pulls back to the level of functionals rather than fields, one can define an analogous abbreviated action functional \(S_0[\mathbf x]\) which depends only on the path \(\mathbf x\) taken rather than the trajectory \(\mathbf x(t)\). Define \(S_0[\mathbf x]\), and moreover show that when the energy \(E\) is conserved, the on-shell path is a stationary point of \(S_0\) (this is called Maupertuis’s principle).

Solution: The abbreviated action for a single particle of mass \(m\) and (non-relativistic) energy \(E=|\mathbf p|^2/2m +V(\mathbf x)\) is:

\[S_0[\mathbf x]:=\int d\mathbf x\cdot\mathbf p=\int ds |\mathbf p|=\int ds\sqrt{2m(E-V(\mathbf x))}\]

(modifying this to \(S_0[\mathbf x]:=\int ds |\mathbf p|=\int ds\sqrt{2(E-V(\mathbf x))}\) allows for interpretation as an \(N\)-particle system in configuration space \(\mathbf x\in\mathbf R^{3N}\) with the Riemannian “mass metric” \(ds^2=m_1|d\mathbf x_1|^2+…+m_N|d\mathbf x_N|^2\)).

To find the stationary paths of \(S_0[\mathbf x]\) subject to the constraint \(H(\mathbf x,\mathbf p)=E\), one can implement a Lagrange multiplier \(\gamma(\tau)\) to perform unconstrained extremization of:

\[S[\mathbf x(\tau)]:=S_0[\mathbf x(\tau)]-\int d\tau\gamma(\tau)(H(\mathbf x,\mathbf p)-E)=\int d\tau (\mathbf p\cdot\dot{\mathbf x}-\gamma(\tau)(H(\mathbf x,\mathbf p)-E))\]

The Euler-Lagrange equations lead to Hamilton’s equations:

\[\frac{d\mathbf x}{d\tau}=\gamma\frac{\partial H}{\partial\mathbf p}\]

\[\frac{d\mathbf p}{d\tau}=-\gamma\frac{\partial H}{\partial\mathbf x}\]

provided the Lagrange multiplier \(\gamma=dt/d\tau\) encodes reparameterization invariance; with this choice it’s clear that the integrand in the functional \(S\) was nothing more than the Lagrangian \(L=\mathbf p\cdot\dot{\mathbf x}-H\) (plus an unimportant constant \(E\)) so Maupertuis’s principle reduces to the usual Hamilton’s principle.

Problem: What does Fermat’s principle in ray optics assert? Hence, derive the ray equation.

Solution: The time functional \(T=T[\mathbf x(s)]\) of a ray trajectory \(\mathbf x(s)\) is stationary on-shell. That is:

\[cT[\mathbf x(s)]=\int ds n(\mathbf x(s))\]

This is reparameterization invariant, since one can arbitrarily parameterize \(\mathbf x=\mathbf x(t)\) and replace \(ds=dt|\dot{\mathbf x}|\). The corresponding Euler-Lagrange equations are:

\[\frac{d}{dt}\left(n(\mathbf x)\frac{\dot{\mathbf x}}{|\dot{\mathbf x}|}\right)=|\dot{\mathbf x}|\frac{\partial n}{\partial\mathbf x}\]

But by choosing the natural parameterization \(t:=s\) one has \(|d\mathbf x/ds|=1\), hence the ray equation:

\[\frac{d}{ds}\left(n\frac{d\mathbf x}{ds}\right)=\frac{\partial n}{\partial\mathbf x}\]

This can also be written in terms of the curvature vector \(\boldsymbol{\kappa}=d^2\mathbf x/ds^2\):

\[\boldsymbol{\kappa}=\left(\frac{\partial\ln n}{\partial\mathbf x}\right)_{\perp d\mathbf x}\]

Problem: Starting from an arbitrary Cartesian component \(\psi(\mathbf x,t)=\psi_0(\mathbf x)e^{i(k_0cT(\mathbf x)-\omega t)}\) of either the \(\mathbf E\) or \(\mathbf B\) fields of a light wave (here \(\omega=ck_0\) with \(k_0=2\pi/\lambda_0\) is the free space wavenumber), make the eikonal approximation to the dispersionless wave equation obeyed by \(\psi\) in order to obtain the (scalar) eikonal equation. By defining light rays as the integral curves of the eikonal field \(cT(\mathbf x)\) (a kind of local optical path length), reproduce the vector eikonal equation from Fermat’s principle above.

Solution: The ansatz \(\psi(\mathbf x,t)=\psi_0(\mathbf x)e^{i(k_0cT(\mathbf x)-\omega t)}\) is easy to justify; the \(e^{-i\omega t}\) is a just a Fourier transform factor that reduces the wave equation \(\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\psi=\frac{n^2}{c^2}\frac{\partial^2\psi}{\partial t^2}\) to a Helmholtz equation \(\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\psi=-n^2k_0^2\psi\). The remaining piece is just a polar parameterization of an arbitrary \(\mathbf C\)-valued spatial field \(\psi_0(\mathbf x)e^{ik_0cT(\mathbf x)}\). One obtains:

\[\biggr|\frac{\partial cT}{\partial\mathbf x}\biggr|^2=n^2+\frac{1}{k_0^2\psi_0}\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\psi_0+\frac{2i}{k_0\psi_0}\frac{\partial\psi_0}{\partial\mathbf x}\cdot\frac{\partial cT}{\partial\mathbf x}+\frac{i}{k_0}\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2cT\]

The eikonal approximation amounts to taking the ray optics limit \(k_0\to\infty\) (in practice, the wavelength \(2\pi/k_0\) has to be much shorter than all other length scales), and yields the (scalar) eikonal equation:

\[\biggr|\frac{\partial cT}{\partial\mathbf x}\biggr|=n\]

A light ray is thus a trajectory \(\mathbf x(s)\) with unit tangent vector:

\[\frac{d\mathbf x}{ds}=\frac{1}{n}\frac{\partial cT}{\partial\mathbf x}\]

The rest is an application of the chain rule:

\[\frac{d}{ds}=\frac{\partial}{\partial\mathbf x}\cdot\frac{d\mathbf x}{ds}=\frac{1}{n}\frac{\partial}{\partial\mathbf x}\cdot\frac{\partial cT}{\partial\mathbf x}\]

followed by the identity:

\[\left(\frac{\partial cT}{\partial\mathbf x}\cdot\frac{\partial}{\partial\mathbf x}\right)\frac{\partial cT}{\partial\mathbf x}=\frac{1}{2}\frac{\partial}{\partial\mathbf x}\biggr|\frac{\partial cT}{\partial\mathbf x}\biggr|^2\]

to deduce the (vector) eikonal equation of motion for ray trajectories just as Fermat’s principle predicts.

Problem: Hence, what is Hamilton’s optics-mechanics analogy?

Solution: In a nutshell, the isomorphism proceeds as:

\[(n(\mathbf x), cT)\leftrightarrow (|\mathbf p(\mathbf x)|,S_0)\]

Problem: Use Hamilton’s optics-mechanics analogy to solve the brachistochrone problem (this was how Johann Bernoulli originally solved it).

Solution: By energy conservation, the speed of the particle at distance \(y>0\) below its initial dropping height is \(v=\sqrt{2gy}\). By Fermat’s principle, minimizing the time functional then amounts to treating the particle as a light ray with \(n(\mathbf x)=c/v(y)\). So the question becomes how do light rays bend in a horizontally stratified medium with \(n(y)\propto y^{-1/2}\)? The answer is given by the ray equations:

\[\frac{d}{ds}\begin{pmatrix} y^{-1/2}dx/ds \\ y^{-1/2}dy/ds\end{pmatrix}=\begin{pmatrix}0 \\ y^{-1/2}/2\end{pmatrix}\]

The horizontal component expresses Snell’s law since \(dx/ds=\sin\theta\) (it expresses momentum conservation along the homogeneous \(\partial n/\partial x=0\) direction). Using the tangent vector constraint \(ds^2=dx^2+dy^2\) gives the ODE of a cycloid:

\[\frac{dy}{dx}=\sqrt{\frac{\text{const}}{y}-1}\]

(the vertical component ODE has an analytical solution \(y(s)=-s^2/8R+s\) which is contained in the cycloid, so is redundant information).

Problem: How did Hamilton’s optics-mechanics analogy inspire Schrodinger to propose his famous equations of quantum mechanics?

Solution: Essentially, Schrodinger asked: ray optics is to wave optics as classical mechanics is to what? In other words, one imagines there exists a wave theory of particles/matter and one would like to take the “inverse eikonal limit” of classical mechanics (here, “inverse eikonal limit” is usually called quantization):

Just as light rays propagate parallel to their phase fronts:

\[\frac{d\mathbf x}{ds}=\frac{1}{n}\frac{\partial cT}{\partial\mathbf x}\]

Particles propagate parallel to their “action fronts” exactly according to Hamilton’s analogy:

\[\frac{d\mathbf x}{ds}=\frac{1}{|\mathbf p(\mathbf x)|}\frac{\partial S_0}{\partial\mathbf x}\]

Already this suggests that the action should have some phase interpretation. More precisely, it should be the phase of the particle’s de Broglie wave in units of \(\hbar\). It’s also not obvious that particle’s should be described by a scalar wave field rather than e.g. the electromagnetic vector wave fields of light. Schrodinger simply guessed it looked like the equivalent of “scalar diffraction theory” with a single wavefunction \(\psi(\mathbf x,t)=\psi_0(\mathbf x,t)e^{iS(\mathbf x,t)/\hbar}\). This gives the Madelung equations of quantum hydrodynamics, one of which is just a continuity equation (giving credence to the Born interpretation of \(|\psi|^2\)) and the other is a quantum Hamilton-Jacobi equation which in the limit \(\hbar\to 0\) (analogous to the eikonal limit \(\lambda_0\to 0\)) simplifies to the classical Hamilton-Jacobi equation.

Posted in Blog | Leave a comment

Pseudo-Riemannian Geometry

Problem: Define the signature of a matrix. Hence, state and prove Sylvester’s law of inertia.

Solution: The signature of an \(n\times n\) matrix \(A\) is a \(3\)-tuple \((n_+,n_-,n_0)\) where \(n_+\) is the number of positive eigenvalues of \(A\) (including multiplicity), \(n_-\) is the number of negative eigenvalue of \(A\) (including multiplicity), and \(n_0=\text{dim}\ker A\) is the multiplicity of the zero eigenvalue; thus, \(n_++n_-+n_0=n\).

Let \(A,B\) be real symmetric matrices. Then Sylvester’s law of inertia asserts that \(A\) and \(B\) are congruent matrices iff they have the same signature (which sometimes is called “inertia” because of this invariance under congruence, hence the name).

Proof: It’s easy to see that the nullity \(n_0\) is preserved by congruence transformations. If one can can show that \(n_+\) is also preserved, then it implies \(n_-\) is conserved by virtue of \(n_++n_-+n_0=n\). To show this, the idea is to prove by contradiction, assuming \(n_+\) is not preserved which by dimension counting would imply a non-zero vector living in the intersection of the subspace spanned by the positive-eigenvalue eigenvectors of \(A\) and the congruence-transformed subspace spanned by the non-positive-eigenvalue eigenvectors of \(B\).

Since any real, symmetric matrix \(A\) is isomorphic to a real quadratic form \(Q(\mathbf x):=\mathbf x^TA\mathbf x\), the concepts of signature and Sylvester’s law of inertia can also be reformulated in the language of quadratic forms rather than real symmetric matrices.

Problem: Let \(X\) be a smooth \(n\)-manifold. Explain what it means to place the additional structure of a pseudo-Riemannian metric \(g\) on \(X\). Then explain how Riemannian and Lorentzian geometry are special cases of pseudo-Riemannian geometry.

Solution: A Riemannian metric \(g\) on \(X\) is a type \((0,2)\) tensor field that defines an inner product \(g_x:T_x(X)^2\to\mathbf R\) at the tangent space \(T_x(X)\) of each point \(x\in X\). The “tensor field” part is sometimes rephrased as saying that \(g_x\) is a bilinear form. Moreover, to flesh out the usual axioms of a real inner product space, the Riemannian metric tensor \(g_x\) at each \(x\in X\) must be symmetric \(g_x(v_x,v’_x)=g_x(v’_x,v_x)\) and positive-definite \(v_x\neq 0\Rightarrow g_x(v_x,v_x)>0\).

A pseudo-Riemannian metric relaxes the positive-definite requirement of a Riemannian metric, instead merely requiring non-degeneracy (i.e. at each point \(x\in X\), the zero vector \(0\) is the only vector orthogonal \(g_x(0,v_x)=0\) to all \(v_x\in T_x(X)\)). More concisely, what this is saying is that \(n_0=0\), so the signature of a pseudo-Riemannian metric may be thought of as a pair \((n_+,n_-)\).

Thus, Riemannian metrics are the special subset of pseudo-Riemannian metrics for which \((n_+,n_-)=(n,0)\). Meanwhile, Lorentzian metrics are another special subset of pseudo-Riemannian metrics for which \((n_+,n_-)=(n-1,1)\) (this is the relativists’/ convention). Typically, spacetime \(X\) has dimension \(n=4\), so e.g. the Minkowski metric is said to have signature \((3,1):=(-,+,+,+)\).

Problem: Let \((X,g)\) be a pseudo-Riemannian manifold. Often, one writes the general expression for an infinitesimal line element on \(X\) as \(ds^2=g_{\mu\nu}dx^{\mu}dx^{\nu}\); explain the shorthand being used here.

Solution: This is nothing more than choosing some chart \(x^{\mu}\) on \(X\) and expanding the metric tensor field \(g\) in the coordinate basis \(\{dx^{\mu}\otimes dx^{\nu}\}\) of type \((0,2)\) tensor fields:

\[g=g_{\mu\nu}dx^{\mu}\otimes dx^{\nu}\]

for some real, symmetric scalar fields \(g_{\mu\nu}:X\to\mathbf R\). One then simply writes \(ds^2:=g\) and omits the tensor product \(\otimes\).

Problem: Let \(x(t)\in X\) be a curve on a Riemannian manifold \((X,g)\). By choosing a chart \(x^{\mu}\) to cover a suitable region of \(X\) spanned by the trajectory \(x(t)\), explain how to compute the length \(s\) of the curve \(x(t)\) using the Riemannian metric \(g\).

Solution: Heuristically, it is:

\[s=\int ds=\int\sqrt{g_{\mu\nu}dx^{\mu}dx^{\nu}}=\int dt\sqrt{g_{\mu\nu}(x(t))\dot{x}^{\mu}\dot{x}^{\nu}}\]

Problem: Write down the action \(S\) for a non-relativistic particle of mass \(m\) moving on a Riemannian manifold \((X,g)\). Hence, derive the geodesic equation. What happens if the particle is relativistic?

Solution: A nonrelativistic free particle of mass \(m\) has action \(S=\int dt L\) described by the usual nonrelativistic kinetic Lagrangian:

\[L=\frac{m}{2}|\dot{\mathbf x}|^2\]

The catch is that \(|\dot{\mathbf x}|^2=g_{ij}(\mathbf x)\dot x^{i}\dot x^{j}\) depends on the Riemannian metric \(g\) in which the particle moves. Applying the Euler-Lagrange equations, one finds the equation of motion (called the geodesic equation):

\[\ddot x^i+\Gamma^i_{jk}\dot x^j\dot x^k=0\]

where the Christoffel symbols \(\Gamma^i_{jk}(\mathbf x):=\frac{1}{2}g^{i\ell}(\mathbf x)\left(\frac{\partial g_{\ell j}}{\partial x^k}+\frac{\partial g_{\ell k}}{\partial x^j}-\frac{\partial g_{jk}}{\partial x^{\ell}}\right)\) are analogous to “fictitious forces” on the particle due to the curvature of \(X\).

By contrast, a relativistic free particle has action \(S=-mcs=-mc^2\tau\). This can be recast into non-relativistic form \(S=\int dt L\) using the relativistic kinetic Lagrangian:

\[L=-mc\sqrt{g_{\mu\nu}\dot x^{\mu}\dot x^{\nu}}\]

which in flat space \(g_{\mu\nu}=\text{diag}(1,-1,-1,-1)\) reduces to the familiar \(L=-mc^2/\gamma\) (nb. more generally, the quantity under the square root is always \(\geq 0\) for the timelike worldlines of particles). Some comments:

  • The stationary action principle \(\delta S=0\) in which the action \(S\) tends to be minimized corresponds to the well-known fact that relativistic free particles travelling in straight line geodesics through spacetime maximize the Minkowski distance \(s\), or equivalently Alice experiences more proper time (i.e. ages more) than Bob in the twin paradox.
  • The relativistic action (unlike its non-relativistic counterpart) is manifestly reparameterization invariant.

Now the Euler-Lagrange geodesic equations for the relativistic Lagrangian are almost identical to the non-relativistic case except there’s an additional term:

\[\ddot x^{\mu}+\Gamma^{\mu}_{\nu\rho}(x)\dot x^{\nu}\dot x^{\rho}=\frac{\dot L}{L}\dot x^{\mu}\]

The additional term vanishes \(\dot L=0\) iff \(L=L(\tau’)\) is parameterized as any affine transformation \(\tau’=a\tau+b\) of the particle’s proper time \(\tau\) (prove this by showing \(\dot{\tau}’=-aL/mc^2\)). Only in this case does the relativistic geodesic equation reduce to the non-relativistic form:

\[\frac{d^2 x^{\mu}}{d\tau’^2}+\Gamma^{\mu}_{\nu\rho}(x(\tau’))\frac{dx^{\nu}}{d\tau’}\frac{dx^{\rho}}{d\tau’}=0\]

Problem: For \(X=\mathbf R^2\), calculate all non-vanishing Christoffel symbols in the polar coordinate chart \((\rho,\phi)\).

Solution:

Note it is usually more efficient to compute Christoffel symbols this way rather than using the explicit formula in terms of partial derivatives of the metric \(g\) (though they’re of course equivalent).

Problem: Explain how the presence of a pseudo-Riemannian metric \(g\) on a smooth manifold \(X\) allows one to identify covectors in \(T_x^*(X)\) with tangent vectors in \(T_x(X)\) at each point \(x\in X\).

Solution: Just as in \(X=\mathbf R^n\), one identifies a tangent vector \(\mathbf v\) with its covector \(\mathbf v\cdot\) via the inner product \(\cdot\), so at a given \(x\in X\), one uses the pseudo-inner product \(g_x\) on \(T_x(X)\) to make the exact same identification \(v_x\leftrightarrow g_x(v_x,\space)\) (this sometimes called the musical isomorphism induced by \(g\) in light of the map \(\flat_x:T_x(X)\to T^*_x(X)\) defined by \(v_x^{\flat_x}(v’_x):=g(v_x,v’_x)\) and its inverse \(\sharp_x=\flat_x^{-1}\)).

In some chart \(x^{\mu}\), one can write the \(1\)-form \(v^{\flat}=(v^{\flat})_{\mu}dx^{\mu}\) with components \((v^{\flat})_{\mu}=v^{\flat}(\partial_{\mu})=g(v,\partial_{\mu})=g_{\mu\nu}v^{\nu}\) where \(v^{\nu}=v(x^{\nu})\) are the components of its musically isomorphic vector field \(v=v^{\nu}\partial_{\nu}\) in the same coordinate basis. Physicists typically abuse notation by merely writing \((v^{\flat})_{\mu}=g_{\mu\nu}v^{\nu}\) as just \(v_{\mu}=g_{\mu\nu}v^{\nu}\), identifying \(v^{\flat}\equiv v\) under the musical isomorphism. But then this lazy notation makes it look as if the metric \(g\) performs a trivial mechanical action of just “lowering the index” on \(v\).

Problem: Define a tensor field \(g^*\) that performs the trivial mechanical action of “raising the index” on a \(1\)-form \(A\).

Solution: Define \(g^*_x:T^*_x(X)^2\to\mathbf R\) to be a type \((2,0)\) tensor given by:

\[g^*(A,A’):=g(A^{\sharp}, A’^{\sharp})\]

Then on the one hand, in some chart \(x^{\mu}\), one has:

\[g^*(v^{\flat},v’^{\flat})=g(v,v’)=g_{\mu\nu}v^{\mu}v’^{\nu}\]

On the other hand, in that same chart \(x^{\mu}\), one has:

\[g^*(v^{\flat},v’^{\flat})=(g^*)^{\mu\nu}(v^{\flat})_{\mu}(v’^{\flat})_{\nu}=(g^*)^{\mu\nu}g_{\mu\rho}v^{\rho}g_{\nu\sigma}v’^{\sigma}=(g^*)^{\rho\sigma}g_{\mu\rho}g_{\nu\sigma}v^{\mu}v’^{\nu}\]

This implies \(g_{\mu\nu}=(g^*)^{\rho\sigma}g_{\mu\rho}g_{\nu\sigma}\). Since \(g\) is non-degenerate (the \(N_0=0\) axiom of a pseudo-Riemannian metric), it follows that \(\det g_{\mu\nu}\neq 0\) is invertible, so the only possibility is \((g^*)^{\mu\nu}g_{\nu\rho}=\delta^{\mu}_{\rho}\), i.e. \(g^*\sim g^{-1}\). It is common practice to just drop the \(*\) and write \(g^{\mu\nu}\) in lieu of \((g^*)^{\mu\nu}\). With this lazy notation, \(g^{\mu\nu}\) looks like it just “raises the index” on \(A^{\mu}=g^{\mu\nu}A_{\nu}\) (where one last abuse of notation \((A^{\sharp})^{\mu}=A^{\mu}\) has been made!). cf. raising lowering operators \(a,a^{\dagger}\) in quantum mechanics.

Problem: What is the canonical volume form on a smooth, orientable pseudo-Riemannian \(n\)-manifold \((X,g)\)? Why is it canonical?

Solution: In a chart \(x^{\mu}\) in which the pseudo-Riemannian metric has components \(g=g_{\mu\nu}dx^{\mu}\otimes dx^{\nu}\), the canonical volume form is \(\sqrt{|\det g|}dx^1\wedge…\wedge dx^n\). This is indeed a volume form because it is a top form with \(\det g\neq 0\) nowhere vanishing for same reasons as before. Moreover, it is canonical because, despite appearances, it does not depend on the choice of chart \(x^{\mu}\) (provided it’s the same orientation) as \(\sqrt{|\det g|}\) is a scalar density of weight \(1\) whereas \(dx^1\wedge…\wedge dx^n\) is a scalar density of weight \(-1\).

Problem: Let \((X,g)\) be a smooth, \(n\)-dimensional, orientable pseudo-Riemannian manifold, and let \(\omega\in\Omega^k(X)\) be a differential \(k\)-form on \(X\). Define the Hodge dual \((n-k)\)-form \(\star\omega\in\Omega^{n-k}(X)\) on \(X\).

Solution: The Hodge dual \(\star\omega\in\Omega^{n-k}(X)\) is defined to be the unique differential \((n-k)\)-form such that for all test differential \(k\)-forms \(\omega’\in\Omega^k(X)\) one has the equality of top forms:

\[\omega’\wedge\star\omega=\widetilde{\langle\omega’,\omega\rangle}\sqrt{|\det g|}dx^1\wedge…\wedge dx^n\]

Here, the operation \(\widetilde{\langle\omega’,\omega\rangle}\) is defined for \(\omega’=A’_1\wedge…\wedge A’_k\) and \(\omega=A_1\wedge…\wedge A_k\) by the Gram determinant:

\[\widetilde{\langle\omega’,\omega\rangle}:=\det\begin{pmatrix}g^*(A’_1,A_1) & \dots & g^*(A’_1,A_k) \\ \vdots & \ddots & \vdots \\ g^*(A’_k,A_1) & \dots & g^*(A’_k,A_k)\end{pmatrix}\]

and extended to arbitrary differential \(k\)-forms \(\omega’,\omega\in\Omega^k(X)\) by bilinearity. It is easy to check that the Hodge star \(\star:\Omega^k(X)\to\Omega^{n-k}(X)\) is a linear transformation \(\star(\phi_1\omega_1+\phi_2\omega_2)=\phi_1\star\omega_1+\phi_2\star\omega_2\) between \(2\) vector spaces of the same dimension \(\dim\Omega^k(X)={n\choose k}={n\choose n-k}=\dim\Omega^{n-k}(X)\). Moreover, it is either an involution or behaves like multiplication by \(i\) in the sense that \(\star^2=(-1)^{k(n-k)+n_-}\).

Problem: For flat Minkowski spacetime \(X=\mathbf R^{1+3}\) with the Lorentzian metric \(ds^2=c^2dt^2-dx^2-dy^2-dz^2\) and orientation defined by the canonical volume form \(cdt\wedge dx\wedge dy\wedge dz\), compute the Hodge dual:

\[\star(e^{-x\cos y}dy\wedge dt+4tz^2 dx\wedge dz)\]

Solution: By linearity, one can focus on computing Hodge duals of the basis \(2\)-forms \(\star (dy\wedge dt)\) and \(\star(dx\wedge dz)\). Demonstrating this for \(\star (dy\wedge dt)\), the idea is to (roughly speaking) wedge it against itself and exploit the defining property of the Hodge dual:

\[dy\wedge dt\wedge\star(dy\wedge dt)=\det\begin{pmatrix}g^{yy} & g^{yt} \\ g^{ty} & g^{tt}\end{pmatrix}cdt\wedge dx\wedge dy\wedge dz\]

\[=\det\begin{pmatrix}-1 & 0 \\ 0 & 1/c^2\end{pmatrix}cdt\wedge dx\wedge dy\wedge dz\]

\[=-\frac{1}{c}dt\wedge dx\wedge dy\wedge dz\]

\[=-\frac{1}{c}dy\wedge dt\wedge dx\wedge dz\]

In this case, one can directly read off the answer \(\star(dy\wedge dt)=-c^{-1}dx\wedge dz\) or equivalently \(\star(dy\wedge dct)=dz\wedge dx\). Analogously, one can check \(\star(dx\wedge dz)=dy\wedge dct\) (consistent with the earlier identity for \(\star^2\)), so the final answer is:

\[\star(e^{-x\cos y}dy\wedge dt+4tz^2 dx\wedge dz)=\frac{e^{-x\cos y}}{c}dz\wedge dx+4tz^2dy\wedge dct\]

For more general non-orthogonal metrics \(g\), one cannot simply read off the answer like that but should set up an ansatz for the Hodge dual like \(\star(dy\wedge dt)=c_{tx}dt\wedge dx+c_{ty}dt\wedge dy+c_{tz}dt\wedge dz+c_{xy}dx\wedge dy+c_{xz}dx\wedge dz+c_{yz}dy\wedge dz\) and obtaining the coefficients by wedging against the relevant basis differential forms.

Problem: Let \((X,g)\) be a smooth, \(n\)-dimensional pseudo-Riemannian manifold. Just as the metric tensor field \(g\) induces a pseudo-inner product between tangent vectors at each point across the manifold \(X\), show that \(g\) also induces an inner product on the space \(\Omega^k(X)\) of differential \(k\)-forms on \(X\).

Solution: Given \(2\) differential \(k\)-forms \(\omega,\omega’\in\Omega^k(X)\), their \(g\)-induced inner product is defined by:

\[\langle\omega,\omega’\rangle:=\int_{X}\omega\wedge\star\omega’\]

(this is well-defined because \(\omega\wedge\star\omega’\in\Omega^n(X)\) is indeed a top form as observed earlier).

Problem: Having endowed \(\Omega^k(X)\) with an inner product, one can define the adjoint (sometimes called the codifferential) \(d^{\dagger}:\Omega^k(X)\to\Omega^{k-1}(X)\) of the exterior derivative operator \(d:\Omega^{k-1}(X)\to\Omega^k(X)\) by insisting that for arbitrary differential forms \(\omega\in\Omega^{k-1}(X),\omega’\in\Omega^k(X)\):

\[\langle d\omega,\omega’\rangle=\langle\omega,d^{\dagger}\omega’\rangle\]

(the LHS is the inner product on \(\Omega^k(X)\) whereas the RHS is the inner product on \(\Omega^{k-1}(X)\)). Show that if the underlying pseudo-Riemannian manifold \(X\) is closed, then:

\[d^{\dagger}=(-1)^{n(k+1)+1+n_-}\star d\space\star\]

Solution: Applying graded integration by parts and Stokes’ theorem:

\[\langle d\omega,\omega’\rangle=\int_{\partial X}\omega\wedge\star\omega’+(-1)^k\int_X\omega\wedge d\star\omega’\]

The first term vanishes because \(X\) is assumed to be closed, and by comparing with the expected form:

\[\langle\omega,d^{\dagger}\omega’\rangle=\int_X\omega\wedge\star d^{\dagger}\omega’\]

it is clear this would hold provided:

\[(-1)^kd\star=\star d^{\dagger}\]

Isolating for \(d^{\dagger}\) by acting with \(\star\) on both sides, using \(\star^2=(-1)^{(k-1)(n-k+1)+n_-}\), and exploiting trivial identities like \((-1)^{k^2-k}=1\) and \((-1)^{-n}=(-1)^n\) for \(k,n\in\mathbf Z\) leads to the claimed result.

Problem: Using the technology of \(d^{\dagger}\), define the Laplace-de Rham operator \(\Delta:\Omega^k(X)\to\Omega^k(X)\). For the case of a scalar field \(k=0\), show that the action of the Laplace-de Rham operator \(\Delta\) coincides with that of the negative Laplacian \(-\nabla^2\) (sometimes called the Laplace-Beltrami operator).

Solution: One has the positive semi-definite Laplace-de Rham operator:

\[\Delta:=(d+d^{\dagger})^2=\{d,d^{\dagger}\}\]

thanks to nilpotence \(d^2=(d^{\dagger})^2=0\). For a scalar field \(\phi\in C^{\infty}(X)\), \(d^{\dagger}\phi=0\) so:

\[\Delta\phi=d^{\dagger}d\phi=(-1)^{n_-+1}\star d(\partial_{\mu}\phi\star dx^{\mu})\]

The Hodge dual of the \(\mu^{\text{th}}\) basis \(1\)-form can be checked to be:

\[\star dx^{\mu}=(-1)^{\nu}g^{\mu\nu}\sqrt{|\det g|}dx^0\wedge…\wedge dx^{\nu-1}\wedge dx^{\nu+1}\wedge…\wedge dx^{n-1}\]

Thus:

\[d(\partial_{\mu}\phi\star dx^{\mu})=\partial_{\nu}\left(\sqrt{|\det g|}g^{\mu\nu}\partial_{\mu}\phi\right)dx^0\wedge…\wedge dx^{n-1}\]

And finally, the Hodge dual of a top form returns a scalar field:

\[\star (dx^0\wedge…\wedge dx^{n-1})=\frac{\sqrt{|\det g|}}{\det g}=\frac{(-1)^{n_-}}{\sqrt{|\det g|}}\]

Thus, one obtains:

\[\Delta\phi=-\frac{1}{\sqrt{|\det g|}}\partial_{\nu}\left(\sqrt{|\det g|}g^{\mu\nu}\partial_{\mu}\phi\right)=-\nabla^2\phi\]

Problem: Let \(\omega\in\Omega^k(X)\) be a differential \(k\)-form on a closed, Riemannian manifold \((X,g)\). Explain what it means for \(\omega\) to be harmonic. Furthermore, prove that \(\omega\) is harmonic iff it is both closed \(d\omega=0\) and co-closed \(d^{\dagger}\omega=0\) (this equivalence underlies the Hodge decomposition theorem and hence the rest of Hodge theory).

Solution: This is just a generalization of the usual notion of a harmonic scalar field obeying Laplace’s equation \(\nabla^2\phi=0\), only now the Laplacian \(\nabla^2\) is replaced by the Laplace-de Rham operator \(\Delta\) and the scalar field \(\phi\) by an arbitrary differential \(k\)-form \(\omega\in\Omega^k(X)\):

\[\Delta\omega=0\]

Because \(X\) is closed, one can say:

\[\langle\omega,\Delta\omega\rangle=\langle\omega, dd^{\dagger}\omega\rangle+\langle\omega,d^{\dagger}d\omega\rangle\]

\[=\langle d^{\dagger}\omega,d^{\dagger}\omega\rangle+\langle d\omega,d\omega\rangle\]

If \(\Delta\omega=0\) is harmonic, then it follows \(\langle d^{\dagger}\omega,d^{\dagger}\omega\rangle=\langle d\omega,d\omega\rangle=0\) because \(g\) is Riemannian, so in particular \(d\omega=d^{\dagger}\omega=0\). The converse is trivial: \(\Delta\omega=dd^{\dagger}\omega+d^{\dagger}d\omega=d0+d^{\dagger}0=0\).

Problem: Let \(X\) be a smooth manifold not necessarily equipped with a metric \(g\). What does it mean to endow \(X\) with the structure of a affine connection \(\nabla\)?

Solution: An affine connection is any map \(\nabla:\mathfrak{X}(X)^2\to\mathfrak{X}(X)\) which is linear in its first tangent vector field argument but acts like a tangent vector (more precisely, like a derivation) in its second tangent vector field argument. Thus, \(\nabla\) is not tensorial because it is not linear in that second argument. Nevertheless, given a (possibly non-coordinate) basis \(\text{span}_{\mathbf R}\{e_{\mu}\}=\mathfrak{X}(X)\) of tangent vector fields, one can still define affine connection coefficients for \(\nabla\):

\[\nabla_{\nu}e_{\rho}=\Gamma^{\mu}_{\nu\rho}e_{\mu}\]

with \(\nabla_{\nu}:=\nabla_{e_{\nu}}\) a shorthand. The map \(\nabla_v:\mathfrak{X}(X)\to\mathfrak{X}(X)\) for some fixed tangent vector field \(v\in\mathfrak{X}(X)\) is called the covariant derivative along \(v\).

Problem: Let \(X\) be a smooth manifold. For two tangent vector fields \(v,v’\in\mathfrak{X}(X)\), given a (possibly non-coordinate) basis \(\text{span}_{\mathbf R}\{e_{\mu}\}=\mathfrak{X}(X)\) of tangent vector fields so that \(v=v^{\mu}e_{\mu}\) and \(v’=v’^{\nu}e_{\nu}\), show that:

\[\nabla_vv’=v^{\nu}(e_{\nu}(v’^{\mu})+\Gamma^{\mu}_{\nu\rho}v’^{\rho})e_{\mu}=v^{\nu}\nabla_{\nu}v’\]

Solution: This is a straightforward application of the properties of affine connections. The only thing to watch out for is that the affine connection is extended to act on scalar fields \(\phi\in C^{\infty}(X)\) in the obvious way:

\[\nabla_v\phi:=v(\phi)=\mathcal L_v\phi\]

In particular, \(\nabla_{\mu}\phi=e_{\mu}(\phi)\), and if \(e_{\mu}=\partial_{\mu}\) happens to be a coordinate basis for \(\mathfrak{X}(X)\), then the covariant derivative \(\nabla_{\mu}=\partial_{\mu}\) coincides with the partial derivative on scalar fields.

Problem: For tangent vector fields \(v,v’\in\mathfrak{X}(X)\), compare the covariant derivative \(\nabla_vv’\) with the Lie derivative \(\mathcal L_vv’\) (both of which are tangent vector fields).

Solution: Basically, in the case of the Lie derivative \(\mathcal L_vv’=[v,v’]\), both entries of the commutator are subject to the product rule, for example \([vv^{\prime\prime},v’]=v[v^{\prime\prime},v’]+[v,v’]v^{\prime\prime}\), so in particular are nonlinear (being additive but inhomogeneous). By contrast, in the covariant derivative \(\nabla_vv’\), the \(v\)-argument is explicitly made to be linear (though the \(v’\) argument behaves in the same way as both arguments of the Lie derivative), so this is what allows familiar linear algebraic expressions like \(\nabla_v=v^{\mu}\nabla_{\mu}\) to be written (this was proven above in the form \(\nabla_vv’=v^{\nu}\nabla_{\nu}v’\)).

Problem: Explain how to compute the covariant derivative of a \(1\)-form \(A\in\Omega^1(X)\) along a tangent vector field \(v\in\mathfrak{X}(X)\) to end up with another \(1\)-form \(\nabla_v A\).

Solution: As a \(1\)-form, \(A\) maps a tangent vector field \(v’\in\mathfrak{X}(X)\) to a scalar field \(A(v’)\in C^{\infty}(X)\). But earlier, it was already explained one can take the covariant derivative of a scalar field, so it makes sense to compute \(\nabla_v(A(v’))=v(A(v’))\). However, by insisting that the product rule holds in the sense:

\[\nabla_v(A(v’))=(\nabla_vA)(v’)+A(\nabla_vv’)\]

So the desired definition ought to be:

\[(\nabla_vA)(v’):=v(A(v’))-A(\nabla_vv’)\]

or in a basis of \(\mathfrak{X}(X)\):

\[(\nabla_vA)(v’)=(\nabla_{\mu}A)_{\nu}v^{\mu}v’^{\nu}\]

with \((\nabla_{\mu}A)_{\nu}=\partial_{\mu}A_{\nu}-\Gamma^{\rho}_{\mu\nu}A_{\rho}\).

Problem: Let \(X\) be a smooth manifold equipped with an affine connection \(\nabla\). Define the type-\((1,2)\) torsion tensor field \(T\) of \(\nabla\), and compute the components of \(T\) in a coordinate basis.

Solution: For a \(1\)-form \(A\in\Omega^1(X)\) and tangent vector fields \(v,v’\in\mathfrak{X}(X)\), one has:

\[T(A,v,v’):=A(\nabla_vv’-\nabla_{v’}v-[v,v’])\]

One should check that \(T\) is indeed a tensor, since the affine connection \(\nabla\) is not a tensor as mentioned earlier. It is easy to see that \(T\) is linear in \(A\), and it is linear in \(v\) iff it is linear in \(v’\) because of antisymmetry \(T(A,v’,v)=-T(A,v,v’)\). Furthermore, it is easy to see that \(T(A,v_1+v_2,v’)=T(A,v_1,v’)+T(A,v_2,v’)\). The tricky property to prove is that \(T(A,\phi v,v’)=\phi T(A,v,v’)\) for any scalar field \(\phi\in C^{\infty}(X)\) (it must be a scalar field and not just some scalar \(\phi\in\mathbf R\) because one is trying to show that \(T\) is a tensor field not just a tensor):

\[T(A,\phi v,v’)=A(\nabla_{\phi v}v’-\nabla_{v’}(\phi v)-[\phi v,v’])\]

\[=A(\phi\nabla_vv’-v'(\phi)v-\phi\nabla_{v’}v-\phi[v,v’]+v'(\phi)v)\]

\[=A(\phi\nabla_vv’-\phi\nabla_{v’}v-\phi[v,v’])\]

\[=\phi A(\nabla_vv’-\nabla_{v’}v-[v,v’])=\phi T(A,v,v’)\]

Thus, in a coordinate basis \(\{\partial_{\mu}\}\) of \(\mathfrak{X}(X)\), the components \(T^{\rho}_{\mu\nu}:=T(dx^{\rho},\partial_{\mu},\partial_{\nu})\) are tensorial:

\[T^{\rho}_{\mu\nu}=\Gamma^{\rho}_{\mu\nu}-\Gamma^{\rho}_{\nu\mu}\]

where Clairaut’s theorem \([\partial_{\mu},\partial_{\nu}]=0\) for a coordinate basis has been used.

Problem: Let \(X\) be a smooth manifold equipped with an affine connection \(\nabla\). Define the type-\((1,3)\) Riemann curvature tensor field \(R\) of \(\nabla\), and compute the components of \(R\) in a coordinate basis.

Solution: For a \(1\)-form \(A\in\Omega^1(X)\) and tangent vector fields \(v,v’,v^{\prime\prime}\in\mathfrak{X}(X)\), one has:

\[R(A,v,v’,v^{\prime\prime}):=A(\nabla_v\nabla_{v’}v^{\prime\prime}-\nabla_{v’}\nabla_vv^{\prime\prime}-\nabla_{[v,v’]}v^{\prime\prime})\]

Similar to the torsion tensor field \(T\), one should check that \(R\) genuinely deserves its designation as a tensor field. In a coordinate basis \(\{\partial_{\mu}\}\) of \(\mathfrak{X}(X)\), the (slightly awkward but conventional) tensor components \(R^{\sigma}_{\rho\mu\nu}:=R(dx^{\sigma},\partial_{\mu},\partial_{\nu},\partial_{\rho})\) are:

\[R^{\sigma}_{\rho\mu\nu}=\partial_{\mu}\Gamma^{\sigma}_{\nu\rho}-\partial_{\nu}\Gamma^{\sigma}_{\mu\rho}+\Gamma^{\lambda}_{\nu\rho}\Gamma^{\sigma}_{\mu\lambda}-\Gamma^{\lambda}_{\mu\rho}\Gamma^{\sigma}_{\nu\lambda}\]

where again Clairaut’s theorem \([\partial_{\mu},\partial_{\nu}]=0\) for a coordinate basis has been used. Again similar to the case for \(T\), the antisymmetry \(R(A,v’,v,v^{\prime\prime})=-R(A,v,v’,v^{\prime\prime})\) manifests at the component level as \(R^{\sigma}_{\rho\nu\mu}=-R^{\sigma}_{\rho\mu\nu}\).

Problem: (about the Ricci identity connecting \(T\) and \(R\) together in terms of the commutator of covariant derivatives)

Solution:

Problem: State and prove the fundamental theorem of pseudo-Riemannian geometry.

Solution: Let \((X,g)\) be a smooth, pseudo-Riemannian manifold. Then there exists a unique affine connection \(\nabla\) (called the Levi-Civita connection) which is torsion-free \(T=0\) and \(g\)-compatible in the sense that \(\nabla_v g=0\) for all tangent vector fields \(v\in\mathfrak{X}(X)\).

Proof:

Posted in Blog | Leave a comment