Polymers

Problem: Explain why for a random walk in \(\mathbf R^d\), the probability distribution of the random vector sum \(\mathbf x:=\mathbf x_1+…+\mathbf x_N\) of \(N\) i.i.d. random vectors \(\mathbf x_i\) each with identical mean \(\boldsymbol{\mu}:=\langle\mathbf x_i\rangle\) and identical covariance matrix \(\sigma^2:=\langle\mathbf x_i^{\otimes 2}\rangle-\boldsymbol{\mu}^{\otimes 2}\) in the large \(N\gg 1\) limit is of the form:

\[p(\mathbf x)=\frac{1}{(2\pi N)^{d/2}\det\sigma}e^{-(\mathbf x-N\boldsymbol{\mu})^T\sigma^{-2}(\mathbf x-N\boldsymbol{\mu})/2N}\]

Solution: Central limit theorem.

Problem: Explain what an ideal polymer chain is.

Solution: A model of polymer chains lacking a concept of excluded volume. The word “ideal” here is strictly weaker than e.g. the “ideal” in “ideal gas” (where it really means no interactions of any kind); here the monomers are permitted to have short-range interactions (e.g. the covalent bonds holding them together, steric hindrance, etc.) but no long-range interactions (in particular, if a later segment of the chain happens to fold back onto an earlier segment, this is allowed within the ideal chain paradigm due to the lack of excluded volume). In particular, one can think of ideal polymer chains as the universality class of polymer chains whose expected end-to-end distance scales as \(\sqrt{\langle r^2\rangle}\sim N^{1/2}\).

Problem: Consider the freely-jointed chain model of a polymer in \(\mathbf R^3\) which consists of \(N\gg 1\) monomers \(\mathbf x_i\) each of fixed length \(\ell:=|\mathbf x_i|\) connected to each other but are otherwise mutually non-interacting and are also isolated from any external solvent, etc. For the ideal chain, calculate the distribution of the end-to-end vector \(\mathbf x:=\sum_{i=1}^N\mathbf x_i\) of the polymer chain.

Solution: In this case, one has \(\boldsymbol{\mu}=\mathbf 0\) and writing \(\mathbf x:=\ell(\cos\phi\sin\theta,\sin\phi\sin\theta,\cos\theta)\) for \((\cos\theta,\phi)\in [-1,1]\times[0,2\pi]\) uniformly distributed, the covariance matrix is proportional to the identity \(\sigma^2=\ell^2/3\) (e.g. \(\langle z^2\rangle=\frac{1}{2\times 2\pi}\int_{-1}^1d\cos\theta\int_0^{2\pi}d\phi (\ell\cos\theta)^2=\ell^2/3\)). Thus:

\[p(\mathbf x)=\left(\frac{3}{2\pi N\ell^2}\right)^{3/2}e^{-3|\mathbf x|^2/2N\ell^2}\]

which implies the radial distribution function for \(r:=|\mathbf x|\):

\[p(r)=\left(\frac{3}{2\pi N\ell^2}\right)^{3/2}4\pi r^2e^{-3r^2/2N\ell^2}\]

with the expected \(\langle r^2\rangle=\int_0^{\infty}drp(r)r^2=N\ell^2\)

(aside: for certain kinds of ring polymers that close back on themselves, the end-to-end distance is trivially \(0\). In that case, an alternative metric for quantifying the polymer chain size is the radius of gyration \(r_g^2:=\frac{1}{N}\sum_{i=1}^N|\mathbf r_i-\mathbf R|^2\) where each \(\mathbf r_i\) refers to the position of (say, the center of mass, but in the coarse-grained limit it doesn’t matter) the \(i^{\text{th}}\) monomer, and \(\mathbf R:=\frac{1}{N}\sum_{i=1}^N\mathbf r_i\) is the center of mass of the polymer chain. By Lagrange’s identity, this can be rewritten as the average distance between pairs of monomers on the chain:

\[r_g^2=\frac{1}{2N^2}\sum_{i=1}^N\sum_{j=1}^N|\mathbf r_i-\mathbf r_j|^2=\frac{1}{N^2}\sum_{i<j}|\mathbf r_i-\mathbf r_j|^2\]

This form is easier to work with because (assuming \(i<j\)) one can express the displacement \(\mathbf r_j-\mathbf r_i=\frac{\mathbf x_i+\mathbf x_j}{2}+\sum_{k=i+1}^{j-1}\mathbf x_{i+1}\) (assuming each monomer’s center of mass is at its midpoint, but this is irrelevant in the \(N\gg 1\) limit). The \(N\gg 1\) coarse-grained result for a freely-jointed chain is:

\[\langle r_g^2\rangle=\frac{\langle r^2\rangle}{6}=\frac{N\ell^2}{6}\]

is smaller than the average end-to-end distance by a factor of \(\sqrt{6}\)).

Problem: An improvement to the freely-jointed chain model can be made if one assumes that \(\mathbf x_i\cdot\mathbf x_{i+1}=\ell^2\cos\theta\) for some fixed bond angle \(\theta\) (strictly speaking this is supplementary to what chemists usually call the “bond angle” \(\theta_{\text{chemist}}=\pi-\theta\)). However, the torsion angle \(\phi\in[0,2\pi]\) is assumed to be unrestricted (this model is called the freely rotating chain). In this case, show that for \(N\gg 1\):

\[\langle r^2\rangle\approx N\ell^2\frac{1+\cos\theta}{1-\cos\theta}\]

(note: \(\frac{1+\cos\theta}{1-\cos\theta}=\cot^2\theta/2\) is monotonically decreasing from a singularity at \(\theta=0\) to \(1\) at \(\theta=\pi/2\) to \(0\) at \(\theta=\pi\)). If furthermore the torsion angle \(\phi\) is non-uniform \(p(\phi)\neq 1/2\pi\) but instead subject to an achiral potential \(V(-\phi)=V(\phi)\) so that \(p(\phi)\sim e^{-\beta V(\phi)}\) (e.g. trans conformations being preferred over gauche), show that \(\langle r^2\rangle\) receives an additional correction factor of the same form as the above but with \(\cos\theta\mapsto\langle\cos\phi\rangle:=\int_0^{2\pi}d\phi p(\phi)\cos\phi\):

\[\langle r^2\rangle\approx N\ell^2\frac{1+\cos\theta}{1-\cos\theta}\frac{1+\langle\cos\phi\rangle}{1-\langle\cos\phi\rangle}\]

(this is called the hindered rotation model of the polymer chain).

Solution: The key lemma is to convince oneself that the average of any vector dotted with a circle of unit vectors in an arbitrary plane is zero. Once one realizes this, it is straightforward to check \(\langle\mathbf x_i\cdot\mathbf x_j\rangle=\ell^2\cos^{|i-j|}\theta\), and the geometric series may be summed and terms simplified in the \(N\gg 1\) limit to obtain the desired result.

Note that one can rewrite the \(2\)-point correlator:

\[\langle\mathbf x_i\cdot\mathbf x_j\rangle=\ell^2\cos^{|i-j|}\theta=\ell^2e^{-|i-j|/N_p}\]

where \(N_p:=-1/\ln\cos\theta\) is the persistence number, and \(\ell_p:=\ell N_p\) is the persistence length. For \(\theta\ll 1\), \(N_p\approx 2/\theta^2\) in which case one can the Kratky-Porod limit \(\theta,\ell\to 0\) in such a way that \(\ell_p=2\ell/\theta^2\) is fixed. This effectively replaces the freely rotating model with a worm-like chain model in which the polymer is parameterized by a smooth trajectory \(\mathbf x(s)\) of its arc length \(s\) with unit tangent vector \(d\mathbf x/ds\) obeying:

\[\biggr\langle\frac{d\mathbf x}{ds}\cdot\frac{d\mathbf x}{ds’}\biggr\rangle=e^{-|s-s’|/\ell_p}\]

One can estimate \(\ell_p\) by modelling the worm-like chain as an Euler-Bernoulli beam! In that case, the polymer Hamiltonian functional is quadratic:

\[H[\theta(s)]=\frac{1}{2}EI\int_0^Lds\left(\frac{d\theta}{ds}\right)^2\]

hence by the equipartition theorem, the variance \(\langle\theta^2\rangle\) in the angular fluctuations is \(\frac{1}{2}EI\frac{\langle\theta^2\rangle}{\ell}=2\times\frac{1}{2}k_BT\), so equating \(\langle\theta^2\rangle=\theta^2\) from the deterministic freely rotating model, one finds the persistence length is set by the ratio of flexural rigidity to thermal energy:

\[\ell_p=\frac{EI}{k_BT}\]

and the expected end-to-end distance is now \(\langle r^2\rangle=\int_0^L ds’\int_0^L ds e^{-|s-s’|/\ell_p}=2\ell_p^2(e^{-L/\ell_p}-1+L/\ell_p)\) which in the limit \(\ell_p\ll L\) results in a Kuhn length \(\ell_{\text{eff}}=2\ell_p\) twice the persistence length (in the opposite limit \(\ell_p\gg L\), one has the obvious \(\langle r^2\rangle\to L^2\)).

For the hindered rotation model, the math is similar, just need to remember that \(\langle\sin\phi\rangle=0\) because \(V(-\phi)=V(\phi)\) is even.

Problem: Define the Kuhn length \(\ell_{\text{eff}}\) of an ideal polymer chain.

Solution: The Kuhn length of an ideal polymer chain is defined by:

\[\ell_{\text{eff}}:=\frac{\langle r^2\rangle}{L}\]

with \(L=N\ell\) the total chain length. This ensures that even if the polymer chain is not freely jointed, it may be approximated by an equivalent freely jointed chain whose “Kuhn monomers” are of length \(\ell_{\text{eff}}>\ell\). For instance, \(\ell_{\text{eff}}=\ell\cot(\theta/2)\) for the freely rotating chain.

Problem: Describe Flory’s mean-field theory.

Solution: (need to explain why this is a mean-field theory?). In a good solvent, \(r\sim N^{3/5}\) (the exact exponent is not \(3/5=0.6\) but actually \(\nu\approx 0.588\) arising from detailed RG flow analysis). In a \(\theta\)-solvent, \(r\sim N^{1/2}\), and in a poor solvent, \(r\sim N^{1/3}\).

Posted in Blog | Leave a comment

Magnetism

Problem: Define the \(2\) words in the phrase “ideal paramagnet“. Show that a classical ideal paramagnet of \(N\) spins each with the same fixed magnetic dipole moment \(\mu:=|\boldsymbol{\mu}|\) placed in a uniform external magnetic field \(B:=|\mathbf B|\) will develop a uniform non-zero induced average magnetization \(\langle M(B)\rangle=n\mu L(\beta\mu B)\) along the direction of \(\mathbf B\) where the Langevin function \(L(x):=\coth(x)-1/x\).

Solution: Ideal means the \(N\) spins are non-interacting with respect to each other (hence there’s no need to specify the precise geometric configuration of the \(N\) spins such as whether they’re on a lattice \(\Lambda\), etc.). Paramagnet imposes conditions on the first \(2\) Taylor expansion coefficients of \(\langle M(B)\rangle\) about \(B=0\):

  1. Non-magnetic/absence of spontaneous magnetization \(\langle M(B=0)\rangle=0\).
  2. Positive-definite zero-field magnetic susceptibility \(\chi_{\mu}/\mu_0:=\partial \langle M\rangle/\partial (B=0)>0\).

Classically, for a single classical spin, the state space is \((\theta,\phi)\in S^2\) and the \(\phi\)-independent Hamiltonian is \(H(\theta)=-\boldsymbol{\mu}\cdot\mathbf B=-\mu B\cos\theta\), so the single-spin partition function is:

\[Z_1=\int_{-1}^1d\cos\theta\int_0^{2\pi}d\phi e^{-\beta H(\theta)}=4\pi\text{sinhc}\beta\mu B\]

Thus, \(F_1=-k_BT\ln Z_1\) and \(\langle\mu_x\rangle=\langle\mu_y\rangle=0\) whereas:

\[\langle\mu_z\rangle=\int_{-1}^1d\cos\theta\int_0^{2\pi}d\phi\frac{e^{\beta\mu B\cos\theta}}{Z_1}\mu\cos\theta=\frac{\partial \ln Z_1}{\partial\beta B}=-\frac{\partial F}{\partial B}=\mu L(\beta\mu B)\]

and hence the result follows from \(\langle M(B)\rangle=n\langle\mu_z\rangle\) where \(n:=N/V\).

Problem: Show that in the high-temperature limit where \(x:=\beta\mu B\ll 1\), the Taylor expansion of the Langevin function about \(x=0\) is \(L(x)=x/3+O_{x\to 0}(x^3)\). Hence, derive Curie’s high-temperature ideal paramagnet law \(\chi_{\mu}=C/T\) for the zero-field magnetic susceptibility and state the value of the Curie constant \(C>0\).

Solution: One has:

\[L(x)=\coth x-\frac{1}{x}=\frac{\cosh x}{\sinh x}-\frac{1}{x}\approx\frac{1+x^2/2+…}{x+x^3/6}-\frac{1}{x}\]

\[=\frac{1}{x}\left(\frac{1+x^2/2}{1+x^2/6}-1\right)\approx\frac{1}{x}\left((1+x^2/2)(1-x^2/6)-1\right)=x/3+O_{x\to 0}(x^3)\]

Thus, it is straightforward to compute the Curie constant \(C=\mu_0 n\mu^2/3k_B\) for a classical ideal paramagnet.

Problem: Repeat the above analysis but for a quantum ideal paramagnet in which all \(N\) spins have the same fixed total angular momentum quantum number \(j\in\{0,1/2,1,…\}\).

Solution: Now, for a single quantum spin with state vector in the manifold \(\mathbf C^{2j+1}\), the spectrum of its Hamiltonian is given by the weak-field Zeeman splitting \(E_{|j,m_j\rangle}=g_jm_j\mu_BB\) for Landé \(g\)-factor \(g_j=1+\frac{j(j+1)+s(s+1)-\ell(\ell+1)}{2j(j+1)}\) and the canonical partition function is a Dirichlet kernel:

\[Z_1=\sum_{m_j=-j}^je^{-\beta E_{|j,m_j\rangle}}=\frac{\sinh(j+1/2)g_j\beta\mu_BB}{\sinh(g_j\beta\mu_BB/2)}\]

Repeating the same steps as above, this time one finds:

\[\langle M(B)\rangle=ng_jj\mu_B B_j(\beta g_jj\mu_BB)\]

where the Brillouin function is defined by:

\[B_j(x):=\left(1+\frac{1}{2j}\right)\coth\left(1+\frac{1}{2j}\right)x-\frac{1}{2j}\coth\frac{x}{2j}\]

In particular, as \(j\to\infty\), \(2j+1\to\infty\) and one recovers the classical continuous angle \(\theta\in [0,\pi]\) and \(\lim_{j\to\infty}B_j(x)=L(x)\).

This is consistent with the Taylor expansion \(B_j(x)=\frac{j+1}{3j}x+O_{x\to 0}(x^3)\) which leads to the quantum Curie constant \(C=\mu_0ng_j^2j(j+1)\mu_B^2/3k_B\). Instead of looking at \(j\to\infty\), one can also take the quantum limit \(j=s=1/2\) and \(\ell=0\), in which case \(g_j=2\) and:

\[B_{1/2}(x)=2\coth 2x-\coth x=\tanh x\]

so one recovers the familiar \(2\)-level system average magnetization \(\langle M\rangle=n\mu_B\tanh\beta\mu_BB\) with Curie constant \(C=\mu_0n\mu_B^2/k_B\).

Problem: Explain why classical statistical mechanics (when applied consistently!) predicts \(\langle\mathbf M(\mathbf B)\rangle=\mathbf 0\) for all \(\mathbf B\) (this is called the Bohr-van Leeuwen theorem). Since the Langevin derivation used a classical stat mech approach yet was able to predict a nontrivial \(\mathbf M(\mathbf B)\), explain why it doesn’t violate the BvL theorem.

Solution: The BvL theorem follows mathematically from the fact that the canonical \(N\)-particle partition function:

\[Z(\beta,\mathbf B)=\frac{1}{h^{3N}N!}\int d^3\mathbf x_1…d^3\mathbf x_Nd^3\mathbf p_1…d^3\mathbf p_Ne^{-\beta H}\]

with \(H=\sum_{i=1}^N\frac{|\mathbf p_i-q_i\mathbf A(\mathbf x_i)|^2}{2m_i}+V(\mathbf x_1,…,\mathbf x_N)\) can instead (via change of variables) be integrated over the kinetic momenta \(m_i\mathbf v_i:=\mathbf p_i-q_i\mathbf A(\mathbf x_i)\) rather than the canonical momenta \(\mathbf p_i\) without incurring a Jacobian penalty \(\partial (m_1\mathbf v_1,…,m_N\mathbf v_N)/\partial(\mathbf p_1,…,\mathbf p_N)=1\), so \(Z(\beta,\mathbf B)=Z(\beta,\mathbf B=\mathbf 0)\) is \(\mathbf B\)-independent and hence \(F=-k_BT\ln Z\) is also \(\mathbf B\)-independent, leading to the BvL theorem \(\langle\mathbf M\rangle=-V^{-1}\partial F/\partial\mathbf B=\mathbf 0\).

By assuming that one could speak of a “fixed magnetic dipole moment \(\mu\)” for all the atoms, Langevin was implicitly quantizing the system since in hindsight \(\mu=g_jj\mu_B\) and \(j\) is quantized and fixed (indeed, making the replacement \(\mu\mapsto g_jj\mu_B\) in the Langevin magnetization maps in the \(j\to\infty\) limit directly onto the Brillouin magnetization). As a result, it is better to regard the Langevin result for \(\langle M(B)\rangle\) as a semi-classical formula rather than strictly belonging to classical physics (otherwise it would violate the BvL theorem!).

Posted in Blog | Leave a comment

Physics-Informed Neural Networks

Problem: Train a physics-informed neural network (PINN) on both the van der Pol oscillator and the drift-free Fokker-Planck diffusion equation.

Solution:

report
Posted in Blog | Leave a comment

Diffusion & Flow-Matching Models

Problem: State and prove Tweedie’s formula.

Solution: Tweedie’s formula asserts that if \(p(\mathbf x|\boldsymbol{\mu},\sigma)=\frac{1}{\det(\sqrt{2\pi}\sigma)}e^{-(\mathbf x-\boldsymbol{\mu})^T\sigma^{-2}(\mathbf x-\boldsymbol{\mu})/2}\) is normally distributed, then without needing to know anything about the prior \(p(\boldsymbol{\mu}|\sigma)\) on the mean random vector \(\boldsymbol{\mu}\), one has the following Bayesian point estimate for it:

\[\langle\boldsymbol{\mu}|\mathbf x,\sigma\rangle=\mathbf x+\sigma^2\frac{\partial\ln p(\mathbf x|\sigma)}{\partial\mathbf x}\]

At first glance, one might think that \(\langle\boldsymbol{\mu}|\mathbf x,\sigma\rangle\approx\mathbf x\), but Tweedie’s formula provides the empirical Bayes correction \(\sigma^2\frac{\partial\ln p(\mathbf x|\sigma)}{\partial\mathbf x}\) to the naive estimate. The proof amounts to a brute force computation of the score vector field:

\[\frac{\partial\ln p(\mathbf x|\sigma)}{\partial\mathbf x}=\frac{1}{p(\mathbf x|\sigma)}\frac{\partial p(\mathbf x|\sigma)}{\partial\mathbf x}=\frac{1}{p(\mathbf x|\sigma)}\int d\boldsymbol{\mu}p(\boldsymbol{\mu}|\sigma)\frac{\partial p(\mathbf x|\boldsymbol{\mu},\sigma)}{\partial\mathbf x}\]

\[=-\frac{\sigma^{-2}}{p(\mathbf x|\sigma)}\int d\boldsymbol{\mu}p(\boldsymbol{\mu}|\sigma)(\mathbf x-\boldsymbol{\mu})p(\mathbf x|\boldsymbol{\mu},\sigma)\]

\[=-\frac{\sigma^{-2}}{p(\mathbf x|\sigma)}\left(\mathbf xp(\mathbf x|\sigma)-\int d\boldsymbol{\mu}\boldsymbol{\mu}p(\mathbf x|\sigma)p(\boldsymbol{\mu}|\mathbf x,\sigma)\right)\]

\[=\sigma^{-2}(\langle\boldsymbol{\mu}|\mathbf x,\sigma\rangle-\mathbf x)\]

Problem: Show that the set of Gaussians is closed under both convolution and multiplication.

Solution: The convolution of two normalized Gaussians is also a normalized Gaussian representing the probability distribution of the sum of the two independent normal random variables (i.e. it is a kind of orthonormal bilinear transformation):

\[p(\mathbf x|\boldsymbol{\mu}_1,\sigma_1)*p(\mathbf x|\boldsymbol{\mu}_2,\sigma_2)=p(\mathbf x|\boldsymbol{\mu}_{1*2},\sigma_{1*2})\]

where \(\boldsymbol{\mu}_{1*2}=\boldsymbol{\mu}_1+\boldsymbol{\mu}_2\) and \(\sigma^2_{1*2}=\sigma^2_1+\sigma^2_2\) (thus, addition of independent normal random variables is isomorphic to vector addition in \((\boldsymbol{\mu},\sigma^2)\)-space).

The product of two normalized Gaussians is an (unnormalized) Gaussian (but whose normalization constant is itself has a Gaussian form):

\[p(\mathbf x|\boldsymbol{\mu}_1,\sigma_1)p(\mathbf x|\boldsymbol{\mu}_2,\sigma_2)=p(\boldsymbol{\mu}_1|\boldsymbol{\mu}_2,\sigma_{1*2})p(\mathbf x|\boldsymbol{\mu}_{12},\sigma_{12})\]

where \(\sigma^{-2}_{12}\boldsymbol{\mu}_{12}=\sigma^{-2}_1\boldsymbol{\mu}_1+\sigma^{-2}_2\boldsymbol{\mu}_2\) and \(\sigma^{-2}_{12}=\sigma^{-2}_1+\sigma^{-2}_2\).

(aside: the properties above are of course tied by the fact that Gaussians are (roughly speaking) eigenfunctions of the Fourier transform:

\[\int d\mathbf xe^{-i\mathbf k\cdot\mathbf x}p(\mathbf x|\boldsymbol{\mu},\sigma)=e^{-i\boldsymbol{\mu}\cdot\mathbf k-\mathbf k^T\sigma^{2}\mathbf k/2}\]

which intertwines convolution and multiplication).

Problem: What is the fundamental problem that diffusion models solve?

Solution: Given some prior data distribution \(p(\mathbf x_0)\) (e.g. the distribution of images or audio), and given some latent \(\mathbf x_T\) which does not lie in the support of \(p\) in the sense that \(p(\mathbf x_T)\approx 0\), a diffusion model is any (possibly non-deterministic) algorithm for transporting \(\mathbf x_T\) onto the submanifold of latents \(\mathbf x_0\) for which \(p(\mathbf x_0)>0\) is supported. Roughly speaking, a diffusion model is any parametric ansatz for the score vector field \(\partial\ln p(\mathbf x_0)/\partial\mathbf x_0\) of the data distribution, so one can get from \(\mathbf x_T\mapsto\mathbf x_0\) by taking steps along the score vector field. Diffusion models are commonly used as generative models in the sense that the latent \(\mathbf x_0\) represents a novel sample from \(p(\mathbf x_0)\) though other non-generative applications (e.g. reconstruction) do exist.

Problem: Explain the problem that denoising diffusion probabilistic models (DDPMs) solve, and contrast how they are used during inference vs. how they are trained.

Solution: DDPMs are generative diffusion models. At inference time, one begins by sampling some noisy latent \(\mathbf x_T\) where \(p(\mathbf x_T)\approx 0\) and iteratively \(\mathbf x_T,…,\mathbf x_0\) denoising \(\mathbf x_T\) towards some submanifold latent \(\mathbf x_0\) where \(p(\mathbf x_0)>0\), thereby generating the sample \(\mathbf x_0\). More explicitly, the sequence \(\mathbf x_T,…,\mathbf x_0\) will turn out to be a discrete-time Markov chain, and the operation of “denoising” \(\mathbf x_t\mapsto\mathbf x_{t-1}\) will really just amount to sampling from the Markov chain’s transition kernel \(p(\mathbf x_{t-1}|\mathbf x_t)=\int d\mathbf x_0p(\mathbf x_0|\mathbf x_t)p(\mathbf x_{t-1}|\mathbf x_0,\mathbf x_t)\). As it is, this is intractable, but focus on that second term in the integrand:

\[p(\mathbf x_{t-1}|\mathbf x_0,\mathbf x_t)\propto p(\mathbf x_t|\mathbf x_{t-1},\mathbf x_0)p(\mathbf x_{t-1}|\mathbf x_0)\]

Both of these are essentially transition kernels for the forward Markov chain \(\mathbf x_0,…,\mathbf x_T\). At this point, the DDPM paper decides to use a Gaussian forward transition kernel:

\[p(\mathbf x_t|\mathbf x_{t-1})\sim e^{-|\mathbf x_t-\sqrt{1-\sigma_t^2}\mathbf x_{t-1}|^2/2\sigma_t^2}\]

where \(0<\sigma^2_1<\sigma^2_2<…<\sigma^2_T\ll 1\) are just \(T\) fixed hyperparameters (called a variance schedule, where the horizon \(T\) is also a hyperparameter). Equivalently, using the reparameterization trick, adding Gaussian noise looks like:

\[\mathbf x_t=\sqrt{1-\sigma_t^2}\mathbf x_{t-1}+\sigma_t\boldsymbol{\varepsilon}\]

where \(\boldsymbol{\varepsilon}\) is drawn from an isotropic standard normal \(p(\boldsymbol{\varepsilon})\sim e^{-|\boldsymbol{\varepsilon}|^2/2}\). The \(\sqrt{1-\sigma_t^2}\) factor is used to scale down the previous state \(\mathbf x_{t-1}\) in order to prevent the variance (i.e. diagonal entries of the covariance matrix) of \(\mathbf x_t\) from exploding (and indeed, would be exactly variance-preserving iff the data distribution covariance \(\sigma^2_{\mathbf x_0}=1\)); cf. Brownian motion of a particle attached to the origin by a spring.

Rather than iteratively stepping from \(\mathbf x_0\) to \(\mathbf x_t\) in \(O(t)\) samples, one can jump in \(O(1)\) time from \(\mathbf x_0\mapsto\mathbf x_t\) via just a single \(\boldsymbol{\varepsilon}\)-sample:

\[\mathbf x_t=\sqrt{(1-\sigma_1^2)…(1-\sigma_t^2)}\mathbf x_0+\sqrt{1-(1-\sigma^2_1)…(1-\sigma^2_t)}\boldsymbol{\varepsilon}\]

Or equivalently:

\[p(\mathbf x_t|\mathbf x_0)\sim e^{-|\mathbf x_t-\sqrt{(1-\sigma_1^2)…(1-\sigma_t^2)}\mathbf x_0|^2/2(1-(1-\sigma^2_1)…(1-\sigma_t^2))}\]

Thus, the “oracle posterior” \(p(\mathbf x_{t-1}|\mathbf x_0,\mathbf x_t)\sim e^{-|\mathbf x_{t-1}-\boldsymbol{\mu}_{t-1}|^2/2\tilde{\sigma}^2_{t-1}}\) is an isotropic Gaussian with:

\[\tilde{\sigma}^{-2}_{t-1}=\frac{1-\sigma_t^2}{\sigma_t^2}+\frac{1}{1-(1-\sigma_1^2)…(1-\sigma^2_{t-1})}\Rightarrow \tilde{\sigma}^2_{t-1}=\frac{1-(1-\sigma_1^2)…(1-\sigma_{t-1}^2)}{1-(1-\sigma^2_1)…(1-\sigma^2_t)}\sigma^2_t\approx\sigma^2_t\]

\[\tilde{\sigma}^{-2}_{t-1}\boldsymbol{\mu}_{t-1}=\frac{1-\sigma_t^2}{\sigma_t^2}\frac{\mathbf x_t}{\sqrt{1-\sigma_t^2}}+\frac{\sqrt{(1-\sigma_1^2)…(1-\sigma^2_{t-1})}}{1-(1-\sigma^2_1)…(1-\sigma^2_{t-1})}\mathbf x_0\]

\[=\frac{\sqrt{1-\sigma_t^2}}{\sigma_t^2}\mathbf x_t+\frac{\sqrt{(1-\sigma_1^2)…(1-\sigma^2_{t-1})}}{1-(1-\sigma^2_1)…(1-\sigma^2_{t-1})}\frac{1}{\sqrt{1-\sigma^2_t}}\left(\mathbf x_t-\sqrt{1-(1-\sigma^2_1)…(1-\sigma^2_t)}\boldsymbol{\varepsilon}\right)\]

\[=\frac{1}{\tilde{\sigma}^2_{t-1}\sqrt{1-\sigma^2_t}}\mathbf x_t-\frac{\sigma_t^2}{\tilde{\sigma}^2_{t-1}\sqrt{(1-\sigma_t^2)(1-(1-\sigma_1^2)…(1-\sigma_t^2))}}\boldsymbol{\varepsilon}\]

Hence:

\[\boldsymbol{\mu}_{t-1}=\frac{1}{\sqrt{1-\sigma^2_t}}\left(\mathbf x_t-\frac{\sigma_t^2}{\sqrt{1-(1-\sigma_1^2)…(1-\sigma_t^2)}}\boldsymbol{\varepsilon}\right)\]

Although by design the forward Markov chain has Gaussian transition kernels \(p(\mathbf x_t|\mathbf x_{t-1})\sim e^{-|\mathbf x_t-\sqrt{1-\sigma_t^2}\mathbf x_{t-1}|^2/2\sigma_t^2}\), the reverse transition kernel \(p(\mathbf x_{t-1}|\mathbf x_t)\) is in general non-Gaussian. However, a theorem of Anderson guarantees that it is in fact Gaussian at the level of stochastic differential equations, and so it is not a bad approximation to simply sample \(\mathbf x_{t-1}\) from the Gaussian \(p(\mathbf x_{t-1}|\mathbf x_t,\mathbf x_0)\) as a proxy for sampling from the true, non-Gaussian \(p(\mathbf x_{t-1}|\mathbf x_t)\). Reparameterized, this means that inference looks like unannealed Langevin dynamics (ULD):

\[\mathbf x_{t-1}=\boldsymbol{\mu}_{t-1}+\tilde{\sigma}_{t-1}\tilde{\boldsymbol{\varepsilon}}\]

\[=\frac{1}{\sqrt{1-\sigma^2_t}}\left(\mathbf x_t-\frac{\sigma_t^2}{\sqrt{1-(1-\sigma_1^2)…(1-\sigma_t^2)}}\boldsymbol{\varepsilon}\right)+\sqrt{\frac{1-(1-\sigma_1^2)…(1-\sigma_{t-1}^2)}{1-(1-\sigma^2_1)…(1-\sigma^2_t)}}\sigma_t\tilde{\boldsymbol{\varepsilon}}\]

The only problem here is that the Gaussian noise \(\boldsymbol{\varepsilon}\) that was injected to get from \(\mathbf x_0\mapsto\mathbf x_t\) is unknown.

This is where training comes in. Training consists of \(3\) distinct sampling steps and assembling the samples:

  1. Sample \(\mathbf x_0\)
  2. Sample \(t\in\{1,…,T\}\)
  3. Sample \(\boldsymbol{\varepsilon}\)
  4. Hence, compute \(\mathbf x_t=\sqrt{(1-\sigma_1^2)…(1-\sigma_t^2)}\mathbf x_0+\sqrt{1-(1-\sigma^2_1)…(1-\sigma^2_t)}\boldsymbol{\varepsilon}\)

Architecturally, the diffusion model itself is then a ResNet-like noise predictor \(\hat{\boldsymbol{\varepsilon}}(\mathbf x_t,t|\boldsymbol{\theta})\) (e.g. a U-Net or a transformer, etc.) that seeks to predict the sampled Gaussian noise \(\boldsymbol{\varepsilon}\) that was added to \(\mathbf x_0\) to obtain \(\mathbf x_t\). This is enforced by the MSE loss function \(L(\hat{\boldsymbol{\varepsilon}},\boldsymbol{\varepsilon})=|\hat{\boldsymbol{\varepsilon}}-\boldsymbol{\varepsilon}|^2/2\) with corresponding cost function over the training set:

\[C_{\text{tr}}(\boldsymbol{\theta})=\frac{1}{N_{\text{tr}}}\sum_{i=1}^{N_{\text{tr}}}L(\hat{\boldsymbol{\varepsilon}}(\mathbf x_{t_i},t_i|\boldsymbol{\theta}),\boldsymbol{\varepsilon}_i)\]

The upshot is that the inference loop Markov chain dynamics are governed (for \(t=T,T-1,…,2\)) by:

\[\mathbf x_{t-1}=\frac{1}{\sqrt{1-\sigma^2_t}}\left(\mathbf x_t-\frac{\sigma_t^2}{\sqrt{1-(1-\sigma_1^2)…(1-\sigma_t^2)}}\hat{\boldsymbol{\varepsilon}}(\mathbf x_t,t|\boldsymbol{\theta})\right)+\sqrt{\frac{1-(1-\sigma_1^2)…(1-\sigma_{t-1}^2)}{1-(1-\sigma^2_1)…(1-\sigma^2_t)}}\sigma_t\tilde{\boldsymbol{\varepsilon}}\]

and for \(t=1\), the final generated sample \(\mathbf x_0=\frac{1}{\sqrt{1-\sigma^2_1}}\left(\mathbf x_1-\sigma_1\hat{\boldsymbol{\varepsilon}}(\mathbf x_1,1|\boldsymbol{\theta})\right)\) should not have any noise added to it.

Problem: In DDPMs, explain how estimating the noise \(\hat{\boldsymbol{\varepsilon}}(\mathbf x_t,t|\boldsymbol{\theta})\) is synonymous with estimating the score vector field of the data distribution \(\partial\ln p(\mathbf x_t)/\partial\mathbf x_t\).

Solution: Because \(\mathbf x_t\) is normally distributed about \(\sqrt{(1-\sigma^2_1)…(1-\sigma^2_t)}\mathbf x_0\) with covariance \(1-(1-\sigma_1^2)…(1-\sigma^2_t)\), Tweedie’s formula asserts that:

\[\langle\sqrt{(1-\sigma^2_1)…(1-\sigma^2_t)}\mathbf x_0|\mathbf x_t\rangle=\mathbf x_t+(1-(1-\sigma_1^2)…(1-\sigma^2_t))\frac{\partial\ln p(\mathbf x_t)}{\partial\mathbf x_t}\]

Comparing this with \(\mathbf x_t=\sqrt{(1-\sigma^2_1)…(1-\sigma^2_t)}\mathbf x_0+\sqrt{1-(1-\sigma^2_1)…(1-\sigma^2_t)}\hat{\boldsymbol{\varepsilon}}(\mathbf x_t,t|\boldsymbol{\theta})\), one concludes that:

\[\frac{\partial\ln p(\mathbf x_t)}{\partial\mathbf x_t}=-\frac{\hat{\boldsymbol{\varepsilon}}(\mathbf x_t,t|\boldsymbol{\theta})}{\sqrt{1-(1-\sigma^2_1)…(1-\sigma^2_t)}}\]

Problem: Explain how the DDPM architecture can be reformulated in terms of a stochastic differential equation, and (following work of Song et al.) deduce its distributionally equivalent probability flow ODE.

Solution: Recall the DDPM forward Markov chain transition kernel (rewritten using \(i\in\{0,…,T\}\) instead of \(t\) because in a moment \(t\in [0,T]\) will be reserved for a continuous variable).

\[\mathbf x_i=\sqrt{1-\sigma_i^2}\mathbf x_{i-1}+\sigma_i\boldsymbol{\varepsilon}\]

Taking the continuous time limit of this discrete difference equation amounts to letting \(T\to\infty\) while \(\sigma^2_i\to 0\) in such a way that their product \(T\sigma^2_i:=\sigma^2(t)\) is fixed (cf. the dipole limit in electrostatics which takes \(q\to\infty\) and \(\Delta\mathbf x\to\mathbf 0\) such that \(\boldsymbol{\pi}:=q\Delta\mathbf x\) is fixed). Hence, writing \(dt:=1/T\):

\[\mathbf x_i=\sqrt{1-\sigma^2(t)dt}\mathbf x_{i-1}+\sigma(t)\sqrt{dt}\boldsymbol{\varepsilon}\]

So after binomial expanding, isolating \(d\mathbf x:=\mathbf x_i-\mathbf x_{i-1}\), and recognizing the Wiener process \(d\mathbf w:=\sqrt{dt}\boldsymbol{\varepsilon}\):

\[d\mathbf x=-\frac{\sigma^2(t)}{2}\mathbf xdt+\sigma(t)d\mathbf w\]

This particular SDE is also an instance of an Ornstein-Uhlenbeck process, but is special in that it is also variance-preserving. The corresponding Fokker-Planck probability current density is \(-\frac{\sigma^2(t)}{2}\left(p(\mathbf x,t)\mathbf x+\frac{\partial p}{\partial\mathbf x}\right):=p(\mathbf x,t)\mathbf v_{\text{eff}}(\mathbf x,t)\) which is distributionally equivalent to the ODE \(d\mathbf x=\mathbf v_{\text{eff}}dt\) with effective velocity field:

\[\mathbf v_{\text{eff}}(\mathbf x,t):=-\frac{\sigma^2(t)}{2}\left(\mathbf x+\frac{\partial \ln p}{\partial\mathbf x}\right)\]

Hence, the SDE admits a corresponding probability flow ODE \(d\mathbf x/dt=\mathbf v_{\text{eff}}(\mathbf x,t)\) which can be integrated during inference using standard numerical algorithms (e.g. Euler, Runge-Kutta); recall the score vector field \(\partial\ln p/\partial\mathbf x\) is what the diffusion model estimates via \(\hat{\boldsymbol{\epsilon}}(\mathbf x,t|\boldsymbol{\theta})\).

Problem: Hence, in light of the above discussion, explain and motivate flow matching models, and describe the simplest instance of it (i.e. rectified flow).

Solution: Roughly speaking, diffusion models are trained to stochastically map signal to noise. At inference, one is therefore obliged to denoise along these stochastic trajectories in order to map from noise back to signal. But this is a bit like shooting oneself in the foot, unnecessarily making one’s life difficult with no ostensible gain. Flow matching models are (loosely speaking) what one gets after cutting through the fluff of diffusion models with Occam’s razor, replacing SDEs with ODEs (indeed, this is w.l.o.g. thanks to the probability flow ODE construction).

Intuitively, the simplest possible map \(\mathbf x_0\mapsto\mathbf x_1\) one could engineer is a simple linear interpolation for \(t\in [0,1]\):

\[\mathbf x_t=t\mathbf x_1+(1-t)\mathbf x_0\]

(aside: in the flow matching literature, the convention about whether \(\mathbf x_0\) represents signal vs. noise is sometimes reversed compared with the diffusion literature, but here it is being assumed that \(\mathbf x_0\) represents signal as in diffusion models). The relevant velocity field to be learned is thus \(d\mathbf x_t/dt=\mathbf x_1-\mathbf x_0\). A flow-matching model thus no longer seeks to estimate a score vector field, but rather a velocity vector field \(\hat{\mathbf v}(\mathbf x,t|\boldsymbol{\theta})\) via an MSE loss function \(L(\hat{\mathbf v},\mathbf v):=|\hat{\mathbf v}-\mathbf v|^2/2\) and training cost function:

\[C_{\text{tr}}(\boldsymbol{\theta})=\frac{1}{N_{\text{tr}}}\sum_{i=1}^{N_{\text{tr}}}L(\hat{\mathbf v}(\mathbf x_{t_i},t_i|\boldsymbol{\theta}),\mathbf x_1-\mathbf x_0)\]

A key advantage of learning such a simple rectified flow (also called optimal transport) is that inference is very easy, and can be achieved with very few steps since one just has to step along a straight line.

Posted in Blog | Leave a comment

Information Geometry

Problem: Let \(\boldsymbol{\Theta}\) be a smooth statistical manifold, and let \(D:\boldsymbol{\Theta}^2\to [0,\infty)\) be a smooth function. What does it mean for \((\boldsymbol{\Theta},D)\) to be a “divergence manifold“?

Solution: The notion of a divergence manifold relaxes the axioms of a metric space, specifically still demanding \(D(\boldsymbol{\theta} || \boldsymbol{\theta}’)\geq 0\) and \(D(\boldsymbol{\theta} || \boldsymbol{\theta}’)=0\Leftrightarrow\boldsymbol{\theta}’=\boldsymbol{\theta}\) for all \(\boldsymbol{\theta},\boldsymbol{\theta}’\in\boldsymbol{\Theta}\) but no longer enforcing symmetry \(D(\boldsymbol{\theta} || \boldsymbol{\theta}’)=D(\boldsymbol{\theta}’ || \boldsymbol{\theta})\) or the triangle inequality \(D(\boldsymbol{\theta} || \boldsymbol{\theta}^{\prime\prime})\leq D(\boldsymbol{\theta} || \boldsymbol{\theta}’) + D(\boldsymbol{\theta}’ || \boldsymbol{\theta}^{\prime\prime})\).

Problem: Explain how any divergence manifold \((\boldsymbol{\Theta},D)\) enjoys the free gift of being automatically equipped with a canonical Riemannian metric tensor field \(g_D(\boldsymbol{\theta}):T_{\boldsymbol{\theta}}(\boldsymbol{\Theta})^2\to\mathbf R\) induced by the divergence function \(D\).

Solution: The so-called Fisher information metric \(g_D(\boldsymbol{\theta})=(g_D)_{ij}(\boldsymbol{\theta})d\theta_i\otimes d\theta_j\) on the statistical manifold \(\boldsymbol{\Theta}\) induced by the divergence \(D\) is basically just its Hessian:

\[(g_D)_{ij}(\boldsymbol{\theta}):=\left(\frac{\partial^2 D(\boldsymbol{\theta}||\boldsymbol{\theta}’)}{\partial\theta’_i\partial\theta’_j}\right)_{\boldsymbol{\theta}’=\boldsymbol{\theta}}\]

Intuitively, one can think of the divergence \(D(\boldsymbol{\theta}||\boldsymbol{\theta}’)\) of \(\boldsymbol{\theta}’\in\boldsymbol{\Theta}\) from \(\boldsymbol{\theta}\in\boldsymbol{\Theta}\) as holding the “ground truth” distribution \(\boldsymbol{\theta}\) fixed while sniffing around for a proxy distribution \(\boldsymbol{\theta}’\) to approximate \(\boldsymbol{\theta}\). By the axioms of a divergence manifold, the global minimum value \(D(\boldsymbol{\theta}||\boldsymbol{\theta}’)=0\) is attained at \(\boldsymbol{\theta}’=\boldsymbol{\theta}\), so Taylor expanding about this global minimum (\((\partial D(\boldsymbol{\theta}||\boldsymbol{\theta}’)/\partial\boldsymbol{\theta}’)_{\boldsymbol{\theta}’=\boldsymbol{\theta}}=\mathbf 0\)), one has the local quadratic form:

\[D(\boldsymbol{\theta}||\boldsymbol{\theta}+d\boldsymbol{\theta})\approx\frac{1}{2}(g_D)_{ij}(\boldsymbol{\theta})d\theta_i d\theta_j\]

Problem: Let \(f:(0,\infty)\to\mathbf R\) be a nonlinear convex function with a zero at \(f(1)=0\). Define the family of \(f\)-divergences \(D_f:\Theta^2\to [0,\infty)\), prove that they do indeed satisfy the axioms of a divergence manifold, and exhibit some examples of functions \(f\) and the corresponding \(f\)-divergence \(D_f\).

Solution: One has:

\[D_f(\boldsymbol{\theta}||\boldsymbol{\theta}’):=\int d\mathbf x p(\mathbf x|\boldsymbol{\theta}’)f\left(\frac{p(\mathbf x|\boldsymbol{\theta})}{p(\mathbf x|\boldsymbol{\theta}’)}\right)\]

By rewriting this as an expectation \(D_f(\boldsymbol{\theta}||\boldsymbol{\theta}’)=\langle f\left(p(\mathbf x|\boldsymbol{\theta})/p(\mathbf x|\boldsymbol{\theta}’)\right)\rangle_{\mathbf x\sim p(\mathbf x|\boldsymbol{\theta}’)}\), applying Jensen’s inequality, using \(\int d\mathbf x p(\mathbf x|\boldsymbol{\theta})=1\) and finally using \(f(1)=0\), one establishes non-negativity \(D_f(\boldsymbol{\theta}||\boldsymbol{\theta}’)\geq 0\). The same \(f(1)=0\) condition also ensures \(\boldsymbol{\theta}=\boldsymbol{\theta}’\Rightarrow D_f(\boldsymbol{\theta}||\boldsymbol{\theta}’)=0\), and the converse is proven by recalling that Jensen’s inequality becomes an equality iff \(f\) is linear (forbidden by hypothesis) or its argument \(p(\mathbf x|\boldsymbol{\theta})/p(\mathbf x|\boldsymbol{\theta}’)=\text{constant}\). But since the expectation of this constant in \(\mathbf x\sim p(\mathbf x|\boldsymbol{\theta}’)\) was \(1\), the constant itself must be \(1\), Q.E.D.

  • \(f(x):=x\ln x\) generates the asymmetric Kullback-Leibler (KL) divergence \[D_{\text{KL}}(\boldsymbol{\theta}||\boldsymbol{\theta}’):=\int d\mathbf x p(\mathbf x|\boldsymbol{\theta})\ln\frac{p(\mathbf x|\boldsymbol{\theta})}{p(\mathbf x|\boldsymbol{\theta}’)}\]
  • \(f(x):=-\ln x\) generates the dual KL divergence \(D_{\text{KL}}(\boldsymbol{\theta}’||\boldsymbol{\theta})\) (in general, \(D_{xf(1/x)}(\boldsymbol{\theta}||\boldsymbol{\theta}’)=D_{f(x)}(\boldsymbol{\theta}’||\boldsymbol{\theta})\))
  • \(f(x):=|x-1|/2\) generates the total variation divergence \[d_{\text{TV}}(\boldsymbol{\theta}||\boldsymbol{\theta}’):=\frac{1}{2}\int d\mathbf x|p(\mathbf x|\boldsymbol{\theta})-p(\mathbf x|\boldsymbol{\theta}’)|\] (indeed, TVD is actually symmetric and obeys the triangle inequality so is a metric in the metric space sense, yet does not admit a Fisher information metric due to the non-differentiability of the absolute value).
  • \(f(x):=(\sqrt{x}-1)^2/2\) generates the symmetric squared Hellinger divergence \[d^2_H(\boldsymbol{\theta}||\boldsymbol{\theta}’):=\frac{1}{2}\int d\mathbf x\left(\sqrt{p(\mathbf x|\boldsymbol{\theta})}-\sqrt{p(\mathbf x|\boldsymbol{\theta}’)}\right)^2=1-\int d\mathbf x\sqrt{p(\mathbf x|\boldsymbol{\theta})p(\mathbf x|\boldsymbol{\theta}’)}\]
  • \(f(x):=(x-1)^2\) generates the Pearson \(\chi^2\) divergence \[D_{\chi^2}(\boldsymbol{\theta}||\boldsymbol{\theta}’):=\int d\mathbf x\frac{(p(\mathbf x|\boldsymbol{\theta})-p(\mathbf x|\boldsymbol{\theta}’))^2}{p(\mathbf x|\boldsymbol{\theta}’)}\] and \(f(x)=(x-1)^2/x\) generates the dual divergence (called the Neyman \(\chi^2\) divergence) in which one replaces \(p(\mathbf x|\boldsymbol{\theta}’)\mapsto p(\mathbf x|\boldsymbol{\theta})\) in the denominator.
  • \(f(x):=\frac{1}{2}\left(x\ln x-(x+1)\ln\frac{x+1}{2}\right)\) generates the symmetric Jensen-Shannon divergence \[d^2_{\text{JS}}(\boldsymbol{\theta}||\boldsymbol{\theta}’):=\frac{D_{\text{KL}}(p_{\boldsymbol{\theta}}||\frac{p_{\boldsymbol{\theta}}+p_{\boldsymbol{\theta}’}}{2})+D_{\text{KL}}(p_{\boldsymbol{\theta}’}||\frac{p_{\boldsymbol{\theta}}+p_{\boldsymbol{\theta}’}}{2})}{2}\]

Problem: What is a fundamental limitation in the family of \(f\)-divergences \(D_f\)? How do integral probability metrics such as the Wasserstein (a.k.a. earth-mover’s) distance address this shortcoming?

Solution:

\[W_{n}(p,p’)=\text{min}_{p(\mathbf x,\mathbf x’)|\int d\mathbf xp(\mathbf x,\mathbf x’)=p(\mathbf x’)\text{ and }\int d\mathbf x’p(\mathbf x,\mathbf x’)=p'(\mathbf x)}\left(\int dp(\mathbf x,\mathbf x’)|\mathbf x-\mathbf x’|^n\right)^{1/n}\]

where the joint probability is \(dp(\mathbf x,\mathbf x’)=d\mathbf xd\mathbf x’p(\mathbf x,\mathbf x’)\).

(include a concrete computation in \(1\) dimension).

Problem: (something about Bregman divergence, comment on how KL div is the only simultaneous f and Bregman divergence)

Solution:

Posted in Blog | Leave a comment

Autoencoders

VAE
Posted in Blog | Leave a comment

Graph Neural Networks

Problem: Give a broad sketch of the current state of the field of research in graph neural networks.

Solution:

Problem: Okay, so now explain what a graph neural network (GNN) actually is.

Solution: A GNN is basically any neural network whose input is any (undirected/directed/mixed/multi) graph (e.g. molecules, social networks, citation networks, etc.). In order to be sensible, the GNN output (whatever it is, e.g. a binary classifier for molecule toxicity) has to genuinely be an intrinsic function of the graph structure alone, and in particular not depend on any arbitrary choice of “ordering” with which one might index the graph vertices and edges (i.e. the output must be either permutation equivariant or permutation invariant depending on its nature, cf. tensors vs. tensor components in some basis).

Problem: Explain the subclass of GNNs known as message-passing neural networks (MPNNs).

Solution: An MPNN, being a certain category of GNNs, starts its life by taking as input some graph \((V,E)\). More precisely, this looks like some feature vector \(\mathbf x_v\) (e.g. mass, charge, atomic number for atoms) for each vertex \(v\in V\) and possibly also a feature vector \(\mathbf x_e\) (e.g. bond length, bond energy, etc.) for each edge \(e\in E\). The idea is that, in a manner similar to a head of self-attention, each vertex \(v\in V\) wants to update its current state \(\mathbf x_v\) into some new state \(\mathbf x’_v\) by soaking in context from its neighbours, and in an analogous manner each edge \(e\in E\) also wants to update its current state \(\mathbf x_e\mapsto\mathbf x’_e\) based on its “neighbours” (thus it’s not quite the same as self-attention in which a token doesn’t just look at its nearest neighbour tokens, but at all the tokens in the context). In the general MPNN framework, this can be roughly broken down into \(3\) conceptual steps:

  1. Message phase: from a sender perspective, each vertex \(v\in V\) “broadcasts” a “personalized” message vector \(\mathbf m_{vv’}\) along the edge \((v,v’)\in E\) connecting it to a neighbouring vertex \(v’\in V\). This message vector \(\mathbf m_{vv’}\) is any (learnable) function of its current state \(\mathbf x_v\), the current state of the receiving neighbour vertex \(\mathbf x_{v’}\), and the current edge feature vector \(\mathbf x_{vv’}\) connecting them.
  2. Aggregation phase: simultaneously, from a receiver perspective, each vertex \(v\in V\) receives the broadcasted signals from its neighbouring vertices. From this perspective, it then takes all the received message vectors and synthesizes them into a single “message summary” vector \(\mathbf m_v\) which in practice is any permutation invariant function of the message vectors \(\mathbf m_{vv’}\) it received from neighbouring vertices \(v’\in V\) (e.g. their average).
  3. Update phase: Finally, the vertex \(v\in V\) updates its own current state \(\mathbf x_v\) to some new state \(\mathbf x’_v\) using another (learnable) function of its current state \(\mathbf x_v\) and the message summary \(\mathbf m_v\).

This \(3\)-step process represents a single forward pass through \(1\) message-passing layer; several composed together define an MPNN.

Problem: Now that the general framework of MPNN architectures has been defined, walk through the following specific examples of MPNN architectures:

  • Graph convolutional networks (GCNs)
  • Graph attention networks (GATs)

Solution:

Posted in Blog | Leave a comment

Renormalization Group

Problem: Consider a Landau-Ginzburg statistical field theory involving a single real scalar field \(\phi(\mathbf x)\) for \(\mathbf x\in\mathbf R^d\) governed by the canonically normalized free energy density:

\[\mathcal F(\phi,\partial\phi/\partial\mathbf x,…)=\frac{1}{2}\biggr|\frac{\partial\phi}{\partial\mathbf x}\biggr|^2+\frac{\phi^2}{2\xi^2}+…\]

Explain what the \(+…\) means, explain which terms have temperature \(T\)-dependence, and explain for which such terms does that \(T\)-dependence actually matter?

Solution: The \(+…\) includes any terms (each with their own coupling constants) compatible with the golden trinity of constraints: locality, analyticity, and symmetry (e.g. a quartic \(g\phi^4\) coupling). The part of the free energy density \(\mathcal F\) before the \(+…\) should be compared to the Lagrangian density \(\mathcal L\) of Klein-Gordon field theory:

\[\mathcal L=\frac{1}{2c^2}\left(\frac{\partial\phi}{\partial t}\right)^2-\frac{1}{2}\biggr|\frac{\partial\phi}{\partial\mathbf x}\biggr|^2-\frac{\phi^2}{2\bar{\lambda}^2}\]

with \(\bar{\lambda}=\hbar/mc\) the reduced Compton wavelength playing a role analogous to the correlation length \(\xi\sim 1/\sqrt{|T-T_c|}\); indeed, this \(T\)-dependence in \(\xi=\xi(T)\) is (usually) the only \(T\)-dependence that matters, even though generically all the other coupling constants will also have \(T\)-dependence.

Problem: Define the (non-standard) notion of “theory space”.

Solution: Roughly speaking, “theory space” is the space of all Landau-Ginzburg statistical field theories \((\mathcal F,k^*)\) (notice it is defined not only by the free energy density \(\mathcal F\) but also the baggage of the UV cutoff \(k^*\); it is an effective field theory). The \(\mathcal F\) part can equivalently be parameterized as a countably \(\infty\)-tuple \((\xi,g,…)\) of the LAS-permitted coupling constants in \(\mathcal F\).

Problem: In broad strokes, describe the sequence of \(3\) steps that comprise a \(\mathbf k\)-space \(\zeta\)-renormalization semigroup transformation from one effective Landau-Ginzburg statistical field theory \((\mathcal F,k^*)\mapsto (\mathcal F_{\zeta},k^*)\) to another with the same UV cutoff \(k^*\) but a new Wilsonian effective free energy \(\mathcal F_{\zeta}\).

Solution: For \(\zeta\in [1,\infty)\), the corresponding \(\mathbf k\)-space \(\zeta\)-RG transformation of \((\mathcal F,k^*)\) is given by the \(3\)-step recipe:

  1. Coarse-graining \(k^*\mapsto k^*/\zeta\) (blocking in real space/integrating out shells in momentum space)
  2. Rescaling \(\mathbf k’:=\zeta\mathbf k\) to recover the original UV cutoff \(k^*/\zeta\mapsto k^*\) (this leads to a reciprocal “zooming out” of space \(\mathbf x\mapsto\mathbf x/\zeta\)).
  3. Rescale fields \(\phi’:=\zeta^{\Delta}\phi\) to make \(\mathcal F_{\zeta}\) canonically normalized with respect to \(\mathcal F\).

Problem: Consider an effective Landau-Ginzburg statistical field theory of a single real scalar field \(\phi(\mathbf x)\in\mathbf R\) with \(\mathbf x\in\mathbf R^d\) whose Fourier transform \(\phi_{\mathbf k}=\int d^d\mathbf x e^{-i\mathbf k\cdot\mathbf x}\phi(\mathbf x)\) is supported only on a ball of radius \(k^*\) (the theory’s UV cutoff). The free energy density corresponds to a free (no pun intended) field:

\[\mathcal F(\phi,\partial\phi/\partial\mathbf x)=\frac{1}{2}\biggr|\frac{\partial\phi}{\partial\mathbf x}\biggr|^2+\frac{\phi^2}{2\xi^2}\]

Perform a \(\mathbf k\)-space \(\zeta\)-renormalization of this theory to find the corresponding Wilsonian effective free energy density \(\mathcal F_{\zeta}\).

Solution: Work with the free energy \(F=\int d^d\mathbf x\mathcal F\) itself instead of just its density \(\mathcal F\):

\[F[\phi]=\frac{1}{2}\int_{|\mathbf k|<k^*}\frac{d^d\mathbf k}{(2\pi)^d}\left(|\mathbf k|^2+\frac{1}{\xi^2}\right)|\phi_{\mathbf k}|^2\]

  1. Partition the support \(|\mathbf k|<k^*\) of \(\phi_{\mathbf k}\) into \(|\mathbf k|<k^*/\zeta\) and \(k^*/\zeta<|\mathbf k|<k^*\) and based on this \(\zeta\), piecewise decompose \(\phi_{\mathbf k}=\phi^{<}_{\mathbf k}+\phi^{>}_{\mathbf k}\). Then one has an instance of the freshman’s dream (thanks to the disjoint supports of \(\phi^{<}_{\mathbf k}\) and \(\phi^{>}_{\mathbf k}\)):

\[|\phi_{\mathbf k}|^2=|\phi^{<}_{\mathbf k}+\phi^{>}_{\mathbf k}|^2=|\phi^{<}_{\mathbf k}|^2+|\phi^{>}_{\mathbf k}|^2\]

So:

\[F[\phi]=\frac{1}{2}\int_{|\mathbf k|<k^*/\zeta}\frac{d^d\mathbf k}{(2\pi)^d}\left(|\mathbf k|^2+\frac{1}{\xi^2}\right)|\phi^{<}_{\mathbf k}|^2+\frac{1}{2}\int_{k^*/\zeta<|\mathbf k|<k^*}\frac{d^d\mathbf k}{(2\pi)^d}\left(|\mathbf k|^2+\frac{1}{\xi^2}\right)|\phi^{>}_{\mathbf k}|^2\]

\[=F[\phi^<_{\zeta}]+F[\phi^>_{\zeta}]\]

In this case the partition function factorizes:

\[Z=\int\mathcal D\phi e^{-\beta F[\phi]}=\int\mathcal D\phi^{<}_{\zeta} e^{-\beta F[\phi^{<}_{\zeta}]}\int\mathcal D\phi^{>}_{\zeta}e^{-\beta F[\phi^{>}_{\zeta}]}=Z^{>}_{\zeta}\int\mathcal D\phi^{<}_{\zeta} e^{-\beta F[\phi^{<}_{\zeta}]}\]

where the measures are \(\mathcal D\phi^{<}_{\zeta}=\prod_{|\mathbf k|<k^*/\zeta}d\phi^{<}_{\mathbf k}\) and \(\mathcal D\phi^{>}_{\zeta}=\prod_{k^*/\zeta<|\mathbf k|<k^*}d\phi^{>}_{\mathbf k}\). The constant \(Z^>_{\zeta}\) doesn’t affect the physics, being absorbed as a constant shift into the Wilsonian effective free energy \(F_{\zeta}[\phi^{<}_{\zeta}]=F[\phi^{<}_{\zeta}]-k_BT\ln Z^{>}_{\zeta}\).

2. Rescaling \(\mathbf k’:=\zeta\mathbf k\), the Wilsonian effective free energy becomes:

\[F_{\zeta}[\phi^{<}_{\zeta}]=\frac{1}{2}\int_{|\mathbf k’|<k^*}\frac{d^d\mathbf k’}{(2\pi)^d}\zeta^{-d}\left(\zeta^{-2}|\mathbf k’|^2+\frac{1}{\xi^2}\right)|\phi^{<}_{\mathbf k’/\zeta}|^2\]

3. Rescaling \(\phi^{<\prime}_{\mathbf k’}:=\zeta^{\Delta}\phi^{<}_{\mathbf k’/\zeta}\), the Wilsonian effective free energy becomes:

\[F_{\zeta}[\phi^{<}_{\zeta}]=\frac{1}{2}\int_{|\mathbf k’|<k^*}\frac{d^d\mathbf k’}{(2\pi)^d}\zeta^{-d}\left(\zeta^{-2}|\mathbf k’|^2+\frac{1}{\xi^2}\right)\zeta^{-2\Delta}|\phi^{<\prime}_{\mathbf k’}|^2\]

so in order to canonically normalize the gradient term, one requires \(\Delta=-(d+2)/2\). This leads to the desired Wilsonian effective free energy density:

\[\mathcal F_{\zeta}(\phi^{<\prime}_{\zeta},\partial\phi^{<\prime}_{\zeta}/\partial\mathbf x)=\frac{1}{2}\biggr|\frac{\partial\phi^{<\prime}_{\zeta}}{\partial\mathbf x}\biggr|^2+\zeta^2\frac{(\phi^{<\prime}_{\zeta})^2}{2\xi^2}\]

Thus, by construction the gradient coupling is marginal (i.e. \(\zeta\)-independent) while the quadratic coupling is relevant because \(\zeta^2\to\infty\) as \(\zeta\to\infty\).

Posted in Blog | Leave a comment

Convolutional Neural Networks

CNNs_Part_1
Posted in Blog | Leave a comment

Hamilton’s Optics-Mechanics Analogy

Problem: Deduce the Hamilton-Jacobi equation of classical mechanics.

Solution: Instead of viewing the action \(S=S[\mathbf x(t)]\) as a functional of the particle’s trajectory \(\mathbf x(t)\), it can be viewed more simply as a scalar field \(S(\mathbf x,t)\) in which the initial point in spacetime \((t_0,\mathbf x_0)\) is fixed and one simply takes the on-shell trajectory from \((t_0,\mathbf x_0)\) to \((t,\mathbf x)\). Then the total differential \(dS=\mathbf p\cdot d\mathbf x\) (follows from the usual Noetherian calculation) so in particular:

\[\mathbf p=\frac{\partial S}{\partial\mathbf x}\]

Intuitively, this is saying that the particle moves in a direction (the direction of the momentum \(\mathbf p\)) orthogonal to the contour surfaces of the action field \(S\), i.e. such isosurfaces can be viewed as “wavefronts”. Then the total time derivative is:

\[\dot S=L\]

But \(\frac{\partial S}{\partial t}+\frac{\partial S}{\partial\mathbf x}\cdot\dot{\mathbf x}=\frac{\partial S}{\partial t}+\mathbf p\cdot\dot{\mathbf x}\). Thus, isolating for \(H=\mathbf p\cdot\dot{\mathbf x}-L\) yields the Hamilton-Jacobi nonlinear \(1^{\text{st}}\)-order PDE for \(S(\mathbf x,t)\):

\[-\frac{\partial S}{\partial t}=H\left(\mathbf x,\frac{\partial S}{\partial\mathbf x},t\right)\]

Problem: When \(\partial H/\partial t=0\), the Hamiltonian is conserved with energy \(H=E\), so this motivates the additive separation of variables, \(S(\mathbf x,t):=S_0(\mathbf x)-Et\) for some constant \(E\). What does the Hamilton-Jacobi equation simplify to in this case? For a single non-relativistic particle of mass \(m\) moving in a potential \(V(\mathbf x)\), what does this look like? What about in \(1\) dimension?

Solution: \[H\left(\mathbf x,\frac{\partial S_0}{\partial\mathbf x}\right)=E\]

which for \(H(\mathbf x,\mathbf p)=|\mathbf p|^2/2m+V(\mathbf x)\) looks like:

\[\frac{1}{2m}\biggr|\frac{\partial S_0}{\partial\mathbf x}\biggr|^2+V(\mathbf x)=E\]

and in \(1\) dimension is integrable to the explicit solution:

\[S_0(x)=\pm\int ^xdx’\sqrt{2m(E-V(x’))}\]

In particular, the usual trajectory \(x(t)\) can be obtained by treating \(S_o=S_0(x,t;E)\) as a family of solutions parameterized by the energy \(E\); this works because \(S_0\) can be a viewed as a particular generating function of a canonical transformation \((\mathbf x,\mathbf p,H)\mapsto (\mathbf x’,\mathbf p’,H’)\) in which the “boosted” Hamiltonian vanishes \(H’=0\).

\[\frac{\partial S_0}{\partial E}=-t_0\Rightarrow t-t_0=\pm\sqrt{\frac{m}{2}}\int_{x_0}^{x(t)}\frac{dx’}{\sqrt{E-V(x’)}}\]

Problem: Above, the static field \(S_0(\mathbf x)\) was introduced to simplify the Hamilton-Jacobi equation when the energy \(E\) was conserved. However, if one pulls back to the level of functionals rather than fields, one can define an analogous abbreviated action functional \(S_0[\mathbf x]\) which depends only on the path \(\mathbf x\) taken rather than the trajectory \(\mathbf x(t)\). Define \(S_0[\mathbf x]\), and moreover show that when the energy \(E\) is conserved, the on-shell path is a stationary point of \(S_0\) (this is called Maupertuis’s principle).

Solution: The abbreviated action for a single particle of mass \(m\) and (non-relativistic) energy \(E=|\mathbf p|^2/2m +V(\mathbf x)\) is:

\[S_0[\mathbf x]:=\int d\mathbf x\cdot\mathbf p=\int ds |\mathbf p|=\int ds\sqrt{2m(E-V(\mathbf x))}\]

(modifying this to \(S_0[\mathbf x]:=\int ds |\mathbf p|=\int ds\sqrt{2(E-V(\mathbf x))}\) allows for interpretation as an \(N\)-particle system in configuration space \(\mathbf x\in\mathbf R^{3N}\) with the Riemannian “mass metric” \(ds^2=m_1|d\mathbf x_1|^2+…+m_N|d\mathbf x_N|^2\)).

To find the stationary paths of \(S_0[\mathbf x]\) subject to the constraint \(H(\mathbf x,\mathbf p)=E\), one can implement a Lagrange multiplier \(\gamma(\tau)\) to perform unconstrained extremization of:

\[S[\mathbf x(\tau)]:=S_0[\mathbf x(\tau)]-\int d\tau\gamma(\tau)(H(\mathbf x,\mathbf p)-E)=\int d\tau (\mathbf p\cdot\dot{\mathbf x}-\gamma(\tau)(H(\mathbf x,\mathbf p)-E))\]

The Euler-Lagrange equations lead to Hamilton’s equations:

\[\frac{d\mathbf x}{d\tau}=\gamma\frac{\partial H}{\partial\mathbf p}\]

\[\frac{d\mathbf p}{d\tau}=-\gamma\frac{\partial H}{\partial\mathbf x}\]

provided the Lagrange multiplier \(\gamma=dt/d\tau\) encodes reparameterization invariance; with this choice it’s clear that the integrand in the functional \(S\) was nothing more than the Lagrangian \(L=\mathbf p\cdot\dot{\mathbf x}-H\) (plus an unimportant constant \(E\)) so Maupertuis’s principle reduces to the usual Hamilton’s principle.

Problem: What does Fermat’s principle in ray optics assert? Hence, derive the ray equation.

Solution: The time functional \(T=T[\mathbf x(s)]\) of a ray trajectory \(\mathbf x(s)\) is stationary on-shell. That is:

\[cT[\mathbf x(s)]=\int ds n(\mathbf x(s))\]

This is reparameterization invariant, since one can arbitrarily parameterize \(\mathbf x=\mathbf x(t)\) and replace \(ds=dt|\dot{\mathbf x}|\). The corresponding Euler-Lagrange equations are:

\[\frac{d}{dt}\left(n(\mathbf x)\frac{\dot{\mathbf x}}{|\dot{\mathbf x}|}\right)=|\dot{\mathbf x}|\frac{\partial n}{\partial\mathbf x}\]

But by choosing the natural parameterization \(t:=s\) one has \(|d\mathbf x/ds|=1\), hence the ray equation:

\[\frac{d}{ds}\left(n\frac{d\mathbf x}{ds}\right)=\frac{\partial n}{\partial\mathbf x}\]

This can also be written in terms of the curvature vector \(\boldsymbol{\kappa}=d^2\mathbf x/ds^2\):

\[\boldsymbol{\kappa}=\left(\frac{\partial\ln n}{\partial\mathbf x}\right)_{\perp d\mathbf x}\]

Problem: Starting from an arbitrary Cartesian component \(\psi(\mathbf x,t)=\psi_0(\mathbf x)e^{i(k_0cT(\mathbf x)-\omega t)}\) of either the \(\mathbf E\) or \(\mathbf B\) fields of a light wave (here \(\omega=ck_0\) with \(k_0=2\pi/\lambda_0\) is the free space wavenumber), make the eikonal approximation to the dispersionless wave equation obeyed by \(\psi\) in order to obtain the (scalar) eikonal equation. By defining light rays as the integral curves of the eikonal field \(cT(\mathbf x)\) (a kind of local optical path length), reproduce the vector eikonal equation from Fermat’s principle above.

Solution: The ansatz \(\psi(\mathbf x,t)=\psi_0(\mathbf x)e^{i(k_0cT(\mathbf x)-\omega t)}\) is easy to justify; the \(e^{-i\omega t}\) is a just a Fourier transform factor that reduces the wave equation \(\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\psi=\frac{n^2}{c^2}\frac{\partial^2\psi}{\partial t^2}\) to a Helmholtz equation \(\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\psi=-n^2k_0^2\psi\). The remaining piece is just a polar parameterization of an arbitrary \(\mathbf C\)-valued spatial field \(\psi_0(\mathbf x)e^{ik_0cT(\mathbf x)}\). One obtains:

\[\biggr|\frac{\partial cT}{\partial\mathbf x}\biggr|^2=n^2+\frac{1}{k_0^2\psi_0}\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\psi_0+\frac{2i}{k_0\psi_0}\frac{\partial\psi_0}{\partial\mathbf x}\cdot\frac{\partial cT}{\partial\mathbf x}+\frac{i}{k_0}\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2cT\]

The eikonal approximation amounts to taking the ray optics limit \(k_0\to\infty\) (in practice, the wavelength \(2\pi/k_0\) has to be much shorter than all other length scales), and yields the (scalar) eikonal equation:

\[\biggr|\frac{\partial cT}{\partial\mathbf x}\biggr|=n\]

A light ray is thus a trajectory \(\mathbf x(s)\) with unit tangent vector:

\[\frac{d\mathbf x}{ds}=\frac{1}{n}\frac{\partial cT}{\partial\mathbf x}\]

The rest is an application of the chain rule:

\[\frac{d}{ds}=\frac{\partial}{\partial\mathbf x}\cdot\frac{d\mathbf x}{ds}=\frac{1}{n}\frac{\partial}{\partial\mathbf x}\cdot\frac{\partial cT}{\partial\mathbf x}\]

followed by the identity:

\[\left(\frac{\partial cT}{\partial\mathbf x}\cdot\frac{\partial}{\partial\mathbf x}\right)\frac{\partial cT}{\partial\mathbf x}=\frac{1}{2}\frac{\partial}{\partial\mathbf x}\biggr|\frac{\partial cT}{\partial\mathbf x}\biggr|^2\]

to deduce the (vector) eikonal equation of motion for ray trajectories just as Fermat’s principle predicts.

Problem: Hence, what is Hamilton’s optics-mechanics analogy?

Solution: In a nutshell, the isomorphism proceeds as:

\[(n(\mathbf x), cT)\leftrightarrow (|\mathbf p(\mathbf x)|,S_0)\]

Problem: Use Hamilton’s optics-mechanics analogy to solve the brachistochrone problem (this was how Johann Bernoulli originally solved it).

Solution: By energy conservation, the speed of the particle at distance \(y>0\) below its initial dropping height is \(v=\sqrt{2gy}\). By Fermat’s principle, minimizing the time functional then amounts to treating the particle as a light ray with \(n(\mathbf x)=c/v(y)\). So the question becomes how do light rays bend in a horizontally stratified medium with \(n(y)\propto y^{-1/2}\)? The answer is given by the ray equations:

\[\frac{d}{ds}\begin{pmatrix} y^{-1/2}dx/ds \\ y^{-1/2}dy/ds\end{pmatrix}=\begin{pmatrix}0 \\ y^{-1/2}/2\end{pmatrix}\]

The horizontal component expresses Snell’s law since \(dx/ds=\sin\theta\) (it expresses momentum conservation along the homogeneous \(\partial n/\partial x=0\) direction). Using the tangent vector constraint \(ds^2=dx^2+dy^2\) gives the ODE of a cycloid:

\[\frac{dy}{dx}=\sqrt{\frac{\text{const}}{y}-1}\]

(the vertical component ODE has an analytical solution \(y(s)=-s^2/8R+s\) which is contained in the cycloid, so is redundant information).

Problem: How did Hamilton’s optics-mechanics analogy inspire Schrodinger to propose his famous equations of quantum mechanics?

Solution: Essentially, Schrodinger asked: ray optics is to wave optics as classical mechanics is to what? In other words, one imagines there exists a wave theory of particles/matter and one would like to take the “inverse eikonal limit” of classical mechanics (here, “inverse eikonal limit” is usually called quantization):

Just as light rays propagate parallel to their phase fronts:

\[\frac{d\mathbf x}{ds}=\frac{1}{n}\frac{\partial cT}{\partial\mathbf x}\]

Particles propagate parallel to their “action fronts” exactly according to Hamilton’s analogy:

\[\frac{d\mathbf x}{ds}=\frac{1}{|\mathbf p(\mathbf x)|}\frac{\partial S_0}{\partial\mathbf x}\]

Already this suggests that the action should have some phase interpretation. More precisely, it should be the phase of the particle’s de Broglie wave in units of \(\hbar\). It’s also not obvious that particle’s should be described by a scalar wave field rather than e.g. the electromagnetic vector wave fields of light. Schrodinger simply guessed it looked like the equivalent of “scalar diffraction theory” with a single wavefunction \(\psi(\mathbf x,t)=\psi_0(\mathbf x,t)e^{iS(\mathbf x,t)/\hbar}\). This gives the Madelung equations of quantum hydrodynamics, one of which is just a continuity equation (giving credence to the Born interpretation of \(|\psi|^2\)) and the other is a quantum Hamilton-Jacobi equation which in the limit \(\hbar\to 0\) (analogous to the eikonal limit \(\lambda_0\to 0\)) simplifies to the classical Hamilton-Jacobi equation.

Posted in Blog | Leave a comment