Problem: Define the \(2\) words in the phrase “ideal paramagnet“. Show that a classical ideal paramagnet of \(N\) spins each with the same fixed magnetic dipole moment \(\mu:=|\boldsymbol{\mu}|\) placed in a uniform external magnetic field \(B:=|\mathbf B|\) will develop a uniform non-zero induced average magnetization \(\langle M\rangle=n\mu L(\beta\mu B)\) along the direction of \(\mathbf B\) where the Langevin function \(L(x):=\coth(x)-1/x\).
Solution:Ideal means the \(N\) spins are non-interacting with respect to each other (hence there’s no need to specify the precise geometric configuration of the \(N\) spins such as whether they’re on a lattice \(\Lambda\), etc.). Paramagnet imposes conditions on the first \(2\) Taylor expansion coefficients of \(M(B)\) about \(B=0\):
Non-magnetic/absence of spontaneous magnetization \(\langle M(B=0)\rangle=0\).
Positive-definite zero-field magnetic susceptibility \(\chi_{\mu}/\mu_0:=\partial \langle M\rangle/\partial (B=0)>0\).
Classically, for a single classical spin, the state space is \((\theta,\phi)\in S^2\) and the \(\phi\)-independent Hamiltonian is \(H(\theta)=-\boldsymbol{\mu}\cdot\mathbf B=-\mu B\cos\theta\), so the single-spin partition function is:
and hence the result follows from \(\langle M\rangle=n\langle\mu_z\rangle\) where \(n:=N/V\).
Problem: Show that in the high-temperature limit where \(x:=\beta\mu B\ll 1\), the Taylor expansion of the Langevin function about \(x=0\) is \(L(x)=x/3+O_{x\to 0}(x^3)\). Hence, derive Curie’s high-temperature ideal paramagnet law \(\chi_{\mu}=C/T\) for the zero-field magnetic susceptibility and state the value of the Curie constant \(C>0\).
Thus, it is straightforward to compute the Curie constant \(C=\mu_0 n\mu^2/3k_B\) for a classical ideal paramagnet.
Problem: Repeat the above analysis but for a quantum ideal paramagnet in which all \(N\) spins have the same fixed total angular momentum quantum number \(j\in\{0,1/2,1,…\}\).
Solution: Now, for a single quantum spin with state vector in the manifold \(\mathbf C^{2j+1}\), the spectrum of its Hamiltonian is given by the weak-field Zeeman splitting \(E_{|j,m_j\rangle}=g_jm_j\mu_BB\) for Landé \(g\)-factor \(g_j=1+\frac{j(j+1)+s(s+1)-\ell(\ell+1)}{2j(j+1)}\) and the canonical partition function is a Dirichlet kernel:
In particular, as \(j\to\infty\), \(2j+1\to\infty\) and one recovers the classical continuous angle \(\theta\in [0,\pi]\) and \(\lim_{j\to\infty}B_j(x)=L(x)\). This is consistent with the Taylor expansion \(B_j(x)=\frac{j+1}{3j}x+O_{x\to 0}(x^3)\) which leads to the quantum Curie constant \(C=\mu_0ng_j^2j(j+1)\mu_B^2/3k_B\). Instead of looking at \(j\to\infty\), one can also take the quantum limit \(j=s=1/2\) and \(\ell=0\), in which case \(g_j=2\) and:
\[B_{1/2}(x)=2\coth 2x-\coth x=\tanh x\]
so one recovers the familiar \(2\)-level system average magnetization \(\langle M\rangle=n\mu_B\tanh\beta\mu_BB\) with Curie constant \(C=\mu_0n\mu_B^2/k_B\).
Problem: Explain why classical statistical mechanics (when applied consistently!) predicts \(\langle\mathbf M\rangle=\mathbf 0\) irrespective of what \(\mathbf B\) is (this is called the Bohr-van Leeuwen theorem). Since the Langevin derivation used a classical stat mech approach yet was able to predict a nontrivial \(\mathbf M(\mathbf B)\), explain why it doesn’t violate the BvL theorem.
Solution: The BvL theorem follows mathematically from the fact that the canonical \(N\)-particle partition function:
with \(H=\sum_{i=1}^N\frac{|\mathbf p_i-q_i\mathbf A(\mathbf x_i)|^2}{2m_i}+V(\mathbf x_1,…,\mathbf x_N)\) can instead (via change of variables) be integrated over the kinetic momenta \(m_i\mathbf v_i:=\mathbf p_i-q_i\mathbf A(\mathbf x_i)\) rather than the canonical momenta \(\mathbf p_i\) without incurring a Jacobian penalty \(\partial (m_1\mathbf v_1,…,m_N\mathbf v_N)/\partial(\mathbf p_1,…,\mathbf p_N)=1\), so \(Z(\beta,\mathbf B)=Z(\beta,\mathbf B=\mathbf 0)\) is \(\mathbf B\)-independent and hence \(F=-k_BT\ln Z\) is also \(\mathbf B\)-independent, leading to the BvL theorem \(\langle\mathbf M\rangle=-V^{-1}\partial F/\partial\mathbf B=\mathbf 0\).
By assuming that one could speak of a “fixed magnetic dipole moment \(\mu\)” for all the atoms, Langevin was implicitly quantizing the system since in hindsight \(\mu=g_jj\mu_B\) and \(j\) is quantized and fixed (indeed, making the replacement \(\mu\mapsto g_jj\mu_B\) in the Langevin magnetization maps in the \(j\to\infty\) limit directly onto the Brillouin magnetization). As a result, it is perhaps wise to regard the Langevin result for \(\langle M\rangle\) as a semi-classical formula rather than strictly belonging to classical physics (otherwise it would violate the BvL theorem!).
The physics-informed neural network (PINN) paradigm is a form of unsupervised machine learning first proposed in Raissi et al. (2019). At inference time, one freezes the model’s internal parameters and queries the PINN at an arbitrary point of the relevant domain (typically some spacetime event $(\mathbf{x}, t)$). The PINN outputs its prediction $\hat{\phi}(x,t)$ for the actual value $\phi(x,t)$ of the PDE solution at that input point, for some fixed initial/boundary conditions and constant parameters in the PDE.
This motivates the training stage to be based off an MSE cost function that simultaneously enforces both the initial/boundary conditions and the PDE constraint itself. At training time, one randomly samples a batch of $N_{\partial}$ points on the boundary and $N_{\text{int}}$ points in the interior of the domain. For each point, the PINN prediction $\hat{\phi}(x_i, t_i)$ is computed, and automatic differentiation (the same chain rule algorithm that powers backpropagation) is applied to obtain the PINN’s predictions for the relevant partial derivatives $\partial\phi/\partial x, \partial\phi/\partial t, \partial^2\phi/\partial x^2, \ldots$ as dictated by the PDE, which is assumed to take the kernelized form $\text{PDE}(\phi(x,t))=0$. The total cost function is then:
The parameters $\lambda_{\partial}, \lambda_{\text{int}}$ are hyperparameters meant to emphasize one component of the loss more heavily (e.g. to prevent pathological underfitting such as $\hat{\phi}=0$, one should choose $\lambda_{\partial} \gg \lambda_{\text{int}}$).
In this project, I explore and quantify the difficulties that PINNs have at modelling high-frequency spatial and temporal Fourier components of PDE solutions, following work of Wang et al. (2021) which in turn follows from pioneering early work by Rahaman et al. (2018) and Xu et al. (2019) on the spectral bias and frequency principle of deep neural networks outside the specialized context of PINNs. It is hypothesized that the smoothness of the activation function (which in the experiments below are always $\tanh$ activations) could induce a smoothness prior on the PINNs responsible for this spectral bias.
The project is split into two experiments:
Van der Pol oscillator (ODE): probing spectral bias in the time domain via relaxation oscillations.
Fokker-Planck equation (PDE): probing spectral bias in the spatial domain via increasingly narrow initial conditions.
Both experiments use the same core architecture: a fully-connected multilayer perceptron (MLP) with 5 hidden layers of 64 neurons each, $\tanh$ activations, and Xavier normal weight initialization. Xavier initialization draws initial weights from $\mathcal{N}(0, \sigma^2)$ with $\sigma^2 = 2/(n_{\text{in}} + n_{\text{out}})$, which is the optimal scaling for networks with symmetric, saturating activations like $\tanh$, as it ensures the initial inputs to each activation fall into the active linear region, avoiding vanishing or exploding gradients at the start of training.
For the van der Pol experiment, the input is 1-dimensional ($t \mapsto \hat{x}(t)$). For the Fokker-Planck experiment, the input is 2-dimensional ($(x,t) \mapsto \hat{p}(x,t)$).
All models were trained with the Adam optimizer using a learning rate schedule consisting of a linear warmup phase followed by cosine annealing decay. Collocation points were re-sampled uniformly at random each epoch.
All training was performed on a single NVIDIA GPU via PyTorch with CUDA. Each van der Pol model (15,000 epochs) took roughly 1-2 minutes; each Fokker-Planck model (15,000 epochs) took roughly 3-4 minutes.
The van der Pol equation is a second-order stiff nonlinear ODE first proposed by electrical engineer Balthasar van der Pol as a model of the voltage $x(t)$ in certain vacuum tube circuits:
$$\ddot{x} + \mu(x^2 – 1)\dot{x} + x = 0$$
For $\mu = 0$ this reduces to the simple harmonic oscillator $\ddot{x} + x = 0$. As $\mu$ increases, the system develops a stable limit cycle that becomes increasingly contorted in $(x, \dot{x})$ phase space, leading to relaxation oscillations with sharp, high-frequency transitions. The period grows asymptotically as $T \sim (3 – 2\ln 2)\mu$ for $\mu \gg 1$.
I first learned about this dynamical system in high school while reading Strogatz’s textbook on Chaos and Nonlinear Dynamics. Upon doing initial research into PINNs, I learned that they had a known deficiency with respect to high-frequency signals, and upon searching for a simple toy ODE to illustrate this, I recalled the van der Pol oscillator and how its relaxation oscillations could serve as a natural probe of the spectral bias.
In all experiments, I used the initial conditions $x(0) = 1$, $\dot{x}(0) = 0$, and varied the damping coefficient $\mu = 0, 0.1, 0.2, \ldots, 1.0, 2.0, \ldots, 10.0$. Training from scratch with Xavier initialization, only the $\mu = 0$ PINN was able to learn the sinusoidal SHM solution. To address this, I adopted a curriculum training strategy: the $\mu = 0$ checkpoint was used to initialize the $\mu = 0.1$ PINN, whose checkpoint in turn initialized the $\mu = 0.2$ PINN, and so forth.
In [1]:
importsys,osimportnumpyasnpimportmatplotlib.pyplotaspltimporttorch# Add module directories to pathsys.path.insert(0,os.path.join(os.getcwd(),'vanderpol_PINN'))sys.path.insert(0,os.path.join(os.getcwd(),'fokkerplanck_PINN'))fromvanderpol_PINN.modelimportVanillaPINNfromvanderpol_PINN.vanderpol_dataimportgenerate_reference_solutionfromfokkerplanck_PINN.modelimportVanillaPINN2Dfromfokkerplanck_PINN.fokkerplanck_dataimportanalytical_fokker_planckDEVICE=torch.device('cuda'iftorch.cuda.is_available()else'cpu')print(f"Using device: {DEVICE}")
The figures below show the curriculum training results for $\mu = 0$ through $\mu = 6$. The left column compares the PINN prediction (blue dashed) against the reference Radau solution (black solid), alongside the training loss curve. The right column shows the corresponding phase portrait $(x, \dot{x})$ with the nullclines of the dynamical system and a background vector field.
Even with the curriculum training approach, by about $\mu \approx 5$ the PINN prediction starts deviating from the Radau solution. This is evidence for spectral bias in the time domain: the PINN cannot represent the sharp, high-frequency transitions of the relaxation oscillations.
In [2]:
defevaluate_pinn_with_derivatives(model,t_eval):# Evaluate a trained PINN, returning x(t) and x_dot(t) via autodiff.model.eval()t_tensor=torch.tensor(t_eval,dtype=torch.float32,device=DEVICE).unsqueeze(1)t_tensor.requires_grad_(True)x_pred=model(t_tensor)x_dot_pred=torch.autograd.grad(x_pred,t_tensor,grad_outputs=torch.ones_like(x_pred),create_graph=False)[0]return(x_pred.detach().cpu().numpy().flatten(),x_dot_pred.detach().cpu().numpy().flatten())defplot_vdp_comparison(mu_values,ckpt_dir,T_DOMAIN=17.0):# Load checkpoints and plot time series + phase portraits side by side.formuinmu_values:mu_str=f"{mu}".replace('.','p')ckpt_path=os.path.join(ckpt_dir,f'mu_{mu}_vanilla_mu_{mu_str}.pt')ifnotos.path.exists(ckpt_path):print(f"Checkpoint not found: {ckpt_path}, skipping.")continuecheckpoint=torch.load(ckpt_path,map_location=DEVICE,weights_only=False)model=VanillaPINN()model.load_state_dict(checkpoint['model_state_dict'])model.to(DEVICE)# Reference solutiont_ref,x_ref,xdot_ref=generate_reference_solution(mu,t_span=(0,T_DOMAIN),num_points=2000)# PINN predictionx_pinn,xdot_pinn=evaluate_pinn_with_derivatives(model,t_ref)# Training loss historyloss_hist=checkpoint.get('loss_history',None)fig,axes=plt.subplots(1,2,figsize=(14,4.5))# --- Time series ---ax=axes[0]ax.plot(t_ref,x_ref,'k-',lw=2,label='Reference (Radau)')ax.plot(t_ref,x_pinn,'b--',lw=1.5,label='Vanilla PINN',alpha=0.8)ax.set_xlabel('$t$');ax.set_ylabel('$x(t)$')ax.set_title(f'Time Series, $\\mu = {mu}$')ax.legend(fontsize=9);ax.grid(True,alpha=0.3)# Inset for loss curve if availableifloss_histisnotNone:ax_inset=ax.inset_axes([0.55,0.55,0.4,0.4])losses=loss_histifisinstance(loss_hist,list)elseloss_hist.get('total',loss_hist)ax_inset.semilogy(losses,'r-',lw=0.5)ax_inset.set_xlabel('Epoch',fontsize=7);ax_inset.set_ylabel('Loss',fontsize=7)ax_inset.tick_params(labelsize=6)# --- Phase portrait ---ax=axes[1]# Nullcline: x_dot = -x / (mu*(x^2 - 1))ifmu>0:x_nc=np.linspace(-3,3,500)nc_color='#2ca02c'formaskin[x_nc<-1,(x_nc>-1)&(x_nc<1),x_nc>1]:seg=x_nc[mask]iflen(seg)>0:ax.plot(seg,-seg/(mu*(seg**2-1)),color=nc_color,ls='--',lw=1.2,alpha=0.6)ax.plot(x_ref,xdot_ref,'k-',lw=2,label='Reference (Radau)')ax.plot(x_pinn,xdot_pinn,'b--',lw=1.5,label='Vanilla PINN',alpha=0.8)ax.plot(x_ref[0],xdot_ref[0],'ro',ms=7,label='IC')ax.set_xlabel('$x$');ax.set_ylabel('$\\dot{x}$')ax.set_title(f'Phase Portrait, $\\mu = {mu}$')ax.legend(fontsize=8);ax.grid(True,alpha=0.3)plt.tight_layout()plt.show()
In [3]:
# Plot van der Pol results for the curriculum training experimentmu_curriculum=[0.0,0.5,1.0,2.0,3.0,4.0,5.0,6.0]plot_vdp_comparison(mu_curriculum,'vanderpol_PINN/checkpoints')
c:\Users\weidu\AppData\Local\Programs\Python\Python313\Lib\site-packages\torch\autograd\graph.py:823: UserWarning: Attempting to run cuBLAS, but there was no current CUDA context! Attempting to set the primary context... (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\aten\src\ATen\cuda\CublasHandlePool.cpp:180.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
The second experiment steps up in complexity from an ODE to a PDE. We consider the drift-free Fokker-Planck equation (i.e. the diffusion/heat equation) for a probability distribution $p(x,t)$ in one spatial dimension:
$$\frac{\partial p}{\partial t} = D \frac{\partial^2 p}{\partial x^2}$$
for diffusion constant $D$. When the initial condition is an infinite spike at the origin, $p(x,0) = \delta(x)$, the analytical solution (the heat kernel/Green’s function) is a Gaussian whose width grows with variance $2Dt$:
$$p(x,t) = \frac{1}{\sqrt{4\pi D t}} \exp\!\left(-\frac{x^2}{4Dt}\right)$$
Because a $\delta$-function is a high-frequency spike (its Fourier transform $\int_{-\infty}^{\infty}dx\, e^{-ikx}\delta(x) = 1$ contains all spatial frequencies uniformly), PINNs should have difficulty modelling it. To quantify when this failure occurs, we set $D = 1$ and initialize the PINN with $p(x,0) = \frac{1}{\sqrt{4\pi t_0}}\exp(-x^2/4t_0)$ for some $t_0 > 0$. As $t_0 \to 0$, the initial condition approaches $\delta(x)$ and should become harder for the PINN to represent.
We truncate the spatial domain to $x \in [-L, L]$ with $L = 10$ and impose Dirichlet boundary conditions $p(\pm L, t) = 0$, which is a reasonable approximation since the Gaussian tails decay exponentially. The PINN architecture is the same MLP as before but with a 2-dimensional input $(x, t)$: five hidden layers of 64 neurons each with $\tanh$ activations and Xavier normal initialization.
with $N_{\text{PDE}} = 2500$, $N_{\text{IC}} = 1000$, $N_{\text{BC}} = 1000$, $\lambda_{\text{IC}} = 10$ (set higher to prevent the trivial $\hat{p} \equiv 0$ solution from dominating), and $\lambda_{\text{BC}} = 1$. All collocation points are re-sampled uniformly at random each epoch. The model was trained for $t_0 \in \{1.0, 0.1, 0.05, 0.01, 0.001, 0.0001\}$, with 15,000 epochs per run.
The figures below show the PINN’s predicted profile $\hat{p}(x,t)$ (blue dashed) compared against the analytical Green’s function (black solid) at six time snapshots $t = 0, 0.2, 0.4, 0.6, 0.8, 1.0$, for decreasing values of $t_0$.
As $t_0$ decreases, the initial condition sharpens into a narrower and taller spike, whose Fourier spectrum becomes progressively more broadband (in the limit $t_0 \to 0$, the spectrum is flat since $\hat{\delta}(k) = 1$). This is where the spectral bias of standard MLPs should manifest most clearly.
For $t_0 = 1.0$ and $t_0 = 0.1$, the PINN achieves near-perfect agreement with the analytical solution at all time snapshots, since the initial Gaussian is broad and smooth (dominated by low spatial frequencies that the PINN can represent without difficulty). By $t_0 = 0.01$, the initial peak is sharp enough that small discrepancies appear in the $t = 0$ panel, though the PINN still captures the subsequent diffusive broadening reasonably well. The failure becomes catastrophic at $t_0 = 0.001$ and $t_0 = 0.0001$: the PINN cannot represent the near-$\delta$-function initial condition at all, and this failure propagates forward in time, producing profiles that are asymmetric, too broad, and shifted away from the origin. The PINN’s predictions at later times ($t = 0.6, 0.8, 1.0$) remain wrong even though the analytical solution has by then diffused into a smooth, low-frequency Gaussian. This suggests that spatial spectral bias corrupts the PINN’s internal representation of the solution globally, not just at the initial time.
In [4]:
defplot_fokkerplanck_results(t0_values,ckpt_dir,t_snapshots,D=1.0,x_range=(-10,10)):# Load FP checkpoints and plot PINN vs analytical at multiple time snapshots.fort0int0_values:t0_str=str(t0).replace('.','p')ckpt_path=os.path.join(ckpt_dir,f'fp_t0_{t0_str}.pt')ifnotos.path.exists(ckpt_path):print(f"Checkpoint not found: {ckpt_path}, skipping.")continuecheckpoint=torch.load(ckpt_path,map_location=DEVICE,weights_only=False)model=VanillaPINN2D()model.load_state_dict(checkpoint['model_state_dict'])model.to(DEVICE)model.eval()x=np.linspace(x_range[0],x_range[1],500)n_snaps=len(t_snapshots)fig,axs=plt.subplots(2,3,figsize=(16,9))axs=axs.flatten()fori,tinenumerate(t_snapshots):ax=axs[i]exact_p=analytical_fokker_planck(x,t+t0,D)xt_grid=np.stack([x,np.full_like(x,t)],axis=1)xt_tensor=torch.tensor(xt_grid,dtype=torch.float32,device=DEVICE)withtorch.no_grad():p_pred=model(xt_tensor).cpu().numpy().flatten()ax.plot(x,exact_p,'k-',lw=2,label='Analytical')ax.plot(x,p_pred,'b--',lw=1.5,label='Vanilla PINN')ax.set_title(f'$t = {t}$',fontsize=13)ax.set_xlabel('$x$')ifi%3==0:ax.set_ylabel('$p(x, t)$')ax.legend(fontsize=9)ax.grid(True,alpha=0.3)fig.suptitle(f'Fokker-Planck: PINN vs Analytical ($t_0 = {t0}$, $D = {D}$)',fontsize=14,fontweight='bold')plt.tight_layout()plt.show()
In [5]:
# Plot Fokker-Planck results for all t0 valuest0_values=[1.0,0.1,0.05,0.01,0.001,0.0001]t_snapshots=[0.0,0.2,0.4,0.6,0.8,1.0]plot_fokkerplanck_results(t0_values,'fokkerplanck_PINN/checkpoints',t_snapshots)
These two experiments probe spectral bias along complementary axes. The van der Pol experiment demonstrates bias in the temporal domain: as $\mu$ increases, the solution $x(t)$ develops sharp transitions in time that the PINN cannot capture. The Fokker-Planck experiment demonstrates bias in the spatial domain: as $t_0$ decreases, the initial condition $p(x, 0)$ develops sharp spatial structure that the PINN cannot represent. In both cases, the fundamental issue is the same: standard MLPs with smooth activation functions tend to learn low-frequency components of the target function first and struggle with high-frequency components.
Several extensions were considered but not implemented:
Parametric PINN for the van der Pol oscillator: rather than training a separate PINN for each $\mu$, one could train a single network that accepts $(t, x(0), \dot{x}(0), \mu)$ as inputs, learning the entire family of oscillators at once.
Transfer learning: following Mustajab et al. (2024), one could first train a PINN on low-frequency solutions and then transfer the weights to better learn higher-frequency solutions.
Normalization penalty for Fokker-Planck: although a normalized initial distribution $\int dx\, p(x,0) = 1$ implies normalization for all subsequent $t$ (provided the probability current vanishes sufficiently fast at the boundaries), for the smaller values of $t_0$ the PINN predictions were no longer even normalized. One way to address this would be to incorporate an explicit normalization penalty proportional to $(1 – \int dx\, p(x,t))^2$ into the cost function, estimated via Monte Carlo integration over the sampled interior points.
Fourier transform penalties: one could add terms to the cost function that penalize discrepancies in the Fourier domain, particularly at high frequencies (e.g. via a Wasserstein distance).
Raissi, M., Perdikaris, P. and Karniadakis, G.E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, pp.686-707.
Wang, S., Teng, Y. and Perdikaris, P. (2021). Understanding and mitigating gradient flow pathologies in physics-informed neural networks. SIAM Journal on Scientific Computing, 43(5), pp.A3055-A3081.
Rahaman, N. et al. (2018). On the spectral bias of neural networks. ICML 2019.
Xu, Z.J. et al. (2019). Frequency principle: Fourier analysis sheds light on deep neural networks. Communications in Computational Physics, 28(5), pp.1746-1767.
Mustajab, A.H. et al. (2024). Transfer learning enhanced physics informed neural network for phase-field modeling of fracture. Theoretical and Applied Fracture Mechanics, 130, p.104321.
Solution: Tweedie’s formula asserts that if \(p(\mathbf x|\boldsymbol{\mu},\sigma)=\frac{1}{\det(\sqrt{2\pi}\sigma)}e^{-(\mathbf x-\boldsymbol{\mu})^T\sigma^{-2}(\mathbf x-\boldsymbol{\mu})/2}\) is normally distributed, then without needing to know anything about the prior \(p(\boldsymbol{\mu}|\sigma)\) on the mean random vector \(\boldsymbol{\mu}\), one has the following Bayesian point estimate for it:
At first glance, one might think that \(\langle\boldsymbol{\mu}|\mathbf x,\sigma\rangle\approx\mathbf x\), but Tweedie’s formula provides the empirical Bayes correction \(\sigma^2\frac{\partial\ln p(\mathbf x|\sigma)}{\partial\mathbf x}\) to the naive estimate. The proof amounts to a brute force computation of the score vector field:
Problem: Show that the set of Gaussians is closed under both convolution and multiplication.
Solution: The convolution of two normalized Gaussians is also a normalized Gaussian representing the probability distribution of the sum of the two independent normal random variables (i.e. it is a kind of orthonormal bilinear transformation):
where \(\boldsymbol{\mu}_{1*2}=\boldsymbol{\mu}_1+\boldsymbol{\mu}_2\) and \(\sigma^2_{1*2}=\sigma^2_1+\sigma^2_2\) (thus, addition of independent normal random variables is isomorphic to vector addition in \((\boldsymbol{\mu},\sigma^2)\)-space).
The product of two normalized Gaussians is an (unnormalized) Gaussian (but whose normalization constant is itself has a Gaussian form):
where \(\sigma^{-2}_{12}\boldsymbol{\mu}_{12}=\sigma^{-2}_1\boldsymbol{\mu}_1+\sigma^{-2}_2\boldsymbol{\mu}_2\) and \(\sigma^{-2}_{12}=\sigma^{-2}_1+\sigma^{-2}_2\).
(aside: the properties above are of course tied by the fact that Gaussians are (roughly speaking) eigenfunctions of the Fourier transform:
which intertwines convolution and multiplication).
Problem: What is the fundamental problem that diffusion models solve?
Solution: Given some prior data distribution \(p(\mathbf x_0)\) (e.g. the distribution of images or audio), and given some latent \(\mathbf x_T\) which does not lie in the support of \(p\) in the sense that \(p(\mathbf x_T)\approx 0\), a diffusion model is any (possibly non-deterministic) algorithm for transporting \(\mathbf x_T\) onto the submanifold of latents \(\mathbf x_0\) for which \(p(\mathbf x_0)>0\) is supported. Roughly speaking, a diffusion model is any parametric ansatz for the score vector field \(\partial\ln p(\mathbf x_0)/\partial\mathbf x_0\) of the data distribution, so one can get from \(\mathbf x_T\mapsto\mathbf x_0\) by taking steps along the score vector field. Diffusion models are commonly used as generative models in the sense that the latent \(\mathbf x_0\) represents a novel sample from \(p(\mathbf x_0)\) though other non-generative applications (e.g. reconstruction) do exist.
Problem: Explain the problem that denoising diffusion probabilistic models(DDPMs) solve, and contrast how they are used during inference vs. how they are trained.
Solution: DDPMs are generative diffusion models. At inference time, one begins by sampling some noisy latent \(\mathbf x_T\) where \(p(\mathbf x_T)\approx 0\) and iteratively \(\mathbf x_T,…,\mathbf x_0\) denoising \(\mathbf x_T\) towards some submanifold latent \(\mathbf x_0\) where \(p(\mathbf x_0)>0\), thereby generating the sample \(\mathbf x_0\). More explicitly, the sequence \(\mathbf x_T,…,\mathbf x_0\) will turn out to be a discrete-time Markov chain, and the operation of “denoising” \(\mathbf x_t\mapsto\mathbf x_{t-1}\) will really just amount to sampling from the Markov chain’s transition kernel \(p(\mathbf x_{t-1}|\mathbf x_t)=\int d\mathbf x_0p(\mathbf x_0|\mathbf x_t)p(\mathbf x_{t-1}|\mathbf x_0,\mathbf x_t)\). As it is, this is intractable, but focus on that second term in the integrand:
Both of these are essentially transition kernels for the forward Markov chain \(\mathbf x_0,…,\mathbf x_T\). At this point, the DDPM paper decides to use a Gaussian forward transition kernel:
where\(0<\sigma^2_1<\sigma^2_2<…<\sigma^2_T\ll 1\) are just \(T\) fixed hyperparameters (called a variance schedule, where the horizon \(T\) is also a hyperparameter). Equivalently, using the reparameterization trick, adding Gaussian noise looks like:
where \(\boldsymbol{\varepsilon}\) is drawn from an isotropic standard normal \(p(\boldsymbol{\varepsilon})\sim e^{-|\boldsymbol{\varepsilon}|^2/2}\). The \(\sqrt{1-\sigma_t^2}\) factor is used to scale down the previous state \(\mathbf x_{t-1}\) in order to prevent the variance (i.e. diagonal entries of the covariance matrix) of \(\mathbf x_t\) from exploding (and indeed, would be exactly variance-preserving iff the data distribution covariance \(\sigma^2_{\mathbf x_0}=1\)); cf. Brownian motion of a particle attached to the origin by a spring.
Rather than iteratively stepping from \(\mathbf x_0\) to \(\mathbf x_t\) in \(O(t)\) samples, one can jump in \(O(1)\) time from \(\mathbf x_0\mapsto\mathbf x_t\) via just a single \(\boldsymbol{\varepsilon}\)-sample:
Thus, the “oracle posterior” \(p(\mathbf x_{t-1}|\mathbf x_0,\mathbf x_t)\sim e^{-|\mathbf x_{t-1}-\boldsymbol{\mu}_{t-1}|^2/2\tilde{\sigma}^2_{t-1}}\) is an isotropic Gaussian with:
Although by design the forward Markov chain has Gaussian transition kernels \(p(\mathbf x_t|\mathbf x_{t-1})\sim e^{-|\mathbf x_t-\sqrt{1-\sigma_t^2}\mathbf x_{t-1}|^2/2\sigma_t^2}\), the reverse transition kernel \(p(\mathbf x_{t-1}|\mathbf x_t)\) is in general non-Gaussian. However, a theorem of Anderson guarantees that it is in fact Gaussian at the level of stochastic differential equations, and so it is not a bad approximation to simply sample \(\mathbf x_{t-1}\) from the Gaussian \(p(\mathbf x_{t-1}|\mathbf x_t,\mathbf x_0)\) as a proxy for sampling from the true, non-Gaussian \(p(\mathbf x_{t-1}|\mathbf x_t)\). Reparameterized, this means that inference looks like unannealed Langevin dynamics (ULD):
The only problem here is that the Gaussian noise \(\boldsymbol{\varepsilon}\) that was injected to get from \(\mathbf x_0\mapsto\mathbf x_t\) is unknown.
This is where training comes in. Training consists of \(3\) distinct sampling steps and assembling the samples:
Architecturally, the diffusion model itself is then a ResNet-like noise predictor \(\hat{\boldsymbol{\varepsilon}}(\mathbf x_t,t|\boldsymbol{\theta})\) (e.g. a U-Net or a transformer, etc.) that seeks to predict the sampled Gaussian noise \(\boldsymbol{\varepsilon}\) that was added to \(\mathbf x_0\) to obtain \(\mathbf x_t\). This is enforced by the MSE loss function \(L(\hat{\boldsymbol{\varepsilon}},\boldsymbol{\varepsilon})=|\hat{\boldsymbol{\varepsilon}}-\boldsymbol{\varepsilon}|^2/2\) with corresponding cost function over the training set:
and for \(t=1\), the final generated sample \(\mathbf x_0=\frac{1}{\sqrt{1-\sigma^2_1}}\left(\mathbf x_1-\sigma_1\hat{\boldsymbol{\varepsilon}}(\mathbf x_1,1|\boldsymbol{\theta})\right)\) should not have any noise added to it.
Problem: In DDPMs, explain how estimating the noise \(\hat{\boldsymbol{\varepsilon}}(\mathbf x_t,t|\boldsymbol{\theta})\) is synonymous with estimating the score vector field of the data distribution \(\partial\ln p(\mathbf x_t)/\partial\mathbf x_t\).
Solution: Because \(\mathbf x_t\) is normally distributed about \(\sqrt{(1-\sigma^2_1)…(1-\sigma^2_t)}\mathbf x_0\) with covariance \(1-(1-\sigma_1^2)…(1-\sigma^2_t)\), Tweedie’s formula asserts that:
Comparing this with \(\mathbf x_t=\sqrt{(1-\sigma^2_1)…(1-\sigma^2_t)}\mathbf x_0+\sqrt{1-(1-\sigma^2_1)…(1-\sigma^2_t)}\hat{\boldsymbol{\varepsilon}}(\mathbf x_t,t|\boldsymbol{\theta})\), one concludes that:
Problem: Explain how the DDPM architecture can be reformulated in terms of a stochastic differential equation, and (following work of Song et al.) deduce its distributionally equivalent probability flow ODE.
Solution: Recall the DDPM forward Markov chain transition kernel (rewritten using \(i\in\{0,…,T\}\) instead of \(t\) because in a moment \(t\in [0,T]\) will be reserved for a continuous variable).
Taking the continuous time limit of this discrete difference equation amounts to letting \(T\to\infty\) while \(\sigma^2_i\to 0\) in such a way that their product \(T\sigma^2_i:=\sigma^2(t)\) is fixed (cf. the dipole limit in electrostatics which takes \(q\to\infty\) and \(\Delta\mathbf x\to\mathbf 0\) such that \(\boldsymbol{\pi}:=q\Delta\mathbf x\) is fixed). Hence, writing \(dt:=1/T\):
So after binomial expanding, isolating \(d\mathbf x:=\mathbf x_i-\mathbf x_{i-1}\), and recognizing the Wiener process \(d\mathbf w:=\sqrt{dt}\boldsymbol{\varepsilon}\):
This particular SDE is also an instance of an Ornstein-Uhlenbeck process, but is special in that it is also variance-preserving. The corresponding Fokker-Planck probability current density is \(-\frac{\sigma^2(t)}{2}\left(p(\mathbf x,t)\mathbf x+\frac{\partial p}{\partial\mathbf x}\right):=p(\mathbf x,t)\mathbf v_{\text{eff}}(\mathbf x,t)\) which is distributionally equivalent to the ODE \(d\mathbf x=\mathbf v_{\text{eff}}dt\) with effective velocity field:
Hence, the SDE admits a corresponding probability flow ODE \(d\mathbf x/dt=\mathbf v_{\text{eff}}(\mathbf x,t)\) which can be integrated during inference using standard numerical algorithms (e.g. Euler, Runge-Kutta); recall the score vector field \(\partial\ln p/\partial\mathbf x\) is what the diffusion model estimates via \(\hat{\boldsymbol{\epsilon}}(\mathbf x,t|\boldsymbol{\theta})\).
Problem: Hence, in light of the above discussion, explain and motivate flow matchingmodels, and describe the simplest instance of it (i.e. rectified flow).
Solution: Roughly speaking, diffusion models are trained to stochastically map signal to noise. At inference, one is therefore obliged to denoise along these stochastic trajectories in order to map from noise back to signal. But this is a bit like shooting oneself in the foot, unnecessarily making one’s life difficult with no ostensible gain. Flow matching models are (loosely speaking) what one gets after cutting through the fluff of diffusion models with Occam’s razor, replacing SDEs with ODEs (indeed, this is w.l.o.g. thanks to the probability flow ODE construction).
Intuitively, the simplest possible map \(\mathbf x_0\mapsto\mathbf x_1\) one could engineer is a simple linear interpolation for \(t\in [0,1]\):
\[\mathbf x_t=t\mathbf x_1+(1-t)\mathbf x_0\]
(aside: in the flow matching literature, the convention about whether \(\mathbf x_0\) represents signal vs. noise is sometimes reversed compared with the diffusion literature, but here it is being assumed that \(\mathbf x_0\) represents signal as in diffusion models). The relevant velocity field to be learned is thus \(d\mathbf x_t/dt=\mathbf x_1-\mathbf x_0\). A flow-matching model thus no longer seeks to estimate a score vector field, but rather a velocity vector field \(\hat{\mathbf v}(\mathbf x,t|\boldsymbol{\theta})\) via an MSE loss function \(L(\hat{\mathbf v},\mathbf v):=|\hat{\mathbf v}-\mathbf v|^2/2\) and training cost function:
A key advantage of learning such a simple rectified flow (also called optimal transport) is that inference is very easy, and can be achieved with very few steps since one just has to step along a straight line.
Problem: Let \(\boldsymbol{\Theta}\) be a smooth statistical manifold, and let \(D:\boldsymbol{\Theta}^2\to [0,\infty)\) be a smooth function. What does it mean for \((\boldsymbol{\Theta},D)\) to be a “divergence manifold“?
Solution: The notion of a divergence manifold relaxes the axioms of a metric space, specifically still demanding \(D(\boldsymbol{\theta} || \boldsymbol{\theta}’)\geq 0\) and \(D(\boldsymbol{\theta} || \boldsymbol{\theta}’)=0\Leftrightarrow\boldsymbol{\theta}’=\boldsymbol{\theta}\) for all \(\boldsymbol{\theta},\boldsymbol{\theta}’\in\boldsymbol{\Theta}\) but no longer enforcing symmetry \(D(\boldsymbol{\theta} || \boldsymbol{\theta}’)=D(\boldsymbol{\theta}’ || \boldsymbol{\theta})\) or the triangle inequality \(D(\boldsymbol{\theta} || \boldsymbol{\theta}^{\prime\prime})\leq D(\boldsymbol{\theta} || \boldsymbol{\theta}’) + D(\boldsymbol{\theta}’ || \boldsymbol{\theta}^{\prime\prime})\).
Problem: Explain how any divergence manifold \((\boldsymbol{\Theta},D)\) enjoys the free gift of being automatically equipped with a canonical Riemannian metric tensor field \(g_D(\boldsymbol{\theta}):T_{\boldsymbol{\theta}}(\boldsymbol{\Theta})^2\to\mathbf R\) induced by the divergence function \(D\).
Solution: The so-called Fisher information metric \(g_D(\boldsymbol{\theta})=(g_D)_{ij}(\boldsymbol{\theta})d\theta_i\otimes d\theta_j\) on the statistical manifold \(\boldsymbol{\Theta}\) induced by the divergence \(D\) is basically just its Hessian:
Intuitively, one can think of the divergence \(D(\boldsymbol{\theta}||\boldsymbol{\theta}’)\) of \(\boldsymbol{\theta}’\in\boldsymbol{\Theta}\) from \(\boldsymbol{\theta}\in\boldsymbol{\Theta}\) as holding the “ground truth” distribution \(\boldsymbol{\theta}\) fixed while sniffing around for a proxy distribution \(\boldsymbol{\theta}’\) to approximate \(\boldsymbol{\theta}\). By the axioms of a divergence manifold, the global minimum value \(D(\boldsymbol{\theta}||\boldsymbol{\theta}’)=0\) is attained at \(\boldsymbol{\theta}’=\boldsymbol{\theta}\), so Taylor expanding about this global minimum (\((\partial D(\boldsymbol{\theta}||\boldsymbol{\theta}’)/\partial\boldsymbol{\theta}’)_{\boldsymbol{\theta}’=\boldsymbol{\theta}}=\mathbf 0\)), one has the local quadratic form:
Problem: Let \(f:(0,\infty)\to\mathbf R\) be a nonlinear convex function with a zero at \(f(1)=0\). Define the family of \(f\)-divergences \(D_f:\Theta^2\to [0,\infty)\), prove that they do indeed satisfy the axioms of a divergence manifold, and exhibit some examples of functions \(f\) and the corresponding \(f\)-divergence \(D_f\).
Solution: One has:
\[D_f(\boldsymbol{\theta}||\boldsymbol{\theta}’):=\int d\mathbf x p(\mathbf x|\boldsymbol{\theta}’)f\left(\frac{p(\mathbf x|\boldsymbol{\theta})}{p(\mathbf x|\boldsymbol{\theta}’)}\right)\]
By rewriting this as an expectation \(D_f(\boldsymbol{\theta}||\boldsymbol{\theta}’)=\langle f\left(p(\mathbf x|\boldsymbol{\theta})/p(\mathbf x|\boldsymbol{\theta}’)\right)\rangle_{\mathbf x\sim p(\mathbf x|\boldsymbol{\theta}’)}\), applying Jensen’s inequality, using \(\int d\mathbf x p(\mathbf x|\boldsymbol{\theta})=1\) and finally using \(f(1)=0\), one establishes non-negativity \(D_f(\boldsymbol{\theta}||\boldsymbol{\theta}’)\geq 0\). The same \(f(1)=0\) condition also ensures \(\boldsymbol{\theta}=\boldsymbol{\theta}’\Rightarrow D_f(\boldsymbol{\theta}||\boldsymbol{\theta}’)=0\), and the converse is proven by recalling that Jensen’s inequality becomes an equality iff \(f\) is linear (forbidden by hypothesis) or its argument \(p(\mathbf x|\boldsymbol{\theta})/p(\mathbf x|\boldsymbol{\theta}’)=\text{constant}\). But since the expectation of this constant in \(\mathbf x\sim p(\mathbf x|\boldsymbol{\theta}’)\) was \(1\), the constant itself must be \(1\), Q.E.D.
\(f(x):=x\ln x\) generates the asymmetric Kullback-Leibler(KL) divergence \[D_{\text{KL}}(\boldsymbol{\theta}||\boldsymbol{\theta}’):=\int d\mathbf x p(\mathbf x|\boldsymbol{\theta})\ln\frac{p(\mathbf x|\boldsymbol{\theta})}{p(\mathbf x|\boldsymbol{\theta}’)}\]
\(f(x):=-\ln x\) generates the dual KL divergence \(D_{\text{KL}}(\boldsymbol{\theta}’||\boldsymbol{\theta})\) (in general, \(D_{xf(1/x)}(\boldsymbol{\theta}||\boldsymbol{\theta}’)=D_{f(x)}(\boldsymbol{\theta}’||\boldsymbol{\theta})\))
\(f(x):=|x-1|/2\) generates the total variation divergence \[d_{\text{TV}}(\boldsymbol{\theta}||\boldsymbol{\theta}’):=\frac{1}{2}\int d\mathbf x|p(\mathbf x|\boldsymbol{\theta})-p(\mathbf x|\boldsymbol{\theta}’)|\] (indeed, TVD is actually symmetric and obeys the triangle inequality so is a metric in the metric space sense, yet does not admit a Fisher information metric due to the non-differentiability of the absolute value).
\(f(x):=(x-1)^2\) generates the Pearson \(\chi^2\) divergence \[D_{\chi^2}(\boldsymbol{\theta}||\boldsymbol{\theta}’):=\int d\mathbf x\frac{(p(\mathbf x|\boldsymbol{\theta})-p(\mathbf x|\boldsymbol{\theta}’))^2}{p(\mathbf x|\boldsymbol{\theta}’)}\] and \(f(x)=(x-1)^2/x\) generates the dual divergence (called the Neyman \(\chi^2\) divergence) in which one replaces \(p(\mathbf x|\boldsymbol{\theta}’)\mapsto p(\mathbf x|\boldsymbol{\theta})\) in the denominator.
\(f(x):=\frac{1}{2}\left(x\ln x-(x+1)\ln\frac{x+1}{2}\right)\) generates the symmetric Jensen-Shannon divergence \[d^2_{\text{JS}}(\boldsymbol{\theta}||\boldsymbol{\theta}’):=\frac{D_{\text{KL}}(p_{\boldsymbol{\theta}}||\frac{p_{\boldsymbol{\theta}}+p_{\boldsymbol{\theta}’}}{2})+D_{\text{KL}}(p_{\boldsymbol{\theta}’}||\frac{p_{\boldsymbol{\theta}}+p_{\boldsymbol{\theta}’}}{2})}{2}\]
Problem: What is a fundamental limitation in the family of \(f\)-divergences \(D_f\)? How do integral probability metrics such as the Wasserstein (a.k.a. earth-mover’s) distance address this shortcoming?
$\textbf{Problem}$: Apply the PRAC-DTDT-ID workflow to a (vanilla) $\textit{autoencoder}$.
$\textbf{Solution}$:
Problem: to learn a lower-dimensional latent manifold representation of the input manifold (this is justified by the $\textit{submanifold hypothesis}$, namely that the data-generating distribution $p(\mathbf x)$ is essentially supported only for $\mathbf x\in \mathbf R^{<d}$ in some lower-dimensional submanifold $\mathbf R^{<d}$ of the input manifold $\cong\mathbf R^d$).
Representation: assume the abstract data manifold has already been coordinatized via some atlas of charts into feature vectors $\mathbf x\in\mathbf R^d$.
Architecture: an encoder neural network $\mathbf e_{\boldsymbol{\phi}}:\mathbf R^d\to\mathbf R^{<d}$ that maps input feature vectors $\mathbf x\in\mathbf R^d$ to a lower-dimensional latent manifold $\mathbf e_{\boldsymbol{\phi}}(\mathbf x)\in\mathbf R^{<d}$. This is then passed through a decoder neural network $\mathbf d_{\boldsymbol{\theta}}:\mathbf R^{<d}\to\mathbf R^d$ that attempts to reconstruct the original input feature vector $\mathbf d_{\boldsymbol{\theta}}(\mathbf e_{\boldsymbol{\phi}}(\mathbf x))\approx\mathbf x\in\mathbf R^d$ from its encoded latent manifold representation $\mathbf e_{\boldsymbol{\phi}}(\mathbf x)\in\mathbf R^{<d}$. In this context, the latent manifold $\mathbf R^{<d}$ is also called the $\textit{bottleneck}$ as it’s the narrowest pinch point (a hidden layer of $<d$ neurons) in the autoencoder. If it were not $<d$ but rather $\geq d$, then the autoencoder could just memorize the input data without learning the relevant submanifold structure, so by forcing the information to flow through a bottleneck (a deliberately lossy form of compression), the autoencoder has to encode efficiently to survive the bottleneck.
Cost: since the overall goal of the autoencoder (by virtue of its etymology!) is to essentially implement an identity function $\mathbf x\mapsto\mathbf x$ (“reconstruction of $\mathbf x$”), the reconstruction loss function may taken to be the usual MSE:
$$L(\hat{\mathbf x},\mathbf x)=\frac{|\hat{\mathbf x}-\mathbf x|^2}{2}$$
so the corresponding cost function is just the expected reconstruction loss over the empirical approximation of $p(\mathbf x)$ due to $N$ i.i.d. samples $\mathbf x_1,…,\mathbf x_N\in\mathbf R^d$ from $p(\mathbf x)$:
$$C(\boldsymbol{\theta},\boldsymbol{\phi})=\frac{1}{N}\sum_{i=1}^NL(\mathbf d_{\boldsymbol{\theta}}(\mathbf e_{\boldsymbol{\phi}}(\mathbf x_i)),\mathbf x_i)$$
Data: one of the advantages of autoencoders is that it is a data-agnostic framework which is thus completely general and can be applied to learn more compact representations of essentially any kind of data (e.g. images, text, audio, …).
Train: use the Adam optimizer.
Dev: use to fine-tune hyperparameters like the bottleneck depth $<d$.
Test: self-explanatory.
Iterate: self-explanatory.
Deploy: self-explanatory.
$\textbf{Problem}$: Having fleshed out the theory of autoencoders, implement a vanilla autoencoder in PyTorch to learn a $16$-dimensional latent manifold representation of the MNIST dataset (where each $28\times 28$ image is flattened and normalized into a feature vector $\mathbf x\in\mathbf [0,1]^{784}$).
$\textbf{Solution}$:
In [ ]:
importtorchimporttorch.nnasnnclassVanillaAutoencoder(nn.Module):def__init__(self):super().__init__()self.encoder=nn.Sequential(nn.Linear(in_features=784,out_features=128),nn.ReLU(),nn.Linear(in_features=128,out_features=16),)self.decoder=nn.Sequential(nn.Linear(in_features=16,out_features=128),nn.ReLU(),nn.Linear(in_features=128,out_features=784),nn.Sigmoid()# because MNIST pixel intensities will be normalized [0, 1])defforward(self,x):# x is a batch of grayscale images with shape (batch_size, 1, 28, 28)batch_size=x.shape[0]x=x.reshape(batch_size,784)# or x.reshape(-1, 784)z=self.encoder(x)x_hat=self.decoder(z)x_hat=x_hat.reshape(batch_size,1,28,28)returnx_hat
In [20]:
fromtorch.optimimportAdammnist_autoencoder=VanillaAutoencoder()loss_func=nn.BCELoss()# rather than MSE b/c MNIST images will have [0, 1] normalized pixel intensitiesoptimizer=Adam(mnist_autoencoder.parameters(),lr=1e-3)
In [21]:
fromdatasetsimportload_datasetfromtorchvisionimporttransformsfromtorch.utils.dataimportDataLoaderhf_mnist=load_dataset("mnist")normalize_tensor=transforms.ToTensor()defnormalize_batch(batch):# Apply the [0, 1] normalization to every image in the current batchbatch["image"]=[normalize_tensor(img)forimginbatch["image"]]returnbatchhf_mnist.set_transform(normalize_batch)train_loader=DataLoader(dataset=hf_mnist["train"],batch_size=64,shuffle=True)test_loader=DataLoader(dataset=hf_mnist["test"],batch_size=256,shuffle=False)num_epochs=10C_test=[]forepochinrange(num_epochs):mnist_autoencoder.train()forbatchintrain_loader:x=batch["image"]optimizer.zero_grad()x_hat=mnist_autoencoder(x)loss=loss_func(x_hat,x)loss.backward()optimizer.step()mnist_autoencoder.eval()epoch_loss=0withtorch.inference_mode():forbatchintest_loader:x=batch["image"]x_hat=mnist_autoencoder(x)batch_loss=loss_func(x_hat,x)epoch_loss+=batch_loss.item()# Calculate the average loss over all test batchesavg_loss=epoch_loss/len(test_loader)C_test.append(avg_loss)print(f"Epoch [{epoch+1}/{num_epochs}] | Test Loss: {avg_loss:.4f}")
Epoch [1/10] | Test Loss: 0.1403
Epoch [2/10] | Test Loss: 0.1294
Epoch [3/10] | Test Loss: 0.1249
Epoch [4/10] | Test Loss: 0.1220
Epoch [5/10] | Test Loss: 0.1197
Epoch [6/10] | Test Loss: 0.1182
Epoch [7/10] | Test Loss: 0.1170
Epoch [8/10] | Test Loss: 0.1159
Epoch [9/10] | Test Loss: 0.1152
Epoch [10/10] | Test Loss: 0.1145
$\textbf{Problem}$: Following up on the above, seeing the test cost $C_{\text{test}}(\boldsymbol{\phi},\boldsymbol{\theta})$ decreasing as both $\boldsymbol{\phi},\boldsymbol{\theta}$ are being updated by gradient descent is one metric, but another test one can do is to directly inspect the geometry of the latent manifold $[0,\infty)^{16}$. Specifically, define a linear interpolation function that takes $2$ distinct images $\mathbf x\neq\mathbf x’\in [0, 1]^{784}$ and, for $10$ discrete time steps $t=0, 0.1, 0.2,…,1$, returns the corresponding $10$ reconstructed images $\mathbf d_{\boldsymbol{\theta}}((1-t)\mathbf e_{\boldsymbol{\phi}}(\mathbf x)+t\mathbf e_{\boldsymbol{\phi}}(\mathbf x’))\in [0,1]^{784}$.
$\textbf{Problem}$: Hence, comment on the key shortcoming of the vanilla autoencoder architecture, and explain how a variational autoencoder architecture addresses this shortcoming.
$\textbf{Solution}$: The shortcoming of the vanilla autoencoder architecture is that the latent manifold representation of the data it learns looks like grouping each of the $10$ digits $0, 1, …, 9$ into topologically disconnected “islands” in the latent manifold. Thus, forcing the decoder $\mathbf d_{\boldsymbol{\theta}}$ to “swim” between $2$ such islands leads to a blurry interpolation (e.g. the $7$ is gradually turning off as the $2$ is gradually turning on, leading to something that awkwardly almost looks like a $3$).
Variational autoencoders (VAEs) address this shortcoming of vanilla autoencoders by injecting stochasticity into both the encoder and decoder, transforming them from $\textit{deterministic}$ functions into $\textit{random}$ vectors.
Instead of a deterministic encoder $\mathbf x\in\mathbf R^d\mapsto\mathbf e_{\boldsymbol{\phi}}(\mathbf x)\in\mathbf R^{<d}$, VAEs use a $\textit{probabilistic}$ encoder composed of $2$ steps:
(Deterministic) Map each feature vector $\mathbf x\in\mathbf R^d\mapsto (\langle\mathbf e_{\boldsymbol{\phi}}(\mathbf x)\rangle,\ln\boldsymbol{\sigma}^2_{\boldsymbol{\phi}}(\mathbf x))\in\mathbf R^{2<d}$ to some Gaussian cloud $p_{\boldsymbol{\phi}}(\mathbf e|\mathbf x)=e^{-(\mathbf e-\langle\mathbf e_{\boldsymbol{\phi}}(\mathbf x)\rangle)^T\sigma^{-2}_{\boldsymbol{\phi}}(\mathbf x)(\mathbf e-\langle\mathbf e_{\boldsymbol{\phi}}(\mathbf x)\rangle)/2}/\det(\sqrt{2\pi}\sigma_{\boldsymbol{\phi}}(\mathbf x))$ in the latent manifold (diagonal covariance matrix $\sigma^2_{\boldsymbol{\phi}}(\mathbf x)=\text{diag}(\boldsymbol{\sigma}^2_{\boldsymbol{\phi}}(\mathbf x))\in\mathbf R^{<d\times<d}$)
(Random) Sample a concrete latent vector $\mathbf e_{\boldsymbol{\phi}}(\mathbf x)\in\mathbf R^{<d}$ from $p_{\boldsymbol{\phi}}(\mathbf e|\mathbf x)$. In order to ensure the gradient $\partial\mathbf e_{\boldsymbol{\phi}}(\mathbf x)/\partial\boldsymbol{\phi}$ exists, apply the trick of sampling a random vector $\boldsymbol{\varepsilon}\in\mathbf R^{<d}$ from a standard isotropic normal distribution $p(\boldsymbol{\varepsilon})=e^{-|\boldsymbol{\varepsilon}|^2/2}/(2\pi)^{<d/2}$ and hence $\textit{reparameterizing}$ $\mathbf e_{\boldsymbol{\phi}}(\mathbf x)=\langle\mathbf e_{\boldsymbol{\phi}}(\mathbf x)\rangle+\boldsymbol{\varepsilon}\odot\boldsymbol{\sigma}_{\boldsymbol{\phi}}(\mathbf x)$, treating $\boldsymbol{\varepsilon}$ as if it were a constant vector during backpropagation.
Similar to the probabilistic encoder, VAEs use a $\textit{probabilistic}$ decoder composed of $2$ analogous steps:
(Deterministic) Map each latent vector $\mathbf e\in\mathbf R^{<d}\mapsto d_{\boldsymbol{\theta}}(\mathbf e)\in\mathbf R^d$ to a Gaussian cloud back in the input manifold with fixed unit covariance $p_{\boldsymbol{\theta}}(\hat{\mathbf x}|\mathbf e)=e^{-|\hat{\mathbf x}-d_{\boldsymbol{\theta}}(\mathbf e)|^2/2}/(2\pi)^{d/2}$.
(Deterministic) Although one’s natural instinct might be to sample from $p_{\boldsymbol{\theta}}(\hat{\mathbf x}|\mathbf e)$ to generate an actual decoded output, it turns out empirically it’s better to just pluck out the mean $d_{\boldsymbol{\theta}}(\mathbf e)$ and call it a day (equivalently, to sample from $p_{\boldsymbol{\theta}}(\hat{\mathbf x}|\mathbf e)=\delta^d(\hat{\mathbf x}-d_{\boldsymbol{\theta}}(\mathbf e))$).
Finally, there’s one other aspect, specifically the C in PRAC-DTDT-ID, on which variational autoencoders differ from vanilla autoencoders, and which also explains the name “variational”. Recall that cost functions in ML often arise from maximum-likelihood estimation (or some variant like MAP). In this case, one would like to maximize the likelihood $p_{\boldsymbol{\theta}}(\hat{\mathbf x}):=p(\hat{\mathbf x}|\boldsymbol{\theta})$ that the $\textit{decoder}$ generates an output $\hat{\mathbf x}\in\mathbf R^d$ with respect to its parameters $\boldsymbol{\theta}$. By taking the log, reformulating the decoder’s generative process $p_{\boldsymbol{\theta}}(\hat{\mathbf x})=\int d^{<d}\mathbf ep(\mathbf e)p_{\boldsymbol{\theta}}(\hat{\mathbf x}|\mathbf e)$ in terms of the same standard normal latent prior $p(\mathbf e)=e^{-|\mathbf e|^2/2}/(2\pi)^{<d/2}$ that was used earlier when sampling $\boldsymbol{\varepsilon}$, replacing the difficult posterior $p_{\boldsymbol{\theta}}(\hat{\mathbf x}|\mathbf e)$ with the Gaussian encoder distribution $p_{\boldsymbol{\phi}}(\mathbf e|\hat{\mathbf x})$ via importance sampling, and applying Jensen’s inequality to the concave $\ln$ function to obtain an $\textit{evidence lower bound}$ (ELBO) on the log-likelihood consisting of distinct $\textit{reconstruction}$ and $\textit{regularization}$ contributions:
where the Kullback-Leibler divergence is:
$$D_{\text{KL}}(p(\mathbf e)|p_{\boldsymbol{\phi}}(\mathbf e|\hat{\mathbf x})):=\langle\ln p_{\boldsymbol{\phi}}(\mathbf e|\hat{\mathbf x})/p(\mathbf e)\rangle_{\mathbf e\sim p_{\boldsymbol{\phi}}(\mathbf e|\hat{\mathbf x})}=-\frac{n}{2}-\text{Tr}\ln\sigma_{\boldsymbol{\phi}}(\hat{\mathbf x})+\frac{|\langle\mathbf e_{\boldsymbol{\phi}}(\hat{\mathbf x})\rangle|^2+|\boldsymbol{\sigma}_{\boldsymbol{\phi}}(\hat{\mathbf x})|^2}{2}$$
Then, similar to e.g. variational principles in quantum mechanics, the VAE cost function to be $\textit{minimized}$ is the empirical expectation:
$$C(\boldsymbol{\phi},\boldsymbol{\theta})=\frac{1}{N}\sum_{i=1}^NL_{\text{ELBO}}(\boldsymbol{\phi},\boldsymbol{\theta},\hat{\mathbf x}_i)$$
The probabilistic decoder $d_{\boldsymbol{\theta}}$ of VAEs are often used in isolation (i.e. discarding the probabilistic encoder) as generative models by sampling from the latent manifold’s prior $p(\mathbf e)=e^{-|\mathbf e|^2/2}/(2\pi)^{<d/2}$ and simply forward passing $\mathbf e\in\mathbf R^{<d}\mapsto d_{\boldsymbol{\theta}}(\mathbf e)\in\mathbf R^d$.
$\textbf{Problem}$: Implement a variational autoencoder on the MNIST dataset with a $16$-dimensional latent manifold.
$\textbf{Solution}$:
In [ ]:
importtorchimporttorch.nnasnnimporttorch.nn.functionalasFfromtorch.optimimportAdamfromdatasetsimportload_datasetfromtorchvisionimporttransformsfromtorch.utils.dataimportDataLoaderimportmatplotlib.pyplotaspltclassVariationalAutoencoder(nn.Module):def__init__(self,latent_dim=16):super().__init__()self.encoder_shared=nn.Sequential(nn.Linear(784,256),nn.ReLU(),nn.Linear(256,128),)self.fc_mu=nn.Linear(128,latent_dim)self.fc_logvar=nn.Linear(128,latent_dim)self.decoder=nn.Sequential(nn.Linear(latent_dim,128),nn.ReLU(),nn.Linear(128,256),nn.ReLU(),nn.Linear(256,784),nn.Sigmoid()# constrain outputs to [0, 1])defencode(self,x):h=self.encoder_shared(x)mu=self.fc_mu(h)logvar=self.fc_logvar(h)returnmu,logvardefreparameterize(self,mu,logvar):std=torch.exp(0.5*logvar)eps=torch.randn_like(std)returnmu+eps*stddefforward(self,x):batch_size=x.shape[0]x_flat=x.view(batch_size,-1)mu,logvar=self.encode(x_flat)z=self.reparameterize(mu,logvar)x_hat_flat=self.decoder(z)x_hat=x_hat_flat.view(batch_size,1,28,28)returnx_hat,mu,logvardefelbo_loss(x_hat,x,mu,logvar):BCE=F.binary_cross_entropy(x_hat,x,reduction='sum')KLD=-0.5*torch.sum(1+logvar-mu.pow(2)-logvar.exp())returnBCE+KLDhf_mnist=load_dataset("mnist")to_tensor=transforms.ToTensor()defformat_batch(batch):batch["image"]=[to_tensor(img)forimginbatch["image"]]returnbatchhf_mnist.set_transform(format_batch)train_loader=DataLoader(dataset=hf_mnist["train"],batch_size=128,shuffle=True)device=torch.device("cuda"iftorch.cuda.is_available()else"cpu")vae=VariationalAutoencoder(latent_dim=16).to(device)optimizer=Adam(vae.parameters(),lr=1e-3)num_epochs=15print("Constructing the Generative Universe...")forepochinrange(num_epochs):vae.train()epoch_loss=0forbatchintrain_loader:x=batch["image"].to(device)optimizer.zero_grad()x_hat,mu,logvar=vae(x)loss=elbo_loss(x_hat,x,mu,logvar)loss.backward()epoch_loss+=loss.item()optimizer.step()avg_loss=epoch_loss/len(train_loader.dataset)print(f"Epoch [{epoch+1}/{num_epochs}] | Average ELBO Loss: {avg_loss:.4f}")print("\nScaffolding removed. Hallucinating novel digits...")vae.eval()num_samples=10withtorch.inference_mode():z_random=torch.randn(num_samples,16).to(device)generated_flat=vae.decoder(z_random)generated_images=generated_flat.view(num_samples,28,28).cpu().numpy()fig,axes=plt.subplots(1,num_samples,figsize=(15,2))foriinrange(num_samples):axes[i].imshow(generated_images[i],cmap='gray')axes[i].axis('off')plt.suptitle("Hallucinated Digits Sampled from N(0, I)",fontsize=16)plt.tight_layout()plt.show()
Constructing the Generative Universe...
Epoch [1/15] | Average ELBO Loss: 187.7829
Epoch [2/15] | Average ELBO Loss: 137.8783
Epoch [3/15] | Average ELBO Loss: 124.7550
Epoch [4/15] | Average ELBO Loss: 118.2380
Epoch [5/15] | Average ELBO Loss: 114.7054
Epoch [6/15] | Average ELBO Loss: 112.2805
Epoch [7/15] | Average ELBO Loss: 110.6472
Epoch [8/15] | Average ELBO Loss: 109.4060
Epoch [9/15] | Average ELBO Loss: 108.4367
Epoch [10/15] | Average ELBO Loss: 107.5782
Epoch [11/15] | Average ELBO Loss: 106.9144
Epoch [12/15] | Average ELBO Loss: 106.3025
Epoch [13/15] | Average ELBO Loss: 105.8472
Epoch [14/15] | Average ELBO Loss: 105.4330
Epoch [15/15] | Average ELBO Loss: 104.9902
Scaffolding removed. Hallucinating novel digits...
Problem: Give a broad sketch of the current state of the field of research in graph neural networks.
Solution:
Problem: Okay, so now explain what a graph neural network (GNN) actually is.
Solution: A GNN is basically any neural network whose input is any (undirected/directed/mixed/multi) graph (e.g. molecules, social networks, citation networks, etc.). In order to be sensible, the GNN output (whatever it is, e.g. a binary classifier for molecule toxicity) has to genuinely be an intrinsic function of the graph structure alone, and in particular not depend on any arbitrary choice of “ordering” with which one might index the graph vertices and edges (i.e. the output must be either permutation equivariant or permutation invariant depending on its nature, cf. tensors vs. tensor components in some basis).
Problem: Explain the subclass of GNNs known as message-passing neural networks (MPNNs).
Solution: An MPNN, being a certain category of GNNs, starts its life by taking as input some graph \((V,E)\). More precisely, this looks like some feature vector \(\mathbf x_v\) (e.g. mass, charge, atomic number for atoms) for each vertex \(v\in V\) and possibly also a feature vector \(\mathbf x_e\) (e.g. bond length, bond energy, etc.) for each edge \(e\in E\). The idea is that, in a manner similar to a head of self-attention, each vertex \(v\in V\) wants to update its current state \(\mathbf x_v\) into some new state \(\mathbf x’_v\) by soaking in context from its neighbours, and in an analogous manner each edge \(e\in E\) also wants to update its current state \(\mathbf x_e\mapsto\mathbf x’_e\) based on its “neighbours” (thus it’s not quite the same as self-attention in which a token doesn’t just look at its nearest neighbour tokens, but at all the tokens in the context). In the general MPNN framework, this can be roughly broken down into \(3\) conceptual steps:
Message phase: from a sender perspective, each vertex \(v\in V\) “broadcasts” a “personalized” message vector \(\mathbf m_{vv’}\) along the edge \((v,v’)\in E\) connecting it to a neighbouring vertex \(v’\in V\). This message vector \(\mathbf m_{vv’}\) is any (learnable) function of its current state \(\mathbf x_v\), the current state of the receiving neighbour vertex \(\mathbf x_{v’}\), and the current edge feature vector \(\mathbf x_{vv’}\) connecting them.
Aggregation phase: simultaneously, from a receiver perspective, each vertex \(v\in V\) receives the broadcasted signals from its neighbouring vertices. From this perspective, it then takes all the received message vectors and synthesizes them into a single “message summary” vector \(\mathbf m_v\) which in practice is any permutation invariant function of the message vectors \(\mathbf m_{vv’}\) it received from neighbouring vertices \(v’\in V\) (e.g. their average).
Update phase: Finally, the vertex \(v\in V\) updates its own current state \(\mathbf x_v\) to some new state \(\mathbf x’_v\) using another (learnable) function of its current state \(\mathbf x_v\) and the message summary \(\mathbf m_v\).
This \(3\)-step process represents a single forward pass through \(1\) message-passing layer; several composed together define an MPNN.
Problem: Now that the general framework of MPNN architectures has been defined, walk through the following specific examples of MPNN architectures:
Problem: Consider a Landau-Ginzburg statistical field theory involving a single real scalar field \(\phi(\mathbf x)\) for \(\mathbf x\in\mathbf R^d\) governed by the canonically normalized free energy density:
Explain what the \(+…\) means, explain which terms have temperature \(T\)-dependence, and explain for which such terms does that \(T\)-dependence actually matter?
Solution: The \(+…\) includes any terms (each with their own coupling constants) compatible with the golden trinity of constraints: locality, analyticity, and symmetry (e.g. a quartic \(g\phi^4\) coupling). The part of the free energy density \(\mathcal F\) before the \(+…\) should be compared to the Lagrangian density \(\mathcal L\) of Klein-Gordon field theory:
with \(\bar{\lambda}=\hbar/mc\) the reduced Compton wavelength playing a role analogous to the correlation length \(\xi\sim 1/\sqrt{|T-T_c|}\); indeed, this \(T\)-dependence in \(\xi=\xi(T)\) is (usually) the only \(T\)-dependence that matters, even though generically all the other coupling constants will also have \(T\)-dependence.
Problem: Define the (non-standard) notion of “theory space”.
Solution: Roughly speaking, “theory space” is the space of all Landau-Ginzburg statistical field theories \((\mathcal F,k^*)\) (notice it is defined not only by the free energy density \(\mathcal F\) but also the baggage of the UV cutoff \(k^*\); it is an effective field theory). The \(\mathcal F\) part can equivalently be parameterized as a countably \(\infty\)-tuple \((\xi,g,…)\) of the LAS-permitted coupling constants in \(\mathcal F\).
Problem: In broad strokes, describe the sequence of \(3\) steps that comprise a \(\mathbf k\)-space \(\zeta\)-renormalization semigrouptransformation from one effective Landau-Ginzburg statistical field theory \((\mathcal F,k^*)\mapsto (\mathcal F_{\zeta},k^*)\) to another with the same UV cutoff \(k^*\) but a new Wilsonian effective free energy \(\mathcal F_{\zeta}\).
Solution: For \(\zeta\in [1,\infty)\), the corresponding \(\mathbf k\)-space \(\zeta\)-RG transformation of \((\mathcal F,k^*)\) is given by the \(3\)-step recipe:
Coarse-graining \(k^*\mapsto k^*/\zeta\) (blocking in real space/integrating out shells in momentum space)
Rescaling \(\mathbf k’:=\zeta\mathbf k\) to recover the original UV cutoff \(k^*/\zeta\mapsto k^*\) (this leads to a reciprocal “zooming out” of space \(\mathbf x\mapsto\mathbf x/\zeta\)).
Rescale fields \(\phi’:=\zeta^{\Delta}\phi\) to make \(\mathcal F_{\zeta}\) canonically normalized with respect to \(\mathcal F\).
Problem: Consider an effective Landau-Ginzburg statistical field theory of a single real scalar field \(\phi(\mathbf x)\in\mathbf R\) with \(\mathbf x\in\mathbf R^d\) whose Fourier transform \(\phi_{\mathbf k}=\int d^d\mathbf x e^{-i\mathbf k\cdot\mathbf x}\phi(\mathbf x)\) is supported only on a ball of radius \(k^*\) (the theory’s UV cutoff). The free energy density corresponds to a free (no pun intended) field:
Perform a \(\mathbf k\)-space \(\zeta\)-renormalization of this theory to find the corresponding Wilsonian effective free energy density \(\mathcal F_{\zeta}\).
Solution: Work with the free energy \(F=\int d^d\mathbf x\mathcal F\) itself instead of just its density \(\mathcal F\):
Partition the support \(|\mathbf k|<k^*\) of \(\phi_{\mathbf k}\) into \(|\mathbf k|<k^*/\zeta\) and \(k^*/\zeta<|\mathbf k|<k^*\) and based on this \(\zeta\), piecewise decompose \(\phi_{\mathbf k}=\phi^{<}_{\mathbf k}+\phi^{>}_{\mathbf k}\). Then one has an instance of the freshman’s dream (thanks to the disjoint supports of \(\phi^{<}_{\mathbf k}\) and \(\phi^{>}_{\mathbf k}\)):
where the measures are \(\mathcal D\phi^{<}_{\zeta}=\prod_{|\mathbf k|<k^*/\zeta}d\phi^{<}_{\mathbf k}\) and \(\mathcal D\phi^{>}_{\zeta}=\prod_{k^*/\zeta<|\mathbf k|<k^*}d\phi^{>}_{\mathbf k}\). The constant \(Z^>_{\zeta}\) doesn’t affect the physics, being absorbed as a constant shift into the Wilsonian effective free energy \(F_{\zeta}[\phi^{<}_{\zeta}]=F[\phi^{<}_{\zeta}]-k_BT\ln Z^{>}_{\zeta}\).
2. Rescaling \(\mathbf k’:=\zeta\mathbf k\), the Wilsonian effective free energy becomes:
so in order to canonically normalize the gradient term, one requires \(\Delta=-(d+2)/2\). This leads to the desired Wilsonian effective free energy density:
Thus, by construction the gradient coupling is marginal (i.e. \(\zeta\)-independent) while the quadratic coupling is relevant because \(\zeta^2\to\infty\) as \(\zeta\to\infty\).
$\textbf{Problem}$: Write functions that take an arbitrary grayscale image and convolve them with a Sobel edge detection kernel. Apply both functions to a grayscale image of your choice.
$\textbf{Problem}$: Above, the $3\times 3$ Sobel edge detection kernels were convolved with an $n\times m$
grayscale image, leading to an $(n-2)\times (m-2)$ edge map. However, although for $n, m \gg 1$ this $-2$ shrinking effect
is not particularly noticeable, over many layers of convolutions it can compound undesirably. Thus, one solution is to use $\textit{padding}$ to prevent this shrinking effect. Show how this works.
$\textbf{Solution}$: (by the way, the above unpadded version is sometimes called a $\textit{valid}$ convolution,
whereas the padded version is sometimes called a $\textit{same}$ convolution).
In [2]:
padded_random_grayscale_img=np.pad(random_grayscale_img,((1,1),(1,1)),'constant')print("Padded random grayscale image shape: ",padded_random_grayscale_img.shape)plt.imshow(padded_random_grayscale_img,cmap='gray')plt.show()print("Horizontal Sobel")plt.imshow(sobel_horizontal(padded_random_grayscale_img),cmap="gray")plt.show()print("Vertical Sobel")plt.imshow(sobel_vertical(padded_random_grayscale_img),cmap="gray")plt.show()
Padded random grayscale image shape: (12, 12)
Horizontal Sobel
Vertical Sobel
$\textbf{Problem}$: Generate a random $15\times 15$ grayscale image, a random $5\times 5$ kernel, and perform a $\textit{stride}$ convolution of the image with the kernel using a stride size of $2$.
$\textbf{Solution}$: The dimension of the output image in this (unpadded) case is $(15-5)/2+1=6$.
$\textbf{Problem}$: Suppose one has a $28\times 28\times 192$ volume at some point inside a convolutional neural network. Explain how a same, unpadded, unit-stride convolution can be performed on this volume to obtain an output volume of dimensions $28\times 28\times 32$.
$\textbf{Solution}$: Use $32$ filters of size $1\times 1\times 192$ with stride $1$ and no padding; this is a common technique to reduce channel number while preserving spatial dimensions.
$\textbf{Problem}$: Implement simplified versions of the LeNet-5, AlexNet, and VGG-16 convolutional neural networks with randomly initialized weights and biases (here simplified means that we will not implement batch normalization, dropout, or any activation functions). Run a forward pass through it without bothering to train the parameters.
$\textbf{Solution}$:
In [18]:
classConvLayer():def__init__(self,in_channels,out_channels,kernel_size,stride=1,padding=0):self.in_channels=in_channelsself.out_channels=out_channelsself.kernel_size=kernel_sizeself.stride=strideself.padding=padding# Initialize weights and biasesself.weights=np.random.randn(out_channels,in_channels,kernel_size,kernel_size)self.biases=np.random.randn(out_channels)defforward(self,x):# x shape: (batch_size, in_channels, height, width)batch_size,in_channels,height,width=x.shape# Calculate output dimensionsout_height=(height+2*self.padding-self.kernel_size)//self.stride+1out_width=(width+2*self.padding-self.kernel_size)//self.stride+1# Initialize outputout=np.zeros((batch_size,self.out_channels,out_height,out_width))# Apply padding if neededifself.padding>0:x=np.pad(x,((0,0),(0,0),(self.padding,self.padding),(self.padding,self.padding)),'constant')# Convolution operationforbinrange(batch_size):forcinrange(self.out_channels):forhinrange(out_height):forwinrange(out_width):# Extract the receptive fieldreceptive_field=x[b,:,h*self.stride:h*self.stride+self.kernel_size,w*self.stride:w*self.stride+self.kernel_size]# Perform convolutionout[b,c,h,w]=np.sum(receptive_field*self.weights[c])+self.biases[c]returnoutclassPoolingLayer():def__init__(self,kernel_size,stride=2,mode="max"):self.kernel_size=kernel_sizeself.stride=strideself.mode=modedefforward(self,x):# x shape: (batch_size, channels, height, width)batch_size,channels,height,width=x.shape# Calculate output dimensionsout_height=height//self.strideout_width=width//self.stride# Initialize outputout=np.zeros((batch_size,channels,out_height,out_width))# Max pooling operationforbinrange(batch_size):forcinrange(channels):forhinrange(out_height):forwinrange(out_width):# Extract the receptive fieldreceptive_field=x[b,c,h*self.stride:h*self.stride+self.kernel_size,w*self.stride:w*self.stride+self.kernel_size]# Perform max poolingifself.mode=="max":out[b,c,h,w]=np.max(receptive_field)elifself.mode=="avg":out[b,c,h,w]=np.mean(receptive_field)returnoutclassFCLayer():def__init__(self,in_features,out_features):self.in_features=in_featuresself.out_features=out_features# Initialize weights and biasesself.weights=np.random.randn(out_features,in_features)self.biases=np.random.randn(out_features)defforward(self,x):# x shape: (batch_size, in_features)batch_size=x.shape[0]# Initialize outputout=np.zeros((batch_size,self.out_features))# Perform matrix multiplicationforbinrange(batch_size):out[b]=np.dot(self.weights,x[b])+self.biasesreturnout
In [19]:
#Le-Net 5 Implementationnum_batches=3np.random.seed(42)batch_rand_gray_imgs=np.random.randint(0,256,size=(num_batches,1,32,32))# Create layerslayer1=ConvLayer(in_channels=1,out_channels=6,kernel_size=5,stride=1,padding=0)layer2=PoolingLayer(kernel_size=2,stride=2,mode="avg")layer3=ConvLayer(in_channels=6,out_channels=16,kernel_size=5,stride=1,padding=0)layer4=PoolingLayer(kernel_size=2,stride=2,mode="avg")layer5=FCLayer(in_features=400,out_features=120)# both fully connected layers assume the input image dimensions are (h,w,c) = (32, 32, 1)layer6=FCLayer(in_features=120,out_features=84)layer7=FCLayer(in_features=84,out_features=1)#y_hat is int b/w 0 and 9 to classify handwritten digit, nowadays prefer softmax# Forward passoutput=layer1.forward(batch_rand_gray_imgs)output=layer2.forward(output)output=layer3.forward(output)output=layer4.forward(output)output=output.reshape(num_batches,-1)output=layer5.forward(output)output=layer6.forward(output)output=layer7.forward(output)print(output)
$\textbf{Problem}$: Why is a CNN a better choice of architecture than an MLP for computer vision tasks?
$\textbf{Solution}$: As a simple example, given a convolutional layer of a CNN in which a square $n\times n\times c$ image is convolved (without striding or padding) with $N$ kernels each of size $k\times k$, producing an output volume of dimensions $(n-k+1)\times (n-k+1)\times N$, the number of parameters is expected to be $N(k^2+1)$ (each of the $N$ kernels has $k^2$ weights and a single bias term). By contrast, if one were to instead implement a fully connected layer of dense connections between the flattened input image of size $cn^2$ and the flattened output volume of size $N(n-k+1)^2$, the total number of parameters would be $cn^2N(n-k+1)^2$ (much larger!). Essentially, this is due to the $\textit{parameter sharing}$ property of CNNs, the sparsity of their connections. Finally, the convolutional structure is better at capturing $\textit{translational invariance}$ of the image (unlike the MLP where pixels are just flattened arbitrarily, losing their spatial information).
$\textbf{Problem}$: Explain what a $\textit{residual block}$ is and how several residual blocks may be composed together to define a $\textit{residual neural network}$ (ResNet).
$\textbf{Solution}$: A residual block may be viewed as a simple additive perturbation to a standard $2$-layer feedforward neural network. Recall that, given an input feature vector $\mathbf x$, a feedforward layer $(W,\mathbf b)$ may be viewed as the composition of a $\textit{linear layer}$ $\mathbf x\mapsto W\mathbf x + \mathbf b$ with a nonlinear activation function $\boldsymbol{\sigma}$. Thus, the overall action of a $2$-layer feedforward looks something like:
$$\mathbf x\mapsto\boldsymbol{\sigma}(W_2\boldsymbol{\sigma}(W_1\mathbf x + \mathbf b_1) + \mathbf b_2)$$
By contrast, in a residual block, the overall action looks like:
$$\mathbf x\mapsto\boldsymbol{\sigma}(W_2\boldsymbol{\sigma}(W_1\mathbf x + \mathbf b_1) + \mathbf b_2 + \mathbf x)$$
(assuming compatible dimensions for vector addition with $\mathbf x$). Thus, although the $1^{\text{st}}$ internal layer of the residual block is no different from a standard feedforward, the difference lies in the insertion of a so-called $\textit{skip connection}$ $+\mathbf x$ in between the linear layer and the activation function of the $2^{\text{nd}}$ layer. A ResNet is then literally just a bunch of residual blocks composed together!
$\textbf{Problem}$: Give some hand-wavy intuition/justification for what makes ResNets interesting/useful.
$\textbf{Solution}$: Consider the following $\textit{thought experiment}$. A standard feedforward neural network classifier with $20$ layers, post-training, has a misclassification rate of $15\%$ $\textit{on the training set}$ itself. Clearly, this seems to be a bias/underfitting problem rather than a variance/overfitting problem because it’s on the $\textit{training set}$. So one reasons that, to make the network more expressive and improve its performance on the training set, one might decide to append another $36$ feedforward layers on top of the $20$ layers already there, thus creating a $56$-layer feedforward. Theoretically, it’s manifest that the optimal training set error can only get lower because one could simply have the first $20$ layers doing exactly what they were doing before, and then have the remaining $36$ layers implement an identity function $\mathbf x\mapsto\mathbf x$. In practice however, it turns out that, paradoxically, training set error actually increases!
The problem is that it turns out to be surprisingly delicate to learn an identity mapping $\mathbf x\mapsto\mathbf x$ due to exploding/vanishing gradient problems during cost function minimization. So rather than struggling so hard to learn the identity, why not redesign the architecture such that the identity $\mathbf x\mapsto\mathbf x$ is the $\textit{default}$ mapping, and to then merely learn whatever perturbation (a.k.a. $\textit{residual}$ hence the network’s name) $\boldsymbol{\varepsilon}(\mathbf x)$ to this identity mapping is needed to learn the actual underlying map of interest $\mathbf x\mapsto \mathbf x+\boldsymbol{\varepsilon}(\mathbf x)$. Roughly speaking, in the earlier notation, one has parameterized the residual by $\boldsymbol{\varepsilon}(\mathbf x)=W_2\boldsymbol{\sigma}(W_1\mathbf x + \mathbf b_1) + \mathbf b_2$, assuming the outer activation function $\boldsymbol{\sigma}$ may be neglected (or if it happens to be a ReLU, this argument is even more appealing!). Notice than that (loosely) $W_1=W_2=\mathbf b_1=\mathbf b_2=\mathbf 0$ gives $\boldsymbol{\varepsilon}(\mathbf x)=\mathbf 0$, hence learning an exact identity map $\mathbf x\mapsto\mathbf x$. But it’s very easy to coerce weights and biases towards $\mathbf 0$ using standard regularization techniques.
In [33]:
# ResNets were first used in computer vision, so the W_1, W_2 matrices above really correspond to convolutional layers in a CNN structureclassResBlock():def__init__(self,in_channels,out_channels,stride=1):self.conv1=ConvLayer(in_channels,out_channels,kernel_size=3,stride=stride,padding=1)self.conv2=ConvLayer(out_channels,out_channels,kernel_size=3,stride=1,padding=1)# Identity x path (Projection Shortcut if dimensions change)self.projection=Noneifstride!=1orin_channels!=out_channels:# 1x1 convolution to match dimensionsself.projection=ConvLayer(in_channels,out_channels,kernel_size=1,stride=stride,padding=0)defrelu(self,x):returnnp.maximum(0,x)defforward(self,x):identity=x# 1. First transformation + activationout=self.conv1.forward(x)out=self.relu(out)# 2. Second transformationout=self.conv2.forward(out)# 3. Match dimensions of identity if neededifself.projectionisnotNone:identity=self.projection.forward(x)# 4. Skip Connection Additionout+=identity# 5. Final activationout=self.relu(out)returnout
$\textbf{Problem}$: Just like a ResNet is made from composing a sequence of residual blocks together, an inception neural network (InceptionNet) is made from composing a sequence of inception blocks together (mostly true; in practice there may also some other side branches, pooling layers, etc). Explain the architecture of an inception block.
$\textbf{Solution}$: Whereas typical CNN architectures require one to choose a filter size, an inception block asks “why not try several filter sizes?”. To this effect, a naive inception block implementation might look something like:
$\textbf{Problem}$: Needless to say, one of the drawbacks to asking “why not try several filter sizes?” is that, by having to literally try several filter sizes, the naive inception block implementation above can get computationally expensive in terms of the number of multiplications and additions required per forward pass. Explain how the introduction of $1\times 1$ $\textit{bottleneck layers}$ can help reduce computational burden without sacrificing too much accuracy.
$\textbf{Solution}$: The idea is to insert $1\times 1$ convolutional layers before the more computationally expensive $3\times 3$ and $5\times 5$ convolutions. These $1\times 1$ convolutions act as “bottlenecks” by reducing the number of channels (and thus the computational cost) before the larger convolutions are applied. For example, instead of applying a $5\times 5$ convolution directly to 256 input channels, we can first apply a $1\times 1$ convolution to reduce the channels to, say, 64, and then apply the $5\times 5$ convolution to the reduced-channel tensor. This significantly reduces the number of parameters and computations while maintaining most of the representational power of the larger filters.
$\textbf{Problem}$: Explain how the MobileNet architecture replaces the standard convolutional layer with a depthwise separable convolutional layer.
$\textbf{Solution}$: The idea is to sort of break down the standard convolutional layer into two parts: a $\textit{depthwise convolutional layer}$ and a $\textit{pointwise convolutional layer}$. The depthwise convolutional layer is a convolutional layer that only convolves with the depth (channels) dimension, while the pointwise convolutional layer is a convolutional layer that only convolves with the height and width dimensions.
Problem: Deduce the Hamilton-Jacobi equation of classical mechanics.
Solution: Instead of viewing the action \(S=S[\mathbf x(t)]\) as a functional of the particle’s trajectory \(\mathbf x(t)\), it can be viewed more simply as a scalar field \(S(\mathbf x,t)\) in which the initial point in spacetime \((t_0,\mathbf x_0)\) is fixed and one simply takes the on-shell trajectory from \((t_0,\mathbf x_0)\) to \((t,\mathbf x)\). Then the total differential \(dS=\mathbf p\cdot d\mathbf x\) (follows from the usual Noetherian calculation) so in particular:
Intuitively, this is saying that the particle moves in a direction (the direction of the momentum \(\mathbf p\)) orthogonal to the contour surfaces of the action field \(S\), i.e. such isosurfaces can be viewed as “wavefronts”. Then the total time derivative is:
\[\dot S=L\]
But \(\frac{\partial S}{\partial t}+\frac{\partial S}{\partial\mathbf x}\cdot\dot{\mathbf x}=\frac{\partial S}{\partial t}+\mathbf p\cdot\dot{\mathbf x}\). Thus, isolating for \(H=\mathbf p\cdot\dot{\mathbf x}-L\) yields the Hamilton-Jacobi nonlinear \(1^{\text{st}}\)-order PDE for \(S(\mathbf x,t)\):
Problem: When \(\partial H/\partial t=0\), the Hamiltonian is conserved with energy \(H=E\), so this motivates the additive separation of variables, \(S(\mathbf x,t):=S_0(\mathbf x)-Et\) for some constant \(E\). What does the Hamilton-Jacobi equation simplify to in this case? For a single non-relativistic particle of mass \(m\) moving in a potential \(V(\mathbf x)\), what does this look like? What about in \(1\) dimension?
and in \(1\) dimension is integrable to the explicit solution:
\[S_0(x)=\pm\int ^xdx’\sqrt{2m(E-V(x’))}\]
In particular, the usual trajectory \(x(t)\) can be obtained by treating \(S_o=S_0(x,t;E)\) as a family of solutions parameterized by the energy \(E\); this works because \(S_0\) can be a viewed as a particular generating function of a canonical transformation \((\mathbf x,\mathbf p,H)\mapsto (\mathbf x’,\mathbf p’,H’)\) in which the “boosted” Hamiltonian vanishes \(H’=0\).
Problem: Above, the static field \(S_0(\mathbf x)\) was introduced to simplify the Hamilton-Jacobi equation when the energy \(E\) was conserved. However, if one pulls back to the level of functionals rather than fields, one can define an analogous abbreviated action functional \(S_0[\mathbf x]\) which depends only on the path \(\mathbf x\) taken rather than the trajectory \(\mathbf x(t)\). Define \(S_0[\mathbf x]\), and moreover show that when the energy \(E\) is conserved, the on-shell path is a stationary point of \(S_0\) (this is called Maupertuis’s principle).
Solution: The abbreviated action for a single particle of mass \(m\) and (non-relativistic) energy \(E=|\mathbf p|^2/2m +V(\mathbf x)\) is:
(modifying this to \(S_0[\mathbf x]:=\int ds |\mathbf p|=\int ds\sqrt{2(E-V(\mathbf x))}\) allows for interpretation as an \(N\)-particle system in configuration space \(\mathbf x\in\mathbf R^{3N}\) with the Riemannian “mass metric” \(ds^2=m_1|d\mathbf x_1|^2+…+m_N|d\mathbf x_N|^2\)).
To find the stationary paths of \(S_0[\mathbf x]\) subject to the constraint \(H(\mathbf x,\mathbf p)=E\), one can implement a Lagrange multiplier \(\gamma(\tau)\) to perform unconstrained extremization of:
provided the Lagrange multiplier \(\gamma=dt/d\tau\) encodes reparameterization invariance; with this choice it’s clear that the integrand in the functional \(S\) was nothing more than the Lagrangian \(L=\mathbf p\cdot\dot{\mathbf x}-H\) (plus an unimportant constant \(E\)) so Maupertuis’s principle reduces to the usual Hamilton’s principle.
Problem: What does Fermat’s principle in ray optics assert? Hence, derive the ray equation.
Solution: The time functional \(T=T[\mathbf x(s)]\) of a ray trajectory \(\mathbf x(s)\) is stationary on-shell. That is:
\[cT[\mathbf x(s)]=\int ds n(\mathbf x(s))\]
This is reparameterization invariant, since one can arbitrarily parameterize \(\mathbf x=\mathbf x(t)\) and replace \(ds=dt|\dot{\mathbf x}|\). The corresponding Euler-Lagrange equations are:
Problem: Starting from an arbitrary Cartesian component \(\psi(\mathbf x,t)=\psi_0(\mathbf x)e^{i(k_0cT(\mathbf x)-\omega t)}\) of either the \(\mathbf E\) or \(\mathbf B\) fields of a light wave (here \(\omega=ck_0\) with \(k_0=2\pi/\lambda_0\) is the free space wavenumber), make the eikonal approximation to the dispersionless wave equation obeyed by \(\psi\) in order to obtain the (scalar) eikonal equation. By defining light rays as the integral curves of the eikonalfield \(cT(\mathbf x)\) (a kind of local optical path length), reproduce the vector eikonal equation from Fermat’s principle above.
Solution: The ansatz \(\psi(\mathbf x,t)=\psi_0(\mathbf x)e^{i(k_0cT(\mathbf x)-\omega t)}\) is easy to justify; the \(e^{-i\omega t}\) is a just a Fourier transform factor that reduces the wave equation \(\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\psi=\frac{n^2}{c^2}\frac{\partial^2\psi}{\partial t^2}\) to a Helmholtz equation \(\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\psi=-n^2k_0^2\psi\). The remaining piece is just a polar parameterization of an arbitrary \(\mathbf C\)-valued spatial field \(\psi_0(\mathbf x)e^{ik_0cT(\mathbf x)}\). One obtains:
The eikonal approximation amounts to taking the ray optics limit \(k_0\to\infty\) (in practice, the wavelength \(2\pi/k_0\) has to be much shorter than all other length scales), and yields the (scalar) eikonal equation:
Problem: Use Hamilton’s optics-mechanics analogy to solve the brachistochrone problem (this was how Johann Bernoulli originally solved it).
Solution: By energy conservation, the speed of the particle at distance \(y>0\) below its initial dropping height is \(v=\sqrt{2gy}\). By Fermat’s principle, minimizing the time functional then amounts to treating the particle as a light ray with \(n(\mathbf x)=c/v(y)\). So the question becomes how do light rays bend in a horizontally stratified medium with \(n(y)\propto y^{-1/2}\)? The answer is given by the ray equations:
The horizontal component expresses Snell’s law since \(dx/ds=\sin\theta\) (it expresses momentum conservation along the homogeneous \(\partial n/\partial x=0\) direction). Using the tangent vector constraint \(ds^2=dx^2+dy^2\) gives the ODE of a cycloid:
\[\frac{dy}{dx}=\sqrt{\frac{\text{const}}{y}-1}\]
(the vertical component ODE has an analytical solution \(y(s)=-s^2/8R+s\) which is contained in the cycloid, so is redundant information).
Problem: How did Hamilton’s optics-mechanics analogy inspire Schrodinger to propose his famous equations of quantum mechanics?
Solution: Essentially, Schrodinger asked: ray optics is to wave optics as classical mechanics is to what? In other words, one imagines there exists a wave theory of particles/matter and one would like to take the “inverse eikonal limit” of classical mechanics (here, “inverse eikonal limit” is usually called quantization):
Just as light rays propagate parallel to their phase fronts:
Already this suggests that the action should have some phase interpretation. More precisely, it should be the phase of the particle’s de Broglie wave in units of \(\hbar\). It’s also not obvious that particle’s should be described by a scalar wave field rather than e.g. the electromagnetic vector wave fields of light. Schrodinger simply guessed it looked like the equivalent of “scalar diffraction theory” with a single wavefunction \(\psi(\mathbf x,t)=\psi_0(\mathbf x,t)e^{iS(\mathbf x,t)/\hbar}\). This gives the Madelung equations of quantum hydrodynamics, one of which is just a continuity equation (giving credence to the Born interpretation of \(|\psi|^2\)) and the other is a quantum Hamilton-Jacobi equation which in the limit \(\hbar\to 0\) (analogous to the eikonal limit \(\lambda_0\to 0\)) simplifies to the classical Hamilton-Jacobi equation.
Problem: Define the signature of a matrix. Hence, state and prove Sylvester’s law of inertia.
Solution: The signature of an \(n\times n\) matrix \(A\) is a \(3\)-tuple \((n_+,n_-,n_0)\) where \(n_+\) is the number of positive eigenvalues of \(A\) (including multiplicity), \(n_-\) is the number of negative eigenvalue of \(A\) (including multiplicity), and \(n_0=\text{dim}\ker A\) is the multiplicity of the zero eigenvalue; thus, \(n_++n_-+n_0=n\).
Let \(A,B\) be real symmetric matrices. Then Sylvester’s law of inertia asserts that \(A\) and \(B\) are congruent matrices iff they have the same signature (which sometimes is called “inertia” because of this invariance under congruence, hence the name).
Proof: It’s easy to see that the nullity \(n_0\) is preserved by congruence transformations. If one can can show that \(n_+\) is also preserved, then it implies \(n_-\) is conserved by virtue of \(n_++n_-+n_0=n\). To show this, the idea is to prove by contradiction, assuming \(n_+\) is not preserved which by dimension counting would imply a non-zero vector living in the intersection of the subspace spanned by the positive-eigenvalue eigenvectors of \(A\) and the congruence-transformed subspace spanned by the non-positive-eigenvalue eigenvectors of \(B\).
Since any real, symmetric matrix \(A\) is isomorphic to a real quadratic form \(Q(\mathbf x):=\mathbf x^TA\mathbf x\), the concepts of signature and Sylvester’s law of inertia can also be reformulated in the language of quadratic forms rather than real symmetric matrices.
Problem: Let \(X\) be a smooth \(n\)-manifold. Explain what it means to place the additional structure of a pseudo-Riemannian metric \(g\) on \(X\). Then explain how Riemannian and Lorentzian geometry are special cases of pseudo-Riemannian geometry.
Solution: A Riemannian metric \(g\) on \(X\) is a type \((0,2)\) tensor field that defines an inner product \(g_x:T_x(X)^2\to\mathbf R\) at the tangent space \(T_x(X)\) of each point \(x\in X\). The “tensor field” part is sometimes rephrased as saying that \(g_x\) is a bilinear form. Moreover, to flesh out the usual axioms of a realinner product space, the Riemannian metric tensor \(g_x\) at each \(x\in X\) must be symmetric \(g_x(v_x,v’_x)=g_x(v’_x,v_x)\) and positive-definite \(v_x\neq 0\Rightarrow g_x(v_x,v_x)>0\).
A pseudo-Riemannian metric relaxes the positive-definite requirement of a Riemannian metric, instead merely requiring non-degeneracy (i.e. at each point \(x\in X\), the zero vector \(0\) is the only vector orthogonal \(g_x(0,v_x)=0\) to all \(v_x\in T_x(X)\)). More concisely, what this is saying is that \(n_0=0\), so the signature of a pseudo-Riemannian metric may be thought of as a pair \((n_+,n_-)\).
Thus, Riemannian metrics are the special subset of pseudo-Riemannian metrics for which \((n_+,n_-)=(n,0)\). Meanwhile, Lorentzian metrics are another special subset of pseudo-Riemannian metrics for which \((n_+,n_-)=(n-1,1)\) (this is the relativists’/ convention). Typically, spacetime \(X\) has dimension \(n=4\), so e.g. the Minkowski metric is said to have signature \((3,1):=(-,+,+,+)\).
Problem: Let \((X,g)\) be a pseudo-Riemannian manifold. Often, one writes the general expression for an infinitesimal line element on \(X\) as \(ds^2=g_{\mu\nu}dx^{\mu}dx^{\nu}\); explain the shorthand being used here.
Solution: This is nothing more than choosing some chart \(x^{\mu}\) on \(X\) and expanding the metric tensor field \(g\) in the coordinate basis \(\{dx^{\mu}\otimes dx^{\nu}\}\) of type \((0,2)\) tensor fields:
\[g=g_{\mu\nu}dx^{\mu}\otimes dx^{\nu}\]
for some real, symmetric scalar fields \(g_{\mu\nu}:X\to\mathbf R\). One then simply writes \(ds^2:=g\) and omits the tensor product \(\otimes\).
Problem: Let \(x(t)\in X\) be a curve on a Riemannian manifold \((X,g)\). By choosing a chart \(x^{\mu}\) to cover a suitable region of \(X\) spanned by the trajectory \(x(t)\), explain how to compute the length \(s\) of the curve \(x(t)\) using the Riemannian metric \(g\).
Problem: Write down the action \(S\) for a non-relativistic particle of mass \(m\) moving on a Riemannian manifold \((X,g)\). Hence, derive the geodesic equation. What happens if the particle is relativistic?
Solution: A nonrelativistic free particle of mass \(m\) has action \(S=\int dt L\) described by the usual nonrelativistic kinetic Lagrangian:
\[L=\frac{m}{2}|\dot{\mathbf x}|^2\]
The catch is that \(|\dot{\mathbf x}|^2=g_{ij}(\mathbf x)\dot x^{i}\dot x^{j}\) depends on the Riemannian metric \(g\) in which the particle moves. Applying the Euler-Lagrange equations, one finds the equation of motion (called the geodesic equation):
\[\ddot x^i+\Gamma^i_{jk}\dot x^j\dot x^k=0\]
where the Christoffel symbols \(\Gamma^i_{jk}(\mathbf x):=\frac{1}{2}g^{i\ell}(\mathbf x)\left(\frac{\partial g_{\ell j}}{\partial x^k}+\frac{\partial g_{\ell k}}{\partial x^j}-\frac{\partial g_{jk}}{\partial x^{\ell}}\right)\) are analogous to “fictitious forces” on the particle due to the curvature of \(X\).
By contrast, a relativistic free particle has action \(S=-mcs=-mc^2\tau\). This can be recast into non-relativistic form \(S=\int dt L\) using the relativistic kinetic Lagrangian:
which in flat space \(g_{\mu\nu}=\text{diag}(1,-1,-1,-1)\) reduces to the familiar \(L=-mc^2/\gamma\) (nb. more generally, the quantity under the square root is always \(\geq 0\) for the timelike worldlines of particles). Some comments:
The stationary action principle \(\delta S=0\) in which the action \(S\) tends to be minimized corresponds to the well-known fact that relativistic free particles travelling in straight line geodesics through spacetime maximize the Minkowski distance \(s\), or equivalently Alice experiences more proper time (i.e. ages more) than Bob in the twin paradox.
The relativistic action (unlike its non-relativistic counterpart) is manifestly reparameterization invariant.
Now the Euler-Lagrange geodesic equations for the relativistic Lagrangian are almost identical to the non-relativistic case except there’s an additional term:
The additional term vanishes \(\dot L=0\) iff \(L=L(\tau’)\) is parameterized as any affine transformation \(\tau’=a\tau+b\) of the particle’s proper time \(\tau\) (prove this by showing \(\dot{\tau}’=-aL/mc^2\)). Only in this case does the relativistic geodesic equation reduce to the non-relativistic form:
Problem: For \(X=\mathbf R^2\), calculate all non-vanishing Christoffel symbols in the polar coordinate chart \((\rho,\phi)\).
Solution:
Note it is usually more efficient to compute Christoffel symbols this way rather than using the explicit formula in terms of partial derivatives of the metric \(g\) (though they’re of course equivalent).
Problem: Explain how the presence of a pseudo-Riemannian metric \(g\) on a smooth manifold \(X\) allows one to identify covectors in \(T_x^*(X)\) with tangent vectors in \(T_x(X)\) at each point \(x\in X\).
Solution: Just as in \(X=\mathbf R^n\), one identifies a tangent vector \(\mathbf v\) with its covector \(\mathbf v\cdot\) via the inner product \(\cdot\), so at a given \(x\in X\), one uses the pseudo-inner product \(g_x\) on \(T_x(X)\) to make the exact same identification \(v_x\leftrightarrow g_x(v_x,\space)\) (this sometimes called the musical isomorphisminduced by \(g\) in light of the map \(\flat_x:T_x(X)\to T^*_x(X)\) defined by \(v_x^{\flat_x}(v’_x):=g(v_x,v’_x)\) and its inverse \(\sharp_x=\flat_x^{-1}\)).
In some chart \(x^{\mu}\), one can write the \(1\)-form \(v^{\flat}=(v^{\flat})_{\mu}dx^{\mu}\) with components \((v^{\flat})_{\mu}=v^{\flat}(\partial_{\mu})=g(v,\partial_{\mu})=g_{\mu\nu}v^{\nu}\) where \(v^{\nu}=v(x^{\nu})\) are the components of its musically isomorphic vector field \(v=v^{\nu}\partial_{\nu}\) in the same coordinate basis. Physicists typically abuse notation by merely writing \((v^{\flat})_{\mu}=g_{\mu\nu}v^{\nu}\) as just \(v_{\mu}=g_{\mu\nu}v^{\nu}\), identifying \(v^{\flat}\equiv v\) under the musical isomorphism. But then this lazy notation makes it look as if the metric \(g\) performs a trivial mechanical action of just “lowering the index” on \(v\).
Problem: Define a tensor field \(g^*\) that performs the trivial mechanical action of “raising the index” on a \(1\)-form \(A\).
Solution: Define \(g^*_x:T^*_x(X)^2\to\mathbf R\) to be a type \((2,0)\) tensor given by:
\[g^*(A,A’):=g(A^{\sharp}, A’^{\sharp})\]
Then on the one hand, in some chart \(x^{\mu}\), one has:
This implies \(g_{\mu\nu}=(g^*)^{\rho\sigma}g_{\mu\rho}g_{\nu\sigma}\). Since \(g\) is non-degenerate (the \(N_0=0\) axiom of a pseudo-Riemannian metric), it follows that \(\det g_{\mu\nu}\neq 0\) is invertible, so the only possibility is \((g^*)^{\mu\nu}g_{\nu\rho}=\delta^{\mu}_{\rho}\), i.e. \(g^*\sim g^{-1}\). It is common practice to just drop the \(*\) and write \(g^{\mu\nu}\) in lieu of \((g^*)^{\mu\nu}\). With this lazy notation, \(g^{\mu\nu}\) looks like it just “raises the index” on \(A^{\mu}=g^{\mu\nu}A_{\nu}\) (where one last abuse of notation \((A^{\sharp})^{\mu}=A^{\mu}\) has been made!). cf. raising lowering operators \(a,a^{\dagger}\) in quantum mechanics.
Problem: What is the canonical volume form on a smooth, orientable pseudo-Riemannian \(n\)-manifold \((X,g)\)? Why is it canonical?
Solution: In a chart \(x^{\mu}\) in which the pseudo-Riemannian metric has components \(g=g_{\mu\nu}dx^{\mu}\otimes dx^{\nu}\), the canonical volume form is \(\sqrt{|\det g|}dx^1\wedge…\wedge dx^n\). This is indeed a volume form because it is a top form with \(\det g\neq 0\) nowhere vanishing for same reasons as before. Moreover, it is canonical because, despite appearances, it does not depend on the choice of chart \(x^{\mu}\) (provided it’s the same orientation) as \(\sqrt{|\det g|}\) is a scalar density of weight \(1\) whereas \(dx^1\wedge…\wedge dx^n\) is a scalar density of weight \(-1\).
Problem: Let \((X,g)\) be a smooth, \(n\)-dimensional, orientable pseudo-Riemannian manifold, and let \(\omega\in\Omega^k(X)\) be a differential \(k\)-form on \(X\). Define the Hodge dual \((n-k)\)-form \(\star\omega\in\Omega^{n-k}(X)\) on \(X\).
Solution: The Hodge dual \(\star\omega\in\Omega^{n-k}(X)\) is defined to be the unique differential \((n-k)\)-form such that for all test differential \(k\)-forms \(\omega’\in\Omega^k(X)\) one has the equality of top forms:
Here, the operation \(\widetilde{\langle\omega’,\omega\rangle}\) is defined for \(\omega’=A’_1\wedge…\wedge A’_k\) and \(\omega=A_1\wedge…\wedge A_k\) by the Gram determinant:
and extended to arbitrary differential \(k\)-forms \(\omega’,\omega\in\Omega^k(X)\) by bilinearity. It is easy to check that the Hodge star \(\star:\Omega^k(X)\to\Omega^{n-k}(X)\) is a linear transformation \(\star(\phi_1\omega_1+\phi_2\omega_2)=\phi_1\star\omega_1+\phi_2\star\omega_2\) between \(2\) vector spaces of the same dimension \(\dim\Omega^k(X)={n\choose k}={n\choose n-k}=\dim\Omega^{n-k}(X)\). Moreover, it is either an involution or behaves like multiplication by \(i\) in the sense that \(\star^2=(-1)^{k(n-k)+n_-}\).
Problem: For flat Minkowski spacetime \(X=\mathbf R^{1+3}\) with the Lorentzian metric \(ds^2=c^2dt^2-dx^2-dy^2-dz^2\) and orientation defined by the canonical volume form \(cdt\wedge dx\wedge dy\wedge dz\), compute the Hodge dual:
Solution: By linearity, one can focus on computing Hodge duals of the basis \(2\)-forms \(\star (dy\wedge dt)\) and \(\star(dx\wedge dz)\). Demonstrating this for \(\star (dy\wedge dt)\), the idea is to (roughly speaking) wedge it against itself and exploit the defining property of the Hodge dual:
In this case, one can directly read off the answer \(\star(dy\wedge dt)=-c^{-1}dx\wedge dz\) or equivalently \(\star(dy\wedge dct)=dz\wedge dx\). Analogously, one can check \(\star(dx\wedge dz)=dy\wedge dct\) (consistent with the earlier identity for \(\star^2\)), so the final answer is:
For more general non-orthogonal metrics \(g\), one cannot simply read off the answer like that but should set up an ansatz for the Hodge dual like \(\star(dy\wedge dt)=c_{tx}dt\wedge dx+c_{ty}dt\wedge dy+c_{tz}dt\wedge dz+c_{xy}dx\wedge dy+c_{xz}dx\wedge dz+c_{yz}dy\wedge dz\) and obtaining the coefficients by wedging against the relevant basis differential forms.
Problem: Let \((X,g)\) be a smooth, \(n\)-dimensional pseudo-Riemannian manifold. Just as the metric tensor field \(g\) induces a pseudo-inner product between tangent vectors at each point across the manifold \(X\), show that \(g\) also induces an inner product on the space \(\Omega^k(X)\) of differential \(k\)-forms on \(X\).
Solution: Given \(2\) differential \(k\)-forms \(\omega,\omega’\in\Omega^k(X)\), their \(g\)-induced inner product is defined by:
(this is well-defined because \(\omega\wedge\star\omega’\in\Omega^n(X)\) is indeed a top form as observed earlier).
Problem: Having endowed \(\Omega^k(X)\) with an inner product, one can define the adjoint (sometimes called the codifferential) \(d^{\dagger}:\Omega^k(X)\to\Omega^{k-1}(X)\) of the exterior derivative operator \(d:\Omega^{k-1}(X)\to\Omega^k(X)\) by insisting that for arbitrary differential forms \(\omega\in\Omega^{k-1}(X),\omega’\in\Omega^k(X)\):
(the LHS is the inner product on \(\Omega^k(X)\) whereas the RHS is the inner product on \(\Omega^{k-1}(X)\)). Show that if the underlying pseudo-Riemannian manifold \(X\) is closed, then:
Isolating for \(d^{\dagger}\) by acting with \(\star\) on both sides, using \(\star^2=(-1)^{(k-1)(n-k+1)+n_-}\), and exploiting trivial identities like \((-1)^{k^2-k}=1\) and \((-1)^{-n}=(-1)^n\) for \(k,n\in\mathbf Z\) leads to the claimed result.
Problem: Using the technology of \(d^{\dagger}\), define the Laplace-de Rham operator \(\Delta:\Omega^k(X)\to\Omega^k(X)\). For the case of a scalar field \(k=0\), show that the action of the Laplace-de Rham operator \(\Delta\) coincides with that of the negative Laplacian \(-\nabla^2\) (sometimes called the Laplace-Beltrami operator).
Solution: One has the positive semi-definite Laplace-de Rham operator:
\[\Delta:=(d+d^{\dagger})^2=\{d,d^{\dagger}\}\]
thanks to nilpotence \(d^2=(d^{\dagger})^2=0\). For a scalar field \(\phi\in C^{\infty}(X)\), \(d^{\dagger}\phi=0\) so:
Problem: Let \(\omega\in\Omega^k(X)\) be a differential \(k\)-form on a closed,Riemannian manifold \((X,g)\). Explain what it means for \(\omega\) to be harmonic. Furthermore, prove that \(\omega\) is harmonic iff it is both closed \(d\omega=0\) and co-closed \(d^{\dagger}\omega=0\) (this equivalence underlies the Hodge decomposition theorem and hence the rest of Hodge theory).
Solution: This is just a generalization of the usual notion of a harmonic scalar field obeying Laplace’s equation \(\nabla^2\phi=0\), only now the Laplacian \(\nabla^2\) is replaced by the Laplace-de Rham operator \(\Delta\) and the scalar field \(\phi\) by an arbitrary differential \(k\)-form \(\omega\in\Omega^k(X)\):
If \(\Delta\omega=0\) is harmonic, then it follows \(\langle d^{\dagger}\omega,d^{\dagger}\omega\rangle=\langle d\omega,d\omega\rangle=0\) because \(g\) is Riemannian, so in particular \(d\omega=d^{\dagger}\omega=0\). The converse is trivial: \(\Delta\omega=dd^{\dagger}\omega+d^{\dagger}d\omega=d0+d^{\dagger}0=0\).
Problem: Let \(X\) be a smooth manifold not necessarily equipped with a metric \(g\). What does it mean to endow \(X\) with the structure of a affine connection \(\nabla\)?
Solution: An affine connection is any map \(\nabla:\mathfrak{X}(X)^2\to\mathfrak{X}(X)\) which is linear in its first tangent vector field argument but acts like a tangent vector (more precisely, like a derivation) in its second tangent vector field argument. Thus, \(\nabla\) is not tensorial because it is not linear in that second argument. Nevertheless, given a (possibly non-coordinate) basis \(\text{span}_{\mathbf R}\{e_{\mu}\}=\mathfrak{X}(X)\) of tangent vector fields, one can still define affine connection coefficients for \(\nabla\):
with \(\nabla_{\nu}:=\nabla_{e_{\nu}}\) a shorthand. The map \(\nabla_v:\mathfrak{X}(X)\to\mathfrak{X}(X)\) for some fixed tangent vector field \(v\in\mathfrak{X}(X)\) is called the covariant derivative along \(v\).
Problem: Let \(X\) be a smooth manifold. For two tangent vector fields \(v,v’\in\mathfrak{X}(X)\), given a (possibly non-coordinate) basis \(\text{span}_{\mathbf R}\{e_{\mu}\}=\mathfrak{X}(X)\) of tangent vector fields so that \(v=v^{\mu}e_{\mu}\) and \(v’=v’^{\nu}e_{\nu}\), show that:
Solution: This is a straightforward application of the properties of affine connections. The only thing to watch out for is that the affine connection is extended to act on scalar fields \(\phi\in C^{\infty}(X)\) in the obvious way:
\[\nabla_v\phi:=v(\phi)=\mathcal L_v\phi\]
In particular, \(\nabla_{\mu}\phi=e_{\mu}(\phi)\), and if \(e_{\mu}=\partial_{\mu}\) happens to be a coordinate basis for \(\mathfrak{X}(X)\), then the covariant derivative \(\nabla_{\mu}=\partial_{\mu}\) coincides with the partial derivative on scalar fields.
Problem: For tangent vector fields \(v,v’\in\mathfrak{X}(X)\), compare the covariant derivative \(\nabla_vv’\) with the Lie derivative \(\mathcal L_vv’\) (both of which are tangent vector fields).
Solution: Basically, in the case of the Lie derivative \(\mathcal L_vv’=[v,v’]\), both entries of the commutator are subject to the product rule, for example \([vv^{\prime\prime},v’]=v[v^{\prime\prime},v’]+[v,v’]v^{\prime\prime}\), so in particular are nonlinear (being additive but inhomogeneous). By contrast, in the covariant derivative \(\nabla_vv’\), the \(v\)-argument is explicitly made to be linear (though the \(v’\) argument behaves in the same way as both arguments of the Lie derivative), so this is what allows familiar linear algebraic expressions like \(\nabla_v=v^{\mu}\nabla_{\mu}\) to be written (this was proven above in the form \(\nabla_vv’=v^{\nu}\nabla_{\nu}v’\)).
Problem: Explain how to compute the covariant derivative of a \(1\)-form \(A\in\Omega^1(X)\) along a tangent vector field \(v\in\mathfrak{X}(X)\) to end up with another \(1\)-form \(\nabla_v A\).
Solution: As a \(1\)-form, \(A\) maps a tangent vector field \(v’\in\mathfrak{X}(X)\) to a scalar field \(A(v’)\in C^{\infty}(X)\). But earlier, it was already explained one can take the covariant derivative of a scalar field, so it makes sense to compute \(\nabla_v(A(v’))=v(A(v’))\). However, by insisting that the product rule holds in the sense:
with \((\nabla_{\mu}A)_{\nu}=\partial_{\mu}A_{\nu}-\Gamma^{\rho}_{\mu\nu}A_{\rho}\).
Problem: Let \(X\) be a smooth manifold equipped with an affine connection \(\nabla\). Define the type-\((1,2)\) torsion tensor field \(T\) of \(\nabla\), and compute the components of \(T\) in a coordinate basis.
Solution: For a \(1\)-form \(A\in\Omega^1(X)\) and tangent vector fields \(v,v’\in\mathfrak{X}(X)\), one has:
\[T(A,v,v’):=A(\nabla_vv’-\nabla_{v’}v-[v,v’])\]
One should check that \(T\) is indeed a tensor, since the affine connection \(\nabla\) is not a tensor as mentioned earlier. It is easy to see that \(T\) is linear in \(A\), and it is linear in \(v\) iff it is linear in \(v’\) because of antisymmetry \(T(A,v’,v)=-T(A,v,v’)\). Furthermore, it is easy to see that \(T(A,v_1+v_2,v’)=T(A,v_1,v’)+T(A,v_2,v’)\). The tricky property to prove is that \(T(A,\phi v,v’)=\phi T(A,v,v’)\) for any scalar field \(\phi\in C^{\infty}(X)\) (it must be a scalar field and not just some scalar \(\phi\in\mathbf R\) because one is trying to show that \(T\) is a tensor field not just a tensor):
Thus, in a coordinate basis \(\{\partial_{\mu}\}\) of \(\mathfrak{X}(X)\), the components \(T^{\rho}_{\mu\nu}:=T(dx^{\rho},\partial_{\mu},\partial_{\nu})\) are tensorial:
where Clairaut’s theorem \([\partial_{\mu},\partial_{\nu}]=0\) for a coordinate basis has been used.
Problem: Let \(X\) be a smooth manifold equipped with an affine connection \(\nabla\). Define the type-\((1,3)\) Riemann curvature tensor field \(R\) of \(\nabla\), and compute the components of \(R\) in a coordinate basis.
Solution: For a \(1\)-form \(A\in\Omega^1(X)\) and tangent vector fields \(v,v’,v^{\prime\prime}\in\mathfrak{X}(X)\), one has:
Similar to the torsion tensor field \(T\), one should check that \(R\) genuinely deserves its designation as a tensor field. In a coordinate basis \(\{\partial_{\mu}\}\) of \(\mathfrak{X}(X)\), the (slightly awkward but conventional) tensor components \(R^{\sigma}_{\rho\mu\nu}:=R(dx^{\sigma},\partial_{\mu},\partial_{\nu},\partial_{\rho})\) are:
where again Clairaut’s theorem \([\partial_{\mu},\partial_{\nu}]=0\) for a coordinate basis has been used. Again similar to the case for \(T\), the antisymmetry \(R(A,v’,v,v^{\prime\prime})=-R(A,v,v’,v^{\prime\prime})\) manifests at the component level as \(R^{\sigma}_{\rho\nu\mu}=-R^{\sigma}_{\rho\mu\nu}\).
Problem: (about the Ricci identity connecting \(T\) and \(R\) together in terms of the commutator of covariant derivatives)
Solution:
Problem: State and prove the fundamental theorem of pseudo-Riemannian geometry.
Solution: Let \((X,g)\) be a smooth, pseudo-Riemannian manifold. Then there exists a unique affine connection \(\nabla\) (called the Levi-Civita connection) which is torsion-free \(T=0\) and \(g\)-compatible in the sense that \(\nabla_v g=0\) for all tangent vector fields \(v\in\mathfrak{X}(X)\).