Convolutional Neural Networks

CNNs_Part_1
Posted in Blog | Leave a comment

Hamilton’s Optics-Mechanics Analogy

Problem: Deduce the Hamilton-Jacobi equation of classical mechanics.

Solution: Instead of viewing the action \(S=S[\mathbf x(t)]\) as a functional of the particle’s trajectory \(\mathbf x(t)\), it can be viewed more simply as a scalar field \(S(\mathbf x,t)\) in which the initial point in spacetime \((t_0,\mathbf x_0)\) is fixed and one simply takes the on-shell trajectory from \((t_0,\mathbf x_0)\) to \((t,\mathbf x)\). Then the total differential \(dS=\mathbf p\cdot d\mathbf x\) (follows from the usual Noetherian calculation) so in particular:

\[\mathbf p=\frac{\partial S}{\partial\mathbf x}\]

Intuitively, this is saying that the particle moves in a direction (the direction of the momentum \(\mathbf p\)) orthogonal to the contour surfaces of the action field \(S\), i.e. such isosurfaces can be viewed as “wavefronts”. Then the total time derivative is:

\[\dot S=L\]

But \(\frac{\partial S}{\partial t}+\frac{\partial S}{\partial\mathbf x}\cdot\dot{\mathbf x}=\frac{\partial S}{\partial t}+\mathbf p\cdot\dot{\mathbf x}\). Thus, isolating for \(H=\mathbf p\cdot\dot{\mathbf x}-L\) yields the Hamilton-Jacobi nonlinear \(1^{\text{st}}\)-order PDE for \(S(\mathbf x,t)\):

\[-\frac{\partial S}{\partial t}=H\left(\mathbf x,\frac{\partial S}{\partial\mathbf x},t\right)\]

Problem: When \(\partial H/\partial t=0\), the Hamiltonian is conserved with energy \(H=E\), so this motivates the additive separation of variables, \(S(\mathbf x,t):=S_0(\mathbf x)-Et\) for some constant \(E\). What does the Hamilton-Jacobi equation simplify to in this case? For a single non-relativistic particle of mass \(m\) moving in a potential \(V(\mathbf x)\), what does this look like? What about in \(1\) dimension?

Solution: \[H\left(\mathbf x,\frac{\partial S_0}{\partial\mathbf x}\right)=E\]

which for \(H(\mathbf x,\mathbf p)=|\mathbf p|^2/2m+V(\mathbf x)\) looks like:

\[\frac{1}{2m}\biggr|\frac{\partial S_0}{\partial\mathbf x}\biggr|^2+V(\mathbf x)=E\]

and in \(1\) dimension is integrable to the explicit solution:

\[S_0(x)=\pm\int ^xdx’\sqrt{2m(E-V(x’))}\]

In particular, the usual trajectory \(x(t)\) can be obtained by treating \(S_o=S_0(x,t;E)\) as a family of solutions parameterized by the energy \(E\); this works because \(S_0\) can be a viewed as a particular generating function of a canonical transformation \((\mathbf x,\mathbf p,H)\mapsto (\mathbf x’,\mathbf p’,H’)\) in which the “boosted” Hamiltonian vanishes \(H’=0\).

\[\frac{\partial S_0}{\partial E}=-t_0\Rightarrow t-t_0=\pm\sqrt{\frac{m}{2}}\int_{x_0}^{x(t)}\frac{dx’}{\sqrt{E-V(x’)}}\]

Problem: Above, the static field \(S_0(\mathbf x)\) was introduced to simplify the Hamilton-Jacobi equation when the energy \(E\) was conserved. However, if one pulls back to the level of functionals rather than fields, one can define an analogous abbreviated action functional \(S_0[\mathbf x]\) which depends only on the path \(\mathbf x\) taken rather than the trajectory \(\mathbf x(t)\). Define \(S_0[\mathbf x]\), and moreover show that when the energy \(E\) is conserved, the on-shell path is a stationary point of \(S_0\) (this is called Maupertuis’s principle).

Solution: The abbreviated action for a single particle of mass \(m\) and (non-relativistic) energy \(E=|\mathbf p|^2/2m +V(\mathbf x)\) is:

\[S_0[\mathbf x]:=\int d\mathbf x\cdot\mathbf p=\int ds |\mathbf p|=\int ds\sqrt{2m(E-V(\mathbf x))}\]

(modifying this to \(S_0[\mathbf x]:=\int ds |\mathbf p|=\int ds\sqrt{2(E-V(\mathbf x))}\) allows for interpretation as an \(N\)-particle system in configuration space \(\mathbf x\in\mathbf R^{3N}\) with the Riemannian “mass metric” \(ds^2=m_1|d\mathbf x_1|^2+…+m_N|d\mathbf x_N|^2\)).

To find the stationary paths of \(S_0[\mathbf x]\) subject to the constraint \(H(\mathbf x,\mathbf p)=E\), one can implement a Lagrange multiplier \(\gamma(\tau)\) to perform unconstrained extremization of:

\[S[\mathbf x(\tau)]:=S_0[\mathbf x(\tau)]-\int d\tau\gamma(\tau)(H(\mathbf x,\mathbf p)-E)=\int d\tau (\mathbf p\cdot\dot{\mathbf x}-\gamma(\tau)(H(\mathbf x,\mathbf p)-E))\]

The Euler-Lagrange equations lead to Hamilton’s equations:

\[\frac{d\mathbf x}{d\tau}=\gamma\frac{\partial H}{\partial\mathbf p}\]

\[\frac{d\mathbf p}{d\tau}=-\gamma\frac{\partial H}{\partial\mathbf x}\]

provided the Lagrange multiplier \(\gamma=dt/d\tau\) encodes reparameterization invariance; with this choice it’s clear that the integrand in the functional \(S\) was nothing more than the Lagrangian \(L=\mathbf p\cdot\dot{\mathbf x}-H\) (plus an unimportant constant \(E\)) so Maupertuis’s principle reduces to the usual Hamilton’s principle.

Problem: What does Fermat’s principle in ray optics assert? Hence, derive the ray equation.

Solution: The time functional \(T=T[\mathbf x(s)]\) of a ray trajectory \(\mathbf x(s)\) is stationary on-shell. That is:

\[cT[\mathbf x(s)]=\int ds n(\mathbf x(s))\]

This is reparameterization invariant, since one can arbitrarily parameterize \(\mathbf x=\mathbf x(t)\) and replace \(ds=dt|\dot{\mathbf x}|\). The corresponding Euler-Lagrange equations are:

\[\frac{d}{dt}\left(n(\mathbf x)\frac{\dot{\mathbf x}}{|\dot{\mathbf x}|}\right)=|\dot{\mathbf x}|\frac{\partial n}{\partial\mathbf x}\]

But by choosing the natural parameterization \(t:=s\) one has \(|d\mathbf x/ds|=1\), hence the ray equation:

\[\frac{d}{ds}\left(n\frac{d\mathbf x}{ds}\right)=\frac{\partial n}{\partial\mathbf x}\]

This can also be written in terms of the curvature vector \(\boldsymbol{\kappa}=d^2\mathbf x/ds^2\):

\[\boldsymbol{\kappa}=\left(\frac{\partial\ln n}{\partial\mathbf x}\right)_{\perp d\mathbf x}\]

Problem: Starting from an arbitrary Cartesian component \(\psi(\mathbf x,t)=\psi_0(\mathbf x)e^{i(k_0cT(\mathbf x)-\omega t)}\) of either the \(\mathbf E\) or \(\mathbf B\) fields of a light wave (here \(\omega=ck_0\) with \(k_0=2\pi/\lambda_0\) is the free space wavenumber), make the eikonal approximation to the dispersionless wave equation obeyed by \(\psi\) in order to obtain the (scalar) eikonal equation. By defining light rays as the integral curves of the eikonal field \(cT(\mathbf x)\) (a kind of local optical path length), reproduce the vector eikonal equation from Fermat’s principle above.

Solution: The ansatz \(\psi(\mathbf x,t)=\psi_0(\mathbf x)e^{i(k_0cT(\mathbf x)-\omega t)}\) is easy to justify; the \(e^{-i\omega t}\) is a just a Fourier transform factor that reduces the wave equation \(\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\psi=\frac{n^2}{c^2}\frac{\partial^2\psi}{\partial t^2}\) to a Helmholtz equation \(\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\psi=-n^2k_0^2\psi\). The remaining piece is just a polar parameterization of an arbitrary \(\mathbf C\)-valued spatial field \(\psi_0(\mathbf x)e^{ik_0cT(\mathbf x)}\). One obtains:

\[\biggr|\frac{\partial cT}{\partial\mathbf x}\biggr|^2=n^2+\frac{1}{k_0^2\psi_0}\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\psi_0+\frac{2i}{k_0\psi_0}\frac{\partial\psi_0}{\partial\mathbf x}\cdot\frac{\partial cT}{\partial\mathbf x}+\frac{i}{k_0}\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2cT\]

The eikonal approximation amounts to taking the ray optics limit \(k_0\to\infty\) (in practice, the wavelength \(2\pi/k_0\) has to be much shorter than all other length scales), and yields the (scalar) eikonal equation:

\[\biggr|\frac{\partial cT}{\partial\mathbf x}\biggr|=n\]

A light ray is thus a trajectory \(\mathbf x(s)\) with unit tangent vector:

\[\frac{d\mathbf x}{ds}=\frac{1}{n}\frac{\partial cT}{\partial\mathbf x}\]

The rest is an application of the chain rule:

\[\frac{d}{ds}=\frac{\partial}{\partial\mathbf x}\cdot\frac{d\mathbf x}{ds}=\frac{1}{n}\frac{\partial}{\partial\mathbf x}\cdot\frac{\partial cT}{\partial\mathbf x}\]

followed by the identity:

\[\left(\frac{\partial cT}{\partial\mathbf x}\cdot\frac{\partial}{\partial\mathbf x}\right)\frac{\partial cT}{\partial\mathbf x}=\frac{1}{2}\frac{\partial}{\partial\mathbf x}\biggr|\frac{\partial cT}{\partial\mathbf x}\biggr|^2\]

to deduce the (vector) eikonal equation of motion for ray trajectories just as Fermat’s principle predicts.

Problem: Hence, what is Hamilton’s optics-mechanics analogy?

Solution: In a nutshell, the isomorphism proceeds as:

\[(n(\mathbf x), cT)\leftrightarrow (|\mathbf p(\mathbf x)|,S_0)\]

Problem: Use Hamilton’s optics-mechanics analogy to solve the brachistochrone problem (this was how Johann Bernoulli originally solved it).

Solution: By energy conservation, the speed of the particle at distance \(y>0\) below its initial dropping height is \(v=\sqrt{2gy}\). By Fermat’s principle, minimizing the time functional then amounts to treating the particle as a light ray with \(n(\mathbf x)=c/v(y)\). So the question becomes how do light rays bend in a horizontally stratified medium with \(n(y)\propto y^{-1/2}\)? The answer is given by the ray equations:

\[\frac{d}{ds}\begin{pmatrix} y^{-1/2}dx/ds \\ y^{-1/2}dy/ds\end{pmatrix}=\begin{pmatrix}0 \\ y^{-1/2}/2\end{pmatrix}\]

The horizontal component expresses Snell’s law since \(dx/ds=\sin\theta\) (it expresses momentum conservation along the homogeneous \(\partial n/\partial x=0\) direction). Using the tangent vector constraint \(ds^2=dx^2+dy^2\) gives the ODE of a cycloid:

\[\frac{dy}{dx}=\sqrt{\frac{\text{const}}{y}-1}\]

(the vertical component ODE has an analytical solution \(y(s)=-s^2/8R+s\) which is contained in the cycloid, so is redundant information).

Problem: How did Hamilton’s optics-mechanics analogy inspire Schrodinger to propose his famous equations of quantum mechanics?

Solution: Essentially, Schrodinger asked: ray optics is to wave optics as classical mechanics is to what? In other words, one imagines there exists a wave theory of particles/matter and one would like to take the “inverse eikonal limit” of classical mechanics (here, “inverse eikonal limit” is usually called quantization):

Just as light rays propagate parallel to their phase fronts:

\[\frac{d\mathbf x}{ds}=\frac{1}{n}\frac{\partial cT}{\partial\mathbf x}\]

Particles propagate parallel to their “action fronts” exactly according to Hamilton’s analogy:

\[\frac{d\mathbf x}{ds}=\frac{1}{|\mathbf p(\mathbf x)|}\frac{\partial S_0}{\partial\mathbf x}\]

Already this suggests that the action should have some phase interpretation. More precisely, it should be the phase of the particle’s de Broglie wave in units of \(\hbar\). It’s also not obvious that particle’s should be described by a scalar wave field rather than e.g. the electromagnetic vector wave fields of light. Schrodinger simply guessed it looked like the equivalent of “scalar diffraction theory” with a single wavefunction \(\psi(\mathbf x,t)=\psi_0(\mathbf x,t)e^{iS(\mathbf x,t)/\hbar}\). This gives the Madelung equations of quantum hydrodynamics, one of which is just a continuity equation (giving credence to the Born interpretation of \(|\psi|^2\)) and the other is a quantum Hamilton-Jacobi equation which in the limit \(\hbar\to 0\) (analogous to the eikonal limit \(\lambda_0\to 0\)) simplifies to the classical Hamilton-Jacobi equation.

Posted in Blog | Leave a comment

Pseudo-Riemannian Geometry

Problem: Define the signature of a matrix. Hence, state and prove Sylvester’s law of inertia.

Solution: The signature of an \(n\times n\) matrix \(A\) is a \(3\)-tuple \((n_+,n_-,n_0)\) where \(n_+\) is the number of positive eigenvalues of \(A\) (including multiplicity), \(n_-\) is the number of negative eigenvalue of \(A\) (including multiplicity), and \(n_0=\text{dim}\ker A\) is the multiplicity of the zero eigenvalue; thus, \(n_++n_-+n_0=n\).

Let \(A,B\) be real symmetric matrices. Then Sylvester’s law of inertia asserts that \(A\) and \(B\) are congruent matrices iff they have the same signature (which sometimes is called “inertia” because of this invariance under congruence, hence the name).

Proof: It’s easy to see that the nullity \(n_0\) is preserved by congruence transformations. If one can can show that \(n_+\) is also preserved, then it implies \(n_-\) is conserved by virtue of \(n_++n_-+n_0=n\). To show this, the idea is to prove by contradiction, assuming \(n_+\) is not preserved which by dimension counting would imply a non-zero vector living in the intersection of the subspace spanned by the positive-eigenvalue eigenvectors of \(A\) and the congruence-transformed subspace spanned by the non-positive-eigenvalue eigenvectors of \(B\).

Since any real, symmetric matrix \(A\) is isomorphic to a real quadratic form \(Q(\mathbf x):=\mathbf x^TA\mathbf x\), the concepts of signature and Sylvester’s law of inertia can also be reformulated in the language of quadratic forms rather than real symmetric matrices.

Problem: Let \(X\) be a smooth \(n\)-manifold. Explain what it means to place the additional structure of a pseudo-Riemannian metric \(g\) on \(X\). Then explain how Riemannian and Lorentzian geometry are special cases of pseudo-Riemannian geometry.

Solution: A Riemannian metric \(g\) on \(X\) is a type \((0,2)\) tensor field that defines an inner product \(g_x:T_x(X)^2\to\mathbf R\) at the tangent space \(T_x(X)\) of each point \(x\in X\). The “tensor field” part is sometimes rephrased as saying that \(g_x\) is a bilinear form. Moreover, to flesh out the usual axioms of a real inner product space, the Riemannian metric tensor \(g_x\) at each \(x\in X\) must be symmetric \(g_x(v_x,v’_x)=g_x(v’_x,v_x)\) and positive-definite \(v_x\neq 0\Rightarrow g_x(v_x,v_x)>0\).

A pseudo-Riemannian metric relaxes the positive-definite requirement of a Riemannian metric, instead merely requiring non-degeneracy (i.e. at each point \(x\in X\), the zero vector \(0\) is the only vector orthogonal \(g_x(0,v_x)=0\) to all \(v_x\in T_x(X)\)). More concisely, what this is saying is that \(n_0=0\), so the signature of a pseudo-Riemannian metric may be thought of as a pair \((n_+,n_-)\).

Thus, Riemannian metrics are the special subset of pseudo-Riemannian metrics for which \((n_+,n_-)=(n,0)\). Meanwhile, Lorentzian metrics are another special subset of pseudo-Riemannian metrics for which \((n_+,n_-)=(n-1,1)\) (this is the relativists’/ convention). Typically, spacetime \(X\) has dimension \(n=4\), so e.g. the Minkowski metric is said to have signature \((3,1):=(-,+,+,+)\).

Problem: Let \((X,g)\) be a pseudo-Riemannian manifold. Often, one writes the general expression for an infinitesimal line element on \(X\) as \(ds^2=g_{\mu\nu}dx^{\mu}dx^{\nu}\); explain the shorthand being used here.

Solution: This is nothing more than choosing some chart \(x^{\mu}\) on \(X\) and expanding the metric tensor field \(g\) in the coordinate basis \(\{dx^{\mu}\otimes dx^{\nu}\}\) of type \((0,2)\) tensor fields:

\[g=g_{\mu\nu}dx^{\mu}\otimes dx^{\nu}\]

for some real, symmetric scalar fields \(g_{\mu\nu}:X\to\mathbf R\). One then simply writes \(ds^2:=g\) and omits the tensor product \(\otimes\).

Problem: Let \(x(t)\in X\) be a curve on a Riemannian manifold \((X,g)\). By choosing a chart \(x^{\mu}\) to cover a suitable region of \(X\) spanned by the trajectory \(x(t)\), explain how to compute the length \(s\) of the curve \(x(t)\) using the Riemannian metric \(g\).

Solution: Heuristically, it is:

\[s=\int ds=\int\sqrt{g_{\mu\nu}dx^{\mu}dx^{\nu}}=\int dt\sqrt{g_{\mu\nu}(x(t))\dot{x}^{\mu}\dot{x}^{\nu}}\]

Problem: Write down the action \(S\) for a non-relativistic particle of mass \(m\) moving on a Riemannian manifold \((X,g)\). Hence, derive the geodesic equation. What happens if the particle is relativistic?

Solution: A nonrelativistic free particle of mass \(m\) has action \(S=\int dt L\) described by the usual nonrelativistic kinetic Lagrangian:

\[L=\frac{m}{2}|\dot{\mathbf x}|^2\]

The catch is that \(|\dot{\mathbf x}|^2=g_{ij}(\mathbf x)\dot x^{i}\dot x^{j}\) depends on the Riemannian metric \(g\) in which the particle moves. Applying the Euler-Lagrange equations, one finds the equation of motion (called the geodesic equation):

\[\ddot x^i+\Gamma^i_{jk}\dot x^j\dot x^k=0\]

where the Christoffel symbols \(\Gamma^i_{jk}(\mathbf x):=\frac{1}{2}g^{i\ell}(\mathbf x)\left(\frac{\partial g_{\ell j}}{\partial x^k}+\frac{\partial g_{\ell k}}{\partial x^j}-\frac{\partial g_{jk}}{\partial x^{\ell}}\right)\) are analogous to “fictitious forces” on the particle due to the curvature of \(X\).

By contrast, a relativistic free particle has action \(S=-mcs=-mc^2\tau\). This can be recast into non-relativistic form \(S=\int dt L\) using the relativistic kinetic Lagrangian:

\[L=-mc\sqrt{g_{\mu\nu}\dot x^{\mu}\dot x^{\nu}}\]

which in flat space \(g_{\mu\nu}=\text{diag}(1,-1,-1,-1)\) reduces to the familiar \(L=-mc^2/\gamma\) (nb. more generally, the quantity under the square root is always \(\geq 0\) for the timelike worldlines of particles). Some comments:

  • The stationary action principle \(\delta S=0\) in which the action \(S\) tends to be minimized corresponds to the well-known fact that relativistic free particles travelling in straight line geodesics through spacetime maximize the Minkowski distance \(s\), or equivalently Alice experiences more proper time (i.e. ages more) than Bob in the twin paradox.
  • The relativistic action (unlike its non-relativistic counterpart) is manifestly reparameterization invariant.

Now the Euler-Lagrange geodesic equations for the relativistic Lagrangian are almost identical to the non-relativistic case except there’s an additional term:

\[\ddot x^{\mu}+\Gamma^{\mu}_{\nu\rho}(x)\dot x^{\nu}\dot x^{\rho}=\frac{\dot L}{L}\dot x^{\mu}\]

The additional term vanishes \(\dot L=0\) iff \(L=L(\tau’)\) is parameterized as any affine transformation \(\tau’=a\tau+b\) of the particle’s proper time \(\tau\) (prove this by showing \(\dot{\tau}’=-aL/mc^2\)). Only in this case does the relativistic geodesic equation reduce to the non-relativistic form:

\[\frac{d^2 x^{\mu}}{d\tau’^2}+\Gamma^{\mu}_{\nu\rho}(x(\tau’))\frac{dx^{\nu}}{d\tau’}\frac{dx^{\rho}}{d\tau’}=0\]

Problem: For \(X=\mathbf R^2\), calculate all non-vanishing Christoffel symbols in the polar coordinate chart \((\rho,\phi)\).

Solution:

Note it is usually more efficient to compute Christoffel symbols this way rather than using the explicit formula in terms of partial derivatives of the metric \(g\) (though they’re of course equivalent).

Problem: Explain how the presence of a pseudo-Riemannian metric \(g\) on a smooth manifold \(X\) allows one to identify covectors in \(T_x^*(X)\) with tangent vectors in \(T_x(X)\) at each point \(x\in X\).

Solution: Just as in \(X=\mathbf R^n\), one identifies a tangent vector \(\mathbf v\) with its covector \(\mathbf v\cdot\) via the inner product \(\cdot\), so at a given \(x\in X\), one uses the pseudo-inner product \(g_x\) on \(T_x(X)\) to make the exact same identification \(v_x\leftrightarrow g_x(v_x,\space)\) (this sometimes called the musical isomorphism induced by \(g\) in light of the map \(\flat_x:T_x(X)\to T^*_x(X)\) defined by \(v_x^{\flat_x}(v’_x):=g(v_x,v’_x)\) and its inverse \(\sharp_x=\flat_x^{-1}\)).

In some chart \(x^{\mu}\), one can write the \(1\)-form \(v^{\flat}=(v^{\flat})_{\mu}dx^{\mu}\) with components \((v^{\flat})_{\mu}=v^{\flat}(\partial_{\mu})=g(v,\partial_{\mu})=g_{\mu\nu}v^{\nu}\) where \(v^{\nu}=v(x^{\nu})\) are the components of its musically isomorphic vector field \(v=v^{\nu}\partial_{\nu}\) in the same coordinate basis. Physicists typically abuse notation by merely writing \((v^{\flat})_{\mu}=g_{\mu\nu}v^{\nu}\) as just \(v_{\mu}=g_{\mu\nu}v^{\nu}\), identifying \(v^{\flat}\equiv v\) under the musical isomorphism. But then this lazy notation makes it look as if the metric \(g\) performs a trivial mechanical action of just “lowering the index” on \(v\).

Problem: Define a tensor field \(g^*\) that performs the trivial mechanical action of “raising the index” on a \(1\)-form \(A\).

Solution: Define \(g^*_x:T^*_x(X)^2\to\mathbf R\) to be a type \((2,0)\) tensor given by:

\[g^*(A,A’):=g(A^{\sharp}, A’^{\sharp})\]

Then on the one hand, in some chart \(x^{\mu}\), one has:

\[g^*(v^{\flat},v’^{\flat})=g(v,v’)=g_{\mu\nu}v^{\mu}v’^{\nu}\]

On the other hand, in that same chart \(x^{\mu}\), one has:

\[g^*(v^{\flat},v’^{\flat})=(g^*)^{\mu\nu}(v^{\flat})_{\mu}(v’^{\flat})_{\nu}=(g^*)^{\mu\nu}g_{\mu\rho}v^{\rho}g_{\nu\sigma}v’^{\sigma}=(g^*)^{\rho\sigma}g_{\mu\rho}g_{\nu\sigma}v^{\mu}v’^{\nu}\]

This implies \(g_{\mu\nu}=(g^*)^{\rho\sigma}g_{\mu\rho}g_{\nu\sigma}\). Since \(g\) is non-degenerate (the \(N_0=0\) axiom of a pseudo-Riemannian metric), it follows that \(\det g_{\mu\nu}\neq 0\) is invertible, so the only possibility is \((g^*)^{\mu\nu}g_{\nu\rho}=\delta^{\mu}_{\rho}\), i.e. \(g^*\sim g^{-1}\). It is common practice to just drop the \(*\) and write \(g^{\mu\nu}\) in lieu of \((g^*)^{\mu\nu}\). With this lazy notation, \(g^{\mu\nu}\) looks like it just “raises the index” on \(A^{\mu}=g^{\mu\nu}A_{\nu}\) (where one last abuse of notation \((A^{\sharp})^{\mu}=A^{\mu}\) has been made!). cf. raising lowering operators \(a,a^{\dagger}\) in quantum mechanics.

Problem: What is the canonical volume form on a smooth, orientable pseudo-Riemannian \(n\)-manifold \(X\)? Why is it canonical?

Solution: In a chart \(x^{\mu}\) in which the pseudo-Riemannian metric has components \(g=g_{\mu\nu}dx^{\mu}\otimes dx^{\nu}\), the canonical volume form is \(\sqrt{|\det g|}dx^1\wedge…\wedge dx^n\). This is indeed a volume form because it is a top form with \(\det g\neq 0\) nowhere vanishing for same reasons as before. Moreover, it is canonical because it does not depend on the choice of chart \(x^{\mu}\) (provided it’s the same orientation) as \(\sqrt{|\det g|}\) is a scalar density of weight \(1\) whereas \(dx^1\wedge…\wedge dx^n\) is a scalar density of weight \(-1\).

Problem: Let \(X\) be a smooth, \(n\)-dimensional, orientable pseudo-Riemannian manifold, and let \(\omega\in\Omega^k(X)\) be a differential \(k\)-form on \(X\). Define the Hodge dual \(n-k\)-form \(\star\omega\in\Omega^{n-k}(X)\) on \(X\).

Solution: The Hodge dual \(\star\omega\in\Omega^{n-k}(X)\) is defined to be the unique differential \(n-k\)-form such that …

Posted in Blog | Leave a comment

Reinforcement Learning (Part \(1\))

Problem: How does the paradigm of reinforcement learning (RL) fit within the broader context of machine learning?

Solution: It is instructive to compare/contrast reinforcement learning with supervised learning. In this way, it will be seen that RL can in fact be viewed as a generalization of SL:

  • In SL, the training feature vectors \(\mathbf x_i\) manifest as the state vector \(\mathbf x_t\) of an agent (similar to how it’s okay to conflate the position vector \(\mathbf x(t)\) of a particle in classical mechanics with the particle itself, there’s no harm in conflating the state vector \(\mathbf x_t\) with the agent itself, see map-territory relation). The difference is purely a perspective shift reflected in the choice of subscript variable \(i=1,2,3,…\) vs. \(t=0,1,2,…\).
  • In SL, the feature vectors \(\mathbf x_i\) are i.i.d. draws from the same underlying data-generating distribution. In RL, the initial state vector \(\mathbf x_{t=0}\) is drawn from an initial distribution, and subsequent state vectors \(\mathbf x_t\) are dependent on the previous state \(\mathbf x_{t-1}\) (discrete-time Markov chain).
  • In SL, the model output \(\hat y(\mathbf x|\boldsymbol{\theta})\) is analogous to an action \(\mathbf f\) outputted by an agent’s policy function \(\pi(\mathbf f|\mathbf x,t)\).
  • In SL, each training feature vector \(\mathbf x_i\) is labelled by its correct output \(y_i\). This concept has no analog in reinforcement learning as there is no notion of a “correct” action for the agent to take.
  • In SL, the feedback mechanism is provided by a loss function \(L(\hat y,y)\). By contrast, in RL the feedback mechanism is provided by a reward signal \(r_t\).
  • In SL, the objective is to find optimal parameters \(\boldsymbol{\theta}^*=\text{argmin}_{\boldsymbol{\theta}}\sum_{i=1}^{N_{\text{train}}}L(\hat y(\mathbf x_i|\boldsymbol{\theta}),y_i)\) minimizing the training cost over the \(N_{\text{train}}\) training examples. In RL, the objective is to find an optimal agent policy \(\pi^*=\text{argmax}_{\pi}\langle\sum_{t=0}^T\gamma^tr_t|\pi\rangle\) maximizing the expected return over an episode of horizon \(T\leq\infty\) (which depends on if/when the agent reaches a terminal state \(\mathbf x_T\)).
  • In SL, there is often a distinction between training, cross-validation, and testing datasets, with a golden rule often being that the model should obviously not be able to see any of the testing data. In RL, it is societally acceptable to train on your test set 🙂

Problem: Imagine the set of all states \(\mathbf x\), i.e. the agent’s state space. On this state space, one can impose a scalar field \(v_{\pi}(\mathbf x,t)\); what is the meaning of this field? Give a simple example thereof.

Solution: The scalar field \(v_{\pi}(\mathbf x,t)\) can roughly be thought of as describing how “valuable” the state \(\mathbf x\) is at time \(t\). More precisely, if one were to initialize an agent \(\pi\) at state \(\mathbf x_t:=\mathbf x\), then \(v_{\pi}(\mathbf x,t)\) is the expected return:

\[v_{\pi}(\mathbf x,t):=\biggr\langle\sum_{t’=t+1}^T\gamma^{t’-t-1}r_{t’}\biggr|\mathbf x,\pi\biggr\rangle\]

Consider a \(\pi\)-creature (of \(3\)Blue\(1\)Brown fame) symbolizing an agent with policy \(\pi\). This \(\pi\)-creature has to complete a maze. Within the RL framework, this can be modelled as an MDP in which each square is one possible state \(\mathbf x\) of the agent, the maze structure defines the allowed actions \(\mathbf f\) from each state \(\mathbf x\), and one can work with an undiscounted \(\gamma=1\) return in which the reward is \(r_t=-1\) for each action taken. Thus, the optimal policy \(\pi^*\) is a time-independent, deterministic policy \(\mathbf f=\pi^*(\mathbf x)\) that gets the \(\pi\)-creature from its initial state to the terminal state in the fewest number of moves. With that in mind, one can label on top of each square \(\mathbf x\) its corresponding optimal value \(v_{\pi^*}(\mathbf x)\) (note, this image was generated using Nano Banana Pro, some of the calculated values are just wrong but the idea is clear):

Problem: The value function \(v_{\pi}(\mathbf x,t)\) is passive; it simply assess how good/bad it is to be at \(\mathbf x_t:=\mathbf x\). In order to remedy this, one can look at the quality function \(q_{\pi}(\mathbf x,\mathbf f,t)\); explain how this takes on a more active role compared to the passive nature of the value function \(v_{\pi}(\mathbf x,t)\).

Solution: Because \(q_{\pi}(\mathbf x,\mathbf f,t)\) assesses the quality of the agent choosing the action \(\mathbf f_t:=\mathbf f\) while in state \(\mathbf x_t:=\mathbf x\). That is, it is a more refined conditional expected return:

\[q_{\pi}(\mathbf x,\mathbf f,t)=\biggr\langle\sum_{t’=t+1}^T\gamma^{t’-t-1}r_{t’}\biggr|\mathbf x,\mathbf f,\pi\biggr\rangle\]

Problem: How come the agent’s policy, value function and quality function are sometimes respectively written as \(\pi(\mathbf f|\mathbf x),v_{\pi}(\mathbf x)\) and \(q_{\pi}(\mathbf x,\mathbf f)\) without the \(t\) argument?

Solution: There could be \(3\) reasons:

  • Similar to the above maze example, sometimes it just doesn’t depend on \(t\).
  • If one is seeking to maximize the agent’s expected return over an infinite horizon \(T=\infty\); in such a case, it is both intuitively and mathematically clear that \(\pi(\mathbf f|\mathbf x),v_{\pi}(\mathbf x)\) and \(q_{\pi}(\mathbf x,\mathbf f)\) should all be invariant with respect to time translations.
  • If the horizon is finite \(T<\infty\), then it is standard to package \(\mathbf x\) and \(t\) together, thereby working with an “augmented” state vector \(\mathbf x’:=(\mathbf x,t)\). Then everything can be made to look identical to the infinite horizon \(T=\infty\) case by replacing \(\mathbf x\mapsto\mathbf x’\).

(cf. distinction between state variables and path variables in thermodynamics).

Problem: State and prove the law of iterated expectation.

Solution: If \(X,Y\) are any random variables, then:

\[\langle X\rangle=\langle\langle X|Y\rangle\rangle\]

The proof is easy as long as one is careful to interpret all the expectations correctly. For instance, \(\langle X|Y\rangle\) is not a scalar but a random variable with respect to \(Y\):

\[\langle X|Y\rangle=\int dx p(x|Y) x\]

Thus, it is clear that the outer expectation must also be with respect to \(Y\) alone:

\[\langle\langle X|Y\rangle\rangle=\int dy p(y)\langle X|y\rangle=\int dxdy p(y)p(x|y)x\]

Rewriting in terms of the joint distribution \(p(x,y)=p(y)p(x|y)\) and integrating out \(\int dy p(x,y)=p(x)\), one finally obtains:

\[=\int dx p(x) x=\langle X\rangle\]

Of course, this also generalizes easily to identities such as the following equality of random variables (w.r.t. \(Z\)):

\[\langle X|Z\rangle=\langle\langle X|Y,Z\rangle|Z\rangle\]

Problem: Show that the value function \(v_{\pi}(\mathbf x)\) satisfied the following identity (to be used as a lemma later):

\[v_{\pi}(\mathbf x_t)=\langle r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1})|\mathbf x_t,\pi\rangle\]

Solution: Denoting the return random variable by:

\[R_t:=\sum_{t’=t+1}^T\gamma^{t’-t-1}r_{t’}\]

The fundamental observation is that \(R_t\) obeys a simple recurrence relation:

\[R_t=r_{t+1}+\gamma R_{t+1}\]

Taking suitable conditional expectations thereof:

\[\langle R_t|\mathbf x_t,\pi\rangle=\langle r_{t+1}|\mathbf x_t,\pi\rangle+\gamma\langle R_{t+1}|\mathbf x_t,\pi\rangle\]

The LHS is the definition of \(v_{\pi}(\mathbf x_t)\). Applying the law of iterated expectation to the \(2^{\text{nd}}\) term on the RHS:

\[\langle R_{t+1}|\mathbf x_t,\pi\rangle=\langle\langle R_{t+1}|\mathbf x_{t+1},\mathbf x_t,\pi\rangle|\mathbf x_t,\pi\rangle\]

The Markov property ensures that \(\langle R_{t+1}|\mathbf x_{t+1},\mathbf x_t,\pi\rangle=\langle R_{t+1}|\mathbf x_{t+1},\pi\rangle=v_{\pi}(\mathbf x_{t+1})\). One thus obtains the desired result:

\[v_{\pi}(\mathbf x_t)=\langle r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1})|\mathbf x_t,\pi\rangle\]

Problem: How are \(v_{\pi}(\mathbf x_t)\) and \(q_{\pi}(\mathbf x_t,\mathbf f_t)\) related to each other?

Solution: Apply the law of iterated expectations:

\[v_{\pi}(\mathbf x_t):=\langle R_t|\mathbf x_t,\pi\rangle=\langle\langle R_t|\mathbf x_t,\mathbf f_t,\pi\rangle|\mathbf x_t,\pi\rangle=\langle q_{\pi}(\mathbf x_t,\mathbf f_t)|\mathbf x_t,\pi\rangle\]

Since \(\mathbf x_t\) is being conditioned upon, the expectation must be over \(\mathbf f_t\) so one can explicitly write it as:

\[=\sum_{\mathbf f_t}p(\mathbf f_t|\mathbf x_t,\pi)q_{\pi}(\mathbf x_t,\mathbf f_t)\]

But trivially \(p(\mathbf f_t|\mathbf x_t,\pi)=\pi(\mathbf f_t|\mathbf x_t)\). Thus, one has an expression for \(v_{\pi}(\mathbf x_t)\) in terms of \(q_{\pi}(\mathbf x_t,\mathbf f_t)\). Now one would like to proceed the other way, relating \(q_{\pi}(\mathbf x_t,\mathbf f_t)\) to \(v_{\pi}(\mathbf x_t,\mathbf f_t)\). This can be achieved by fleshing out the expectation from the earlier lemma:

\[v_{\pi}(\mathbf x_t)=\langle r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1})|\mathbf x_t,\pi\rangle\]

\[=\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\pi)(r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1}))\]

Further expand \(p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\pi)=\sum_{\mathbf f_t}p(\mathbf f_t|\mathbf x_t,\pi)p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t,\pi)\) and recognize again \(p(\mathbf f_t|\mathbf x_t,\pi)=\pi(\mathbf f_t|\mathbf x_t)\) and \(p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t,\pi)=p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)\) because the action \(\mathbf f_t\) is already fixed. Moving the summation \(\sum_{\mathbf f_t}\) to the outside so as to compare with the earlier identity expressing \(v_{\pi}(\mathbf x_t)\) in terms of \(q_{\pi}(\mathbf x_t,\mathbf f_t)\), one can pattern-match:

\[q_{\pi}(\mathbf x_t,\mathbf f_t)=\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)(r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1}))\]

(which in hindsight is manifest…)

Problem: Hence, deduce the Bellman equations for the value and quality functions.

Solution: To obtain the Bellman equation for \(v_{\pi}(\mathbf x_t)\), substitute the above expression for \(q_{\pi}(\mathbf x_t,\mathbf f_t)\) into the expression for \(v_{\pi}(\mathbf x_t)\):

\[v_{\pi}(\mathbf x_t)=\sum_{\mathbf f_t}\pi(\mathbf f_t|\mathbf x_t)\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)(r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1}))\]

Similarly, to the get the Bellman equation for \(q_{\pi}(\mathbf x_t,\mathbf f_t)\) substitute vice versa:

\[q_{\pi}(\mathbf x_t,\mathbf f_t)=\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)\left(r_{t+1}+\gamma\sum_{\mathbf f_{t+1}}p(\mathbf f_{t+1}|\mathbf x_{t+1},\pi)q_{\pi}(\mathbf x_{t+1},\mathbf f_{t+1})\right)\]

Intuitively, the Bellman equations relate the value of every state with the values of states that can be transitioned into, thereby providing a system of simultaneous equations that in principle can be solved to deduce all state values. A key caveat for this dynamic programming approach to work is that one must have complete knowledge of the environment dynamics \(p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)\) (similar remarks apply to the qualities of different state-action pairs).

Problem: The above Bellman equations apply to a generic agent policy \(\pi\); how do they specialize to the case of an optimal policy \(\pi^*\)?

Solution: Reverting back to the form of the “pre-Bellman” equations and using the key insight that the optimal policy \(\pi^*(\mathbf f_t|\mathbf x_t)=\delta_{\mathbf f_t,\text{argmax}_{\mathbf f_t}q_{\pi^*}(\mathbf x_t,\mathbf f_t)}\) should only assign non-zero probability to actions of the best quality, one has:

\[v_{\pi^*}(\mathbf x_t)=\text{max}_{\mathbf f_t}q_{\pi^*}(\mathbf x_t,\mathbf f_t)\]

\[q_{\pi^*}(\mathbf x_t,\mathbf f_t)=\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)(r_{t+1}+\gamma v_{\pi^*}(\mathbf x_{t+1}))\]

which lead to the Bellman optimality equations.

Problem: In light of Bellman optimality, discuss how policy evaluation and policy improvement work. Hence, explain the notion of generalized policy iteration (GPI).

Solution: Imagine the space of all \((\pi,v)\) pairs, where \(\pi(\mathbf f_t|\mathbf x_t)\) is any policy distribution and \(v(\mathbf x_t)\) is the value function of some policy. Within this space, there are \(2\) canonical subspaces. One is the set \((\pi,v_{\pi})\) of all pairs where \(v_{\pi}\) is specifically the value function associated to policy \(\pi\), and the other subspace \((\pi_v,v)\) where \(\pi_v(\mathbf f_t|\mathbf x_t):=\delta_{\mathbf f_t,\text{argmax}_{\mathbf f_t}q(\mathbf x_t,\mathbf f_t)}\) is greedy with respect to the value function \(v\), where \(q(\mathbf x_t,\mathbf f_t)=\langle r_{t+1}+\gamma v(\mathbf x_{t+1})|\mathbf x_t,\mathbf f_t\rangle\) (in particular, notice \(q\neq q_{\pi}\); the policy \(\pi\) does not appear in the definition of \(q\)).

  • Policy evaluation is an algorithm for projecting \((v,\pi)\mapsto (v_{\pi},\pi)\). The idea is to view \(v\) as a random initial guess for the underlying true policy value function \(v_{\pi}\) (though any terminal states \(\mathbf x_T\) should be initialized \(v(\mathbf x_T):=0\)). Then, sweeping through each state \(\mathbf x_t\), one updates the value of \(v(\mathbf x_t)\) using the Bellman equation involving the (known) randomly initialized values of the other states:

\[v(\mathbf x_t)\mapsto\sum_{\mathbf f_t}\pi(\mathbf f_t|\mathbf x_t)\sum_{\mathbf x_{t+1},r_{t+1}}p(\mathbf x_{t+1},r_{t+1}|\mathbf x_t,\mathbf f_t)(r+\gamma v(\mathbf x_{t+1}))\]

With sufficiently many sweeps over the state space, general theorems guarantee convergence \(v\to v_{\pi}\).

  • Policy improvement is an algorithm for projecting onto the other canonical subspace \((v,\pi)\mapsto (v,\pi_v)\). Essentially just act greedy:
    \[\pi(\mathbf f_t|\mathbf x_t)\mapsto\delta_{\mathbf f_t,\text{argmax}_{\mathbf f_t}q_{\pi}(\mathbf x_t,\mathbf f_t)}\]
    and is rigorously justified by the policy improvement theorem.

Policy iteration roughly amounts to alternating between the \(2\) steps of policy evaluation and policy improvement; however there is some leeway in how one does this, for instance one need not necessarily run the policy evaluation step to completion but rather just perform a single sweep over the state space each time (this more “generalized” version of policy iteration is called value iteration, and can be shown to eventually still funnel/converge onto the optimal policy \(\pi^*\) as one can prove that \(\pi\) is greedy with respect to its own value function \(v_{\pi}\) iff \(\pi=\pi^*\)).

Problem: Distinguish between model-free vs. model-based methods/algorithms in reinforcement learning.

Solution: In both cases, the fundamental limitation (which previously was taken for granted) is incomplete knowledge of the environment dynamics \(p(\mathbf x_{t+1},r_{t+1}|\mathbf x_t,\mathbf f_t)\); instead, one can only sample trajectories \(\mathbf x_{t=0},\mathbf f_{t=0},r_{t=0},\mathbf x_{t=1},\mathbf f_{t=1},r_{t=1},…\) when running a policy \(\pi\) through the MDP. Recalling the fundamental objective of RL is to compute \(\pi^*=\text{argmax}_{\pi}\langle R_t|\pi\rangle\), a model-free method for doing so is one which does not attempt to estimate \(p(\mathbf x_{t+1},r_{t+1}|\mathbf x_t,\mathbf f_t)\) (here \(p\) is what the word “model” refers to, i.e. a model of the environment dynamics) whereas model-based methods do attempt to estimate \(p(\mathbf x_{t+1},r_{t+1}|\mathbf x_t,\mathbf f_t)\) and thereby obtain \(\pi^*\) through GPI as outlined earlier (see the OpenAI Spinning Up documentation for a nice diagram of this taxonomy and further comments).

Problem: Explain how Monte Carlo evaluation and Monte Carlo control are both model-free methods that accomplish the same actions as their respective model-based cousins, i.e. policy evaluation and policy improvement.

Solution: Fix \(\pi\). The key is to recognize that \(v_{\pi}(\mathbf x_t)\) is an expectation of the return random variable starting at \(\mathbf x_t\), and Monte Carlo point estimators can always be applied to approximate expectation values:

\[v_{\pi}(\mathbf x_t)=\langle R_t|\mathbf x_t,\pi\rangle\approx\frac{1}{N_{\mathbf x_t}}\sum_{n=1}^{N_{\mathbf x_t}}R_n\]

where \(N_{\mathbf x_t}\) is the number of times the MDP passes through state \(\mathbf x_t\) in however many trajectories, and \(R_n\) is the return following the \(n^{\text{th}}\) pass through \(\mathbf x_t\) within some MDP trajectory.

Lemma: the arithmetic mean \(\hat{\mu}_n:=\frac{1}{n}\sum_{i=1}^nx_i\) obeys the recurrence relation \(\hat{\mu}_n=\hat{\mu}_{n-1}+\frac{1}{n}(x_n-\hat{\mu}_{n-1})\)

Proof: trivial, but instructive to highlight the convex linear combination nature of the recurrence \(\hat{\mu}_n=\left(1-\frac{1}{n}\right)\hat{\mu}_{n-1}+\frac{1}{n}x_n\), so for large \(n\gg 1\), the majority \(1-1/n\) of the weight is given to the prior value of the mean \(\hat{\mu}_{n-1}\), only a tiny sliver \(1/n\) is given to the new sample \(x_n\).

Thus, rather than waiting for a bunch of trajectories to finish before applying the Monte Carlo estimate above, one can introduce an intermediate estimator initialized at \(\hat v_{\pi}(\mathbf x_t):=0\) that applies (at the termination of each trajectory) the following iterative rule for each time \(n=1,2,…,N_{\mathbf x_t}\) that \(\mathbf x_t\) was visited:

\[\hat v_{\pi}(\mathbf x_t)\mapsto\hat v_{\pi}(\mathbf x_t)+\alpha_n(R_n-\hat v_{\pi}(\mathbf x_t))\]

where \(\alpha_n:=1/n\). Since this is just a mathematical decomposition of the original Monte Carlo estimate, it is an equally valid model-free method for \(v_{\pi}\) estimation. Now comes a curveball: one common modification is to replace \(\alpha_n\mapsto\alpha\) with an \(n\)-independent constant \(\alpha\in[0,1]\) (this is called constant-\(\alpha\) MC).

If one were presented with the recurrence relation \(\hat{\mu}_n=\hat{\mu}_{n-1}+\frac{1}{n}(x_n-\hat{\mu}_{n-1})\) and initial condition \(\hat{\mu}_0=0\), one could claim the explicit solution to be \(\hat{\mu}_n=\frac{1}{n}\sum_{i=1}^nx_i\). However, if instead one were presented with the recurrence relation \(\hat{\mu}’_n=\hat{\mu}’_{n-1}+\alpha(x_n-\hat{\mu}’_{n-1})\) and same initial condition \(\hat{\mu}’_0=0\), what would be the solution for \(\hat{\mu}’_n\)? It’s easy to check \(\hat{\mu}’_n=\alpha\sum_{i=1}^n(1-\alpha)^{n-i}x_i\); thus, recent samples are given exponentially more weight, and it’s biased towards underestimating \(\langle\hat{\mu}’_n\rangle-\mu=-\mu(1-\alpha)^n\), though is asymptotically unbiased \(\lim_{n\to\infty}\langle\hat{\mu}’_n\rangle=\mu\). However, unlike the asymptotically vanishing variance \(\sigma^2_{\hat{\mu}_n}=\sigma^2/n\) of the standard mean estimator \(\hat{\mu}_n\), the variance \(\sigma^2_{\hat{\mu}’_n}=\frac{\alpha\sigma^2}{2-\alpha}(1-(1-\alpha)^{2n})\to\frac{\alpha\sigma^2}{2-\alpha}\approx\alpha\sigma^2/2\) of \(\hat{\mu}’_n\) is bounded from below, hence the estimate \(\hat{\mu}’_n\) will forever oscillate about \(\mu\). It should make sense \(\sigma^2_{\hat{\mu}’_n}\propto\alpha\), i.e. a larger step size \(\alpha\) means bigger oscillations, though it also means faster convergence of \(\hat{\mu}’_n\to\mu\) because of the \((1-\alpha)^n\).

This all begs the question: why bother with constant-\(\alpha\) MC? (come back to this…)

Having used constant-\(\alpha\) MC to estimate \(v_{\pi}(\mathbf x_t)\) in a model-free way for some fixed agent policy \(\pi\), one can naturally ask about model-free estimation of \(q_{\pi}(\mathbf x_t,\mathbf f_t)\). It turns out to just be:

\[\hat q_{\pi}(\mathbf x_t,\mathbf f_t)\mapsto\hat q_{\pi}(\mathbf x_t,\mathbf f_t)+\alpha(R_n-\hat q_{\pi}(\mathbf x_t,\mathbf f_t))\]

Conceptually, the standard way to convince oneself of this is that because \(\pi\) was fixed, it could be thought of as part of the incomplete knowledge of the environment dynamics \(p\) so that the agent never takes any actions but just lets the environment push its state around. This perspective shift from MDP trajectories \(\mathbf x_{t=0},\mathbf f_{t=0},r_{t=0},\mathbf x_{t=1},\mathbf f_{t=1},r_{t=1},…\) to MRP trajectories \(\mathbf x_{t=0},r_{t=0},\mathbf x_{t=1},r_{t=1},…\) amounts to viewing the MRP states \(\mathbf x_t\) as corresponding to augmented state-action pairs \((\mathbf x_t,\mathbf f_t)\) in the MDP. But then it is mathematically apparent that \(v_{\pi}(\mathbf x_t)\) in the MRP is just \(q_{\pi}(\mathbf x_t,\mathbf f_t)\) in the MDP, so constant-\(\alpha\) MC works for \(q_{\pi}\)-evaluation! Thus, Monte Carlo control chooses to be greedy with respect to \(q_{\pi}\), and since \(q_{\pi}\) could be estimated in a model-free way, this therefore implements model-free policy improvement.

Problem: Explain the exploration-exploitation tradeoff in reinforcement learning and how a soft policy \(\pi\) can help to address this.

Solution: Being greedily exploiting high value states all the time could lead to the agent settling into a “locally” optimal but not globally optimal policy \(\pi^*\); the only way to know for sure is to explore all state-action sequences, for instance by ensuring that one’s policy is soft \(\pi(\mathbf f_t|\mathbf x_t)>0\) for all states \(\mathbf x_t\) and actions \(\mathbf f_t\). A crude way to guarantee softness during Monte Carlo control is to use an \(\varepsilon\)-greedy policy (a better name would be either a \((1-\varepsilon)\)-greedy policy or an \(\varepsilon\)-soft policy), though there are other more refined ways to do so.

Problem: Having distinguished between model-based and model-free methods in RL, one can make an orthogonal distinction between on-policy and off-policy methods. Explain and motivate this taxonomy.

Solution: The constant-\(\alpha\) MC \(q_{\pi}\)-evaluation followed by \(\varepsilon\)-greedy MC control is a simple model-free, on-policy RL algorithm for discovering \(\pi^*\). This is because the trajectory samples used to evaluate \(q_{\pi}\) came from the same policy \(\pi\) that is subsequently being improved. However, in practice the trajectory samples may come from some (known) behavior policy \(\pi_b\) even though the goal is to evaluate \(q_{\pi}\) so as to improve another, different (known) target policy \(\pi\neq\pi_b\). This is where off-policy methods come in (cf. the distinction between on-shell vs. off-shell particles in physics). The obvious way to bridge \(\pi_b\) and \(\pi\) is via importance sampling:

\[q_{\pi}(\mathbf x_t,\mathbf f_t)=\langle R_t|\mathbf x_t,\mathbf f_t,\pi\rangle\]

\[=\frac{1}{N_{\mathbf x_t}}\sum_{n=1}^{N_{\mathbf x_t}}\frac{\pi(\mathbf f_{t+1}|\mathbf x_{t+1})…\pi(\mathbf f_T|\mathbf x_T)}{\pi_b(\mathbf f_{t+1}|\mathbf x_{t+1})…\pi_b(\mathbf f_T|\mathbf x_T)}R_n\]

where it is essential to emphasize that the trajectories in this sum are sampled under the behavior policy \(\pi_b\)! This is model-free because all the transition probabilities have cancelled out in the importance sampling quotient.

Posted in Blog | Leave a comment

Density Functional Theory

Problem: In one sentence, what is the essence of DFT?

Solution: To replace \(\Psi\mapsto n\), where the number density of a system of \(N\) identical quantum particles (usually electrons) \(n(\mathbf x)\) is:

\[n(\mathbf x):=N\int d^3\mathbf x_2…d^3\mathbf x_N |\Psi(\mathbf x,\mathbf x_2,…,\mathbf x_N)|^2\]

in terms of its \(N\)-body wavefunction \(\Psi(\mathbf x,\mathbf x_2,…,\mathbf x_N)\). Thus, \(\int d^3\mathbf x n(\mathbf x)=N\) is normalized.

Problem: Write down the non-relativistic Hamiltonian \(H\) of a molecule. Attempt to express the expected energy \(E=\langle\Psi|H|\Psi\rangle\) in many-body state \(|\Psi\rangle\in\bigwedge^NL^2(\mathbf R^3\to\mathbf C)\otimes\mathbf C^2\) in terms of \(n(\mathbf x)\) defined above; this functional \(E=E[n]\) of the density \(n\) explains the name “density functional theory“.

Solution: (the Born-Oppenheimer approximation is implicitly used whereby the electrons move on an effective/external potential energy surface \(V_{\text{ext}}(\mathbf x)\) defined by a collection of stationary nuclei)

\[H=\sum_{i=1}^N\frac{|\mathbf P_i|^2}{2m}+V_{\text{ext}}(\mathbf X_i)+\frac{1}{2}\sum_{1\leq i\neq j\leq N}\frac{\alpha\hbar c}{|\mathbf X_i-\mathbf X_j|}\]

One can check that:

\[E=\frac{\hbar^2}{2m}\sum_{i=1}^N\int d^3\mathbf x_1…d^3\mathbf x_N\biggr|\frac{\partial\Psi}{\partial\mathbf x_i}\biggr|^2+\int d^3\mathbf x n(\mathbf x)V_{\text{ext}}(\mathbf x)+\frac{\alpha\hbar c}{2}\int d^3\mathbf x d^3\mathbf x’\frac{n_2(\mathbf x,\mathbf x’)}{|\mathbf x-\mathbf x’|}\]

where \(n_2(\mathbf x,\mathbf x’)=N(N-1)\int d^3\mathbf x_3…d^3\mathbf x_N|\Psi(\mathbf x,\mathbf x’,\mathbf x_3,…,\mathbf x_N)|^2\) so in particular \((N-1)n(\mathbf x)=\int d^3\mathbf x’ n_2(\mathbf x,\mathbf x’)\).

Problem: Write down the Thomas-Fermi approximation for \(E[n]\), and explain why it sucks. Also derive the von Weizsäcker correction to the Thomas-Fermi energy functional.

Solution: The Thomas-Fermi functional treats the electrons as a locally ideal Fermi gas, thus having the usual Pauli pressure \(p\propto n^{5/3}\) in \(\mathbf R^3\) and thus (kinetic) energy density \(3p/2\) scaling likewise. Furthermore, the exchange-correlation functional is taken to vanish \(E_{\text{xc}}[n]=0\) (thus, Thomas-Fermi is often said to be semiclassical):

\[E_{\text{TF}}[n]=\frac{3\hbar^2(3\pi^2)^{2/3}}{10m}\int d^3\mathbf x n^{5/3}(\mathbf x)+\int d^3\mathbf x n(\mathbf x)V_{\text{ext}}(\mathbf x)+\frac{\alpha\hbar c}{2}\int d^3\mathbf x d^3\mathbf x’\frac{n(\mathbf x)n(\mathbf x’)}{|\mathbf x-\mathbf x’|}\]

One has:

\[\frac{\delta}{\delta n(\mathbf x)}\left(E_{\text{TF}}-\mu\left(\int d^3\mathbf xn(\mathbf x)-N\right)\right)=\frac{\hbar^2k^2_F(\mathbf x)}{2m}+V_{\text{eff}}(\mathbf x)-\mu=0\]

where \(k_F(\mathbf x):=(3\pi^2n(\mathbf x))^{1/3}\) and \(V_{\text{eff}}(\mathbf x)=V_{\text{ext}}(\mathbf x)+\alpha\hbar c\int d^3\mathbf x’\frac{n(\mathbf x’)}{|\mathbf x-\mathbf x’|}\).

(aside: although the Thomas-Fermi model is meant to be applied to molecules, a result of Teller showed that \(E_{\text{TF molecule}}>2E_{\text{TF atom}}\) so Thomas-Fermi theory predicts the molecular state is unstable with respect to dissociation. In a central external potential such as due to a single nucleus \(V_{\text{ext}}(r)=-Z\alpha\hbar c/r\), the effective potential \(V_{\text{eff}}(r)\) and electron density \(n(r)\) are both isotropic as well. Starting from Poisson’s equation:

\[\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\frac{V_{\text{eff}}(r)}{-e}=-\frac{-en(r)}{\varepsilon_0}\]

with \(\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2f(r)=\frac{1}{r^2}\frac{d}{dr}\left(r^2\frac{df}{dr}\right)=\frac{1}{r}\frac{d^2}{dr^2}(rf(r))\) (use the latter) one can check that by defining \(f(r):=\left(\frac{8}{3\pi a_0}\right)^2rk^2_F(r)\), one recovers the Thomas-Fermi ODE:

\[\frac{d^2 f}{dr^2}=\frac{f^{3/2}(r)}{\sqrt{r}}\]

The von Weizsäcker term is a correction to the kinetic energy functional in the Thomas-Fermi model that accounts for density gradients:

\[T_{\text{vW}}[n]=\frac{\hbar^2}{8m}\int d^3\mathbf x\frac{|\partial n/\partial\mathbf x|^2}{n(\mathbf x)}\]

Heuristically, it can be justified by writing the expected kinetic energy of a single electron \(\frac{\hbar^2}{2m}\int d^3\mathbf x |\partial\psi/\partial\mathbf x|^2\) with wavefunction \(\psi(\mathbf x)\in\mathbf R\) and replacing \(\psi(\mathbf x)\mapsto\sqrt{n(\mathbf x)}\).

Problem: Prove the two Hohenberg-Kohn theorems.

Solution: First, it is useful to prove:

  • Lemma: If \(V_{\text{ext}}(\mathbf x)\neq V’_{\text{ext}}(\mathbf x)\) represent \(2\) physically distinct external potentials (i.e. they differ by more than just some additive constant), then the corresponding molecular Hamiltonians \(H\neq H’\) do not share any non-zero eigenstates.
  • Proof: Suppose for sake of contradiction that there exists \(|\Psi\rangle\neq 0\) such that
    \[H|\Psi\rangle=E|\Psi\rangle\]
    \[H’||\Psi\rangle=E’|\Psi\rangle\]
    Subtracting, one obtains:
    \[(H’-H)|\Psi\rangle=(E’-E)|\Psi\rangle\]
    Inspecting the form of the molecular Hamiltonian \(H\) given earlier, one sees that the “universal” part (i.e. the kinetic energy and electron-electron Coulomb repulsion terms) cancel, leaving only the non-universal residue \(H’-H=\sum_{i=1}^NV’_{\text{ext}}(\mathbf X_i)-V_{\text{ext}}(\mathbf X_i)\). But the RHS \(E’-E\) is \(\mathbf X_i\)-independent for all \(i=1,…,N\); this means \(V’_{\text{ext}}(\mathbf x)-V_{\text{ext}}(\mathbf x)=\text{const}\) which is a contradiction.

    With this in mind, the first HK theorem follows by specializing to the respective (distinct by the above lemma) ground states \(|\Psi_0\rangle\neq|\Psi’_0\rangle\) of \(H\) and \(H’\). Apply the variational bound both ways:

    \[\]

    Problem: Demonstrate the first Hohenberg-Kohn theorem for \(2\) electrons in a \(1\)-dimensional harmonic trap.

Solution:

Problem: Derive the Kohn-Sham equations.

Solution: Fictitious ideal electron gas with density \(n(\mathbf x)\).

Problem:

Posted in Blog | Leave a comment

PyTorch Fundamentals (Part \(2\))

Problem: Do an end-to-end walkthrough of the PyTorch machine learning workflow using the most basic univariate linear regression example. In particular, generate some linear data over a normalized feature space (whose slope \(w\) and intercept \(b\) would in practice be a priori unknown), split that linear data into training and testing subsets (no cross-validation dataset needed for this simple example), define the linear layer class, instantiate a model object of the class, and starting from random values of \(w\) and \(b\), use stochastic gradient descent with learning rate \(\alpha=0.01\) to minimize the training cost function \(C_{\text{train}}(w,b)\) based on an \(L^1\) loss. Iterate SGD for \(300\) epochs, and for every \(20\) epochs, record the current value of \(C_{\text{train}}(w,b)\) and the current value of \(C_{\text{test}}(w,b)\). Plot these cost function curves as a function of the epoch number \(0, 20, 40,…\). Save the final state dictionary of the model’s learned parameters \(w,b\) post-training, and load it back onto a new instance of the model class.

Solution:

pytorch_workflow
Posted in Blog | Leave a comment

PyTorch Fundamentals (Part \(1\))

Problem: Illustrate some of the basic fundamentals involved in using the PyTorch deep learning library. In particular, discuss the attributes of PyTorch tensors (e.g. dtype, CPU/GPU devices, etc.), how to generate random PyTorch tensors with/without seeding, and operations that can be performed on and between PyTorch tensors.

Solution:

pytorch_fundamentals
Posted in Blog | Leave a comment

Self-Attention in Transformers

Problem: Explain how the transformer architecture works at a mathematical level (e.g. as outlined in the Attention Is All You Need paper).

Solution:

  1. (Tokenization) Partition the inputted natural language text into a sequence of tokens \(\tau_1,…,\tau_N\) (here \(N\) is bounded from above by the LLMs context size).
  2. (Embedding) Each token \(\tau_i\) is embedded as a vector \(\tau_i\mapsto\mathbf x_i\in\mathbf R^{n_e}\) that contains information about the token’s generic meaning as well as its position in the inputted natural language text. Here, the hyperparameter \(n_e\) is the dimension of the embedding space.
  3. (Single Self-Attention Head) For each embedding vector \(\mathbf x_i\), compute its query vector \(\mathbf q_i=W_{\mathbf q}\mathbf x_i\), its key vector \(\mathbf k_i=W_{\mathbf k}\mathbf x_i\), and its value vector \(\mathbf v_i=W_{\mathbf v}\mathbf x_i\). Here, \(W_{\mathbf q},W_{\mathbf k}\in\mathbf R^{n_{qk}\times n_e}\) are weight matrices that map from the embedding space \(\mathbf R^{n_e}\) to the query/key space \(\mathbf R^{n_{qk}}\) of dimension \(n_{qk}\) and \(W_{\mathbf v}\in\mathbf R^{n_e\times n_e}\) is the weight matrix of values (which in practice is decomposed into a low-rank approximation \(W_{\mathbf v}=W_{\mathbf v\uparrow}W_{\mathbf v\downarrow}\) where typically \(W_{\mathbf v\downarrow}\in\mathbf R^{n_{qk}\times n_e}\) and \(W_{\mathbf v\uparrow}\in\mathbf R^{n_e\times n_{qk}}\)). For each \(\mathbf x_i\), one computes an update vector \(\Delta\mathbf x_i\) to be added to it according to a convex linear combination of the value vectors \(\mathbf v_1,…,\mathbf v_N\) of all the embeddings \(\mathbf x_1,…,\mathbf x_N\) in the context, specifically:

\[\Delta\mathbf x_i=V\text{softmax}\left(\frac{K^T\mathbf q_i}{\sqrt{n_{qk}}}\right)\]

where \(K=(\mathbf k_1,…,\mathbf k_N)\in\mathbf R^{n_{qk}\times N}\) and \(V=(\mathbf v_1,…,\mathbf v_N)\in\mathbf R^{n_e\times N}\) are key and value matrices associated to the inputted context (filled with column vectors here rather than the ML convention of row vectors). This map that takes the initial, generic token embeddings \(\mathbf x_i\) and nudges them towards more contextualized embeddings \(\mathbf x_i\mapsto\mathbf x’_i=\mathbf x_i+\Delta\mathbf x_i\) is called a head of self-attention. The \(1/\sqrt{n_{qk}}\) scaling in the softmax temperature is justified on the grounds that if \(\mathbf k\) and \(\mathbf q\) are random vectors whose independent components each have mean \(0\) and variance \(1\), then \(\mathbf k\cdot\mathbf q\) will have mean \(0\) and variance \(n_{qk}\), hence the need to normalize by \(\sqrt{n_{qk}}\) to ensure \(\mathbf k\cdot\mathbf q/\sqrt{n_{qk}}\) continues to have variance \(1\).

4. (Multi-Headed Self-Attention) Since context can influence meaning in different ways, repeat the above procedure in parallel for several heads of self-attention; each head will propose a displacement update to each of the \(N\) original embeddings \(\mathbf x_i\); add up all of them.

5. (Multilayer Perceptron) Linear, ReLU, Linear basically. It is hypothesized that facts are stored in this part of the transformer.

6. (Layers) Alternate between the multi-headed self-attention blocks and MLP blocks, make a probabilistic prediction of the next token \(\hat{\tau}_{N+1}\) using only the final, context-rich, modified embedding \(\mathbf x’_N\) of the last token \(\tau_N\) in the context by applying an unembedding matrix \(\mathbf u=W_{\mathbf u}\mathbf x’_N\) and running it through a softmax \(\text{softmax}(\mathbf u)\).

    Problem: Based on the above discussion of the transformer architecture, explain how a large language model (LLM) like Gemini, ChatGPT, Claude, Grok, DeepSeek, etc. works (at a high level).

    Solution: Essentially, since an LLM is a neural network which takes as input some string of text and probabilistically predicts the next token, by seeding it with some corpus of text \(T\), the LLM can sample according to the probability distribution it generates for the next token, and append that to \(T\mapsto T+\tau\). Then, simply repeat this except pretend that \(T+\tau\) was the seed all along. In this way, generative AI models such as ChatGPT (where GPT stands for generative pre-trained transformer) work. In practice, it is helpful to also provide some system prompt like “What follows is a conversation between a user and a knowledgeable AI assistant:”.

    Posted in Blog | Leave a comment

    JAX Fundamentals (Part \(1\))

    JAX_tutorial
    Posted in Blog | Leave a comment

    Monte Carlo Methods

    Problem: Distinguish between Las Vegas methods and Monte Carlo methods.

    Solution: Both are umbrella terms referring to broad classes of methods that draw (repeatedly) from (not necessarily i.i.d.) random variables to compute the value of some deterministic variable. Here, “compute” means “find that deterministic value with certainty” in the case of a Las Vegas method whereas “compute” means “(point) estimate” in the case of a Monte Carlo method. Thus, their key difference lies in how they tradeoff time \(t\) vs. accuracy \(\alpha\):

    Problem: Given a scalar continuous random variable with probability density function \(p(x)\), how can one sample faithfully from such an \(x\sim p(x)\)?

    Solution: Fundamentally, a computer can (approximately) draw from a uniform random variable \(u\in [0,1]\) (using seed-based pseudorandom number generators for instance). In order to convert this \(u\to x\), the standard way is to use the cumulative distribution function of \(x\):

    \[\int_{-\infty}^xdx’p(x’):=u\]

    Then, as the probability density of \(u\) is uniform \(p(u)=1\), the probability density of \(x\) will be \(p(x)=\frac{du}{dx}p(u)\). As the CDF is a monotonically increasing bijection, there always exists an \(x=x(u)\) for each \(u\in[0,1]\). And intuitively, this whole sampling scheme should make a lot of sense:

    Problem: So, it seems like the above “inverse CDF trick” has solved the problem of sampling from an arbitrary random variable \(x\) (and it generalizes to the case of a higher-dimensional random vector \(\mathbf x\in\mathbf R^n\) by sampling conditionally). So how come in many practical applications this “sampling” problem can still be difficult?

    Solution: In practice, one may not have access to \(p(\mathbf x)\) itself, but only the general shape \(\tilde p(\mathbf x)=Zp(\mathbf x)\) of it (e.g. in Bayesian inference). One might object that \(Z\) is known, namely \(Z=\int d^n\mathbf x\tilde p(\mathbf x)\) can be written explicitly as an \(n\)-dimensional integral, so wherein lies the difficulty? The difficulty lies in evaluating the integral, which suffers from the curse of dimensionality if \(n\gg 1\) is large (just imagine approximating the integral by a sum, and then realizing that the number of terms \(N\) in such a series would grow as \(N\sim O(\exp n)\)). The inverse CDF trick thus fails because it requires one to have \(p(\mathbf x)\) itself, not merely \(\tilde p(\mathbf x)\).

    Problem: Describe the Monte Carlo method known as importance sampling for estimating the expectation \(\langle f(\mathbf x)\rangle\) of a random variable \(f(\mathbf x)\).

    Solution: Importance sampling consists of a deliberate reframing of the integrand of the expectation inner product integral:

    \[\langle f(\mathbf x)\rangle=\int d^n\mathbf x p(\mathbf x)f(\mathbf x)=\int d^n\mathbf x q(\mathbf x)\frac{p(\mathbf x)f(\mathbf x)}{q(\mathbf x)}\]

    in other words, replacing \(p(\mathbf x)\mapsto q(\mathbf x)\) and \(f(\mathbf x)\mapsto p(\mathbf x)f(\mathbf x)/q(\mathbf x)\). There are \(2\) cases in which importance sampling can be useful. The first case is related to the previous problem, namely when sampling directly from \(p(\mathbf x)\) is too difficult, in which case \(q(\mathbf x)\) should be chosen to be easier to both sample from and evaluate than \(p(\mathbf x)\). The second case is, even if sampling directly from \(p(\mathbf x)\) is feasible, it can still be useful to instead sample from \(q(\mathbf x)\) as a means of reducing the variance of the Monte Carlo estimator. That is, whereas the original Monte Carlo estimator of the expectation involving \(N\) i.i.d. draws \(\mathbf x_1,…,\mathbf x_N\) from \(p(\mathbf x)\) is \(\sum_{i=1}^Nf(\mathbf x_i)/N\) with variance \(\sigma^2_{f(\mathbf x)}/N\), the new Monte Carlo estimator of the expectation involving \(N\) i.i.d. draws from \(q(\mathbf x)\) is \(\sum_{i=1}^Np(\mathbf x_i)f(\mathbf x_i)/q(\mathbf x_i)N\) with variance \(\sigma^2_{p(\mathbf x)f(\mathbf x)/q(\mathbf x)}/N\); for importance sampling to be useful, \(q(\mathbf x)\) should be chosen such that:

    \[\sigma^2_{p(\mathbf x)f(\mathbf x)/q(\mathbf x)}<\sigma^2_{f(\mathbf x)}\]

    One can explicitly find the function \(q(\mathbf x)\) that minimizes the variance functional of \(q(\mathbf x)\) (using a Lagrange multiplier to enforce normalization \(\int d^n\mathbf x q(\mathbf x)=1\)):

    \[q(\mathbf x)\sim p(\mathbf x)|f(\mathbf x)|\]

    In and of itself, this \(q(\mathbf x)\) cannot be used because it contains the difficult piece \(p(\mathbf x)\). Nevertheless, motivated by this result, one can heuristically importance sample using a \(q(\mathbf x)\) which ideally is large at \(\mathbf x\in\mathbf R^n\) where \(p(\mathbf x)|f(\mathbf x)|\) is large. Intuitively, this is because that’s the integrand in the expectation inner product integral, so locations where it’s large will contribute most to the integral, hence are the most “important” regions to sample from with \(q(\mathbf x)\).

    One glaring flaw with the above discussion is that if \(Z\) is not known, then it will also be impossible to sample from \(p(\mathbf x)\). Instead, suppose as before that only the shape \(\tilde p(\mathbf x)=Zp(\mathbf x)\) is known with \(Z=\int d^n\mathbf x\tilde p(\mathbf x)\). Thus, the expectation is:

    \[\frac{\int d^n\mathbf x\tilde{p}(\mathbf x)f(\mathbf x)}{\int d^n\mathbf x\tilde{p}(\mathbf x)}\]

    The key is to notice that the denominator is just the \(f(\mathbf x)=1\) special case of the numerator, hence the importance sampling trick can be applied to both numerator and denominator, thereby obtaining the (biased but asymptotically unbiased Monte Carlo estimator):

    \[\frac{\sum_{i=1}^N\tilde p(\mathbf x_i)f(\mathbf x_i)/q(\mathbf x_i)}{\sum_{i=1}^N\tilde p(\mathbf x_i)/q(\mathbf x_i)}\]

    for the expectation.

    Problem: Describe what Markov chain Monte Carlo (MCMC) methods are about.

    Solution: Whereas in “naive” Monte Carlo methods, the random vector draws \(\mathbf x_1,…,\mathbf x_N\) are often i.i.d., here the key innovation is that the sequence of random vectors is no longer i.i.d., rather \(\mathbf x_t\) is correlated with \(\mathbf x_{t-1}\) (but uncorrelated with all the earlier \(\mathbf x_{<t-1}\). In other words, the sequence \(\mathbf x_1,…,\mathbf x_N\) forms a (discrete-time) Markov chain, i.e. there exists a transition kernel with “matrix elements” \(p(\mathbf x’|\mathbf x)\) for any pair of states \(\mathbf x,\mathbf x’\in\mathbf R^n\) (assuming the DTMC is \(t\)-independent/homogeneous). The goal of an MCMC method is to find a suitable \(p(\mathbf x’|\mathbf x)\) such that the stationary \(N\to\infty\) distribution of the chain matches \(p(\mathbf x)\). Thus, it’s basically an “MCMC sampler”.

    Problem: What does it mean to burn in an MCMC run? What does it mean to thin an MCMC run?

    Solution: Given a DTMC \(\mathbf x_1,…,\mathbf x_N\), the idea of burning-in the chain is to discard the first few samples \(\mathbf x_1,\mathbf x_2,…\) as they’re correlated with the (arbitrary) choice of starting \(\mathbf x_1\) and so are not representative of \(p(\mathbf x)\) (equivalently, one hopes to burn-in for the first \(N_b\) samples where \(N_b\) is around the mixing time of the Markov chain). Burn-in therefore gives the MCMC run a chance to find the high-probability regions of \(p(\mathbf x)\).

    Thinning an MCMC run means only keep e.g. every \(10\) DTMC points or something like that…this is to reduce the correlation between adjacent samples as the goal is to sample independently from \(p(\mathbf x)\).

    Problem: The Metropolis-Hastings method is a general template for a fairly broad family of MCMC methods. Explain how it works.

    Solution: In order for \(p(\mathbf x)\) to be the stationary distribution of a DTMC, by definition one requires it to be an eigenvector (with eigenvalue \(1\)) of the transition kernel:

    \[\int d^n\mathbf xp(\mathbf x’|\mathbf x)p(\mathbf x)=p(\mathbf x’)\]

    A sufficient (though not necessary) condition for this is if the chain is in detailed balance:

    \[p(\mathbf x’|\mathbf x)p(\mathbf x)=p(\mathbf x|\mathbf x’)p(\mathbf x’)\]

    for all \(\mathbf x,\mathbf x’\in\mathbf R^n\). The idea of Metropolis-Hastings is to conceptually decompose the transition from state \(\mathbf x\to\mathbf x’\) into \(2\) simpler steps, namely:

    1. Propose the transition \(\mathbf x\to\mathbf x’\) with proposal probability \(p_p(\mathbf x’|\mathbf x)\) (this distribution \(p_p\) needs be easy to sample \(\mathbf x’\) from for any \(\mathbf x\)!)
    2. Accept/reject the proposed transition \(\mathbf x\to\mathbf x’\) with an acceptance probability \(p_a(\mathbf x’|\mathbf x)\) (this distribution also needs to be easy to sample from!)

    Thus, \(p(\mathbf x’|\mathbf x)=p_a(\mathbf x’|\mathbf x)p_p(\mathbf x’|\mathbf x)\). The ratio of acceptance probabilities from detailed balance is therefore:

    \[\frac{p_a(\mathbf x’|\mathbf x)}{p_a(\mathbf x|\mathbf x’)}=\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p_p(\mathbf x)}\]

    The ratio on the RHS is known (furthermore, because it’s a ratio, neither \(p_p\) nor \(p\) need be normalized! In other words, one can replace \(p_p\mapsto \tilde p_p\) and \(p\mapsto\tilde p\)). The task remains to specify the form of \(p_a(\mathbf x’|\mathbf x)\). The detailed balance condition doesn’t give a unique solution for \(p_a\), but heuristically one can get a unique solution by further insisting that the walker accept moves often. That is, w.l.o.g. suppose the RHS is computed to be \(>1\), so that \(p_a(\mathbf x’|\mathbf x)>p_a(\mathbf x|\mathbf x’)\). This implies that the transition from \(\mathbf x\to\mathbf x’\) is in some sense more variable than the reverse transition \(\mathbf x’\to\mathbf x\), so it seems that if at time \(t\) the MCMC walker were at \(\mathbf x_t=\mathbf x\), then at time \(t+1\) the MCMC walker should definitely move to state \(\mathbf x_{t+1}:=\mathbf x’\). In other words, one would like to set \(p_a(\mathbf x’|\mathbf x):=1\) and hence \(p_a(\mathbf x|\mathbf x’)=\frac{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}{p_p(\mathbf x|\mathbf x’)p_p(\mathbf x’)}\). But this therefore fixes the form of the function \(p_a\)! Specifically, to evaluate \(p_a(\mathbf x’|\mathbf x)\), one should set it to be \(1\) if the RHS ratio is \(>1\) (which coincidentally is also the value that \(p_a\) itself needs to be set to), whereas if the RHS ratio is \(<1\), then \(p_a\) should be set to the RHS ratio. An elegant way to encode all of this into a single compact expression is just:

    \[p_a(\mathbf x’|\mathbf x)=\min\left(1,\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}\right)\]

    Practically, to actually implement this Metropolis-Hastings acceptance probability \(p_a\), one can essentially flip a “biased” coin with probability \(p_a(\mathbf x’|\mathbf x)\) of landing “heads” (meaning to accept the transition \(\mathbf x\to\mathbf x’\)), e.g. randomly picking a number \(u\in[0,1]\) and declaring acceptance iff \(u\leq p_a(\mathbf x’|\mathbf x)\).

    (aside: if the proposal probability \(p_p(\mathbf x’|\mathbf x)\sim e^{-|\mathbf x’-\mathbf x|^2/2\sigma^2}\) is taken to be normally and isotropically distributed about \(\mathbf x\), then it is symmetric \(p_p(\mathbf x’|\mathbf x)=p_p(\mathbf x|\mathbf x’)\) and the MH acceptance probability simplifies \(\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}=\frac{p(\mathbf x’)}{p(\mathbf x)}\) (in which case this is sometimes just called the Metropolis method). Another tip is that the burn-in period of the MCMC walk can often be used to tune the variance \(\sigma^2\) of the above Gaussian \(p_p\) distribution, namely if the burn-in period lasts for \(N_b\) samples, then look at the number of times \(N_a\) the MCMC walker accepted a proposed state from \(p_p\); a rule-of-thumb is that in high-dimensions \(n\gg 1\), one should tune \(\sigma^2\) such that the fraction of accepted proposals \(N_a/N_b\approx 23.4\%\), see “Weak convergence and optimal scaling of random walk Metropolis algorithms” by Roberts, Gelman, and Gilks (\(1997\)). Finally, one more MCMC trick worth mentioning is that one can always run e.g. \(\sim 100\) MCMC sequences in parallel and sample randomly from them as an alternative to thinning).

    Problem: Explain how Gibbs sampling works.

    Solution: Whereas a Metropolis-Hastings sampler moves around in state space like a queen in chess, a Gibbs sampler moves around like a rook. That is, it only ever moves along the \(n\) coordinate directions in \(\mathbf x=(x_1,…,x_n)\). Indeed, the Gibbs sampler can be viewed as a special case of a Metropolis-Hastings sampler in which the proposal distribution \(p_p\) is given by:

    \[p_p(\mathbf x’|\mathbf x):=p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)\delta(x’_1-x_1)…\delta(x’_{i-1}-x_{i-1})\delta(x’_{i+1}-x_{i+1})…\delta(x’_n-x_n)\]

    where \(i=1,…,n\) is cycled through the \(n\) coordinate directions. For each \(1\leq i\leq n\), the MH acceptance probability is:

    \[\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}=\frac{p(x_i|x’_1,…,x’_{i-1},x’_{i+1},…,x’_n)\delta(x_1-x’_1)…\delta(x_{i-1}-x’_{i-1})\delta(x_{i+1}-x’_{i+1})…\delta(x_n-x’_n)p(\mathbf x’)}{p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)\delta(x’_1-x_1)…\delta(x’_{i-1}-x_{i-1})\delta(x’_{i+1}-x_{i+1})…\delta(x’_n-x_n)p(\mathbf x)}\]

    \[=\frac{p(x_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x’_i,…,x_n)}{p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x_n)}\]

    \[=\frac{p(x_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x_{i-1},x_{i+1},…,x_n)p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)}{p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_i|x_1,…,x_{i-1},x_{i+1},…,x_n)}\]

    \[=1\]

    so \(\text{min}(1,1)=1\); in other words, the Gibbs sampler always accepts because the proposal is coming from the conditional probability slices themselves (assumed to be known even if the full joint distribution \(p(\mathbf x)\) is unknown), so they must be authentic samples in a sense. For a Gibbs sampler, only after cycling through all \(n\) coordinate directions \(i=1,…,n\) does it count as \(1\) MCMC iteration.

    (aside: a common variant of the basic Gibbs sampling formalism described above is block Gibbs sampling in which one may choose to update \(2\) or more features simultaneously, conditioning on the distribution of those features given all other features fixed; this would enable diagonal movements within that subspace which, depending how correlated those features are, could significantly increase the effective sample size (ESS)).

    Problem: Explain how the Hamiltonian Monte Carlo method works.

    Solution:

    Posted in Blog | Leave a comment