Renormalization Group

Problem: Consider a Landau-Ginzburg theory involving a single real scalar field \(\phi(\mathbf x)\) for \(\mathbf x\in\mathbf R^n\) governed by the canonically normalized free energy density:

\[\mathcal F(\phi,\partial\phi/\partial\mathbf x,…)=\frac{1}{2}\biggr|\frac{\partial\phi}{\partial\mathbf x}\biggr|^2+\frac{\phi^2}{2\xi^2}+…\]

Explain what the \(+…\) means, explain which terms have temperature \(T\)-dependence, and explain for which such terms does that \(T\)-dependence actually matter?

Solution: The \(+…\) includes any terms (each with their own coupling constants) compatible with the golden trinity of constraints: locality, analyticity, and symmetry (e.g. a quartic \(g\phi^4\) coupling). The part of the free energy density \(\mathcal F\) before the \(+…\) should compared to the Lagrangian density \(\mathcal L\) of Klein-Gordon field theory:

\[\mathcal L=\frac{1}{2c^2}\left(\frac{\partial\phi}{\partial t}\right)^2-\frac{1}{2}\biggr|\frac{\partial\phi}{\partial\mathbf x}\biggr|^2-\frac{\phi^2}{2\bar{\lambda}^2}\]

with \(\bar{\lambda}=\hbar/mc\) the reduced Compton wavelength playing a role analogous to the correlation length \(\xi\sim 1/\sqrt{|T-T_c|}\); indeed, this \(T\)-dependence in \(\xi=\xi(T)\) is the only \(T\)-dependence that matters, even though generically all the other coupling constants will also have \(T\)-dependence.

Problem:

Posted in Blog | Leave a comment

Convolutional Neural Networks

CNNs_Part_1
Posted in Blog | Leave a comment

Hamilton’s Optics-Mechanics Analogy

Problem: Deduce the Hamilton-Jacobi equation of classical mechanics.

Solution: Instead of viewing the action \(S=S[\mathbf x(t)]\) as a functional of the particle’s trajectory \(\mathbf x(t)\), it can be viewed more simply as a scalar field \(S(\mathbf x,t)\) in which the initial point in spacetime \((t_0,\mathbf x_0)\) is fixed and one simply takes the on-shell trajectory from \((t_0,\mathbf x_0)\) to \((t,\mathbf x)\). Then the total differential \(dS=\mathbf p\cdot d\mathbf x\) (follows from the usual Noetherian calculation) so in particular:

\[\mathbf p=\frac{\partial S}{\partial\mathbf x}\]

Intuitively, this is saying that the particle moves in a direction (the direction of the momentum \(\mathbf p\)) orthogonal to the contour surfaces of the action field \(S\), i.e. such isosurfaces can be viewed as “wavefronts”. Then the total time derivative is:

\[\dot S=L\]

But \(\frac{\partial S}{\partial t}+\frac{\partial S}{\partial\mathbf x}\cdot\dot{\mathbf x}=\frac{\partial S}{\partial t}+\mathbf p\cdot\dot{\mathbf x}\). Thus, isolating for \(H=\mathbf p\cdot\dot{\mathbf x}-L\) yields the Hamilton-Jacobi nonlinear \(1^{\text{st}}\)-order PDE for \(S(\mathbf x,t)\):

\[-\frac{\partial S}{\partial t}=H\left(\mathbf x,\frac{\partial S}{\partial\mathbf x},t\right)\]

Problem: When \(\partial H/\partial t=0\), the Hamiltonian is conserved with energy \(H=E\), so this motivates the additive separation of variables, \(S(\mathbf x,t):=S_0(\mathbf x)-Et\) for some constant \(E\). What does the Hamilton-Jacobi equation simplify to in this case? For a single non-relativistic particle of mass \(m\) moving in a potential \(V(\mathbf x)\), what does this look like? What about in \(1\) dimension?

Solution: \[H\left(\mathbf x,\frac{\partial S_0}{\partial\mathbf x}\right)=E\]

which for \(H(\mathbf x,\mathbf p)=|\mathbf p|^2/2m+V(\mathbf x)\) looks like:

\[\frac{1}{2m}\biggr|\frac{\partial S_0}{\partial\mathbf x}\biggr|^2+V(\mathbf x)=E\]

and in \(1\) dimension is integrable to the explicit solution:

\[S_0(x)=\pm\int ^xdx’\sqrt{2m(E-V(x’))}\]

In particular, the usual trajectory \(x(t)\) can be obtained by treating \(S_o=S_0(x,t;E)\) as a family of solutions parameterized by the energy \(E\); this works because \(S_0\) can be a viewed as a particular generating function of a canonical transformation \((\mathbf x,\mathbf p,H)\mapsto (\mathbf x’,\mathbf p’,H’)\) in which the “boosted” Hamiltonian vanishes \(H’=0\).

\[\frac{\partial S_0}{\partial E}=-t_0\Rightarrow t-t_0=\pm\sqrt{\frac{m}{2}}\int_{x_0}^{x(t)}\frac{dx’}{\sqrt{E-V(x’)}}\]

Problem: Above, the static field \(S_0(\mathbf x)\) was introduced to simplify the Hamilton-Jacobi equation when the energy \(E\) was conserved. However, if one pulls back to the level of functionals rather than fields, one can define an analogous abbreviated action functional \(S_0[\mathbf x]\) which depends only on the path \(\mathbf x\) taken rather than the trajectory \(\mathbf x(t)\). Define \(S_0[\mathbf x]\), and moreover show that when the energy \(E\) is conserved, the on-shell path is a stationary point of \(S_0\) (this is called Maupertuis’s principle).

Solution: The abbreviated action for a single particle of mass \(m\) and (non-relativistic) energy \(E=|\mathbf p|^2/2m +V(\mathbf x)\) is:

\[S_0[\mathbf x]:=\int d\mathbf x\cdot\mathbf p=\int ds |\mathbf p|=\int ds\sqrt{2m(E-V(\mathbf x))}\]

(modifying this to \(S_0[\mathbf x]:=\int ds |\mathbf p|=\int ds\sqrt{2(E-V(\mathbf x))}\) allows for interpretation as an \(N\)-particle system in configuration space \(\mathbf x\in\mathbf R^{3N}\) with the Riemannian “mass metric” \(ds^2=m_1|d\mathbf x_1|^2+…+m_N|d\mathbf x_N|^2\)).

To find the stationary paths of \(S_0[\mathbf x]\) subject to the constraint \(H(\mathbf x,\mathbf p)=E\), one can implement a Lagrange multiplier \(\gamma(\tau)\) to perform unconstrained extremization of:

\[S[\mathbf x(\tau)]:=S_0[\mathbf x(\tau)]-\int d\tau\gamma(\tau)(H(\mathbf x,\mathbf p)-E)=\int d\tau (\mathbf p\cdot\dot{\mathbf x}-\gamma(\tau)(H(\mathbf x,\mathbf p)-E))\]

The Euler-Lagrange equations lead to Hamilton’s equations:

\[\frac{d\mathbf x}{d\tau}=\gamma\frac{\partial H}{\partial\mathbf p}\]

\[\frac{d\mathbf p}{d\tau}=-\gamma\frac{\partial H}{\partial\mathbf x}\]

provided the Lagrange multiplier \(\gamma=dt/d\tau\) encodes reparameterization invariance; with this choice it’s clear that the integrand in the functional \(S\) was nothing more than the Lagrangian \(L=\mathbf p\cdot\dot{\mathbf x}-H\) (plus an unimportant constant \(E\)) so Maupertuis’s principle reduces to the usual Hamilton’s principle.

Problem: What does Fermat’s principle in ray optics assert? Hence, derive the ray equation.

Solution: The time functional \(T=T[\mathbf x(s)]\) of a ray trajectory \(\mathbf x(s)\) is stationary on-shell. That is:

\[cT[\mathbf x(s)]=\int ds n(\mathbf x(s))\]

This is reparameterization invariant, since one can arbitrarily parameterize \(\mathbf x=\mathbf x(t)\) and replace \(ds=dt|\dot{\mathbf x}|\). The corresponding Euler-Lagrange equations are:

\[\frac{d}{dt}\left(n(\mathbf x)\frac{\dot{\mathbf x}}{|\dot{\mathbf x}|}\right)=|\dot{\mathbf x}|\frac{\partial n}{\partial\mathbf x}\]

But by choosing the natural parameterization \(t:=s\) one has \(|d\mathbf x/ds|=1\), hence the ray equation:

\[\frac{d}{ds}\left(n\frac{d\mathbf x}{ds}\right)=\frac{\partial n}{\partial\mathbf x}\]

This can also be written in terms of the curvature vector \(\boldsymbol{\kappa}=d^2\mathbf x/ds^2\):

\[\boldsymbol{\kappa}=\left(\frac{\partial\ln n}{\partial\mathbf x}\right)_{\perp d\mathbf x}\]

Problem: Starting from an arbitrary Cartesian component \(\psi(\mathbf x,t)=\psi_0(\mathbf x)e^{i(k_0cT(\mathbf x)-\omega t)}\) of either the \(\mathbf E\) or \(\mathbf B\) fields of a light wave (here \(\omega=ck_0\) with \(k_0=2\pi/\lambda_0\) is the free space wavenumber), make the eikonal approximation to the dispersionless wave equation obeyed by \(\psi\) in order to obtain the (scalar) eikonal equation. By defining light rays as the integral curves of the eikonal field \(cT(\mathbf x)\) (a kind of local optical path length), reproduce the vector eikonal equation from Fermat’s principle above.

Solution: The ansatz \(\psi(\mathbf x,t)=\psi_0(\mathbf x)e^{i(k_0cT(\mathbf x)-\omega t)}\) is easy to justify; the \(e^{-i\omega t}\) is a just a Fourier transform factor that reduces the wave equation \(\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\psi=\frac{n^2}{c^2}\frac{\partial^2\psi}{\partial t^2}\) to a Helmholtz equation \(\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\psi=-n^2k_0^2\psi\). The remaining piece is just a polar parameterization of an arbitrary \(\mathbf C\)-valued spatial field \(\psi_0(\mathbf x)e^{ik_0cT(\mathbf x)}\). One obtains:

\[\biggr|\frac{\partial cT}{\partial\mathbf x}\biggr|^2=n^2+\frac{1}{k_0^2\psi_0}\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\psi_0+\frac{2i}{k_0\psi_0}\frac{\partial\psi_0}{\partial\mathbf x}\cdot\frac{\partial cT}{\partial\mathbf x}+\frac{i}{k_0}\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2cT\]

The eikonal approximation amounts to taking the ray optics limit \(k_0\to\infty\) (in practice, the wavelength \(2\pi/k_0\) has to be much shorter than all other length scales), and yields the (scalar) eikonal equation:

\[\biggr|\frac{\partial cT}{\partial\mathbf x}\biggr|=n\]

A light ray is thus a trajectory \(\mathbf x(s)\) with unit tangent vector:

\[\frac{d\mathbf x}{ds}=\frac{1}{n}\frac{\partial cT}{\partial\mathbf x}\]

The rest is an application of the chain rule:

\[\frac{d}{ds}=\frac{\partial}{\partial\mathbf x}\cdot\frac{d\mathbf x}{ds}=\frac{1}{n}\frac{\partial}{\partial\mathbf x}\cdot\frac{\partial cT}{\partial\mathbf x}\]

followed by the identity:

\[\left(\frac{\partial cT}{\partial\mathbf x}\cdot\frac{\partial}{\partial\mathbf x}\right)\frac{\partial cT}{\partial\mathbf x}=\frac{1}{2}\frac{\partial}{\partial\mathbf x}\biggr|\frac{\partial cT}{\partial\mathbf x}\biggr|^2\]

to deduce the (vector) eikonal equation of motion for ray trajectories just as Fermat’s principle predicts.

Problem: Hence, what is Hamilton’s optics-mechanics analogy?

Solution: In a nutshell, the isomorphism proceeds as:

\[(n(\mathbf x), cT)\leftrightarrow (|\mathbf p(\mathbf x)|,S_0)\]

Problem: Use Hamilton’s optics-mechanics analogy to solve the brachistochrone problem (this was how Johann Bernoulli originally solved it).

Solution: By energy conservation, the speed of the particle at distance \(y>0\) below its initial dropping height is \(v=\sqrt{2gy}\). By Fermat’s principle, minimizing the time functional then amounts to treating the particle as a light ray with \(n(\mathbf x)=c/v(y)\). So the question becomes how do light rays bend in a horizontally stratified medium with \(n(y)\propto y^{-1/2}\)? The answer is given by the ray equations:

\[\frac{d}{ds}\begin{pmatrix} y^{-1/2}dx/ds \\ y^{-1/2}dy/ds\end{pmatrix}=\begin{pmatrix}0 \\ y^{-1/2}/2\end{pmatrix}\]

The horizontal component expresses Snell’s law since \(dx/ds=\sin\theta\) (it expresses momentum conservation along the homogeneous \(\partial n/\partial x=0\) direction). Using the tangent vector constraint \(ds^2=dx^2+dy^2\) gives the ODE of a cycloid:

\[\frac{dy}{dx}=\sqrt{\frac{\text{const}}{y}-1}\]

(the vertical component ODE has an analytical solution \(y(s)=-s^2/8R+s\) which is contained in the cycloid, so is redundant information).

Problem: How did Hamilton’s optics-mechanics analogy inspire Schrodinger to propose his famous equations of quantum mechanics?

Solution: Essentially, Schrodinger asked: ray optics is to wave optics as classical mechanics is to what? In other words, one imagines there exists a wave theory of particles/matter and one would like to take the “inverse eikonal limit” of classical mechanics (here, “inverse eikonal limit” is usually called quantization):

Just as light rays propagate parallel to their phase fronts:

\[\frac{d\mathbf x}{ds}=\frac{1}{n}\frac{\partial cT}{\partial\mathbf x}\]

Particles propagate parallel to their “action fronts” exactly according to Hamilton’s analogy:

\[\frac{d\mathbf x}{ds}=\frac{1}{|\mathbf p(\mathbf x)|}\frac{\partial S_0}{\partial\mathbf x}\]

Already this suggests that the action should have some phase interpretation. More precisely, it should be the phase of the particle’s de Broglie wave in units of \(\hbar\). It’s also not obvious that particle’s should be described by a scalar wave field rather than e.g. the electromagnetic vector wave fields of light. Schrodinger simply guessed it looked like the equivalent of “scalar diffraction theory” with a single wavefunction \(\psi(\mathbf x,t)=\psi_0(\mathbf x,t)e^{iS(\mathbf x,t)/\hbar}\). This gives the Madelung equations of quantum hydrodynamics, one of which is just a continuity equation (giving credence to the Born interpretation of \(|\psi|^2\)) and the other is a quantum Hamilton-Jacobi equation which in the limit \(\hbar\to 0\) (analogous to the eikonal limit \(\lambda_0\to 0\)) simplifies to the classical Hamilton-Jacobi equation.

Posted in Blog | Leave a comment

Pseudo-Riemannian Geometry

Problem: Define the signature of a matrix. Hence, state and prove Sylvester’s law of inertia.

Solution: The signature of an \(n\times n\) matrix \(A\) is a \(3\)-tuple \((n_+,n_-,n_0)\) where \(n_+\) is the number of positive eigenvalues of \(A\) (including multiplicity), \(n_-\) is the number of negative eigenvalue of \(A\) (including multiplicity), and \(n_0=\text{dim}\ker A\) is the multiplicity of the zero eigenvalue; thus, \(n_++n_-+n_0=n\).

Let \(A,B\) be real symmetric matrices. Then Sylvester’s law of inertia asserts that \(A\) and \(B\) are congruent matrices iff they have the same signature (which sometimes is called “inertia” because of this invariance under congruence, hence the name).

Proof: It’s easy to see that the nullity \(n_0\) is preserved by congruence transformations. If one can can show that \(n_+\) is also preserved, then it implies \(n_-\) is conserved by virtue of \(n_++n_-+n_0=n\). To show this, the idea is to prove by contradiction, assuming \(n_+\) is not preserved which by dimension counting would imply a non-zero vector living in the intersection of the subspace spanned by the positive-eigenvalue eigenvectors of \(A\) and the congruence-transformed subspace spanned by the non-positive-eigenvalue eigenvectors of \(B\).

Since any real, symmetric matrix \(A\) is isomorphic to a real quadratic form \(Q(\mathbf x):=\mathbf x^TA\mathbf x\), the concepts of signature and Sylvester’s law of inertia can also be reformulated in the language of quadratic forms rather than real symmetric matrices.

Problem: Let \(X\) be a smooth \(n\)-manifold. Explain what it means to place the additional structure of a pseudo-Riemannian metric \(g\) on \(X\). Then explain how Riemannian and Lorentzian geometry are special cases of pseudo-Riemannian geometry.

Solution: A Riemannian metric \(g\) on \(X\) is a type \((0,2)\) tensor field that defines an inner product \(g_x:T_x(X)^2\to\mathbf R\) at the tangent space \(T_x(X)\) of each point \(x\in X\). The “tensor field” part is sometimes rephrased as saying that \(g_x\) is a bilinear form. Moreover, to flesh out the usual axioms of a real inner product space, the Riemannian metric tensor \(g_x\) at each \(x\in X\) must be symmetric \(g_x(v_x,v’_x)=g_x(v’_x,v_x)\) and positive-definite \(v_x\neq 0\Rightarrow g_x(v_x,v_x)>0\).

A pseudo-Riemannian metric relaxes the positive-definite requirement of a Riemannian metric, instead merely requiring non-degeneracy (i.e. at each point \(x\in X\), the zero vector \(0\) is the only vector orthogonal \(g_x(0,v_x)=0\) to all \(v_x\in T_x(X)\)). More concisely, what this is saying is that \(n_0=0\), so the signature of a pseudo-Riemannian metric may be thought of as a pair \((n_+,n_-)\).

Thus, Riemannian metrics are the special subset of pseudo-Riemannian metrics for which \((n_+,n_-)=(n,0)\). Meanwhile, Lorentzian metrics are another special subset of pseudo-Riemannian metrics for which \((n_+,n_-)=(n-1,1)\) (this is the relativists’/ convention). Typically, spacetime \(X\) has dimension \(n=4\), so e.g. the Minkowski metric is said to have signature \((3,1):=(-,+,+,+)\).

Problem: Let \((X,g)\) be a pseudo-Riemannian manifold. Often, one writes the general expression for an infinitesimal line element on \(X\) as \(ds^2=g_{\mu\nu}dx^{\mu}dx^{\nu}\); explain the shorthand being used here.

Solution: This is nothing more than choosing some chart \(x^{\mu}\) on \(X\) and expanding the metric tensor field \(g\) in the coordinate basis \(\{dx^{\mu}\otimes dx^{\nu}\}\) of type \((0,2)\) tensor fields:

\[g=g_{\mu\nu}dx^{\mu}\otimes dx^{\nu}\]

for some real, symmetric scalar fields \(g_{\mu\nu}:X\to\mathbf R\). One then simply writes \(ds^2:=g\) and omits the tensor product \(\otimes\).

Problem: Let \(x(t)\in X\) be a curve on a Riemannian manifold \((X,g)\). By choosing a chart \(x^{\mu}\) to cover a suitable region of \(X\) spanned by the trajectory \(x(t)\), explain how to compute the length \(s\) of the curve \(x(t)\) using the Riemannian metric \(g\).

Solution: Heuristically, it is:

\[s=\int ds=\int\sqrt{g_{\mu\nu}dx^{\mu}dx^{\nu}}=\int dt\sqrt{g_{\mu\nu}(x(t))\dot{x}^{\mu}\dot{x}^{\nu}}\]

Problem: Write down the action \(S\) for a non-relativistic particle of mass \(m\) moving on a Riemannian manifold \((X,g)\). Hence, derive the geodesic equation. What happens if the particle is relativistic?

Solution: A nonrelativistic free particle of mass \(m\) has action \(S=\int dt L\) described by the usual nonrelativistic kinetic Lagrangian:

\[L=\frac{m}{2}|\dot{\mathbf x}|^2\]

The catch is that \(|\dot{\mathbf x}|^2=g_{ij}(\mathbf x)\dot x^{i}\dot x^{j}\) depends on the Riemannian metric \(g\) in which the particle moves. Applying the Euler-Lagrange equations, one finds the equation of motion (called the geodesic equation):

\[\ddot x^i+\Gamma^i_{jk}\dot x^j\dot x^k=0\]

where the Christoffel symbols \(\Gamma^i_{jk}(\mathbf x):=\frac{1}{2}g^{i\ell}(\mathbf x)\left(\frac{\partial g_{\ell j}}{\partial x^k}+\frac{\partial g_{\ell k}}{\partial x^j}-\frac{\partial g_{jk}}{\partial x^{\ell}}\right)\) are analogous to “fictitious forces” on the particle due to the curvature of \(X\).

By contrast, a relativistic free particle has action \(S=-mcs=-mc^2\tau\). This can be recast into non-relativistic form \(S=\int dt L\) using the relativistic kinetic Lagrangian:

\[L=-mc\sqrt{g_{\mu\nu}\dot x^{\mu}\dot x^{\nu}}\]

which in flat space \(g_{\mu\nu}=\text{diag}(1,-1,-1,-1)\) reduces to the familiar \(L=-mc^2/\gamma\) (nb. more generally, the quantity under the square root is always \(\geq 0\) for the timelike worldlines of particles). Some comments:

  • The stationary action principle \(\delta S=0\) in which the action \(S\) tends to be minimized corresponds to the well-known fact that relativistic free particles travelling in straight line geodesics through spacetime maximize the Minkowski distance \(s\), or equivalently Alice experiences more proper time (i.e. ages more) than Bob in the twin paradox.
  • The relativistic action (unlike its non-relativistic counterpart) is manifestly reparameterization invariant.

Now the Euler-Lagrange geodesic equations for the relativistic Lagrangian are almost identical to the non-relativistic case except there’s an additional term:

\[\ddot x^{\mu}+\Gamma^{\mu}_{\nu\rho}(x)\dot x^{\nu}\dot x^{\rho}=\frac{\dot L}{L}\dot x^{\mu}\]

The additional term vanishes \(\dot L=0\) iff \(L=L(\tau’)\) is parameterized as any affine transformation \(\tau’=a\tau+b\) of the particle’s proper time \(\tau\) (prove this by showing \(\dot{\tau}’=-aL/mc^2\)). Only in this case does the relativistic geodesic equation reduce to the non-relativistic form:

\[\frac{d^2 x^{\mu}}{d\tau’^2}+\Gamma^{\mu}_{\nu\rho}(x(\tau’))\frac{dx^{\nu}}{d\tau’}\frac{dx^{\rho}}{d\tau’}=0\]

Problem: For \(X=\mathbf R^2\), calculate all non-vanishing Christoffel symbols in the polar coordinate chart \((\rho,\phi)\).

Solution:

Note it is usually more efficient to compute Christoffel symbols this way rather than using the explicit formula in terms of partial derivatives of the metric \(g\) (though they’re of course equivalent).

Problem: Explain how the presence of a pseudo-Riemannian metric \(g\) on a smooth manifold \(X\) allows one to identify covectors in \(T_x^*(X)\) with tangent vectors in \(T_x(X)\) at each point \(x\in X\).

Solution: Just as in \(X=\mathbf R^n\), one identifies a tangent vector \(\mathbf v\) with its covector \(\mathbf v\cdot\) via the inner product \(\cdot\), so at a given \(x\in X\), one uses the pseudo-inner product \(g_x\) on \(T_x(X)\) to make the exact same identification \(v_x\leftrightarrow g_x(v_x,\space)\) (this sometimes called the musical isomorphism induced by \(g\) in light of the map \(\flat_x:T_x(X)\to T^*_x(X)\) defined by \(v_x^{\flat_x}(v’_x):=g(v_x,v’_x)\) and its inverse \(\sharp_x=\flat_x^{-1}\)).

In some chart \(x^{\mu}\), one can write the \(1\)-form \(v^{\flat}=(v^{\flat})_{\mu}dx^{\mu}\) with components \((v^{\flat})_{\mu}=v^{\flat}(\partial_{\mu})=g(v,\partial_{\mu})=g_{\mu\nu}v^{\nu}\) where \(v^{\nu}=v(x^{\nu})\) are the components of its musically isomorphic vector field \(v=v^{\nu}\partial_{\nu}\) in the same coordinate basis. Physicists typically abuse notation by merely writing \((v^{\flat})_{\mu}=g_{\mu\nu}v^{\nu}\) as just \(v_{\mu}=g_{\mu\nu}v^{\nu}\), identifying \(v^{\flat}\equiv v\) under the musical isomorphism. But then this lazy notation makes it look as if the metric \(g\) performs a trivial mechanical action of just “lowering the index” on \(v\).

Problem: Define a tensor field \(g^*\) that performs the trivial mechanical action of “raising the index” on a \(1\)-form \(A\).

Solution: Define \(g^*_x:T^*_x(X)^2\to\mathbf R\) to be a type \((2,0)\) tensor given by:

\[g^*(A,A’):=g(A^{\sharp}, A’^{\sharp})\]

Then on the one hand, in some chart \(x^{\mu}\), one has:

\[g^*(v^{\flat},v’^{\flat})=g(v,v’)=g_{\mu\nu}v^{\mu}v’^{\nu}\]

On the other hand, in that same chart \(x^{\mu}\), one has:

\[g^*(v^{\flat},v’^{\flat})=(g^*)^{\mu\nu}(v^{\flat})_{\mu}(v’^{\flat})_{\nu}=(g^*)^{\mu\nu}g_{\mu\rho}v^{\rho}g_{\nu\sigma}v’^{\sigma}=(g^*)^{\rho\sigma}g_{\mu\rho}g_{\nu\sigma}v^{\mu}v’^{\nu}\]

This implies \(g_{\mu\nu}=(g^*)^{\rho\sigma}g_{\mu\rho}g_{\nu\sigma}\). Since \(g\) is non-degenerate (the \(N_0=0\) axiom of a pseudo-Riemannian metric), it follows that \(\det g_{\mu\nu}\neq 0\) is invertible, so the only possibility is \((g^*)^{\mu\nu}g_{\nu\rho}=\delta^{\mu}_{\rho}\), i.e. \(g^*\sim g^{-1}\). It is common practice to just drop the \(*\) and write \(g^{\mu\nu}\) in lieu of \((g^*)^{\mu\nu}\). With this lazy notation, \(g^{\mu\nu}\) looks like it just “raises the index” on \(A^{\mu}=g^{\mu\nu}A_{\nu}\) (where one last abuse of notation \((A^{\sharp})^{\mu}=A^{\mu}\) has been made!). cf. raising lowering operators \(a,a^{\dagger}\) in quantum mechanics.

Problem: What is the canonical volume form on a smooth, orientable pseudo-Riemannian \(n\)-manifold \(X\)? Why is it canonical?

Solution: In a chart \(x^{\mu}\) in which the pseudo-Riemannian metric has components \(g=g_{\mu\nu}dx^{\mu}\otimes dx^{\nu}\), the canonical volume form is \(\sqrt{|\det g|}dx^1\wedge…\wedge dx^n\). This is indeed a volume form because it is a top form with \(\det g\neq 0\) nowhere vanishing for same reasons as before. Moreover, it is canonical because it does not depend on the choice of chart \(x^{\mu}\) (provided it’s the same orientation) as \(\sqrt{|\det g|}\) is a scalar density of weight \(1\) whereas \(dx^1\wedge…\wedge dx^n\) is a scalar density of weight \(-1\).

Problem: Let \(X\) be a smooth, \(n\)-dimensional, orientable pseudo-Riemannian manifold, and let \(\omega\in\Omega^k(X)\) be a differential \(k\)-form on \(X\). Define the Hodge dual \(n-k\)-form \(\star\omega\in\Omega^{n-k}(X)\) on \(X\).

Solution: The Hodge dual \(\star\omega\in\Omega^{n-k}(X)\) is defined to be the unique differential \(n-k\)-form such that …

Posted in Blog | Leave a comment

Reinforcement Learning (Part \(1\))

Problem: How does the paradigm of reinforcement learning (RL) fit within the broader context of machine learning?

Solution: It is instructive to compare/contrast reinforcement learning with supervised learning. In this way, it will be seen that RL can in fact be viewed as a generalization of SL:

  • In SL, the training feature vectors \(\mathbf x_i\) manifest as the state vector \(\mathbf x_t\) of an agent (similar to how it’s okay to conflate the position vector \(\mathbf x(t)\) of a particle in classical mechanics with the particle itself, there’s no harm in conflating the state vector \(\mathbf x_t\) with the agent itself, see map-territory relation). The difference is purely a perspective shift reflected in the choice of subscript variable \(i=1,2,3,…\) vs. \(t=0,1,2,…\).
  • In SL, the feature vectors \(\mathbf x_i\) are i.i.d. draws from the same underlying data-generating distribution. In RL, the initial state vector \(\mathbf x_{t=0}\) is drawn from an initial distribution, and subsequent state vectors \(\mathbf x_t\) are dependent on the previous state \(\mathbf x_{t-1}\) (discrete-time Markov chain).
  • In SL, the model output \(\hat y(\mathbf x|\boldsymbol{\theta})\) is analogous to an action \(\mathbf f\) outputted by an agent’s policy function \(\pi(\mathbf f|\mathbf x,t)\).
  • In SL, each training feature vector \(\mathbf x_i\) is labelled by its correct output \(y_i\). This concept has no analog in reinforcement learning as there is no notion of a “correct” action for the agent to take.
  • In SL, the feedback mechanism is provided by a loss function \(L(\hat y,y)\). By contrast, in RL the feedback mechanism is provided by a reward signal \(r_t\).
  • In SL, the objective is to find optimal parameters \(\boldsymbol{\theta}^*=\text{argmin}_{\boldsymbol{\theta}}\sum_{i=1}^{N_{\text{train}}}L(\hat y(\mathbf x_i|\boldsymbol{\theta}),y_i)\) minimizing the training cost over the \(N_{\text{train}}\) training examples. In RL, the objective is to find an optimal agent policy \(\pi^*=\text{argmax}_{\pi}\langle\sum_{t=0}^T\gamma^tr_t|\pi\rangle\) maximizing the expected return over an episode of horizon \(T\leq\infty\) (which depends on if/when the agent reaches a terminal state \(\mathbf x_T\)).
  • In SL, there is often a distinction between training, cross-validation, and testing datasets, with a golden rule often being that the model should obviously not be able to see any of the testing data. In RL, it is societally acceptable to train on your test set 🙂

Problem: Imagine the set of all states \(\mathbf x\), i.e. the agent’s state space. On this state space, one can impose a scalar field \(v_{\pi}(\mathbf x,t)\); what is the meaning of this field? Give a simple example thereof.

Solution: The scalar field \(v_{\pi}(\mathbf x,t)\) can roughly be thought of as describing how “valuable” the state \(\mathbf x\) is at time \(t\). More precisely, if one were to initialize an agent \(\pi\) at state \(\mathbf x_t:=\mathbf x\), then \(v_{\pi}(\mathbf x,t)\) is the expected return:

\[v_{\pi}(\mathbf x,t):=\biggr\langle\sum_{t’=t+1}^T\gamma^{t’-t-1}r_{t’}\biggr|\mathbf x,\pi\biggr\rangle\]

Consider a \(\pi\)-creature (of \(3\)Blue\(1\)Brown fame) symbolizing an agent with policy \(\pi\). This \(\pi\)-creature has to complete a maze. Within the RL framework, this can be modelled as an MDP in which each square is one possible state \(\mathbf x\) of the agent, the maze structure defines the allowed actions \(\mathbf f\) from each state \(\mathbf x\), and one can work with an undiscounted \(\gamma=1\) return in which the reward is \(r_t=-1\) for each action taken. Thus, the optimal policy \(\pi^*\) is a time-independent, deterministic policy \(\mathbf f=\pi^*(\mathbf x)\) that gets the \(\pi\)-creature from its initial state to the terminal state in the fewest number of moves. With that in mind, one can label on top of each square \(\mathbf x\) its corresponding optimal value \(v_{\pi^*}(\mathbf x)\) (note, this image was generated using Nano Banana Pro, some of the calculated values are just wrong but the idea is clear):

Problem: The value function \(v_{\pi}(\mathbf x,t)\) is passive; it simply assess how good/bad it is to be at \(\mathbf x_t:=\mathbf x\). In order to remedy this, one can look at the quality function \(q_{\pi}(\mathbf x,\mathbf f,t)\); explain how this takes on a more active role compared to the passive nature of the value function \(v_{\pi}(\mathbf x,t)\).

Solution: Because \(q_{\pi}(\mathbf x,\mathbf f,t)\) assesses the quality of the agent choosing the action \(\mathbf f_t:=\mathbf f\) while in state \(\mathbf x_t:=\mathbf x\). That is, it is a more refined conditional expected return:

\[q_{\pi}(\mathbf x,\mathbf f,t)=\biggr\langle\sum_{t’=t+1}^T\gamma^{t’-t-1}r_{t’}\biggr|\mathbf x,\mathbf f,\pi\biggr\rangle\]

Problem: How come the agent’s policy, value function and quality function are sometimes respectively written as \(\pi(\mathbf f|\mathbf x),v_{\pi}(\mathbf x)\) and \(q_{\pi}(\mathbf x,\mathbf f)\) without the \(t\) argument?

Solution: There could be \(3\) reasons:

  • Similar to the above maze example, sometimes it just doesn’t depend on \(t\).
  • If one is seeking to maximize the agent’s expected return over an infinite horizon \(T=\infty\); in such a case, it is both intuitively and mathematically clear that \(\pi(\mathbf f|\mathbf x),v_{\pi}(\mathbf x)\) and \(q_{\pi}(\mathbf x,\mathbf f)\) should all be invariant with respect to time translations.
  • If the horizon is finite \(T<\infty\), then it is standard to package \(\mathbf x\) and \(t\) together, thereby working with an “augmented” state vector \(\mathbf x’:=(\mathbf x,t)\). Then everything can be made to look identical to the infinite horizon \(T=\infty\) case by replacing \(\mathbf x\mapsto\mathbf x’\).

(cf. distinction between state variables and path variables in thermodynamics).

Problem: State and prove the law of iterated expectation.

Solution: If \(X,Y\) are any random variables, then:

\[\langle X\rangle=\langle\langle X|Y\rangle\rangle\]

The proof is easy as long as one is careful to interpret all the expectations correctly. For instance, \(\langle X|Y\rangle\) is not a scalar but a random variable with respect to \(Y\):

\[\langle X|Y\rangle=\int dx p(x|Y) x\]

Thus, it is clear that the outer expectation must also be with respect to \(Y\) alone:

\[\langle\langle X|Y\rangle\rangle=\int dy p(y)\langle X|y\rangle=\int dxdy p(y)p(x|y)x\]

Rewriting in terms of the joint distribution \(p(x,y)=p(y)p(x|y)\) and integrating out \(\int dy p(x,y)=p(x)\), one finally obtains:

\[=\int dx p(x) x=\langle X\rangle\]

Of course, this also generalizes easily to identities such as the following equality of random variables (w.r.t. \(Z\)):

\[\langle X|Z\rangle=\langle\langle X|Y,Z\rangle|Z\rangle\]

Problem: Show that the value function \(v_{\pi}(\mathbf x)\) satisfied the following identity (to be used as a lemma later):

\[v_{\pi}(\mathbf x_t)=\langle r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1})|\mathbf x_t,\pi\rangle\]

Solution: Denoting the return random variable by:

\[R_t:=\sum_{t’=t+1}^T\gamma^{t’-t-1}r_{t’}\]

The fundamental observation is that \(R_t\) obeys a simple recurrence relation:

\[R_t=r_{t+1}+\gamma R_{t+1}\]

Taking suitable conditional expectations thereof:

\[\langle R_t|\mathbf x_t,\pi\rangle=\langle r_{t+1}|\mathbf x_t,\pi\rangle+\gamma\langle R_{t+1}|\mathbf x_t,\pi\rangle\]

The LHS is the definition of \(v_{\pi}(\mathbf x_t)\). Applying the law of iterated expectation to the \(2^{\text{nd}}\) term on the RHS:

\[\langle R_{t+1}|\mathbf x_t,\pi\rangle=\langle\langle R_{t+1}|\mathbf x_{t+1},\mathbf x_t,\pi\rangle|\mathbf x_t,\pi\rangle\]

The Markov property ensures that \(\langle R_{t+1}|\mathbf x_{t+1},\mathbf x_t,\pi\rangle=\langle R_{t+1}|\mathbf x_{t+1},\pi\rangle=v_{\pi}(\mathbf x_{t+1})\). One thus obtains the desired result:

\[v_{\pi}(\mathbf x_t)=\langle r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1})|\mathbf x_t,\pi\rangle\]

Problem: How are \(v_{\pi}(\mathbf x_t)\) and \(q_{\pi}(\mathbf x_t,\mathbf f_t)\) related to each other?

Solution: Apply the law of iterated expectations:

\[v_{\pi}(\mathbf x_t):=\langle R_t|\mathbf x_t,\pi\rangle=\langle\langle R_t|\mathbf x_t,\mathbf f_t,\pi\rangle|\mathbf x_t,\pi\rangle=\langle q_{\pi}(\mathbf x_t,\mathbf f_t)|\mathbf x_t,\pi\rangle\]

Since \(\mathbf x_t\) is being conditioned upon, the expectation must be over \(\mathbf f_t\) so one can explicitly write it as:

\[=\sum_{\mathbf f_t}p(\mathbf f_t|\mathbf x_t,\pi)q_{\pi}(\mathbf x_t,\mathbf f_t)\]

But trivially \(p(\mathbf f_t|\mathbf x_t,\pi)=\pi(\mathbf f_t|\mathbf x_t)\). Thus, one has an expression for \(v_{\pi}(\mathbf x_t)\) in terms of \(q_{\pi}(\mathbf x_t,\mathbf f_t)\). Now one would like to proceed the other way, relating \(q_{\pi}(\mathbf x_t,\mathbf f_t)\) to \(v_{\pi}(\mathbf x_t,\mathbf f_t)\). This can be achieved by fleshing out the expectation from the earlier lemma:

\[v_{\pi}(\mathbf x_t)=\langle r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1})|\mathbf x_t,\pi\rangle\]

\[=\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\pi)(r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1}))\]

Further expand \(p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\pi)=\sum_{\mathbf f_t}p(\mathbf f_t|\mathbf x_t,\pi)p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t,\pi)\) and recognize again \(p(\mathbf f_t|\mathbf x_t,\pi)=\pi(\mathbf f_t|\mathbf x_t)\) and \(p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t,\pi)=p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)\) because the action \(\mathbf f_t\) is already fixed. Moving the summation \(\sum_{\mathbf f_t}\) to the outside so as to compare with the earlier identity expressing \(v_{\pi}(\mathbf x_t)\) in terms of \(q_{\pi}(\mathbf x_t,\mathbf f_t)\), one can pattern-match:

\[q_{\pi}(\mathbf x_t,\mathbf f_t)=\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)(r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1}))\]

(which in hindsight is manifest…)

Problem: Hence, deduce the Bellman equations for the value and quality functions.

Solution: To obtain the Bellman equation for \(v_{\pi}(\mathbf x_t)\), substitute the above expression for \(q_{\pi}(\mathbf x_t,\mathbf f_t)\) into the expression for \(v_{\pi}(\mathbf x_t)\):

\[v_{\pi}(\mathbf x_t)=\sum_{\mathbf f_t}\pi(\mathbf f_t|\mathbf x_t)\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)(r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1}))\]

Similarly, to the get the Bellman equation for \(q_{\pi}(\mathbf x_t,\mathbf f_t)\) substitute vice versa:

\[q_{\pi}(\mathbf x_t,\mathbf f_t)=\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)\left(r_{t+1}+\gamma\sum_{\mathbf f_{t+1}}p(\mathbf f_{t+1}|\mathbf x_{t+1},\pi)q_{\pi}(\mathbf x_{t+1},\mathbf f_{t+1})\right)\]

Intuitively, the Bellman equations relate the value of every state with the values of states that can be transitioned into, thereby providing a system of simultaneous equations that in principle can be solved to deduce all state values. A key caveat for this dynamic programming approach to work is that one must have complete knowledge of the environment dynamics \(p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)\) (similar remarks apply to the qualities of different state-action pairs).

Problem: The above Bellman equations apply to a generic agent policy \(\pi\); how do they specialize to the case of an optimal policy \(\pi^*\)?

Solution: Reverting back to the form of the “pre-Bellman” equations and using the key insight that the optimal policy \(\pi^*(\mathbf f_t|\mathbf x_t)=\delta_{\mathbf f_t,\text{argmax}_{\mathbf f_t}q_{\pi^*}(\mathbf x_t,\mathbf f_t)}\) should only assign non-zero probability to actions of the best quality, one has:

\[v_{\pi^*}(\mathbf x_t)=\text{max}_{\mathbf f_t}q_{\pi^*}(\mathbf x_t,\mathbf f_t)\]

\[q_{\pi^*}(\mathbf x_t,\mathbf f_t)=\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)(r_{t+1}+\gamma v_{\pi^*}(\mathbf x_{t+1}))\]

which lead to the Bellman optimality equations.

Problem: In light of Bellman optimality, discuss how policy evaluation and policy improvement work. Hence, explain the notion of generalized policy iteration (GPI).

Solution: Imagine the space of all \((\pi,v)\) pairs, where \(\pi(\mathbf f_t|\mathbf x_t)\) is any policy distribution and \(v(\mathbf x_t)\) is the value function of some policy. Within this space, there are \(2\) canonical subspaces. One is the set \((\pi,v_{\pi})\) of all pairs where \(v_{\pi}\) is specifically the value function associated to policy \(\pi\), and the other subspace \((\pi_v,v)\) where \(\pi_v(\mathbf f_t|\mathbf x_t):=\delta_{\mathbf f_t,\text{argmax}_{\mathbf f_t}q(\mathbf x_t,\mathbf f_t)}\) is greedy with respect to the value function \(v\), where \(q(\mathbf x_t,\mathbf f_t)=\langle r_{t+1}+\gamma v(\mathbf x_{t+1})|\mathbf x_t,\mathbf f_t\rangle\) (in particular, notice \(q\neq q_{\pi}\); the policy \(\pi\) does not appear in the definition of \(q\)).

  • Policy evaluation is an algorithm for projecting \((v,\pi)\mapsto (v_{\pi},\pi)\). The idea is to view \(v\) as a random initial guess for the underlying true policy value function \(v_{\pi}\) (though any terminal states \(\mathbf x_T\) should be initialized \(v(\mathbf x_T):=0\)). Then, sweeping through each state \(\mathbf x_t\), one updates the value of \(v(\mathbf x_t)\) using the Bellman equation involving the (known) randomly initialized values of the other states:

\[v(\mathbf x_t)\mapsto\sum_{\mathbf f_t}\pi(\mathbf f_t|\mathbf x_t)\sum_{\mathbf x_{t+1},r_{t+1}}p(\mathbf x_{t+1},r_{t+1}|\mathbf x_t,\mathbf f_t)(r+\gamma v(\mathbf x_{t+1}))\]

With sufficiently many sweeps over the state space, general theorems guarantee convergence \(v\to v_{\pi}\).

  • Policy improvement is an algorithm for projecting onto the other canonical subspace \((v,\pi)\mapsto (v,\pi_v)\). Essentially just act greedy:
    \[\pi(\mathbf f_t|\mathbf x_t)\mapsto\delta_{\mathbf f_t,\text{argmax}_{\mathbf f_t}q_{\pi}(\mathbf x_t,\mathbf f_t)}\]
    and is rigorously justified by the policy improvement theorem.

Policy iteration roughly amounts to alternating between the \(2\) steps of policy evaluation and policy improvement; however there is some leeway in how one does this, for instance one need not necessarily run the policy evaluation step to completion but rather just perform a single sweep over the state space each time (this more “generalized” version of policy iteration is called value iteration, and can be shown to eventually still funnel/converge onto the optimal policy \(\pi^*\) as one can prove that \(\pi\) is greedy with respect to its own value function \(v_{\pi}\) iff \(\pi=\pi^*\)).

Problem: Distinguish between model-free vs. model-based methods/algorithms in reinforcement learning.

Solution: In both cases, the fundamental limitation (which previously was taken for granted) is incomplete knowledge of the environment dynamics \(p(\mathbf x_{t+1},r_{t+1}|\mathbf x_t,\mathbf f_t)\); instead, one can only sample trajectories \(\mathbf x_{t=0},\mathbf f_{t=0},r_{t=0},\mathbf x_{t=1},\mathbf f_{t=1},r_{t=1},…\) when running a policy \(\pi\) through the MDP. Recalling the fundamental objective of RL is to compute \(\pi^*=\text{argmax}_{\pi}\langle R_t|\pi\rangle\), a model-free method for doing so is one which does not attempt to estimate \(p(\mathbf x_{t+1},r_{t+1}|\mathbf x_t,\mathbf f_t)\) (here \(p\) is what the word “model” refers to, i.e. a model of the environment dynamics) whereas model-based methods do attempt to estimate \(p(\mathbf x_{t+1},r_{t+1}|\mathbf x_t,\mathbf f_t)\) and thereby obtain \(\pi^*\) through GPI as outlined earlier (see the OpenAI Spinning Up documentation for a nice diagram of this taxonomy and further comments).

Problem: Explain how Monte Carlo evaluation and Monte Carlo control are both model-free methods that accomplish the same actions as their respective model-based cousins, i.e. policy evaluation and policy improvement.

Solution: Fix \(\pi\). The key is to recognize that \(v_{\pi}(\mathbf x_t)\) is an expectation of the return random variable starting at \(\mathbf x_t\), and Monte Carlo point estimators can always be applied to approximate expectation values:

\[v_{\pi}(\mathbf x_t)=\langle R_t|\mathbf x_t,\pi\rangle\approx\frac{1}{N_{\mathbf x_t}}\sum_{n=1}^{N_{\mathbf x_t}}R_n\]

where \(N_{\mathbf x_t}\) is the number of times the MDP passes through state \(\mathbf x_t\) in however many trajectories, and \(R_n\) is the return following the \(n^{\text{th}}\) pass through \(\mathbf x_t\) within some MDP trajectory.

Lemma: the arithmetic mean \(\hat{\mu}_n:=\frac{1}{n}\sum_{i=1}^nx_i\) obeys the recurrence relation \(\hat{\mu}_n=\hat{\mu}_{n-1}+\frac{1}{n}(x_n-\hat{\mu}_{n-1})\)

Proof: trivial, but instructive to highlight the convex linear combination nature of the recurrence \(\hat{\mu}_n=\left(1-\frac{1}{n}\right)\hat{\mu}_{n-1}+\frac{1}{n}x_n\), so for large \(n\gg 1\), the majority \(1-1/n\) of the weight is given to the prior value of the mean \(\hat{\mu}_{n-1}\), only a tiny sliver \(1/n\) is given to the new sample \(x_n\).

Thus, rather than waiting for a bunch of trajectories to finish before applying the Monte Carlo estimate above, one can introduce an intermediate estimator initialized at \(\hat v_{\pi}(\mathbf x_t):=0\) that applies (at the termination of each trajectory) the following iterative rule for each time \(n=1,2,…,N_{\mathbf x_t}\) that \(\mathbf x_t\) was visited:

\[\hat v_{\pi}(\mathbf x_t)\mapsto\hat v_{\pi}(\mathbf x_t)+\alpha_n(R_n-\hat v_{\pi}(\mathbf x_t))\]

where \(\alpha_n:=1/n\). Since this is just a mathematical decomposition of the original Monte Carlo estimate, it is an equally valid model-free method for \(v_{\pi}\) estimation. Now comes a curveball: one common modification is to replace \(\alpha_n\mapsto\alpha\) with an \(n\)-independent constant \(\alpha\in[0,1]\) (this is called constant-\(\alpha\) MC).

If one were presented with the recurrence relation \(\hat{\mu}_n=\hat{\mu}_{n-1}+\frac{1}{n}(x_n-\hat{\mu}_{n-1})\) and initial condition \(\hat{\mu}_0=0\), one could claim the explicit solution to be \(\hat{\mu}_n=\frac{1}{n}\sum_{i=1}^nx_i\). However, if instead one were presented with the recurrence relation \(\hat{\mu}’_n=\hat{\mu}’_{n-1}+\alpha(x_n-\hat{\mu}’_{n-1})\) and same initial condition \(\hat{\mu}’_0=0\), what would be the solution for \(\hat{\mu}’_n\)? It’s easy to check \(\hat{\mu}’_n=\alpha\sum_{i=1}^n(1-\alpha)^{n-i}x_i\); thus, recent samples are given exponentially more weight, and it’s biased towards underestimating \(\langle\hat{\mu}’_n\rangle-\mu=-\mu(1-\alpha)^n\), though is asymptotically unbiased \(\lim_{n\to\infty}\langle\hat{\mu}’_n\rangle=\mu\). However, unlike the asymptotically vanishing variance \(\sigma^2_{\hat{\mu}_n}=\sigma^2/n\) of the standard mean estimator \(\hat{\mu}_n\), the variance \(\sigma^2_{\hat{\mu}’_n}=\frac{\alpha\sigma^2}{2-\alpha}(1-(1-\alpha)^{2n})\to\frac{\alpha\sigma^2}{2-\alpha}\approx\alpha\sigma^2/2\) of \(\hat{\mu}’_n\) is bounded from below, hence the estimate \(\hat{\mu}’_n\) will forever oscillate about \(\mu\). It should make sense \(\sigma^2_{\hat{\mu}’_n}\propto\alpha\), i.e. a larger step size \(\alpha\) means bigger oscillations, though it also means faster convergence of \(\hat{\mu}’_n\to\mu\) because of the \((1-\alpha)^n\).

This all begs the question: why bother with constant-\(\alpha\) MC? (come back to this…)

Having used constant-\(\alpha\) MC to estimate \(v_{\pi}(\mathbf x_t)\) in a model-free way for some fixed agent policy \(\pi\), one can naturally ask about model-free estimation of \(q_{\pi}(\mathbf x_t,\mathbf f_t)\). It turns out to just be:

\[\hat q_{\pi}(\mathbf x_t,\mathbf f_t)\mapsto\hat q_{\pi}(\mathbf x_t,\mathbf f_t)+\alpha(R_n-\hat q_{\pi}(\mathbf x_t,\mathbf f_t))\]

Conceptually, the standard way to convince oneself of this is that because \(\pi\) was fixed, it could be thought of as part of the incomplete knowledge of the environment dynamics \(p\) so that the agent never takes any actions but just lets the environment push its state around. This perspective shift from MDP trajectories \(\mathbf x_{t=0},\mathbf f_{t=0},r_{t=0},\mathbf x_{t=1},\mathbf f_{t=1},r_{t=1},…\) to MRP trajectories \(\mathbf x_{t=0},r_{t=0},\mathbf x_{t=1},r_{t=1},…\) amounts to viewing the MRP states \(\mathbf x_t\) as corresponding to augmented state-action pairs \((\mathbf x_t,\mathbf f_t)\) in the MDP. But then it is mathematically apparent that \(v_{\pi}(\mathbf x_t)\) in the MRP is just \(q_{\pi}(\mathbf x_t,\mathbf f_t)\) in the MDP, so constant-\(\alpha\) MC works for \(q_{\pi}\)-evaluation! Thus, Monte Carlo control chooses to be greedy with respect to \(q_{\pi}\), and since \(q_{\pi}\) could be estimated in a model-free way, this therefore implements model-free policy improvement.

Problem: Explain the exploration-exploitation tradeoff in reinforcement learning and how a soft policy \(\pi\) can help to address this.

Solution: Being greedily exploiting high value states all the time could lead to the agent settling into a “locally” optimal but not globally optimal policy \(\pi^*\); the only way to know for sure is to explore all state-action sequences, for instance by ensuring that one’s policy is soft \(\pi(\mathbf f_t|\mathbf x_t)>0\) for all states \(\mathbf x_t\) and actions \(\mathbf f_t\). A crude way to guarantee softness during Monte Carlo control is to use an \(\varepsilon\)-greedy policy (a better name would be either a \((1-\varepsilon)\)-greedy policy or an \(\varepsilon\)-soft policy), though there are other more refined ways to do so.

Problem: Having distinguished between model-based and model-free methods in RL, one can make an orthogonal distinction between on-policy and off-policy methods. Explain and motivate this taxonomy.

Solution: The constant-\(\alpha\) MC \(q_{\pi}\)-evaluation followed by \(\varepsilon\)-greedy MC control is a simple model-free, on-policy RL algorithm for discovering \(\pi^*\). This is because the trajectory samples used to evaluate \(q_{\pi}\) came from the same policy \(\pi\) that is subsequently being improved. However, in practice the trajectory samples may come from some (known) behavior policy \(\pi_b\) even though the goal is to evaluate \(q_{\pi}\) so as to improve another, different (known) target policy \(\pi\neq\pi_b\). This is where off-policy methods come in (cf. the distinction between on-shell vs. off-shell particles in physics). The obvious way to bridge \(\pi_b\) and \(\pi\) is via importance sampling:

\[q_{\pi}(\mathbf x_t,\mathbf f_t)=\langle R_t|\mathbf x_t,\mathbf f_t,\pi\rangle\]

\[=\frac{1}{N_{\mathbf x_t}}\sum_{n=1}^{N_{\mathbf x_t}}\frac{\pi(\mathbf f_{t+1}|\mathbf x_{t+1})…\pi(\mathbf f_T|\mathbf x_T)}{\pi_b(\mathbf f_{t+1}|\mathbf x_{t+1})…\pi_b(\mathbf f_T|\mathbf x_T)}R_n\]

where it is essential to emphasize that the trajectories in this sum are sampled under the behavior policy \(\pi_b\)! This is model-free because all the transition probabilities have cancelled out in the importance sampling quotient.

Posted in Blog | Leave a comment

Density Functional Theory

Problem: In one sentence, what is the essence of DFT?

Solution: To replace \(\Psi\mapsto n\), where the number density of a system of \(N\) identical quantum particles (usually electrons) \(n(\mathbf x)\) is:

\[n(\mathbf x):=N\int d^3\mathbf x_2…d^3\mathbf x_N |\Psi(\mathbf x,\mathbf x_2,…,\mathbf x_N)|^2\]

in terms of its \(N\)-body wavefunction \(\Psi(\mathbf x,\mathbf x_2,…,\mathbf x_N)\). Thus, \(\int d^3\mathbf x n(\mathbf x)=N\) is normalized.

Problem: Write down the non-relativistic Hamiltonian \(H\) of a molecule. Attempt to express the expected energy \(E=\langle\Psi|H|\Psi\rangle\) in many-body state \(|\Psi\rangle\in\bigwedge^NL^2(\mathbf R^3\to\mathbf C)\otimes\mathbf C^2\) in terms of \(n(\mathbf x)\) defined above; this functional \(E=E[n]\) of the density \(n\) explains the name “density functional theory“.

Solution: (the Born-Oppenheimer approximation is implicitly used whereby the electrons move on an effective/external potential energy surface \(V_{\text{ext}}(\mathbf x)\) defined by a collection of stationary nuclei)

\[H=\sum_{i=1}^N\frac{|\mathbf P_i|^2}{2m}+V_{\text{ext}}(\mathbf X_i)+\frac{1}{2}\sum_{1\leq i\neq j\leq N}\frac{\alpha\hbar c}{|\mathbf X_i-\mathbf X_j|}\]

One can check that:

\[E=\frac{\hbar^2}{2m}\sum_{i=1}^N\int d^3\mathbf x_1…d^3\mathbf x_N\biggr|\frac{\partial\Psi}{\partial\mathbf x_i}\biggr|^2+\int d^3\mathbf x n(\mathbf x)V_{\text{ext}}(\mathbf x)+\frac{\alpha\hbar c}{2}\int d^3\mathbf x d^3\mathbf x’\frac{n_2(\mathbf x,\mathbf x’)}{|\mathbf x-\mathbf x’|}\]

where \(n_2(\mathbf x,\mathbf x’)=N(N-1)\int d^3\mathbf x_3…d^3\mathbf x_N|\Psi(\mathbf x,\mathbf x’,\mathbf x_3,…,\mathbf x_N)|^2\) so in particular \((N-1)n(\mathbf x)=\int d^3\mathbf x’ n_2(\mathbf x,\mathbf x’)\).

Problem: Write down the Thomas-Fermi approximation for \(E[n]\), and explain why it sucks. Also derive the von Weizsäcker correction to the Thomas-Fermi energy functional.

Solution: The Thomas-Fermi functional treats the electrons as a locally ideal Fermi gas, thus having the usual Pauli pressure \(p\propto n^{5/3}\) in \(\mathbf R^3\) and thus (kinetic) energy density \(3p/2\) scaling likewise. Furthermore, the exchange-correlation functional is taken to vanish \(E_{\text{xc}}[n]=0\) (thus, Thomas-Fermi is often said to be semiclassical):

\[E_{\text{TF}}[n]=\frac{3\hbar^2(3\pi^2)^{2/3}}{10m}\int d^3\mathbf x n^{5/3}(\mathbf x)+\int d^3\mathbf x n(\mathbf x)V_{\text{ext}}(\mathbf x)+\frac{\alpha\hbar c}{2}\int d^3\mathbf x d^3\mathbf x’\frac{n(\mathbf x)n(\mathbf x’)}{|\mathbf x-\mathbf x’|}\]

One has:

\[\frac{\delta}{\delta n(\mathbf x)}\left(E_{\text{TF}}-\mu\left(\int d^3\mathbf xn(\mathbf x)-N\right)\right)=\frac{\hbar^2k^2_F(\mathbf x)}{2m}+V_{\text{eff}}(\mathbf x)-\mu=0\]

where \(k_F(\mathbf x):=(3\pi^2n(\mathbf x))^{1/3}\) and \(V_{\text{eff}}(\mathbf x)=V_{\text{ext}}(\mathbf x)+\alpha\hbar c\int d^3\mathbf x’\frac{n(\mathbf x’)}{|\mathbf x-\mathbf x’|}\).

(aside: although the Thomas-Fermi model is meant to be applied to molecules, a result of Teller showed that \(E_{\text{TF molecule}}>2E_{\text{TF atom}}\) so Thomas-Fermi theory predicts the molecular state is unstable with respect to dissociation. In a central external potential such as due to a single nucleus \(V_{\text{ext}}(r)=-Z\alpha\hbar c/r\), the effective potential \(V_{\text{eff}}(r)\) and electron density \(n(r)\) are both isotropic as well. Starting from Poisson’s equation:

\[\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\frac{V_{\text{eff}}(r)}{-e}=-\frac{-en(r)}{\varepsilon_0}\]

with \(\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2f(r)=\frac{1}{r^2}\frac{d}{dr}\left(r^2\frac{df}{dr}\right)=\frac{1}{r}\frac{d^2}{dr^2}(rf(r))\) (use the latter) one can check that by defining \(f(r):=\left(\frac{8}{3\pi a_0}\right)^2rk^2_F(r)\), one recovers the Thomas-Fermi ODE:

\[\frac{d^2 f}{dr^2}=\frac{f^{3/2}(r)}{\sqrt{r}}\]

The von Weizsäcker term is a correction to the kinetic energy functional in the Thomas-Fermi model that accounts for density gradients:

\[T_{\text{vW}}[n]=\frac{\hbar^2}{8m}\int d^3\mathbf x\frac{|\partial n/\partial\mathbf x|^2}{n(\mathbf x)}\]

Heuristically, it can be justified by writing the expected kinetic energy of a single electron \(\frac{\hbar^2}{2m}\int d^3\mathbf x |\partial\psi/\partial\mathbf x|^2\) with wavefunction \(\psi(\mathbf x)\in\mathbf R\) and replacing \(\psi(\mathbf x)\mapsto\sqrt{n(\mathbf x)}\).

Problem: Prove the two Hohenberg-Kohn theorems.

Solution: First, it is useful to prove:

  • Lemma: If \(V_{\text{ext}}(\mathbf x)\neq V’_{\text{ext}}(\mathbf x)\) represent \(2\) physically distinct external potentials (i.e. they differ by more than just some additive constant), then the corresponding molecular Hamiltonians \(H\neq H’\) do not share any non-zero eigenstates.
  • Proof: Suppose for sake of contradiction that there exists \(|\Psi\rangle\neq 0\) such that
    \[H|\Psi\rangle=E|\Psi\rangle\]
    \[H’||\Psi\rangle=E’|\Psi\rangle\]
    Subtracting, one obtains:
    \[(H’-H)|\Psi\rangle=(E’-E)|\Psi\rangle\]
    Inspecting the form of the molecular Hamiltonian \(H\) given earlier, one sees that the “universal” part (i.e. the kinetic energy and electron-electron Coulomb repulsion terms) cancel, leaving only the non-universal residue \(H’-H=\sum_{i=1}^NV’_{\text{ext}}(\mathbf X_i)-V_{\text{ext}}(\mathbf X_i)\). But the RHS \(E’-E\) is \(\mathbf X_i\)-independent for all \(i=1,…,N\); this means \(V’_{\text{ext}}(\mathbf x)-V_{\text{ext}}(\mathbf x)=\text{const}\) which is a contradiction.

    With this in mind, the first HK theorem follows by specializing to the respective (distinct by the above lemma) ground states \(|\Psi_0\rangle\neq|\Psi’_0\rangle\) of \(H\) and \(H’\). Apply the variational bound both ways:

    \[\]

    Problem: Demonstrate the first Hohenberg-Kohn theorem for \(2\) electrons in a \(1\)-dimensional harmonic trap.

Solution:

Problem: Derive the Kohn-Sham equations.

Solution: Fictitious ideal electron gas with density \(n(\mathbf x)\).

Problem:

Posted in Blog | Leave a comment

PyTorch Fundamentals (Part \(2\))

Problem: Do an end-to-end walkthrough of the PyTorch machine learning workflow using the most basic univariate linear regression example. In particular, generate some linear data over a normalized feature space (whose slope \(w\) and intercept \(b\) would in practice be a priori unknown), split that linear data into training and testing subsets (no cross-validation dataset needed for this simple example), define the linear layer class, instantiate a model object of the class, and starting from random values of \(w\) and \(b\), use stochastic gradient descent with learning rate \(\alpha=0.01\) to minimize the training cost function \(C_{\text{train}}(w,b)\) based on an \(L^1\) loss. Iterate SGD for \(300\) epochs, and for every \(20\) epochs, record the current value of \(C_{\text{train}}(w,b)\) and the current value of \(C_{\text{test}}(w,b)\). Plot these cost function curves as a function of the epoch number \(0, 20, 40,…\). Save the final state dictionary of the model’s learned parameters \(w,b\) post-training, and load it back onto a new instance of the model class.

Solution:

pytorch_workflow
Posted in Blog | Leave a comment

PyTorch Fundamentals (Part \(1\))

Problem: Illustrate some of the basic fundamentals involved in using the PyTorch deep learning library. In particular, discuss the attributes of PyTorch tensors (e.g. dtype, CPU/GPU devices, etc.), how to generate random PyTorch tensors with/without seeding, and operations that can be performed on and between PyTorch tensors.

Solution:

pytorch_fundamentals
Posted in Blog | Leave a comment

Self-Attention in Transformers

Problem: Explain how the transformer architecture works at a mathematical level (e.g. as outlined in the Attention Is All You Need paper).

Solution:

  1. (Tokenization) Partition the inputted natural language text into a sequence of tokens \(\tau_1,…,\tau_N\) (here \(N\) is bounded from above by the LLMs context size).
  2. (Embedding) Each token \(\tau_i\) is embedded as a vector \(\tau_i\mapsto\mathbf x_i\in\mathbf R^{n_e}\) that contains information about the token’s generic meaning as well as its position in the inputted natural language text. Here, the hyperparameter \(n_e\) is the dimension of the embedding space.
  3. (Single Self-Attention Head) For each embedding vector \(\mathbf x_i\), compute its query vector \(\mathbf q_i=W_{\mathbf q}\mathbf x_i\), its key vector \(\mathbf k_i=W_{\mathbf k}\mathbf x_i\), and its value vector \(\mathbf v_i=W_{\mathbf v}\mathbf x_i\). Here, \(W_{\mathbf q},W_{\mathbf k}\in\mathbf R^{n_{qk}\times n_e}\) are weight matrices that map from the embedding space \(\mathbf R^{n_e}\) to the query/key space \(\mathbf R^{n_{qk}}\) of dimension \(n_{qk}\) and \(W_{\mathbf v}\in\mathbf R^{n_e\times n_e}\) is the weight matrix of values (which in practice is decomposed into a low-rank approximation \(W_{\mathbf v}=W_{\mathbf v\uparrow}W_{\mathbf v\downarrow}\) where typically \(W_{\mathbf v\downarrow}\in\mathbf R^{n_{qk}\times n_e}\) and \(W_{\mathbf v\uparrow}\in\mathbf R^{n_e\times n_{qk}}\)). For each \(\mathbf x_i\), one computes an update vector \(\Delta\mathbf x_i\) to be added to it according to a convex linear combination of the value vectors \(\mathbf v_1,…,\mathbf v_N\) of all the embeddings \(\mathbf x_1,…,\mathbf x_N\) in the context, specifically:

\[\Delta\mathbf x_i=V\text{softmax}\left(\frac{K^T\mathbf q_i}{\sqrt{n_{qk}}}\right)\]

where \(K=(\mathbf k_1,…,\mathbf k_N)\in\mathbf R^{n_{qk}\times N}\) and \(V=(\mathbf v_1,…,\mathbf v_N)\in\mathbf R^{n_e\times N}\) are key and value matrices associated to the inputted context (filled with column vectors here rather than the ML convention of row vectors). This map that takes the initial, generic token embeddings \(\mathbf x_i\) and nudges them towards more contextualized embeddings \(\mathbf x_i\mapsto\mathbf x’_i=\mathbf x_i+\Delta\mathbf x_i\) is called a head of self-attention. The \(1/\sqrt{n_{qk}}\) scaling in the softmax temperature is justified on the grounds that if \(\mathbf k\) and \(\mathbf q\) are random vectors whose independent components each have mean \(0\) and variance \(1\), then \(\mathbf k\cdot\mathbf q\) will have mean \(0\) and variance \(n_{qk}\), hence the need to normalize by \(\sqrt{n_{qk}}\) to ensure \(\mathbf k\cdot\mathbf q/\sqrt{n_{qk}}\) continues to have variance \(1\).

4. (Multi-Headed Self-Attention) Since context can influence meaning in different ways, repeat the above procedure in parallel for several heads of self-attention; each head will propose a displacement update to each of the \(N\) original embeddings \(\mathbf x_i\); add up all of them.

5. (Multilayer Perceptron) Linear, ReLU, Linear basically. It is hypothesized that facts are stored in this part of the transformer.

6. (Layers) Alternate between the multi-headed self-attention blocks and MLP blocks, make a probabilistic prediction of the next token \(\hat{\tau}_{N+1}\) using only the final, context-rich, modified embedding \(\mathbf x’_N\) of the last token \(\tau_N\) in the context by applying an unembedding matrix \(\mathbf u=W_{\mathbf u}\mathbf x’_N\) and running it through a softmax \(\text{softmax}(\mathbf u)\).

    Problem: Based on the above discussion of the transformer architecture, explain how a large language model (LLM) like Gemini, ChatGPT, Claude, Grok, DeepSeek, etc. works (at a high level).

    Solution: Essentially, since an LLM is a neural network which takes as input some string of text and probabilistically predicts the next token, by seeding it with some corpus of text \(T\), the LLM can sample according to the probability distribution it generates for the next token, and append that to \(T\mapsto T+\tau\). Then, simply repeat this except pretend that \(T+\tau\) was the seed all along. In this way, generative AI models such as ChatGPT (where GPT stands for generative pre-trained transformer) work. In practice, it is helpful to also provide some system prompt like “What follows is a conversation between a user and a knowledgeable AI assistant:”.

    Posted in Blog | Leave a comment

    JAX Fundamentals (Part \(1\))

    JAX_tutorial
    Posted in Blog | Leave a comment