Reinforcement Learning

Problem: How does the paradigm of reinforcement learning (RL) fit within the broader context of machine learning?

Solution: It is instructive to compare/contrast reinforcement learning with supervised learning. In this way, it will be seen that RL can in fact be viewed as a generalization of SL:

  • In SL, the training feature vectors \(\mathbf x_i\) manifest as the state vector \(\mathbf x_t\) of an agent (similar to how it’s okay to conflate the position vector \(\mathbf x(t)\) of a particle in classical mechanics with the particle itself, there’s no harm in conflating the state vector \(\mathbf x_t\) with the agent itself, see map-territory relation). The difference is purely a perspective shift reflected in the choice of subscript variable \(i=1,2,3,…\) vs. \(t=0,1,2,…\).
  • In SL, the feature vectors \(\mathbf x_i\) are i.i.d. draws from the same underlying data-generating distribution. In RL, the initial state vector \(\mathbf x_{t=0}\) is drawn from an initial distribution, and subsequent state vectors \(\mathbf x_t\) are dependent on the previous state \(\mathbf x_{t-1}\) (discrete-time Markov chain).
  • In SL, the model output \(\hat y(\mathbf x|\boldsymbol{\theta})\) is analogous to an action \(\mathbf f\) outputted by an agent’s policy function \(\pi(\mathbf f|\mathbf x,t)\).
  • In SL, each training feature vector \(\mathbf x_i\) is labelled by its correct output \(y_i\). This concept has no analog in reinforcement learning as there is no notion of a “correct” action for the agent to take.
  • In SL, the feedback mechanism is provided by a loss function \(L(\hat y,y)\). By contrast, in RL the feedback mechanism is provided by a reward signal \(r_t\).
  • In SL, the objective is to find optimal parameters \(\boldsymbol{\theta}^*=\text{argmin}_{\boldsymbol{\theta}}\sum_{i=1}^{N_{\text{train}}}L(\hat y(\mathbf x_i|\boldsymbol{\theta}),y_i)\) minimizing the training cost over the \(N_{\text{train}}\) training examples. In RL, the objective is to find an optimal agent policy \(\pi^*=\text{argmax}_{\pi}\langle\sum_{t=0}^T\gamma^tr_t|\pi\rangle\) maximizing the expected return over an episode of horizon \(T\leq\infty\) (which depends on if/when the agent reaches a terminal state \(\mathbf x_T\)).
  • In SL, there is often a distinction between training, cross-validation, and testing datasets, with a golden rule often being that the model should obviously not be able to see any of the testing data. In RL, it is societally acceptable to train on your test set 🙂

Problem: Imagine the set of all states \(\mathbf x\), i.e. the agent’s state space. On this state space, one can impose a scalar field \(v_{\pi}(\mathbf x,t)\); what is the meaning of this field? Give a simple example thereof.

Solution: The scalar field \(v_{\pi}(\mathbf x,t)\) can roughly be thought of as describing how “valuable” the state \(\mathbf x\) is at time \(t\). More precisely, if one were to initialize an agent \(\pi\) at state \(\mathbf x_t:=\mathbf x\), then \(v_{\pi}(\mathbf x,t)\) is the expected return:

\[v_{\pi}(\mathbf x,t):=\biggr\langle\sum_{t’=t+1}^T\gamma^{t’-t-1}r_{t’}\biggr|\mathbf x,\pi\biggr\rangle\]

Consider a \(\pi\)-creature (of \(3\)Blue\(1\)Brown fame) symbolizing an agent with policy \(\pi\). This \(\pi\)-creature has to complete a maze. Within the RL framework, this can be modelled as an MDP in which each square is one possible state \(\mathbf x\) of the agent, the maze structure defines the allowed actions \(\mathbf f\) from each state \(\mathbf x\), and one can work with an undiscounted \(\gamma=1\) return in which the reward is \(r_t=-1\) for each action taken. Thus, the optimal policy \(\pi^*\) is a time-independent, deterministic policy \(\mathbf f=\pi^*(\mathbf x)\) that gets the \(\pi\)-creature from its initial state to the terminal state in the fewest number of moves. With that in mind, one can label on top of each square \(\mathbf x\) its corresponding optimal value \(v_{\pi^*}(\mathbf x)\) (note, this image was generated using Nano Banana Pro, some of the calculated values are just wrong but the idea is clear):

Problem: The value function \(v_{\pi}(\mathbf x,t)\) is passive; it simply assess how good/bad it is to be at \(\mathbf x_t:=\mathbf x\). In order to remedy this, one can look at the quality function \(q_{\pi}(\mathbf x,\mathbf f,t)\); explain how this takes on a more active role compared to the passive nature of the value function \(v_{\pi}(\mathbf x,t)\).

Solution: Because \(q_{\pi}(\mathbf x,\mathbf f,t)\) assesses the quality of the agent choosing the action \(\mathbf f_t:=\mathbf f\) while in state \(\mathbf x_t:=\mathbf x\). That is, it is a more refined conditional expected return:

\[q_{\pi}(\mathbf x,\mathbf f,t)=\biggr\langle\sum_{t’=t+1}^T\gamma^{t’-t-1}r_{t’}\biggr|\mathbf x,\mathbf f,\pi\biggr\rangle\]

Problem: How come the agent’s policy, value function and quality function are sometimes respectively written as \(\pi(\mathbf f|\mathbf x),v_{\pi}(\mathbf x)\) and \(q_{\pi}(\mathbf x,\mathbf f)\) without the \(t\) argument?

Solution: There could be \(3\) reasons:

  • Similar to the above maze example, sometimes it just doesn’t depend on \(t\).
  • If one is seeking to maximize the agent’s expected return over an infinite horizon \(T=\infty\); in such a case, it is both intuitively and mathematically clear that \(\pi(\mathbf f|\mathbf x),v_{\pi}(\mathbf x)\) and \(q_{\pi}(\mathbf x,\mathbf f)\) should all be invariant with respect to time translations.
  • If the horizon is finite \(T<\infty\), then it is standard to package \(\mathbf x\) and \(t\) together, thereby working with an “augmented” state vector \(\mathbf x’:=(\mathbf x,t)\). Then everything can be made to look identical to the infinite horizon \(T=\infty\) case by replacing \(\mathbf x\mapsto\mathbf x’\).

(cf. distinction between state variables and path variables in thermodynamics).

Problem: State and prove the law of iterated expectation.

Solution: If \(X,Y\) are any random variables, then:

\[\langle X\rangle=\langle\langle X|Y\rangle\rangle\]

The proof is easy as long as one is careful to interpret all the expectations correctly. For instance, \(\langle X|Y\rangle\) is not a scalar but a random variable with respect to \(Y\):

\[\langle X|Y\rangle=\int dx p(x|Y) x\]

Thus, it is clear that the outer expectation must also be with respect to \(Y\) alone:

\[\langle\langle X|Y\rangle\rangle=\int dy p(y)\langle X|y\rangle=\int dxdy p(y)p(x|y)x\]

Rewriting in terms of the joint distribution \(p(x,y)=p(y)p(x|y)\) and integrating out \(\int dy p(x,y)=p(x)\), one finally obtains:

\[=\int dx p(x) x=\langle X\rangle\]

Of course, this also generalizes easily to identities such as the following equality of random variables (w.r.t. \(Z\)):

\[\langle X|Z\rangle=\langle\langle X|Y,Z\rangle|Z\rangle\]

Problem: Show that the value function \(v_{\pi}(\mathbf x)\) satisfied the following identity (to be used as a lemma later):

\[v_{\pi}(\mathbf x_t)=\langle r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1})|\mathbf x_t,\pi\rangle\]

Solution: Denoting the return random variable by:

\[R_t:=\sum_{t’=t+1}^T\gamma^{t’-t-1}r_{t’}\]

The fundamental observation is that \(R_t\) obeys a simple recurrence relation:

\[R_t=r_{t+1}+\gamma R_{t+1}\]

Taking suitable conditional expectations thereof:

\[\langle R_t|\mathbf x_t,\pi\rangle=\langle r_{t+1}|\mathbf x_t,\pi\rangle+\gamma\langle R_{t+1}|\mathbf x_t,\pi\rangle\]

The LHS is the definition of \(v_{\pi}(\mathbf x_t)\). Applying the law of iterated expectation to the \(2^{\text{nd}}\) term on the RHS:

\[\langle R_{t+1}|\mathbf x_t,\pi\rangle=\langle\langle R_{t+1}|\mathbf x_{t+1},\mathbf x_t,\pi\rangle|\mathbf x_t,\pi\rangle\]

The Markov property ensures that \(\langle R_{t+1}|\mathbf x_{t+1},\mathbf x_t,\pi\rangle=\langle R_{t+1}|\mathbf x_{t+1},\pi\rangle=v_{\pi}(\mathbf x_{t+1})\). One thus obtains the desired result:

\[v_{\pi}(\mathbf x_t)=\langle r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1})|\mathbf x_t,\pi\rangle\]

Problem: How are \(v_{\pi}(\mathbf x_t)\) and \(q_{\pi}(\mathbf x_t,\mathbf f_t)\) related to each other?

Solution: Apply the law of iterated expectations:

\[v_{\pi}(\mathbf x_t):=\langle R_t|\mathbf x_t,\pi\rangle=\langle\langle R_t|\mathbf x_t,\mathbf f_t,\pi\rangle|\mathbf x_t,\pi\rangle=\langle q_{\pi}(\mathbf x_t,\mathbf f_t)|\mathbf x_t,\pi\rangle\]

Since \(\mathbf x_t\) is being conditioned upon, the expectation must be over \(\mathbf f_t\) so one can explicitly write it as:

\[=\sum_{\mathbf f_t}p(\mathbf f_t|\mathbf x_t,\pi)q_{\pi}(\mathbf x_t,\mathbf f_t)\]

But trivially \(p(\mathbf f_t|\mathbf x_t,\pi)=\pi(\mathbf f_t|\mathbf x_t)\). Thus, one has an expression for \(v_{\pi}(\mathbf x_t)\) in terms of \(q_{\pi}(\mathbf x_t,\mathbf f_t)\). Now one would like to proceed the other way, relating \(q_{\pi}(\mathbf x_t,\mathbf f_t)\) to \(v_{\pi}(\mathbf x_t,\mathbf f_t)\). This can be achieved by fleshing out the expectation from the earlier lemma:

\[v_{\pi}(\mathbf x_t)=\langle r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1})|\mathbf x_t,\pi\rangle\]

\[=\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\pi)(r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1}))\]

Further expand \(p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\pi)=\sum_{\mathbf f_t}p(\mathbf f_t|\mathbf x_t,\pi)p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t,\pi)\) and recognize again \(p(\mathbf f_t|\mathbf x_t,\pi)=\pi(\mathbf f_t|\mathbf x_t)\) and \(p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t,\pi)=p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)\) because the action \(\mathbf f_t\) is already fixed. Moving the summation \(\sum_{\mathbf f_t}\) to the outside so as to compare with the earlier identity expressing \(v_{\pi}(\mathbf x_t)\) in terms of \(q_{\pi}(\mathbf x_t,\mathbf f_t)\), one can pattern-match:

\[q_{\pi}(\mathbf x_t,\mathbf f_t)=\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)(r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1}))\]

(which in hindsight is manifest…)

Problem: Hence, deduce the Bellman equations for the value and quality functions.

Solution: To obtain the Bellman equation for \(v_{\pi}(\mathbf x_t)\), substitute the above expression for \(q_{\pi}(\mathbf x_t,\mathbf f_t)\) into the expression for \(v_{\pi}(\mathbf x_t)\):

\[v_{\pi}(\mathbf x_t)=\sum_{\mathbf f_t}\pi(\mathbf f_t|\mathbf x_t)\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)(r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1}))\]

Similarly, to the get the Bellman equation for \(q_{\pi}(\mathbf x_t,\mathbf f_t)\) substitute vice versa:

\[q_{\pi}(\mathbf x_t,\mathbf f_t)=\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)\left(r_{t+1}+\gamma\sum_{\mathbf f_{t+1}}p(\mathbf f_{t+1}|\mathbf x_{t+1},\pi)q_{\pi}(\mathbf x_{t+1},\mathbf f_{t+1})\right)\]

Intuitively, the Bellman equations relate the value of every state with the values of states that can be transitioned into, thereby providing a system of simultaneous equations that in principle can be solved to deduce all state values. A key caveat for this dynamic programming approach to work is that one must have complete knowledge of the environment dynamics \(p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)\) (similar remarks apply to the qualities of different state-action pairs).

Problem: The above Bellman equations apply to a generic agent policy \(\pi\); how do they specialize to the case of an optimal policy \(\pi^*\)?

Solution: Reverting back to the form of the “pre-Bellman” equations and using the key insight that the optimal policy \(\pi^*(\mathbf f_t|\mathbf x_t)=\delta_{\mathbf f_t,\text{argmax}_{\mathbf f_t}q_{\pi^*}(\mathbf x_t,\mathbf f_t)}\) should only assign non-zero probability to actions of the best quality, one has:

\[v_{\pi^*}(\mathbf x_t)=\text{max}_{\mathbf f_t}q_{\pi^*}(\mathbf x_t,\mathbf f_t)\]

\[q_{\pi^*}(\mathbf x_t,\mathbf f_t)=\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)(r_{t+1}+\gamma v_{\pi^*}(\mathbf x_{t+1}))\]

which lead to the Bellman optimality equations.

Problem: In light of Bellman optimality, discuss how policy evaluation and policy improvement work. Hence, explain the notion of generalized policy iteration (GPI).

Solution: Imagine the space of all \((\pi,v)\) pairs, where \(\pi(\mathbf f_t|\mathbf x_t)\) is any policy distribution and \(v(\mathbf x_t)\) is the value function of some policy. Within this space, there are \(2\) canonical subspaces. One is the set \((\pi,v_{\pi})\) of all pairs where \(v_{\pi}\) is specifically the value function associated to policy \(\pi\), and the other subspace \((\pi_v,v)\) where \(\pi_v(\mathbf f_t|\mathbf x_t):=\delta_{\mathbf f_t,\text{argmax}_{\mathbf f_t}q(\mathbf x_t,\mathbf f_t)}\) is \(\varepsilon=0\)-greedy with respect to the value function \(v\), where \(q(\mathbf x_t,\mathbf f_t)=\langle r_{t+1}+\gamma v(\mathbf x_{t+1})|\mathbf x_t,\mathbf f_t\rangle\) (in particular, notice \(q\neq q_{\pi}\); the policy \(\pi\) does not appear in the definition of \(q\)).

  • Policy evaluation is an algorithm for projecting \((v,\pi)\mapsto (v_{\pi},\pi)\). The idea is to view \(v\) as a random initial guess for the underlying true policy value function \(v_{\pi}\) (though any terminal states \(\mathbf x_T\) should be initialized \(v(\mathbf x_T):=0\)). Then, sweeping through each state \(\mathbf x_t\), one updates the value of \(v(\mathbf x_t)\) using the Bellman equation involving the (known) randomly initialized values of the other states:

\[v(\mathbf x_t)\mapsto\sum_{\mathbf f_t}\pi(\mathbf f_t|\mathbf x_t)\sum_{\mathbf x_{t+1},r_{t+1}}p(\mathbf x_{t+1},r_{t+1}|\mathbf x_t,\mathbf f_t)(r+\gamma v(\mathbf x_{t+1}))\]

With sufficiently many sweeps over the state space, general theorems guarantee convergence \(v\to v_{\pi}\).

  • Policy improvement is an algorithm for projecting onto the other canonical subspace \((v,\pi)\mapsto (v,\pi_v)\). Essentially just act \(\varepsilon=0\)-greedy:
    \[\pi(\mathbf f_t|\mathbf x_t)\mapsto\delta_{\mathbf f_t,\text{argmax}_{\mathbf f_t}q_{\pi}(\mathbf x_t,\mathbf f_t)}\]
    and is rigorously justified by the policy improvement theorem.

Policy iteration roughly amounts to alternating between the \(2\) steps of policy evaluation and policy improvement; however there is some leeway in how one does this, for instance one need not necessarily run the policy evaluation step to completion but rather just perform a single sweep over the state space each time (this more “generalized” version of policy iteration is called value iteration, and can be shown to eventually still funnel/converge onto the optimal policy \(\pi^*\) as one can prove that \(\pi\) is \(\varepsilon=0\)-greedy with respect to its own value function \(v_{\pi}\) iff \(\pi=\pi^*\)).

Problem: Distinguish between model-free vs. model-based methods/algorithms in reinforcement learning.

Solution: In both cases, the fundamental limitation (which previously was taken for granted) is incomplete knowledge of the environment dynamics \(p(\mathbf x_{t+1},r_{t+1}|\mathbf x_t,\mathbf f_t)\); instead, one can only sample trajectories \(\mathbf x_{t=0},\mathbf f_{t=0},r_{t=0},\mathbf x_{t=1},\mathbf f_{t=1},r_{t=1},…\) when running a policy \(\pi\) through the MDP. Recalling the fundamental objective of RL is to compute \(\pi^*=\text{argmax}_{\pi}\langle R_t|\pi\rangle\), a model-free method for doing so is one which does not attempt to estimate \(p(\mathbf x_{t+1},r_{t+1}|\mathbf x_t,\mathbf f_t)\) (here \(p\) is what the word “model” refers to, i.e. a model of the environment dynamics) whereas model-based methods do attempt to estimate \(p(\mathbf x_{t+1},r_{t+1}|\mathbf x_t,\mathbf f_t)\) and thereby obtain \(\pi^*\) through GPI as outlined earlier (see the OpenAI Spinning Up documentation for a nice diagram of this taxonomy and further comments).

Problem: Explain how Monte Carlo evaluation and Monte Carlo control are both model-free methods that accomplish the same actions as their respective model-based cousins, i.e. policy evaluation and policy improvement.

Solution: Fix \(\pi\). The key is to recognize that \(v_{\pi}(\mathbf x_t)\) is an expectation of the return random variable starting at \(\mathbf x_t\), and Monte Carlo point estimators can always be applied to approximate expectation values:

\[v_{\pi}(\mathbf x_t)=\langle R_t|\mathbf x_t,\pi\rangle\approx\frac{1}{N_{\mathbf x_t}}\sum_{n=1}^{N_{\mathbf x_t}}R_n\]

where \(N_{\mathbf x_t}\) is the number of times the MDP passes through state \(\mathbf x_t\) in however many trajectories, and \(R_n\) is the return following the \(n^{\text{th}}\) pass through \(\mathbf x_t\) within some MDP trajectory.

Lemma: the arithmetic mean \(\hat{\mu}_n:=\frac{1}{n}\sum_{i=1}^nx_i\) obeys the recurrence relation \(\hat{\mu}_n=\hat{\mu}_{n-1}+\frac{1}{n}(x_n-\hat{\mu}_{n-1})\)

Proof: trivial, but instructive to highlight the convex linear combination nature of the recurrence \(\hat{\mu}_n=\left(1-\frac{1}{n}\right)\hat{\mu}_{n-1}+\frac{1}{n}x_n\), so for large \(n\gg 1\), the majority \(1-1/n\) of the weight is given to the prior value of the mean \(\hat{\mu}_{n-1}\), only a tiny sliver \(1/n\) is given to the new sample \(x_n\).

Thus, rather than waiting for a bunch of trajectories to finish before applying the Monte Carlo estimate above, one can introduce an intermediate estimator initialized at \(\hat v_{\pi}(\mathbf x_t):=0\) that applies (at the termination of each trajectory) the following iterative rule for each time \(n=1,2,…,N_{\mathbf x_t}\) that \(\mathbf x_t\) was visited:

\[\hat v_{\pi}(\mathbf x_t)\mapsto\hat v_{\pi}(\mathbf x_t)+\alpha_n(R_n-\hat v_{\pi}(\mathbf x_t))\]

where \(\alpha_n:=1/n\). Since this is just a mathematical decomposition of the original Monte Carlo estimate, it is an equally valid model-free method for \(v_{\pi}\) estimation. Now comes a curveball: one common modification is to replace \(\alpha_n\mapsto\alpha\) with an \(n\)-independent constant \(\alpha\in[0,1]\) (this is called constant-\(\alpha\) MC).

If one were presented with the recurrence relation \(\hat{\mu}_n=\hat{\mu}_{n-1}+\frac{1}{n}(x_n-\hat{\mu}_{n-1})\) and initial condition \(\hat{\mu}_0=0\), one could claim the explicit solution to be \(\hat{\mu}_n=\frac{1}{n}\sum_{i=1}^nx_i\). However, if instead one were presented with the recurrence relation \(\hat{\mu}’_n=\hat{\mu}’_{n-1}+\alpha(x_n-\hat{\mu}’_{n-1})\) and same initial condition \(\hat{\mu}’_0=0\), what would be the solution for \(\hat{\mu}’_n\)? It’s easy to check \(\hat{\mu}’_n=\alpha\sum_{i=1}^n(1-\alpha)^{n-i}x_i\); thus, recent samples are given exponentially more weight, and it’s biased towards underestimating \(\langle\hat{\mu}’_n\rangle-\mu=-\mu(1-\alpha)^n\), though is asymptotically unbiased \(\lim_{n\to\infty}\langle\hat{\mu}’_n\rangle=\mu\). However, unlike the asymptotically vanishing variance \(\sigma^2_{\hat{\mu}_n}=\sigma^2/n\) of the standard mean estimator \(\hat{\mu}_n\), the variance \(\sigma^2_{\hat{\mu}’_n}=\frac{\alpha\sigma^2}{2-\alpha}(1-(1-\alpha)^{2n})\to\frac{\alpha\sigma^2}{2-\alpha}\approx\alpha\sigma^2/2\) of \(\hat{\mu}’_n\) is bounded from below, hence the estimate \(\hat{\mu}’_n\) will forever oscillate about \(\mu\). It should make sense \(\sigma^2_{\hat{\mu}’_n}\propto\alpha\), i.e. a larger step size \(\alpha\) means bigger oscillations, though it also means faster convergence of \(\hat{\mu}’_n\to\mu\) because of the \((1-\alpha)^n\).

This all begs the question: why bother with constant-\(\alpha\) MC? (come back to this…)

Having used constant-\(\alpha\) MC to estimate \(v_{\pi}(\mathbf x_t)\) in a model-free way for some fixed agent policy \(\pi\), one can naturally ask about model-free estimation of \(q_{\pi}(\mathbf x_t,\mathbf f_t)\). It turns out to just be:

\[\hat q_{\pi}(\mathbf x_t,\mathbf f_t)\mapsto\hat q_{\pi}(\mathbf x_t,\mathbf f_t)+\alpha(R_n-\hat q_{\pi}(\mathbf x_t,\mathbf f_t))\]

Conceptually, the standard way to convince oneself of this is that because \(\pi\) was fixed, it could be thought of as part of the incomplete knowledge of the environment dynamics \(p\) so that the agent never takes any actions but just lets the environment push its state around. This perspective shift from MDP trajectories \(\mathbf x_{t=0},\mathbf f_{t=0},r_{t=0},\mathbf x_{t=1},\mathbf f_{t=1},r_{t=1},…\) to MRP trajectories \(\mathbf x_{t=0},r_{t=0},\mathbf x_{t=1},r_{t=1},…\) amounts to viewing the MRP states \(\mathbf x_t\) as corresponding to augmented state-action pairs \((\mathbf x_t,\mathbf f_t)\) in the MDP. But then it is mathematically apparent that \(v_{\pi}(\mathbf x_t)\) in the MRP is just \(q_{\pi}(\mathbf x_t,\mathbf f_t)\) in the MDP, so constant-\(\alpha\) MC works for \(q_{\pi}\)-evaluation!

Posted in Blog | Leave a comment

Density Functional Theory

Problem: In one sentence, what is the essence of DFT?

Solution: To replace \(\Psi\mapsto n\), where the number density of a system of \(N\) identical quantum particles (usually electrons) \(n(\mathbf x)\) is:

\[n(\mathbf x):=N\int d^3\mathbf x_2…d^3\mathbf x_N |\Psi(\mathbf x,\mathbf x_2,…,\mathbf x_N)|^2\]

in terms of its \(N\)-body wavefunction \(\Psi(\mathbf x,\mathbf x_2,…,\mathbf x_N)\). Thus, \(\int d^3\mathbf x n(\mathbf x)=N\) is normalized.

Problem: Write down the non-relativistic Hamiltonian \(H\) of a molecule. Attempt to express the expected energy \(E=\langle\Psi|H|\Psi\rangle\) in many-body state \(|\Psi\rangle\in\bigwedge^NL^2(\mathbf R^3\to\mathbf C)\otimes\mathbf C^2\) in terms of \(n(\mathbf x)\) defined above; this functional \(E=E[n]\) of the density \(n\) explains the name “density functional theory“.

Solution: (the Born-Oppenheimer approximation is implicitly used whereby the electrons move on an effective/external potential energy surface \(V_{\text{ext}}(\mathbf x)\) defined by a collection of stationary nuclei)

\[H=\sum_{i=1}^N\frac{|\mathbf P_i|^2}{2m}+V_{\text{ext}}(\mathbf X_i)+\frac{1}{2}\sum_{1\leq i\neq j\leq N}\frac{\alpha\hbar c}{|\mathbf X_i-\mathbf X_j|}\]

One can check that:

\[E=\frac{\hbar^2}{2m}\sum_{i=1}^N\int d^3\mathbf x_1…d^3\mathbf x_N\biggr|\frac{\partial\Psi}{\partial\mathbf x_i}\biggr|^2+\int d^3\mathbf x n(\mathbf x)V_{\text{ext}}(\mathbf x)+\frac{\alpha\hbar c}{2}\int d^3\mathbf x d^3\mathbf x’\frac{n_2(\mathbf x,\mathbf x’)}{|\mathbf x-\mathbf x’|}\]

where \(n_2(\mathbf x,\mathbf x’)=N(N-1)\int d^3\mathbf x_3…d^3\mathbf x_N|\Psi(\mathbf x,\mathbf x’,\mathbf x_3,…,\mathbf x_N)|^2\) so in particular \((N-1)n(\mathbf x)=\int d^3\mathbf x’ n_2(\mathbf x,\mathbf x’)\).

Problem: Write down the Thomas-Fermi approximation for \(E[n]\), and explain why it sucks. Also derive the von Weizsäcker correction to the Thomas-Fermi energy functional.

Solution: The Thomas-Fermi functional treats the electrons as a locally ideal Fermi gas, thus having the usual Pauli pressure \(p\propto n^{5/3}\) in \(\mathbf R^3\) and thus (kinetic) energy density \(3p/2\) scaling likewise. Furthermore, the exchange-correlation functional is taken to vanish \(E_{\text{xc}}[n]=0\) (thus, Thomas-Fermi is often said to be semiclassical):

\[E_{\text{TF}}[n]=\frac{3\hbar^2(3\pi^2)^{2/3}}{10m}\int d^3\mathbf x n^{5/3}(\mathbf x)+\int d^3\mathbf x n(\mathbf x)V_{\text{ext}}(\mathbf x)+\frac{\alpha\hbar c}{2}\int d^3\mathbf x d^3\mathbf x’\frac{n(\mathbf x)n(\mathbf x’)}{|\mathbf x-\mathbf x’|}\]

One has:

\[\frac{\delta}{\delta n(\mathbf x)}\left(E_{\text{TF}}-\mu\left(\int d^3\mathbf xn(\mathbf x)-N\right)\right)=\frac{\hbar^2k^2_F(\mathbf x)}{2m}+V_{\text{eff}}(\mathbf x)-\mu=0\]

where \(k_F(\mathbf x):=(3\pi^2n(\mathbf x))^{1/3}\) and \(V_{\text{eff}}(\mathbf x)=V_{\text{ext}}(\mathbf x)+\alpha\hbar c\int d^3\mathbf x’\frac{n(\mathbf x’)}{|\mathbf x-\mathbf x’|}\).

(aside: although the Thomas-Fermi model is meant to be applied to molecules, a result of Teller showed that \(E_{\text{TF molecule}}>2E_{\text{TF atom}}\) so Thomas-Fermi theory predicts the molecular state is unstable with respect to dissociation. In a central external potential such as due to a single nucleus \(V_{\text{ext}}(r)=-Z\alpha\hbar c/r\), the effective potential \(V_{\text{eff}}(r)\) and electron density \(n(r)\) are both isotropic as well. Starting from Poisson’s equation:

\[\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\frac{V_{\text{eff}}(r)}{-e}=-\frac{-en(r)}{\varepsilon_0}\]

with \(\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2f(r)=\frac{1}{r^2}\frac{d}{dr}\left(r^2\frac{df}{dr}\right)=\frac{1}{r}\frac{d^2}{dr^2}(rf(r))\) (use the latter) one can check that by defining \(f(r):=\left(\frac{8}{3\pi a_0}\right)^2rk^2_F(r)\), one recovers the Thomas-Fermi ODE:

\[\frac{d^2 f}{dr^2}=\frac{f^{3/2}(r)}{\sqrt{r}}\]

The von Weizsäcker term is a correction to the kinetic energy functional in the Thomas-Fermi model that accounts for density gradients:

\[T_{\text{vW}}[n]=\frac{\hbar^2}{8m}\int d^3\mathbf x\frac{|\partial n/\partial\mathbf x|^2}{n(\mathbf x)}\]

Heuristically, it can be justified by writing the expected kinetic energy of a single electron \(\frac{\hbar^2}{2m}\int d^3\mathbf x |\partial\psi/\partial\mathbf x|^2\) with wavefunction \(\psi(\mathbf x)\in\mathbf R\) and replacing \(\psi(\mathbf x)\mapsto\sqrt{n(\mathbf x)}\).

Problem: Prove the two Hohenberg-Kohn theorems.

Solution: First, it is useful to prove:

  • Lemma: If \(V_{\text{ext}}(\mathbf x)\neq V’_{\text{ext}}(\mathbf x)\) represent \(2\) physically distinct external potentials (i.e. they differ by more than just some additive constant), then the corresponding molecular Hamiltonians \(H\neq H’\) do not share any non-zero eigenstates.
  • Proof: Suppose for sake of contradiction that there exists \(|\Psi\rangle\neq 0\) such that
    \[H|\Psi\rangle=E|\Psi\rangle\]
    \[H’||\Psi\rangle=E’|\Psi\rangle\]
    Subtracting, one obtains:
    \[(H’-H)|\Psi\rangle=(E’-E)|\Psi\rangle\]
    Inspecting the form of the molecular Hamiltonian \(H\) given earlier, one sees that the “universal” part (i.e. the kinetic energy and electron-electron Coulomb repulsion terms) cancel, leaving only the non-universal residue \(H’-H=\sum_{i=1}^NV’_{\text{ext}}(\mathbf X_i)-V_{\text{ext}}(\mathbf X_i)\). But the RHS \(E’-E\) is \(\mathbf X_i\)-independent for all \(i=1,…,N\); this means \(V’_{\text{ext}}(\mathbf x)-V_{\text{ext}}(\mathbf x)=\text{const}\) which is a contradiction.

    With this in mind, the first HK theorem follows by specializing to the respective (distinct by the above lemma) ground states \(|\Psi_0\rangle\neq|\Psi’_0\rangle\) of \(H\) and \(H’\). Apply the variational bound both ways:

    \[\]

    Problem: Demonstrate the first Hohenberg-Kohn theorem for \(2\) electrons in a \(1\)-dimensional harmonic trap.

Solution:

Problem: Derive the Kohn-Sham equations.

Solution: Fictitious ideal electron gas with density \(n(\mathbf x)\).

Problem:

Posted in Blog | Leave a comment

PyTorch Fundamentals (Part \(2\))

Problem: Do an end-to-end walkthrough of the PyTorch machine learning workflow using the most basic univariate linear regression example. In particular, generate some linear data over a normalized feature space (whose slope \(w\) and intercept \(b\) would in practice be a priori unknown), split that linear data into training and testing subsets (no cross-validation dataset needed for this simple example), define the linear layer class, instantiate a model object of the class, and starting from random values of \(w\) and \(b\), use stochastic gradient descent with learning rate \(\alpha=0.01\) to minimize the training cost function \(C_{\text{train}}(w,b)\) based on an \(L^1\) loss. Iterate SGD for \(300\) epochs, and for every \(20\) epochs, record the current value of \(C_{\text{train}}(w,b)\) and the current value of \(C_{\text{test}}(w,b)\). Plot these cost function curves as a function of the epoch number \(0, 20, 40,…\). Save the final state dictionary of the model’s learned parameters \(w,b\) post-training, and load it back onto a new instance of the model class.

Solution:

pytorch_workflow
Posted in Blog | Leave a comment

PyTorch Fundamentals (Part \(1\))

Problem: Illustrate some of the basic fundamentals involved in using the PyTorch deep learning library. In particular, discuss the attributes of PyTorch tensors (e.g. dtype, CPU/GPU devices, etc.), how to generate random PyTorch tensors with/without seeding, and operations that can be performed on and between PyTorch tensors.

Solution:

pytorch_fundamentals
Posted in Blog | Leave a comment

Self-Attention in Transformers

Problem: Explain how the transformer architecture works at a mathematical level (e.g. as outlined in the Attention Is All You Need paper).

Solution:

  1. (Tokenization) Partition the inputted natural language text into a sequence of tokens \(\tau_1,…,\tau_N\) (here \(N\) is bounded from above by the LLMs context size).
  2. (Embedding) Each token \(\tau_i\) is embedded as a vector \(\tau_i\mapsto\mathbf x_i\in\mathbf R^{n_e}\) that contains information about the token’s generic meaning as well as its position in the inputted natural language text. Here, the hyperparameter \(n_e\) is the dimension of the embedding space.
  3. (Single Self-Attention Head) For each embedding vector \(\mathbf x_i\), compute its query vector \(\mathbf q_i=W_{\mathbf q}\mathbf x_i\), its key vector \(\mathbf k_i=W_{\mathbf k}\mathbf x_i\), and its value vector \(\mathbf v_i=W_{\mathbf v}\mathbf x_i\). Here, \(W_{\mathbf q},W_{\mathbf k}\in\mathbf R^{n_{qk}\times n_e}\) are weight matrices that map from the embedding space \(\mathbf R^{n_e}\) to the query/key space \(\mathbf R^{n_{qk}}\) of dimension \(n_{qk}\) and \(W_{\mathbf v}\in\mathbf R^{n_e\times n_e}\) is the weight matrix of values (which in practice is decomposed into a low-rank approximation \(W_{\mathbf v}=W_{\mathbf v\uparrow}W_{\mathbf v\downarrow}\) where typically \(W_{\mathbf v\downarrow}\in\mathbf R^{n_{qk}\times n_e}\) and \(W_{\mathbf v\uparrow}\in\mathbf R^{n_e\times n_{qk}}\)). For each \(\mathbf x_i\), one computes an update vector \(\Delta\mathbf x_i\) to be added to it according to a convex linear combination of the value vectors \(\mathbf v_1,…,\mathbf v_N\) of all the embeddings \(\mathbf x_1,…,\mathbf x_N\) in the context, specifically:

\[\Delta\mathbf x_i=V\text{softmax}\left(\frac{K^T\mathbf q_i}{\sqrt{n_{qk}}}\right)\]

where \(K=(\mathbf k_1,…,\mathbf k_N)\in\mathbf R^{n_{qk}\times N}\) and \(V=(\mathbf v_1,…,\mathbf v_N)\in\mathbf R^{n_e\times N}\) are key and value matrices associated to the inputted context (filled with column vectors here rather than the ML convention of row vectors). This map that takes the initial, generic token embeddings \(\mathbf x_i\) and nudges them towards more contextualized embeddings \(\mathbf x_i\mapsto\mathbf x’_i=\mathbf x_i+\Delta\mathbf x_i\) is called a head of self-attention. The \(1/\sqrt{n_{qk}}\) scaling in the softmax temperature is justified on the grounds that if \(\mathbf k\) and \(\mathbf q\) are random vectors whose independent components each have mean \(0\) and variance \(1\), then \(\mathbf k\cdot\mathbf q\) will have mean \(0\) and variance \(n_{qk}\), hence the need to normalize by \(\sqrt{n_{qk}}\) to ensure \(\mathbf k\cdot\mathbf q/\sqrt{n_{qk}}\) continues to have variance \(1\).

4. (Multi-Headed Self-Attention) Since context can influence meaning in different ways, repeat the above procedure in parallel for several heads of self-attention; each head will propose a displacement update to each of the \(N\) original embeddings \(\mathbf x_i\); add up all of them.

5. (Multilayer Perceptron) Linear, ReLU, Linear basically. It is hypothesized that facts are stored in this part of the transformer.

6. (Layers) Alternate between the multi-headed self-attention blocks and MLP blocks, make a probabilistic prediction of the next token \(\hat{\tau}_{N+1}\) using only the final, context-rich, modified embedding \(\mathbf x’_N\) of the last token \(\tau_N\) in the context by applying an unembedding matrix \(\mathbf u=W_{\mathbf u}\mathbf x’_N\) and running it through a softmax \(\text{softmax}(\mathbf u)\).

    Problem: Based on the above discussion of the transformer architecture, explain how a large language model (LLM) like Gemini, ChatGPT, Claude, Grok, DeepSeek, etc. works (at a high level).

    Solution: Essentially, since an LLM is a neural network which takes as input some string of text and probabilistically predicts the next token, by seeding it with some corpus of text \(T\), the LLM can sample according to the probability distribution it generates for the next token, and append that to \(T\mapsto T+\tau\). Then, simply repeat this except pretend that \(T+\tau\) was the seed all along. In this way, generative AI models such as ChatGPT (where GPT stands for generative pre-trained transformer) work. In practice, it is helpful to also provide some system prompt like “What follows is a conversation between a user and a knowledgeable AI assistant:”.

    Posted in Blog | Leave a comment

    JAX Fundamentals (Part \(1\))

    JAX_tutorial
    Posted in Blog | Leave a comment

    Monte Carlo Methods

    Problem: Distinguish between Las Vegas methods and Monte Carlo methods.

    Solution: Both are umbrella terms referring to broad classes of methods that draw (repeatedly) from (not necessarily i.i.d.) random variables to compute the value of some deterministic variable. Here, “compute” means “find that deterministic value with certainty” in the case of a Las Vegas method whereas “compute” means “(point) estimate” in the case of a Monte Carlo method. Thus, their key difference lies in how they tradeoff time \(t\) vs. accuracy \(\alpha\):

    Problem: Given a scalar continuous random variable with probability density function \(p(x)\), how can one sample faithfully from such an \(x\sim p(x)\)?

    Solution: Fundamentally, a computer can (approximately) draw from a uniform random variable \(u\in [0,1]\) (using seed-based pseudorandom number generators for instance). In order to convert this \(u\to x\), the standard way is to use the cumulative distribution function of \(x\):

    \[\int_{-\infty}^xdx’p(x’):=u\]

    Then, as the probability density of \(u\) is uniform \(p(u)=1\), the probability density of \(x\) will be \(p(x)=\frac{du}{dx}p(u)\). As the CDF is a monotonically increasing bijection, there always exists an \(x=x(u)\) for each \(u\in[0,1]\). And intuitively, this whole sampling scheme should make a lot of sense:

    Problem: So, it seems like the above “inverse CDF trick” has solved the problem of sampling from an arbitrary random variable \(x\) (and it generalizes to the case of a higher-dimensional random vector \(\mathbf x\in\mathbf R^n\) by sampling conditionally). So how come in many practical applications this “sampling” problem can still be difficult?

    Solution: In practice, one may not have access to \(p(\mathbf x)\) itself, but only the general shape \(\tilde p(\mathbf x)=Zp(\mathbf x)\) of it (e.g. in Bayesian inference). One might object that \(Z\) is known, namely \(Z=\int d^n\mathbf x\tilde p(\mathbf x)\) can be written explicitly as an \(n\)-dimensional integral, so wherein lies the difficulty? The difficulty lies in evaluating the integral, which suffers from the curse of dimensionality if \(n\gg 1\) is large (just imagine approximating the integral by a sum, and then realizing that the number of terms \(N\) in such a series would grow as \(N\sim O(\exp n)\)). The inverse CDF trick thus fails because it requires one to have \(p(\mathbf x)\) itself, not merely \(\tilde p(\mathbf x)\).

    Problem: Describe the Monte Carlo method known as importance sampling for estimating the expectation \(\langle f(\mathbf x)\rangle\) of a random variable \(f(\mathbf x)\).

    Solution: Importance sampling consists of a deliberate reframing of the integrand of the expectation inner product integral:

    \[\langle f(\mathbf x)\rangle=\int d^n\mathbf x p(\mathbf x)f(\mathbf x)=\int d^n\mathbf x q(\mathbf x)\frac{p(\mathbf x)f(\mathbf x)}{q(\mathbf x)}\]

    in other words, replacing \(p(\mathbf x)\mapsto q(\mathbf x)\) and \(f(\mathbf x)\mapsto p(\mathbf x)f(\mathbf x)/q(\mathbf x)\). There are \(2\) cases in which importance sampling can be useful. The first case is related to the previous problem, namely when sampling directly from \(p(\mathbf x)\) is too difficult, in which case \(q(\mathbf x)\) should be chosen to be easier to both sample from and evaluate than \(p(\mathbf x)\). The second case is, even if sampling directly from \(p(\mathbf x)\) is feasible, it can still be useful to instead sample from \(q(\mathbf x)\) as a means of reducing the variance of the Monte Carlo estimator. That is, whereas the original Monte Carlo estimator of the expectation involving \(N\) i.i.d. draws \(\mathbf x_1,…,\mathbf x_N\) from \(p(\mathbf x)\) is \(\sum_{i=1}^Nf(\mathbf x_i)/N\) with variance \(\sigma^2_{f(\mathbf x)}/N\), the new Monte Carlo estimator of the expectation involving \(N\) i.i.d. draws from \(q(\mathbf x)\) is \(\sum_{i=1}^Np(\mathbf x_i)f(\mathbf x_i)/q(\mathbf x_i)N\) with variance \(\sigma^2_{p(\mathbf x)f(\mathbf x)/q(\mathbf x)}/N\); for importance sampling to be useful, \(q(\mathbf x)\) should be chosen such that:

    \[\sigma^2_{p(\mathbf x)f(\mathbf x)/q(\mathbf x)}<\sigma^2_{f(\mathbf x)}\]

    One can explicitly find the function \(q(\mathbf x)\) that minimizes the variance functional of \(q(\mathbf x)\) (using a Lagrange multiplier to enforce normalization \(\int d^n\mathbf x q(\mathbf x)=1\)):

    \[q(\mathbf x)\sim p(\mathbf x)|f(\mathbf x)|\]

    In and of itself, this \(q(\mathbf x)\) cannot be used because it contains the difficult piece \(p(\mathbf x)\). Nevertheless, motivated by this result, one can heuristically importance sample using a \(q(\mathbf x)\) which ideally is large at \(\mathbf x\in\mathbf R^n\) where \(p(\mathbf x)|f(\mathbf x)|\) is large. Intuitively, this is because that’s the integrand in the expectation inner product integral, so locations where it’s large will contribute most to the integral, hence are the most “important” regions to sample from with \(q(\mathbf x)\).

    One glaring flaw with the above discussion is that if \(Z\) is not known, then it will also be impossible to sample from \(p(\mathbf x)\). Instead, suppose as before that only the shape \(\tilde p(\mathbf x)=Zp(\mathbf x)\) is known with \(Z=\int d^n\mathbf x\tilde p(\mathbf x)\). Thus, the expectation is:

    \[\frac{\int d^n\mathbf x\tilde{p}(\mathbf x)f(\mathbf x)}{\int d^n\mathbf x\tilde{p}(\mathbf x)}\]

    The key is to notice that the denominator is just the \(f(\mathbf x)=1\) special case of the numerator, hence the importance sampling trick can be applied to both numerator and denominator, thereby obtaining the (biased but asymptotically unbiased Monte Carlo estimator):

    \[\frac{\sum_{i=1}^N\tilde p(\mathbf x_i)f(\mathbf x_i)/q(\mathbf x_i)}{\sum_{i=1}^N\tilde p(\mathbf x_i)/q(\mathbf x_i)}\]

    for the expectation.

    Problem: Describe what Markov chain Monte Carlo (MCMC) methods are about.

    Solution: Whereas in “naive” Monte Carlo methods, the random vector draws \(\mathbf x_1,…,\mathbf x_N\) are often i.i.d., here the key innovation is that the sequence of random vectors is no longer i.i.d., rather \(\mathbf x_t\) is correlated with \(\mathbf x_{t-1}\) (but uncorrelated with all the earlier \(\mathbf x_{<t-1}\). In other words, the sequence \(\mathbf x_1,…,\mathbf x_N\) forms a (discrete-time) Markov chain, i.e. there exists a transition kernel with “matrix elements” \(p(\mathbf x’|\mathbf x)\) for any pair of states \(\mathbf x,\mathbf x’\in\mathbf R^n\) (assuming the DTMC is \(t\)-independent/homogeneous). The goal of an MCMC method is to find a suitable \(p(\mathbf x’|\mathbf x)\) such that the stationary \(N\to\infty\) distribution of the chain matches \(p(\mathbf x)\). Thus, it’s basically an “MCMC sampler”.

    Problem: What does it mean to burn in an MCMC run? What does it mean to thin an MCMC run?

    Solution: Given a DTMC \(\mathbf x_1,…,\mathbf x_N\), the idea of burning-in the chain is to discard the first few samples \(\mathbf x_1,\mathbf x_2,…\) as they’re correlated with the (arbitrary) choice of starting \(\mathbf x_1\) and so are not representative of \(p(\mathbf x)\) (equivalently, one hopes to burn-in for the first \(N_b\) samples where \(N_b\) is around the mixing time of the Markov chain). Burn-in therefore gives the MCMC run a chance to find the high-probability regions of \(p(\mathbf x)\).

    Thinning an MCMC run means only keep e.g. every \(10\) DTMC points or something like that…this is to reduce the correlation between adjacent samples as the goal is to sample independently from \(p(\mathbf x)\).

    Problem: The Metropolis-Hastings method is a general template for a fairly broad family of MCMC methods. Explain how it works.

    Solution: In order for \(p(\mathbf x)\) to be the stationary distribution of a DTMC, by definition one requires it to be an eigenvector (with eigenvalue \(1\)) of the transition kernel:

    \[\int d^n\mathbf xp(\mathbf x’|\mathbf x)p(\mathbf x)=p(\mathbf x’)\]

    A sufficient (though not necessary) condition for this is if the chain is in detailed balance:

    \[p(\mathbf x’|\mathbf x)p(\mathbf x)=p(\mathbf x|\mathbf x’)p(\mathbf x’)\]

    for all \(\mathbf x,\mathbf x’\in\mathbf R^n\). The idea of Metropolis-Hastings is to conceptually decompose the transition from state \(\mathbf x\to\mathbf x’\) into \(2\) simpler steps, namely:

    1. Propose the transition \(\mathbf x\to\mathbf x’\) with proposal probability \(p_p(\mathbf x’|\mathbf x)\) (this distribution \(p_p\) needs be easy to sample \(\mathbf x’\) from for any \(\mathbf x\)!)
    2. Accept/reject the proposed transition \(\mathbf x\to\mathbf x’\) with an acceptance probability \(p_a(\mathbf x’|\mathbf x)\) (this distribution also needs to be easy to sample from!)

    Thus, \(p(\mathbf x’|\mathbf x)=p_a(\mathbf x’|\mathbf x)p_p(\mathbf x’|\mathbf x)\). The ratio of acceptance probabilities from detailed balance is therefore:

    \[\frac{p_a(\mathbf x’|\mathbf x)}{p_a(\mathbf x|\mathbf x’)}=\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p_p(\mathbf x)}\]

    The ratio on the RHS is known (furthermore, because it’s a ratio, neither \(p_p\) nor \(p\) need be normalized! In other words, one can replace \(p_p\mapsto \tilde p_p\) and \(p\mapsto\tilde p\)). The task remains to specify the form of \(p_a(\mathbf x’|\mathbf x)\). The detailed balance condition doesn’t give a unique solution for \(p_a\), but heuristically one can get a unique solution by further insisting that the walker accept moves often. That is, w.l.o.g. suppose the RHS is computed to be \(>1\), so that \(p_a(\mathbf x’|\mathbf x)>p_a(\mathbf x|\mathbf x’)\). This implies that the transition from \(\mathbf x\to\mathbf x’\) is in some sense more variable than the reverse transition \(\mathbf x’\to\mathbf x\), so it seems that if at time \(t\) the MCMC walker were at \(\mathbf x_t=\mathbf x\), then at time \(t+1\) the MCMC walker should definitely move to state \(\mathbf x_{t+1}:=\mathbf x’\). In other words, one would like to set \(p_a(\mathbf x’|\mathbf x):=1\) and hence \(p_a(\mathbf x|\mathbf x’)=\frac{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}{p_p(\mathbf x|\mathbf x’)p_p(\mathbf x’)}\). But this therefore fixes the form of the function \(p_a\)! Specifically, to evaluate \(p_a(\mathbf x’|\mathbf x)\), one should set it to be \(1\) if the RHS ratio is \(>1\) (which coincidentally is also the value that \(p_a\) itself needs to be set to), whereas if the RHS ratio is \(<1\), then \(p_a\) should be set to the RHS ratio. An elegant way to encode all of this into a single compact expression is just:

    \[p_a(\mathbf x’|\mathbf x)=\min\left(1,\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}\right)\]

    Practically, to actually implement this Metropolis-Hastings acceptance probability \(p_a\), one can essentially flip a “biased” coin with probability \(p_a(\mathbf x’|\mathbf x)\) of landing “heads” (meaning to accept the transition \(\mathbf x\to\mathbf x’\)), e.g. randomly picking a number \(u\in[0,1]\) and declaring acceptance iff \(u\leq p_a(\mathbf x’|\mathbf x)\).

    (aside: if the proposal probability \(p_p(\mathbf x’|\mathbf x)\sim e^{-|\mathbf x’-\mathbf x|^2/2\sigma^2}\) is taken to be normally and isotropically distributed about \(\mathbf x\), then it is symmetric \(p_p(\mathbf x’|\mathbf x)=p_p(\mathbf x|\mathbf x’)\) and the MH acceptance probability simplifies \(\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}=\frac{p(\mathbf x’)}{p(\mathbf x)}\) (in which case this is sometimes just called the Metropolis method). Another tip is that the burn-in period of the MCMC walk can often be used to tune the variance \(\sigma^2\) of the above Gaussian \(p_p\) distribution, namely if the burn-in period lasts for \(N_b\) samples, then look at the number of times \(N_a\) the MCMC walker accepted a proposed state from \(p_p\); a rule-of-thumb is that in high-dimensions \(n\gg 1\), one should tune \(\sigma^2\) such that the fraction of accepted proposals \(N_a/N_b\approx 23.4\%\), see “Weak convergence and optimal scaling of random walk Metropolis algorithms” by Roberts, Gelman, and Gilks (\(1997\)). Finally, one more MCMC trick worth mentioning is that one can always run e.g. \(\sim 100\) MCMC sequences in parallel and sample randomly from them as an alternative to thinning).

    Problem: Explain how Gibbs sampling works.

    Solution: Whereas a Metropolis-Hastings sampler moves around in state space like a queen in chess, a Gibbs sampler moves around like a rook. That is, it only ever moves along the \(n\) coordinate directions in \(\mathbf x=(x_1,…,x_n)\). Indeed, the Gibbs sampler can be viewed as a special case of a Metropolis-Hastings sampler in which the proposal distribution \(p_p\) is given by:

    \[p_p(\mathbf x’|\mathbf x):=p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)\delta(x’_1-x_1)…\delta(x’_{i-1}-x_{i-1})\delta(x’_{i+1}-x_{i+1})…\delta(x’_n-x_n)\]

    where \(i=1,…,n\) is cycled through the \(n\) coordinate directions. For each \(1\leq i\leq n\), the MH acceptance probability is:

    \[\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}=\frac{p(x_i|x’_1,…,x’_{i-1},x’_{i+1},…,x’_n)\delta(x_1-x’_1)…\delta(x_{i-1}-x’_{i-1})\delta(x_{i+1}-x’_{i+1})…\delta(x_n-x’_n)p(\mathbf x’)}{p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)\delta(x’_1-x_1)…\delta(x’_{i-1}-x_{i-1})\delta(x’_{i+1}-x_{i+1})…\delta(x’_n-x_n)p(\mathbf x)}\]

    \[=\frac{p(x_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x’_i,…,x_n)}{p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x_n)}\]

    \[=\frac{p(x_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x_{i-1},x_{i+1},…,x_n)p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)}{p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_i|x_1,…,x_{i-1},x_{i+1},…,x_n)}\]

    \[=1\]

    so \(\text{min}(1,1)=1\); in other words, the Gibbs sampler always accepts because the proposal is coming from the conditional probability slices themselves (assumed to be known even if the full joint distribution \(p(\mathbf x)\) is unknown), so they must be authentic samples in a sense. For a Gibbs sampler, only after cycling through all \(n\) coordinate directions \(i=1,…,n\) does it count as \(1\) MCMC iteration.

    (aside: a common variant of the basic Gibbs sampling formalism described above is block Gibbs sampling in which one may choose to update \(2\) or more features simultaneously, conditioning on the distribution of those features given all other features fixed; this would enable diagonal movements within that subspace which, depending how correlated those features are, could significantly increase the effective sample size (ESS)).

    Problem: Explain how the Hamiltonian Monte Carlo method works.

    Solution:

    Posted in Blog | Leave a comment

    Differential Geometry

    Problem: What does it mean for a topological space \(X\) to be locally homeomorphic to a topological space \(Y\)? Hence, what does it mean for a topological space \(X\) to be locally Euclidean?

    Solution: \(X\) is said to be locally homeomorphic to \(Y\) iff every point \(x\in X\) has a neighbourhood homeomorphic to an open subset of \(Y\). In particular, \(X\) is said to be locally Euclidean iff it is locally homeomorphic to Euclidean space \(Y=\mathbf R^n\) for some \(n\in\mathbf N\) called the dimension of \(X\).

    Explicitly, this means every point \(x\in X\) is associated to at least one neighbourhood \(U_x\) along with at least one homeomorphism \(\phi_x:U_x\to\phi_x(U_x)\subset\mathbf R^n\); the ordered pair \((U_x,\phi_x)\) is an example of a chart on \(X\), and any collection of charts covering \(X\) (such as \(\{(U_x,\phi_x):x\in X\}\)) is called an atlas for \(X\). Hence, \(X\) is locally Euclidean iff there exists an atlas for \(X\).

    (caution: there is a distinct concept of a local homeomorphism from \(X\to Y\); existence of such implies that \(X\) is locally homeomorphic to \(Y\), but the converse is false).

    Problem: Give an example of topological spaces \(X\) and \(Y\) such that \(X\) is locally homeomorphic to \(Y\) but not vice versa.

    Solution: \(X=(0,1)\) and \(Y=[0,1)\). Thus, be wary that local homeomorphicity is not a symmetric relation of topological spaces.

    Problem: Let \(X\) be an \(n\)-dimensional locally Euclidean topological space. What additional topological constraints are typically placed on \(X\) in order for it to qualify as a topological \(n\)-manifold?

    Solution:

    1. \(X\) is Hausdorff.
    2. \(X\) is second countable.

    That being said, conceptually the most important criterion to remember is the local Euclideanity of \(X\). Topological manifolds are the most basic/general/fundamental of manifolds, onto which additional structures may be attached.

    Problem: Let \(X\) be a topological \(n\)-manifold. What additional piece of structure should be attached to \(X\) so as to consider it a \(C^k\)-differentiable \(n\)-manifold?

    Solution: The additional piece of structure is a \(C^k\)-atlas. A \(C^k\)-atlas for \(X\) is just an atlas for \(X\) with the property that all overlapping charts are compatible with each other. This means that if \((U_1,\phi_1),(U_2,\phi_2)\) are \(2\) charts of a \(C^k\)-atlas for \(X\) such that \(U_1\cap U_2\neq\emptyset\), then the transition map \(\phi_2\circ\phi_1^{-1}:\phi_1(U_1\cap U_2)\subset\mathbf R^n\to\phi_2(U_1\cap U_2)\subset\mathbf R^n\) is \(C^k\)-differentiable (equivalent to \(\phi_1\circ\phi_2^{-1}\) being \(C^k\)-differentiable).

    An important point to mention is that a given topological \(n\)-manifold may have multiple different \(C^k\)-atlases. According to the above definition, that would seem to give rise to \(2\) distinct \(C^k\)-manifolds…but intuitively that would be undesirable; they should be viewed as really the same \(C^k\) manifold. This motivates imposing an equivalence relation on \(C^k\)-manifolds by asserting that two \(C^k\)-manifolds are equivalent iff their corresponding \(C^k\)-atlases are compatible…which just means that the union of their \(C^k\)-atlases is itself a \(C^k\)-atlas, or equivalently every chart in one \(C^k\)-atlas is compatible with every chart in the other \(C^k\)-atlas. This is sometimes formalized by introducing a so-called “maximal \(C^k\)-atlas”, but the basic idea should be clear.

    Finally, note that topological manifolds are just \(C^0\)-manifolds because homeomorphisms are continuous. In physics, the most common case is to deal with \(C^{\infty}\)-manifolds like spacetime in GR or classical configuration/phase spaces in CM, or state space in thermodynamics, also called smooth manifolds.

    Problem: Let \(S^1\) be the circle, a \(1\)-dimensional smooth manifold which is regarded as being embedded in \(\mathbf R^2\) via the set of points \(x^2+y^2=1\). Explain why \(S^1\) is locally (but not globally) homeomorphic to the real line \(\mathbf R\). Furthermore, explain whether or not the pair \((S^1,\theta)\) (where \(\theta:(\cos\theta,\sin\theta)\in S^1\mapsto\theta\in[0,2\pi)\) is the obvious angular coordinate mapping) is a chart on \(S^1\) or not.

    Solution: \(S^1\) is locally homeomorphic to \(\mathbf R\) because, roughly speaking, every little strip of open arc in \(S^1\), upon zooming in, looks like a strip of straight line in \(\mathbf R\). However, \(S^1\) is not globally homeomorphic to \(\mathbf R\) because \(S^1\) is compact whereas \(\mathbf R\) is unbounded. The pair \((S^1,\theta)\) is not a chart on \(S^1\) because its image \([0,2\pi)\) isn’t open in \(\mathbf R\).

    Problem: Hence, demonstrate an example of an atlas for \(S^1\).

    Solution: To circumvent the problem above, one requires \(2\) charts to cover \(S^1\), thereby forming an atlas for \(S^1\). For example, a chart \(\theta_1:S^1-\{(1,0)\}\to (0,2\pi)\) similar to the above but with the point \((1,0)\) removed, and a second chart \(\theta_2:S^1-\{(-1,0)\}\to (-\pi,\pi)\) with the antipodal point \((-1,0)\) removed. Then the domain of these \(2\) charts overlap on the upper and lower semicircles respectively, so the transition function on this overlapping domain is:

    \[\theta_2(\theta_1^{-1}(\theta))=\theta[\theta\in(0,\pi)]+(\theta-2\pi)[\theta\in(\pi,2\pi)]\]

    which is indeed \(C^{\infty}\).

    Problem: Define the real associative algebra \(C^k(X\to\mathbf R)\).

    Solution: This is simply the space of all \(C^k\)-differentiable scalar fields on the \(C^k\)-manifold \(X\). To be precise, \(f\in C^k(X\to\mathbf R)\) iff for all charts \((U,\phi)\) in the \(C^k\)-atlas defining the differential structure of \(X\), the map \(f\circ\phi^{-1}:\phi(U)\to\mathbf R\) is \(C^k\)-differentiable.

    Problem: What is a tangent vector to a \(C^1\)-manifold \(X\) at a point \(x\in X\)?

    Solution: In the algebraic formulation, a tangent vector is simply any \(\mathbf R\)-derivation on \(C^1(X\to\mathbf R)\) at a specific point \(x\in X\) on the manifold. Concretely, this means that a tangent vector at \(x\in X\) is any real-valued function \(\partial_{\mu}|_x:C^1(X\to\mathbf R)\to\mathbf R\) which is linear and satisfies the product rule:

    \[\partial_{\mu}|_x(\alpha f+\beta g)=\alpha\partial_{\mu}|_x(f)+\beta\partial_{\mu}|x(g)\]

    \[\partial_{\mu}|_x(fg)=f(x)\partial_{\mu}|_x(g)+\partial_{\mu}|_x(f)g(x)\]

    A more intuitive but equivalent formulation of what “tangent vector” means is to consider trajectories \(\gamma(t)\in X\) through the manifold \(X\) passing through the point \(x\in X\) at time \(t\). Then, choosing an arbitrary chart \((U,\phi)\) in a neighbourhood of \(x\in U\), one considers the coordinatized version of \(\gamma(t)\in U\subset X\), namely \(\phi(\gamma(t))\in\mathbf R^n\) which will be differentiable with some derivative \(\frac{d}{dt}\phi(\gamma(t))\in\mathbf R^n\). All curves \(\gamma(t)\) with the same derivative at \(x\) are considered to form an equivalence class, and the value of that derivative is then effectively the tangent vector. Unlike the algebraic formulation, here one has to check that this definition is independent of the choice of neighbourhood chart \((U,\phi)\) of \(x\in U\subset X\).

    Problem: What is the tangent space \(T_x(X)\) to a \(C^1\)-manifold \(X\) at a point \(x\in X\)?

    Solution: \(T_x(X)\) is simply the set of all tangent vectors to \(X\) at \(x\in X\). As the name of its elements (i.e. “tangent vectors“) suggests, \(T_x(X)\) is a real, \(n\)-dimensional vector space. By choosing a neighbourhood chart \((U,\phi)\) containing \(x\in U\), this provides \(n\) local coordinates \(\phi(x)=(x^0(x),…,x^{n-1}(x))\) on \(U\). An explicit basis for \(T_x(X)\) then consists of the \(n\) partial derivative tangent vector operators \(\partial_{\mu}|_x\in T_x(X)\) for \(\mu=0,…,n-1\) defined by:

    \[\partial_{\mu}|_x(f):=\frac{\partial (f\circ\phi^{-1})}{\partial x^{\mu}}\biggr|_{\phi(x)}\]

    (which is well-defined because \(f\in C^1(X\to\mathbf R)\)). Any tangent vector \(v_x\in T_x(X)\) at \(x\in X\) can then be written as a linear combination of these basis tangent vectors:

    \[v_x=v^{\mu}\partial_{\mu}|_x\]

    with real coefficients \(v^{\mu}\in\mathbf R\) called the contravariant components of \(v_x\in T_x(X)\) in the coordinate basis \(\partial_{\mu}|_x\).

    Problem: A given tangent vector \(v_x\in T_x(X)\) may be written with contravariant components in \(2\) distinct coordinate bases:

    \[v_x=v^{\mu}\partial_{\mu}|_x=v’^{\nu}\partial’_{\nu}|_x\]

    simply from picking \(2\) different charts \((U,\phi),(U’,\phi’)\) in the \(C^1\)-atlas of \(X\) containing \(x\in U\cap U’\). Describe how the contravariant components \(v’^{\nu}\) may be obtained from the contravariant components \(v^{\mu}\).

    Solution: Act on an arbitrary \(f\in C^1(X\to\mathbf R)\) to obtain:

    \[v_x(f)=v^{\mu}\frac{\partial (f\circ\phi^{-1})}{\partial x^{\mu}}\biggr|_{\phi(x)}=v’^{\nu}\frac{\partial (f\circ\phi’^{-1})}{\partial x’^{\nu}}\biggr|_{\phi'(x)}\]

    Now insert a “resolution of the identity” \(f\circ\phi’^{-1}\circ\phi’\circ\phi^{-1}\) and because \(X\) is a \(C^1\)-manifold, the transition map \(\phi’\circ\phi^{-1}\) will be differentiable and in particular, by the chain rule:

    \[\frac{\partial (f\circ\phi^{-1})}{\partial x^{\mu}}\biggr|_{\phi(x)}=\frac{\partial x’^{\nu}}{\partial x^{\mu}}\biggr|_{\phi(x)}\frac{\partial(f\circ\phi’^{-1})}{\partial x’^{\nu}}\biggr|_{\phi'(x)}\]

    where \(\phi'(x)=(x’^0(x),…,x’^{n-1}(x))\). This simple chain rule identity can by itself already be viewed as a passive change of \(T_x(X)\)-basis:

    \[\frac{\partial}{\partial x’^{\nu}}\biggr|_{x}=\frac{\partial x^{\mu}}{\partial x’^{\nu}}\biggr|_{\phi'(x)}\frac{\partial}{\partial x^{\mu}}\biggr|_x\]

    Or equivalently, as an active change of the contravariant components via the Jacobian matrix:

    \[v’^{\nu}=\frac{\partial x’^{\nu}}{\partial x^{\mu}}\biggr|_{\phi(x)}v^{\mu}\]

    Problem: What is a vector field \(v\) on a \(C^1\)-manifold \(X\)?

    Solution: The more intuitive way to understand it is as a map \(v:X\to TX\), where \(TX:=\bigsqcup_{x\in X}T_x(X)\) is the so-called tangent bundle of \(X\). This basically corresponds to assigning a tangent vector \(v_x\in T_x(X)\) rooted at each point \(x\in X\) across the manifold \(X\). Alternatively, a more algebraic definition of a vector field is as a map from scalar fields to scalar fields \(v:C^1(X\to\mathbf R)\to C^1(X\to\mathbf R)\), defined so that for a scalar field \(f:X\to\mathbf R\) on the manifold \(X\), the new scalar field \(v(f):X\to\mathbf R\) can be thought of as a global directional derivative map \(v(f)(x):=v_x(f)\) giving the directional derivative \(v_x(f)\in\mathbf R\) of the scalar field \(f\) at each point \(x\in X\) according to some assignment \(v_x\in T_x(X)\) of tangent vectors (in the spirit of the first definition).

    Problem: Given a \(C^1\)-manifold \(X\), and a neighbourhood of a point \(x\in X\) with local coordinates \((x^0,…,x^{n-1})\), explain how a vector field \(v\) may be expressed in this coordinate basis over said neighbourhood.

    Solution: \(v=v^{\mu}\partial_{\mu}\) is essentially a directional derivative operator. Just to be clear, there is a standard abuse of notation going on here where the underlying chart \(\phi\) defining the local coordinates \((x^0,…,x^{n-1})\) is swept under the rug, so for instance \(v(f)(x)=v^{\mu}(x)\frac{\partial f}{\partial x^{\mu}}(x)\) where the shorthand \(\frac{\partial f}{\partial x^{\mu}}(x)=\frac{\partial (f\circ\phi^{-1})}{\partial x^{\mu}}(\phi(x))\) is being used.

    Problem: Given \(2\) vector fields \(v,w\) on the same \(C^2\)-manifold \(X\), explain why the composition \(v\circ w\) (and hence also \(w\circ v\)) is not a vector field on \(X\).

    Solution: Although it’s linear, it fails to satisfy the product rule.

    Problem: Explain how the above situation can be remedied.

    Solution: Consider instead the commutator \([v,w]:C^2(X\to\mathbf R)\to C^2(X\to\mathbf R)\). This is now both linear, but also importantly obeys the product rule:

    \[[v,w](fg)=f[v,w](g)+g[v,w](f)\]

    Indeed, as with any commutator this makes the space of all vector fields not only a real vector space but also a Lie algebra. In a common coordinate basis \((x^0,…,x^{n-1})\) where \(v=v^{\mu}\partial_{\mu}\) and \(w=w^{\nu}\partial_{\nu}\), the commutator has the representation:

    \[[v,w]=\left(v^{\mu}\partial_{\mu}w^{\nu}-w^{\mu}\partial_{\mu}v^{\nu}\right)\partial_{\nu}\]

    Posted in Blog | Leave a comment

    Support Vector Machines

    Problem: Explain how a hard-margin support vector machine would perform binary classification.

    Solution: Conceptually, it’s simple. Given a training set of \(N\) feature vectors \(\mathbf x_1,…,\mathbf x_N\in\mathbf R^n\) each associated with some binary target label \(y_1,…,y_N\in\{-1,1\}\) (notice the \(2\) binary classes are \(y=-1\) and \(y=1\) rather than the more conventional \(y=0\) and \(y=1\); this is simply a matter of convenience), the goal of an SVM is to compute the unique hyperplane in \(\mathbf R^n\) that will separate the \(y=-1\) and \(y=1\) classes as “best” as possible. For \(n=2\) this can be visualized as trying to find the “widest possible street” between \(2\) neighborhoods:

    Mathematically, if the hyperplane decision boundary \(\mathbf w\cdot\mathbf x+b=0\) is defined by a normal vector \(\mathbf w\in\mathbf R^n\) such that the SVM classifies points with \(\mathbf w\cdot\mathbf x+b\geq 0\) as \(y=1\) and \(\mathbf w\cdot\mathbf x+b\leq 0\) as \(y=-1\) (more compactly, \(\hat y_{\text{SVM}}(\mathbf x|\mathbf w,b)=\text{sgn}(\mathbf w\cdot\mathbf x+b)\)), and the overall normalization of \(\mathbf w\) and \(b\) is fixed by stipulating that the “gutters” of the street are defined by the hyperplanes \(\mathbf w\cdot\mathbf x+b=\pm 1\) passing through the support (feature) vectors (drawn as bigger red/green dots in the diagram), then the “width of the street” is the distance between these \(2\) “gutter hyperplanes”, namely \(2/|\mathbf w|\). Since one would like to maximize this margin \(2/|\mathbf w|\), this is equivalent to minimizing the quadratic \(|\mathbf w|^2/2\). Thus, one can formulate this “hard-margin SVM” algorithm as the programming problem:

    \[\text{argmin}_{\mathbf w\in\mathbf R^n,b\in\mathbf R:\forall i=1,…,N,y_i(\mathbf w\cdot\mathbf x_i+b)\geq 1}\frac{|\mathbf w|^2}{2}\]

    where the \(N\) constraints \(y_i(\mathbf w\cdot\mathbf x_i+b)\geq 1\) (or vectorially \(\mathbf y\odot(X^T\mathbf w+b\mathbf{1})\geq\mathbf 1\)) are just saying that all \(N\) feature vectors \(\mathbf x_1,…,\mathbf x_N\) in the training set should lie on the correct side of the decision boundary hyperplane \(\mathbf w\cdot\mathbf x+b=0\).

    Problem: What is the glaring problem with a hard-margin SVM? How does a soft-margin SVM help to address this issue?

    Solution: The basic problem is that the hard-margin SVM is way too sensitive to outliers, e.g. a single \(y=-1\) feature vector amidst the \(y=1\) neighborhood would cause the hard-margin SVM to be unable to find a solution (because indeed there would be no hyperplane that perfectly separates all the training data into their correct classes). Intuitively, it seems like in such a case the best way to deal with this bias-variance tradeoff is to shrug and ignore the outliers:

    Mathematically, the usual way to encode this is by introducing \(N\) “slack variables” \(\xi_1,…,\xi_N\geq 0\) such that the earlier \(N\) constraints are relaxed to \(y_i(\mathbf w\cdot\mathbf x_i+b)\geq 1-\xi_i\). However, to avoid cutting too much slack (i.e. incurring too much misclassification on the training set), their use should be penalized by the \(\ell^1\)-norm \(|\boldsymbol{\xi}|_1:=\boldsymbol{\xi}\cdot\mathbf{1}=\sum_{i=1}^N\xi_i\). To this effect, the programming problem is modified to that of soft-margin SVM:

    \[\text{argmin}_{\mathbf w\in\mathbf R^n,b\in\mathbf R,\boldsymbol{\xi}\in\mathbf R^N:\mathbf y\odot(X^T\mathbf w+b\mathbf{1})\geq\mathbf 1-\boldsymbol{\xi},\boldsymbol{\xi}\geq\mathbf 0}\frac{|\mathbf w|^2}{2}+C|\boldsymbol{\xi}|_1\]

    where \(C\geq 0\) is a “strictness” hyperparameter that must be selected ahead of time to balance hard vs. soft SVM.

    Problem: The above represents the primal formulation of the programming problem in terms of so-called primal variables \(\mathbf w,b,\boldsymbol{\xi}\). Show how the KKT conditions enable one to obtain a corresponding dual formulation of the programming problem in terms of a dual variable \(\boldsymbol{\lambda}\):

    \[\text{argmax}_{\boldsymbol{\lambda}\in\mathbf R^N:\boldsymbol{\lambda}\cdot\mathbf y=0,\mathbf 0\leq\boldsymbol{\lambda}\leq C\mathbf 1}|\boldsymbol{\lambda}|_1-\frac{1}{2}(\boldsymbol{\lambda}\odot\mathbf y)X^TX(\boldsymbol{\lambda}\odot\mathbf y)\]

    Solution: The idea is to combine the \(2\) constraints \(\mathbf y\odot(X^T\mathbf w+b\mathbf{1})\geq\mathbf 1-\boldsymbol{\xi}\) and \(\boldsymbol{\xi}\geq\mathbf 0\) using \(2\) KKT multipliers \(\boldsymbol{\lambda},\boldsymbol{\mu}\) into the Lagrangian:

    \[L(\mathbf w,b,\boldsymbol{\xi},\boldsymbol{\lambda},\boldsymbol{\mu})=\frac{|\mathbf w|^2}{2}+C|\boldsymbol{\xi}|_1-\boldsymbol{\lambda}\cdot(\mathbf y\odot(X^T\mathbf w+b\mathbf{1})-\mathbf 1+\boldsymbol{\xi})-\boldsymbol{\mu}\cdot\boldsymbol{\xi}\]

    The KKT necessary conditions assert:

    \[\frac{\partial L}{\partial\mathbf w}=\mathbf 0\Rightarrow\mathbf w=X(\boldsymbol{\lambda}\odot\mathbf y)\]

    \[\frac{\partial L}{\partial b}=0\Rightarrow\boldsymbol{\lambda}\cdot\mathbf y=0\]

    \[\frac{\partial L}{\partial\boldsymbol{\xi}}=\mathbf 0\Rightarrow C\mathbf 1=\boldsymbol{\lambda}+\boldsymbol{\mu}\]

    in addition to the \(2\) original constraints on the primal variables (primal feasibility), and dual feasibility \(\boldsymbol{\lambda},\boldsymbol{\mu}\geq\mathbf 0\) on both dual variables (as a corollary, \(\mathbf 0\leq\boldsymbol{\lambda}\leq C\mathbf 1\)), and complementary slackness \(\boldsymbol{\lambda}\odot(\mathbf y\odot(X^T\mathbf w+b\mathbf{1})-\mathbf 1+\boldsymbol{\xi})=\boldsymbol{\mu}\odot\boldsymbol{\xi}=\mathbf 0\). The idea of the support vectors is clearly seen in the \(1^{\text{st}}\) of these complementary slackness conditions.

    By substituting the stationary conditions found back into the Lagrangian, one can eliminate all the primal variables (and even eliminate one of the dual variables \(\boldsymbol{\mu}\)) to obtain the claim:

    \[L(\boldsymbol{\lambda})=|\boldsymbol{\lambda}|_1-\frac{1}{2}(\boldsymbol{\lambda}\odot\mathbf y)X^TX(\boldsymbol{\lambda}\odot\mathbf y)\]

    which is a standard quadratic programming exercise with efficient solutions for \(\boldsymbol{\lambda}\).

    (mention something about strong duality as enabling a min-max interchange of primal vs. dual?)

    Problem: Although soft-margin SVM is a significant improvement over hard-margin SVM, it too has a glaring issue. What is that issue and how can it be addressed?

    Solution: Whether hard-margin or soft-margin, SVMs are ultimately linear binary classifiers; they can only directly learn a hyperplane decision boundary. Therefore, if the data is simply linearly inseparable, then even soft-margin SVM would be unwise to apply directly.

    However, there is a creative solution; just take the \(N\) feature vectors \(\mathbf x_1,…,\mathbf x_N\in\mathbf R^n\) and apply a nonlinear transformation \(\prime:\mathbf R^n\to\mathbf R^{>n}\) from the current feature space \(\mathbf R^n\) to some higher-dimensional feature space \(\mathbf R^{>n}\) (essentially a kind of feature engineering), thereby obtaining \(N\) transformed feature vectors \(\mathbf x’_1,…,\mathbf x’_N\in\mathbf R^{>n}\). Ideally, if the nonlinear transformation \(\prime\) is well-chosen, then the transformed feature vectors \(\mathbf x’_1,…,\mathbf x’_N\) will become linearly separable in the higher-dimensional feature space \(\mathbf R^{>n}\), in which case the usual soft-margin SVM may be applied in \(\mathbf R^{>n}\). Then, applying the inverse nonlinear mapping \(\prime^{-1}:\mathbf R^{>n}\to\mathbf R^n\) back to the original feature space \(\mathbf R^n\), the SVM (soft) maximal-margin hyperplane in \(\mathbf R^{>n}\) would be transformed into some nonlinear decision boundary back in \(\mathbf R^n\), thereby effectively extending the applicability of the SVM method to linearly inseparable data! See this YouTube video for an example.

    Problem: Continuing off the previous problem, suppose one is already in the transformed space with feature vectors \(\mathbf x’_1,…,\mathbf x’_N\in\mathbf R^{>n}\), and one has solved the dual programming problem in this space to obtain some optimal \(\boldsymbol{\lambda}\in\mathbf R^N\). How is binary classification thus performed?

    Solution: Based on the previous analysis, the SVM binary classifier in this transformed feature space \(\mathbf R^{>n}\) may be written:

    \[\hat y_{\text{SVM}}(\mathbf x’)=\text{sgn}(\mathbf w\cdot\mathbf x’+b)=\text{sgn}\left(\sum_{i=1}^N\lambda_iy_i\mathbf x’\cdot\mathbf x’_i+y_S-\sum_{i=1}^N\lambda_iy_i\mathbf x’_S\cdot\mathbf x’_i\right)\]

    where \(\mathbf x’_S\) is any support feature vector. Note by complementary slackness that most \(\lambda_i=0\) are vanishing (corresponding non-support feature vectors \(\mathbf x_i\)), so the sums \(\sum_{i=1}^N\) may also be written as sums over only support vectors.

    Problem: Explain how the expression above can be simplified by means of the kernel trick.

    Solution: Basically, the feature space transformation \(\prime:\mathbf R^n\to\mathbf R^{>n}\) is a complicated mapping of vectors to vectors. By contrast, only their scalar dot products \(\mathbf x’\cdot\mathbf x’_i\) appear in the SVM classifier \(\hat y_{\text{SVM}}(\mathbf x)\), and scalars are simpler than vectors. The kernel trick essentially invites one to forget about the details of the transformation \(\prime\) and just directly specify an analytical expression for the dot product in the transformed feature space \(K(\mathbf x,\tilde{\mathbf x}):=\mathbf x’\cdot\tilde{\mathbf x}’\), where here \(K:\mathbf R^n\times\mathbf R^n\to\mathbf R\) is called the kernel function. Effectively, this means one never has to actually visit the transformed feature space \(\mathbf R^{>n}\) since the kernel is perfectly happy to just take the feature vectors in the original space \(\mathbf R^n\). Thus, one can write:

    \[\hat y_{\text{SVM}}(\mathbf x)=\text{sgn}(\mathbf w\cdot\mathbf x’+b)=\text{sgn}\left(\sum_{i=1}^N\lambda_iy_iK(\mathbf x,\mathbf x_i)+y_S-\sum_{i=1}^N\lambda_iy_iK(\mathbf x_S,\mathbf x_i)\right)\]

    Problem: What is the Gaussian (radial basis function) kernel \(K_{\sigma}(\mathbf x,\tilde{\mathbf x})\)? Sketch why it’s a valid kernel function.

    Solution: This is by far the most popular kernel for nonlinear SVM classification, involving a hyperparameter \(\sigma\):

    \[K_{\sigma}(\mathbf x,\tilde{\mathbf x})=e^{-|\mathbf x-\tilde{\mathbf x}|^2/2\sigma^2}\]

    algebraically, it is clearly equivalent to:

    \[K_{\sigma}(\mathbf x,\tilde{\mathbf x})=e^{-|\mathbf x|^2/2\sigma^2}e^{-|\tilde{\mathbf x}|^2/2\sigma^2}e^{\mathbf x\cdot\tilde{\mathbf x}/\sigma^2}\]

    Taylor expanding the last factor:

    \[K_{\sigma}(\mathbf x,\tilde{\mathbf x})=e^{-|\mathbf x|^2/2\sigma^2}e^{-|\tilde{\mathbf x}|^2/2\sigma^2}\sum_{d=0}^{\infty}\frac{(\mathbf x\cdot\tilde{\mathbf x})^d}{\sigma^{2d}d!}\]

    so it’s clearly a (countably-infinite-dimensional!) inner product because of the presence of the \(d\)-dimensional polynomial kernels \(K_d(\mathbf x,\tilde{\mathbf x})=(\mathbf x\cdot\tilde{\mathbf x})^d\).

    Problem: What were the main problems with SVMs that spurred the deep learning renaissance via the development of artificial neural networks?

    Solution: The main issue with SVMs is their lack of scalability. Roughly speaking, the computational cost of training an SVM on a dataset of \(N\) training examples scales like \(O(N^2)\) in the best case (essentially because the Gram matrix \(X^TX\) is an \(N\times N\) matrix of all pairwise interactions of features), even with modern tricks like sequential minimal optimization this is still unavoidable. Another smaller issue is that

    Problem: Write a short Python program using the scikit-learn library to train a support vector machine on the Kaggle Titanic competition dataset.

    Solution:

    titanic
    Posted in Blog | Leave a comment

    Numerical Computation

    Problem: In numerical computation, what are the \(2\) main kinds of rounding error?

    Solution: Overflow error (\(N\approx\infty\)) but perhaps even more dangerous is underflow error (\(\varepsilon\approx 0\)) which are in some sense inverses of each other:

    \[0=\frac{1}{\infty}\]

    Problem: Explain how rounding errors can affect evaluation of the softmax activation function and strategies to address it.

    Solution: Given a vector \(\mathbf x\in\mathbf R^n\), the corresponding softmax probability \(\ell^1\)-unit vector is \(e^{\mathbf x}/|e^{\mathbf x}|_1\in S^{n-1}\). If one of the components of \(\mathbf x\) is very negative, then the corresponding component of \(e^{\mathbf x}\) can underflow, whereas if it is instead very positive, then overflow becomes a possibility.

    To address these numerical stability issues, the trick is to notice that softmax is invariant under “diagonal” translations along \(\mathbf 1\in\mathbf R^n\), i.e. \(\mathbf x\mapsto\mathbf x-\lambda\mathbf 1\) leads to \(e^{\mathbf x-\lambda\mathbf 1}=e^{\mathbf x}\odot e^{-\lambda\mathbf 1}=e^{-\lambda}e^{\mathbf x}\) and hence the \(\ell^1\)-norm \(|e^{\mathbf x-\lambda\mathbf 1}|_1=e^{-\lambda}|e^{\mathbf x}|_1\) is also merely scaled by the same factor \(e^{-\lambda}\). Choosing \(\lambda:=\text{max}(\mathbf x)\) to be maximum component of \(\mathbf x\) (which coincides with the \(\ell^{\infty}\) norm if \(\mathbf x\geq 0\) is positive semi-definite) would eliminate any overflow error that may have been present in evaluating the numerator \(e^{\mathbf x}\) by instead evaluating \(e^{\mathbf x-\text{max}(\mathbf x)\mathbf 1}\). For the same reason, the shifted denominator \(|e^{\mathbf x-\text{max}(\mathbf x)\mathbf 1}|_1\) is also immune to overflow error. Furthermore, the denominator is also resistant to underflow error because at least one term in the series for \(|e^{\mathbf x-\lambda\mathbf 1}|_1=\sum_{i}e^{x_i-\lambda}\) will be \(e^0=1\gg 0\), namely term \(i=\text{argmax}(\mathbf x)\). The only possible problem is that the modified numerator could still experience an underflow error; this would be bad if subsequently one were to evaluate the information content (loss function) of a softmax outcome in which taking \(\log(0)=-\infty\) would produce a nan. For this case, one can apply a similar trick to numerically stabilize the computation of this logarithm.

    Problem: Describe the numerical analysis phenomenon of catastrophic cancellation.

    Solution: Subtraction \(f(x,y):=x-y\) is ill-conditioned when \(x\approx y\). This is because even if \(x\) and \(y\) have small relative errors, the relative error in their difference \(f(x,y)\) can still be substantial.

    Problem: What does the condition number of a matrix \(X\) measure?

    Solution: The condition number measures how numerically stable it is to invert \(X\mapsto X^{-1}\). It is given by the ratio of the largest to smallest eigenvalues of \(X\) (by absolute magnitude, assuming \(X\) is diagonalizable):

    \[\frac{\text{max}|\text{spec}(X)|}{\text{min}|\text{spec}(X)|}\]

    For instance, a non-invertible matrix would have \(\text{min}|\text{spec}(X)|=0\) and thus an infinite condition number, since it’s non-invertible.

    Problem: Show that during gradient descent of a scalar field \(f(\mathbf x)\), if at some point \(\mathbf x\) in the optimization landscape, one has \(\frac{\partial f}{\partial\mathbf x}\cdot\frac{\partial^2 f}{\partial\mathbf x^2}\frac{\partial f}{\partial\mathbf x}>0\) (e.g. such as if the Hessian \(\frac{\partial^2 f}{\partial\mathbf x^2}\) is just generally positive-definite or the gradient \(\frac{\partial f}{\partial\mathbf x}\) is an eigenvector of the Hessian \(\frac{\partial^2 f}{\partial\mathbf x^2}\) with positive eigenvalue), then by a line search logic, the most effective local learning rate \(\alpha(\mathbf x)>0\) is given by:

    \[\alpha(\mathbf x)=\frac{\left|\frac{\partial f}{\partial\mathbf x}\right|^2}{\frac{\partial f}{\partial\mathbf x}\cdot\frac{\partial^2 f}{\partial\mathbf x^2}\frac{\partial f}{\partial\mathbf x}}\]

    Solution: Since gradient descent takes a step \(\mathbf x\mapsto\mathbf x-\alpha\frac{\partial f}{\partial\mathbf x}\), and the goal is to decrease the value of the function \(f\) as much as possible to get closer to the minimum, one would like to maximize:

    \[f(\mathbf x)-f\left(\mathbf x-\alpha\frac{\partial f}{\partial\mathbf x}\right)\]

    as a function of the learning rate \(\alpha>0\). Assuming that a quadratic approximation is valid:

    \[f(\mathbf x)-f\left(\mathbf x-\alpha\frac{\partial f}{\partial\mathbf x}\right)\approx \alpha\left|\frac{\partial f}{\partial\mathbf x}\right|^2-\frac{\alpha^2}{2}\frac{\partial f}{\partial\mathbf x}\cdot\frac{\partial^2 f}{\partial\mathbf x^2}\frac{\partial f}{\partial\mathbf x}\]

    which has a well-defined maximum of the \(\alpha\)-parabola provided the \(\alpha^2\) coefficient is positive.

    Problem: Explain why gradient descent is behaving like this in this case? What can be done to address that?

    Solution: The problem is that the Hessian is ill-conditioned (in this case the condition number is \(5\) everywhere); the curvature in the \((1,1)\) direction is \(5\times\) greater than the curvature in the \((1,-1)\) direction. Gradient descent wastes time repeatedly descending “canyon walls” because it sees those as the steepest features. A \(2^{\text{nd}}\)-order optimization algorithm (e.g. Newton’s method) that uses information from the Hessian would do better.

    Problem: Explain how the Karush-Kuhn-Tucker (KKT) conditions generalize the method of Lagrange multipliers.

    Solution: The idea is, in addition to equality constraints like \(x+y=1\), one can also allow for inequality constraints like \(x^2+y^2\leq 1\) (indeed, an equality \(a=b\) is just \(2\) inequalities \(a\leq b\) and \(a\geq b\)). Then, constrained optimization of a scalar field \(f(\mathbf x)\) subject to a bunch of equality constraints \(g_1(\mathbf x)=g_2(\mathbf x)=…=0\) and a bunch of inequality constraints \(h_1(\mathbf x),h_2(\mathbf x),…\leq 0\) is equivalent to unconstrained optimization of the Lagrangian:

    \[L(\mathbf x,\boldsymbol{\lambda}_{\mathbf g},\boldsymbol{\lambda}_{\mathbf h})=f(\mathbf x)+\boldsymbol{\lambda}_{\mathbf g}\cdot\mathbf g(\mathbf x)+\boldsymbol{\lambda}_{\mathbf h}\cdot\mathbf h(\mathbf x)\]

    where \(\boldsymbol{\lambda}_{\mathbf g},\boldsymbol{\lambda}_{\mathbf h}\) are called KKT multipliers. The KKT necessary (but not sufficient!) conditions for optimality are then (ignoring technicalities about equating objects of possibly different dimensions):

    1. \(\frac{\partial L}{\partial\mathbf x}\mathbf 0\) (stationary)
    2. \(\mathbf g(\mathbf x)=\mathbf 0\geq \mathbf h(\mathbf x)\) (primal feasibility)
    3. \(\boldsymbol{\lambda}_{\mathbf h}\geq 0\) (if the inequalities were instead written as \(\mathbf h(\mathbf x)\geq\mathbf 0\) then this would instead require \(\boldsymbol{\lambda}_{\mathbf h}\leq 0\)) (dual feasibility)
    4. \(\boldsymbol{\lambda}_{\mathbf h}\odot\mathbf h(\mathbf x)=\mathbf 0\) (complementary slackness)
    Posted in Blog | Leave a comment