PyTorch Fundamentals (Part \(1\))

Problem: Illustrate some of the basic fundamentals involved in using the PyTorch deep learning library. In particular, discuss the attributes of PyTorch tensors (e.g. dtype, CPU/GPU devices, etc.), how to generate random PyTorch tensors with/without seeding, and operations that can be performed on and between PyTorch tensors.

Solution:

pytorch_fundamentals
Posted in Blog | Leave a comment

Self-Attention in Transformers

Problem: Explain how the transformer architecture works at a mathematical level (e.g. as outlined in the Attention Is All You Need paper).

Solution:

  1. (Tokenization) Partition the inputted natural language text into a sequence of tokens \(\tau_1,…,\tau_N\) (here \(N\) is bounded from above by the LLMs context size).
  2. (Embedding) Each token \(\tau_i\) is embedded as a vector \(\tau_i\mapsto\mathbf x_i\in\mathbf R^{n_e}\) that contains information about the token’s generic meaning as well as its position in the inputted natural language text. Here, the hyperparameter \(n_e\) is the dimension of the embedding space.
  3. (Single Self-Attention Head) For each embedding vector \(\mathbf x_i\), compute its query vector \(\mathbf q_i=W_{\mathbf q}\mathbf x_i\), its key vector \(\mathbf k_i=W_{\mathbf k}\mathbf x_i\), and its value vector \(\mathbf v_i=W_{\mathbf v}\mathbf x_i\). Here, \(W_{\mathbf q},W_{\mathbf k}\in\mathbf R^{n_{qk}\times n_e}\) are weight matrices that map from the embedding space \(\mathbf R^{n_e}\) to the query/key space \(\mathbf R^{n_{qk}}\) of dimension \(n_{qk}\) and \(W_{\mathbf v}\in\mathbf R^{n_e\times n_e}\) is the weight matrix of values (which in practice is decomposed into a low-rank approximation \(W_{\mathbf v}=W_{\mathbf v\uparrow}W_{\mathbf v\downarrow}\) where typically \(W_{\mathbf v\downarrow}\in\mathbf R^{n_{qk}\times n_e}\) and \(W_{\mathbf v\uparrow}\in\mathbf R^{n_e\times n_{qk}}\)). For each \(\mathbf x_i\), one computes an update vector \(\Delta\mathbf x_i\) to be added to it according to a convex linear combination of the value vectors \(\mathbf v_1,…,\mathbf v_N\) of all the embeddings \(\mathbf x_1,…,\mathbf x_N\) in the context, specifically:

\[\Delta\mathbf x_i=V\text{softmax}\left(\frac{K^T\mathbf q_i}{\sqrt{n_{qk}}}\right)\]

where \(K=(\mathbf k_1,…,\mathbf k_N)\in\mathbf R^{n_{qk}\times N}\) and \(V=(\mathbf v_1,…,\mathbf v_N)\in\mathbf R^{n_e\times N}\) are key and value matrices associated to the inputted context (filled with column vectors here rather than the ML convention of row vectors). This map that takes the initial, generic token embeddings \(\mathbf x_i\) and nudges them towards more contextualized embeddings \(\mathbf x_i\mapsto\mathbf x’_i=\mathbf x_i+\Delta\mathbf x_i\) is called a head of self-attention. The \(1/\sqrt{n_{qk}}\) scaling in the softmax temperature is justified on the grounds that if \(\mathbf k\) and \(\mathbf q\) are random vectors whose independent components each have mean \(0\) and variance \(1\), then \(\mathbf k\cdot\mathbf q\) will have mean \(0\) and variance \(n_{qk}\), hence the need to normalize by \(\sqrt{n_{qk}}\) to ensure \(\mathbf k\cdot\mathbf q/\sqrt{n_{qk}}\) continues to have variance \(1\).

4. (Multi-Headed Self-Attention) Since context can influence meaning in different ways, repeat the above procedure in parallel for several heads of self-attention; each head will propose a displacement update to each of the \(N\) original embeddings \(\mathbf x_i\); add up all of them.

5. (Multilayer Perceptron) Linear, ReLU, Linear basically. It is hypothesized that facts are stored in this part of the transformer.

6. (Layers) Alternate between the multi-headed self-attention blocks and MLP blocks, make a probabilistic prediction of the next token \(\hat{\tau}_{N+1}\) using only the final, context-rich, modified embedding \(\mathbf x’_N\) of the last token \(\tau_N\) in the context by applying an unembedding matrix \(\mathbf u=W_{\mathbf u}\mathbf x’_N\) and running it through a softmax \(\text{softmax}(\mathbf u)\).

    Problem: Based on the above discussion of the transformer architecture, explain how a large language model (LLM) like Gemini, ChatGPT, Claude, Grok, DeepSeek, etc. works (at a high level).

    Solution: Essentially, since an LLM is a neural network which takes as input some string of text and probabilistically predicts the next token, by seeding it with some corpus of text \(T\), the LLM can sample according to the probability distribution it generates for the next token, and append that to \(T\mapsto T+\tau\). Then, simply repeat this except pretend that \(T+\tau\) was the seed all along. In this way, generative AI models such as ChatGPT (where GPT stands for generative pre-trained transformer) work. In practice, it is helpful to also provide some system prompt like “What follows is a conversation between a user and a knowledgeable AI assistant:”.

    Posted in Blog | Leave a comment

    JAX Fundamentals (Part \(1\))

    JAX_tutorial
    Posted in Blog | Leave a comment

    Monte Carlo Methods

    Problem: Distinguish between Las Vegas methods and Monte Carlo methods.

    Solution: Both are umbrella terms referring to broad classes of methods that draw (repeatedly) from (not necessarily i.i.d.) random variables to compute the value of some deterministic variable. Here, “compute” means “find that deterministic value with certainty” in the case of a Las Vegas method whereas “compute” means “(point) estimate” in the case of a Monte Carlo method. Thus, their key difference lies in how they tradeoff time \(t\) vs. accuracy \(\alpha\):

    Problem: Given a scalar continuous random variable with probability density function \(p(x)\), how can one sample faithfully from such an \(x\sim p(x)\)?

    Solution: Fundamentally, a computer can (approximately) draw from a uniform random variable \(u\in [0,1]\) (using seed-based pseudorandom number generators for instance). In order to convert this \(u\to x\), the standard way is to use the cumulative distribution function of \(x\):

    \[\int_{-\infty}^xdx’p(x’):=u\]

    Then, as the probability density of \(u\) is uniform \(p(u)=1\), the probability density of \(x\) will be \(p(x)=\frac{du}{dx}p(u)\). As the CDF is a monotonically increasing bijection, there always exists an \(x=x(u)\) for each \(u\in[0,1]\). And intuitively, this whole sampling scheme should make a lot of sense:

    Problem: So, it seems like the above “inverse CDF trick” has solved the problem of sampling from an arbitrary random variable \(x\) (and it generalizes to the case of a higher-dimensional random vector \(\mathbf x\in\mathbf R^n\) by sampling conditionally). So how come in many practical applications this “sampling” problem can still be difficult?

    Solution: In practice, one may not have access to \(p(\mathbf x)\) itself, but only the general shape \(\tilde p(\mathbf x)=Zp(\mathbf x)\) of it (e.g. in Bayesian inference). One might object that \(Z\) is known, namely \(Z=\int d^n\mathbf x\tilde p(\mathbf x)\) can be written explicitly as an \(n\)-dimensional integral, so wherein lies the difficulty? The difficulty lies in evaluating the integral, which suffers from the curse of dimensionality if \(n\gg 1\) is large (just imagine approximating the integral by a sum, and then realizing that the number of terms \(N\) in such a series would grow as \(N\sim O(\exp n)\)). The inverse CDF trick thus fails because it requires one to have \(p(\mathbf x)\) itself, not merely \(\tilde p(\mathbf x)\).

    Problem: Describe the Monte Carlo method known as importance sampling for estimating the expectation \(\langle f(\mathbf x)\rangle\) of a random variable \(f(\mathbf x)\).

    Solution: Importance sampling consists of a deliberate reframing of the integrand of the expectation inner product integral:

    \[\langle f(\mathbf x)\rangle=\int d^n\mathbf x p(\mathbf x)f(\mathbf x)=\int d^n\mathbf x q(\mathbf x)\frac{p(\mathbf x)f(\mathbf x)}{q(\mathbf x)}\]

    in other words, replacing \(p(\mathbf x)\mapsto q(\mathbf x)\) and \(f(\mathbf x)\mapsto p(\mathbf x)f(\mathbf x)/q(\mathbf x)\). There are \(2\) cases in which importance sampling can be useful. The first case is related to the previous problem, namely when sampling directly from \(p(\mathbf x)\) is too difficult, in which case \(q(\mathbf x)\) should be chosen to be easier to both sample from and evaluate than \(p(\mathbf x)\). The second case is, even if sampling directly from \(p(\mathbf x)\) is feasible, it can still be useful to instead sample from \(q(\mathbf x)\) as a means of reducing the variance of the Monte Carlo estimator. That is, whereas the original Monte Carlo estimator of the expectation involving \(N\) i.i.d. draws \(\mathbf x_1,…,\mathbf x_N\) from \(p(\mathbf x)\) is \(\sum_{i=1}^Nf(\mathbf x_i)/N\) with variance \(\sigma^2_{f(\mathbf x)}/N\), the new Monte Carlo estimator of the expectation involving \(N\) i.i.d. draws from \(q(\mathbf x)\) is \(\sum_{i=1}^Np(\mathbf x_i)f(\mathbf x_i)/q(\mathbf x_i)N\) with variance \(\sigma^2_{p(\mathbf x)f(\mathbf x)/q(\mathbf x)}/N\); for importance sampling to be useful, \(q(\mathbf x)\) should be chosen such that:

    \[\sigma^2_{p(\mathbf x)f(\mathbf x)/q(\mathbf x)}<\sigma^2_{f(\mathbf x)}\]

    One can explicitly find the function \(q(\mathbf x)\) that minimizes the variance functional of \(q(\mathbf x)\) (using a Lagrange multiplier to enforce normalization \(\int d^n\mathbf x q(\mathbf x)=1\)):

    \[q(\mathbf x)\sim p(\mathbf x)|f(\mathbf x)|\]

    In and of itself, this \(q(\mathbf x)\) cannot be used because it contains the difficult piece \(p(\mathbf x)\). Nevertheless, motivated by this result, one can heuristically importance sample using a \(q(\mathbf x)\) which ideally is large at \(\mathbf x\in\mathbf R^n\) where \(p(\mathbf x)|f(\mathbf x)|\) is large. Intuitively, this is because that’s the integrand in the expectation inner product integral, so locations where it’s large will contribute most to the integral, hence are the most “important” regions to sample from with \(q(\mathbf x)\).

    One glaring flaw with the above discussion is that if \(Z\) is not known, then it will also be impossible to sample from \(p(\mathbf x)\). Instead, suppose as before that only the shape \(\tilde p(\mathbf x)=Zp(\mathbf x)\) is known with \(Z=\int d^n\mathbf x\tilde p(\mathbf x)\). Thus, the expectation is:

    \[\frac{\int d^n\mathbf x\tilde{p}(\mathbf x)f(\mathbf x)}{\int d^n\mathbf x\tilde{p}(\mathbf x)}\]

    The key is to notice that the denominator is just the \(f(\mathbf x)=1\) special case of the numerator, hence the importance sampling trick can be applied to both numerator and denominator, thereby obtaining the (biased but asymptotically unbiased Monte Carlo estimator):

    \[\frac{\sum_{i=1}^N\tilde p(\mathbf x_i)f(\mathbf x_i)/q(\mathbf x_i)}{\sum_{i=1}^N\tilde p(\mathbf x_i)/q(\mathbf x_i)}\]

    for the expectation.

    Problem: Describe what Markov chain Monte Carlo (MCMC) methods are about.

    Solution: Whereas in “naive” Monte Carlo methods, the random vector draws \(\mathbf x_1,…,\mathbf x_N\) are often i.i.d., here the key innovation is that the sequence of random vectors is no longer i.i.d., rather \(\mathbf x_t\) is correlated with \(\mathbf x_{t-1}\) (but uncorrelated with all the earlier \(\mathbf x_{<t-1}\). In other words, the sequence \(\mathbf x_1,…,\mathbf x_N\) forms a (discrete-time) Markov chain, i.e. there exists a transition kernel with “matrix elements” \(p(\mathbf x’|\mathbf x)\) for any pair of states \(\mathbf x,\mathbf x’\in\mathbf R^n\) (assuming the DTMC is \(t\)-independent/homogeneous). The goal of an MCMC method is to find a suitable \(p(\mathbf x’|\mathbf x)\) such that the stationary \(N\to\infty\) distribution of the chain matches \(p(\mathbf x)\). Thus, it’s basically an “MCMC sampler”.

    Problem: What does it mean to burn in an MCMC run? What does it mean to thin an MCMC run?

    Solution: Given a DTMC \(\mathbf x_1,…,\mathbf x_N\), the idea of burning-in the chain is to discard the first few samples \(\mathbf x_1,\mathbf x_2,…\) as they’re correlated with the (arbitrary) choice of starting \(\mathbf x_1\) and so are not representative of \(p(\mathbf x)\) (equivalently, one hopes to burn-in for the first \(N_b\) samples where \(N_b\) is around the mixing time of the Markov chain). Burn-in therefore gives the MCMC run a chance to find the high-probability regions of \(p(\mathbf x)\).

    Thinning an MCMC run means only keep e.g. every \(10\) DTMC points or something like that…this is to reduce the correlation between adjacent samples as the goal is to sample independently from \(p(\mathbf x)\).

    Problem: The Metropolis-Hastings method is a general template for a fairly broad family of MCMC methods. Explain how it works.

    Solution: In order for \(p(\mathbf x)\) to be the stationary distribution of a DTMC, by definition one requires it to be an eigenvector (with eigenvalue \(1\)) of the transition kernel:

    \[\int d^n\mathbf xp(\mathbf x’|\mathbf x)p(\mathbf x)=p(\mathbf x’)\]

    A sufficient (though not necessary) condition for this is if the chain is in detailed balance:

    \[p(\mathbf x’|\mathbf x)p(\mathbf x)=p(\mathbf x|\mathbf x’)p(\mathbf x’)\]

    for all \(\mathbf x,\mathbf x’\in\mathbf R^n\). The idea of Metropolis-Hastings is to conceptually decompose the transition from state \(\mathbf x\to\mathbf x’\) into \(2\) simpler steps, namely:

    1. Propose the transition \(\mathbf x\to\mathbf x’\) with proposal probability \(p_p(\mathbf x’|\mathbf x)\) (this distribution \(p_p\) needs be easy to sample \(\mathbf x’\) from for any \(\mathbf x\)!)
    2. Accept/reject the proposed transition \(\mathbf x\to\mathbf x’\) with an acceptance probability \(p_a(\mathbf x’|\mathbf x)\) (this distribution also needs to be easy to sample from!)

    Thus, \(p(\mathbf x’|\mathbf x)=p_a(\mathbf x’|\mathbf x)p_p(\mathbf x’|\mathbf x)\). The ratio of acceptance probabilities from detailed balance is therefore:

    \[\frac{p_a(\mathbf x’|\mathbf x)}{p_a(\mathbf x|\mathbf x’)}=\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p_p(\mathbf x)}\]

    The ratio on the RHS is known (furthermore, because it’s a ratio, neither \(p_p\) nor \(p\) need be normalized! In other words, one can replace \(p_p\mapsto \tilde p_p\) and \(p\mapsto\tilde p\)). The task remains to specify the form of \(p_a(\mathbf x’|\mathbf x)\). The detailed balance condition doesn’t give a unique solution for \(p_a\), but heuristically one can get a unique solution by further insisting that the walker accept moves often. That is, w.l.o.g. suppose the RHS is computed to be \(>1\), so that \(p_a(\mathbf x’|\mathbf x)>p_a(\mathbf x|\mathbf x’)\). This implies that the transition from \(\mathbf x\to\mathbf x’\) is in some sense more variable than the reverse transition \(\mathbf x’\to\mathbf x\), so it seems that if at time \(t\) the MCMC walker were at \(\mathbf x_t=\mathbf x\), then at time \(t+1\) the MCMC walker should definitely move to state \(\mathbf x_{t+1}:=\mathbf x’\). In other words, one would like to set \(p_a(\mathbf x’|\mathbf x):=1\) and hence \(p_a(\mathbf x|\mathbf x’)=\frac{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}{p_p(\mathbf x|\mathbf x’)p_p(\mathbf x’)}\). But this therefore fixes the form of the function \(p_a\)! Specifically, to evaluate \(p_a(\mathbf x’|\mathbf x)\), one should set it to be \(1\) if the RHS ratio is \(>1\) (which coincidentally is also the value that \(p_a\) itself needs to be set to), whereas if the RHS ratio is \(<1\), then \(p_a\) should be set to the RHS ratio. An elegant way to encode all of this into a single compact expression is just:

    \[p_a(\mathbf x’|\mathbf x)=\min\left(1,\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}\right)\]

    Practically, to actually implement this Metropolis-Hastings acceptance probability \(p_a\), one can essentially flip a “biased” coin with probability \(p_a(\mathbf x’|\mathbf x)\) of landing “heads” (meaning to accept the transition \(\mathbf x\to\mathbf x’\)), e.g. randomly picking a number \(u\in[0,1]\) and declaring acceptance iff \(u\leq p_a(\mathbf x’|\mathbf x)\).

    (aside: if the proposal probability \(p_p(\mathbf x’|\mathbf x)\sim e^{-|\mathbf x’-\mathbf x|^2/2\sigma^2}\) is taken to be normally and isotropically distributed about \(\mathbf x\), then it is symmetric \(p_p(\mathbf x’|\mathbf x)=p_p(\mathbf x|\mathbf x’)\) and the MH acceptance probability simplifies \(\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}=\frac{p(\mathbf x’)}{p(\mathbf x)}\) (in which case this is sometimes just called the Metropolis method). Another tip is that the burn-in period of the MCMC walk can often be used to tune the variance \(\sigma^2\) of the above Gaussian \(p_p\) distribution, namely if the burn-in period lasts for \(N_b\) samples, then look at the number of times \(N_a\) the MCMC walker accepted a proposed state from \(p_p\); a rule-of-thumb is that in high-dimensions \(n\gg 1\), one should tune \(\sigma^2\) such that the fraction of accepted proposals \(N_a/N_b\approx 23.4\%\), see “Weak convergence and optimal scaling of random walk Metropolis algorithms” by Roberts, Gelman, and Gilks (\(1997\)). Finally, one more MCMC trick worth mentioning is that one can always run e.g. \(\sim 100\) MCMC sequences in parallel and sample randomly from them as an alternative to thinning).

    Problem: Explain how Gibbs sampling works.

    Solution: Whereas a Metropolis-Hastings sampler moves around in state space like a queen in chess, a Gibbs sampler moves around like a rook. That is, it only ever moves along the \(n\) coordinate directions in \(\mathbf x=(x_1,…,x_n)\). Indeed, the Gibbs sampler can be viewed as a special case of a Metropolis-Hastings sampler in which the proposal distribution \(p_p\) is given by:

    \[p_p(\mathbf x’|\mathbf x):=p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)\delta(x’_1-x_1)…\delta(x’_{i-1}-x_{i-1})\delta(x’_{i+1}-x_{i+1})…\delta(x’_n-x_n)\]

    where \(i=1,…,n\) is cycled through the \(n\) coordinate directions. For each \(1\leq i\leq n\), the MH acceptance probability is:

    \[\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}=\frac{p(x_i|x’_1,…,x’_{i-1},x’_{i+1},…,x’_n)\delta(x_1-x’_1)…\delta(x_{i-1}-x’_{i-1})\delta(x_{i+1}-x’_{i+1})…\delta(x_n-x’_n)p(\mathbf x’)}{p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)\delta(x’_1-x_1)…\delta(x’_{i-1}-x_{i-1})\delta(x’_{i+1}-x_{i+1})…\delta(x’_n-x_n)p(\mathbf x)}\]

    \[=\frac{p(x_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x’_i,…,x_n)}{p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x_n)}\]

    \[=\frac{p(x_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x_{i-1},x_{i+1},…,x_n)p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)}{p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_i|x_1,…,x_{i-1},x_{i+1},…,x_n)}\]

    \[=1\]

    so \(\text{min}(1,1)=1\); in other words, the Gibbs sampler always accepts because the proposal is coming from the conditional probability slices themselves (assumed to be known even if the full joint distribution \(p(\mathbf x)\) is unknown), so they must be authentic samples in a sense. For a Gibbs sampler, only after cycling through all \(n\) coordinate directions \(i=1,…,n\) does it count as \(1\) MCMC iteration.

    (aside: a common variant of the basic Gibbs sampling formalism described above is block Gibbs sampling in which one may choose to update \(2\) or more features simultaneously, conditioning on the distribution of those features given all other features fixed; this would enable diagonal movements within that subspace which, depending how correlated those features are, could significantly increase the effective sample size (ESS)).

    Problem: Explain how the Hamiltonian Monte Carlo method works.

    Solution:

    Posted in Blog | Leave a comment

    Differential Geometry

    Problem: What does it mean for a topological space \(X\) to be locally homeomorphic to a topological space \(Y\)? Hence, what does it mean for a topological space \(X\) to be locally Euclidean?

    Solution: \(X\) is said to be locally homeomorphic to \(Y\) iff every point \(x\in X\) has a neighbourhood homeomorphic to an open subset of \(Y\). In particular, \(X\) is said to be locally Euclidean iff it is locally homeomorphic to Euclidean space \(Y=\mathbf R^n\) for some \(n\in\mathbf N\) called the dimension of \(X\).

    Explicitly, this means every point \(x\in X\) is associated to at least one neighbourhood \(U_x\) along with at least one homeomorphism \(\phi_x:U_x\to\phi_x(U_x)\subset\mathbf R^n\); the ordered pair \((U_x,\phi_x)\) is an example of a chart on \(X\), and any collection of charts covering \(X\) (such as \(\{(U_x,\phi_x):x\in X\}\)) is called an atlas for \(X\). Hence, \(X\) is locally Euclidean iff there exists an atlas for \(X\).

    (caution: there is a distinct concept of a local homeomorphism from \(X\to Y\); existence of such implies that \(X\) is locally homeomorphic to \(Y\), but the converse is false).

    Problem: Give an example of topological spaces \(X\) and \(Y\) such that \(X\) is locally homeomorphic to \(Y\) but not vice versa.

    Solution: \(X=(0,1)\) and \(Y=[0,1)\). Thus, be wary that local homeomorphicity is not a symmetric relation of topological spaces.

    Problem: Let \(X\) be an \(n\)-dimensional locally Euclidean topological space. What additional topological constraints are typically placed on \(X\) in order for it to qualify as a topological \(n\)-manifold?

    Solution:

    1. \(X\) is Hausdorff.
    2. \(X\) is second countable.

    That being said, conceptually the most important criterion to remember is the local Euclideanity of \(X\). Topological manifolds are the most basic/general/fundamental of manifolds, onto which additional structures may be attached.

    Problem: Let \(X\) be a topological \(n\)-manifold. What additional piece of structure should be attached to \(X\) so as to consider it a \(C^k\)-differentiable \(n\)-manifold?

    Solution: The additional piece of structure is a \(C^k\)-atlas. A \(C^k\)-atlas for \(X\) is just an atlas for \(X\) with the property that all overlapping charts are compatible with each other. This means that if \((U_1,\phi_1),(U_2,\phi_2)\) are \(2\) charts of a \(C^k\)-atlas for \(X\) such that \(U_1\cap U_2\neq\emptyset\), then the transition map \(\phi_2\circ\phi_1^{-1}:\phi_1(U_1\cap U_2)\subset\mathbf R^n\to\phi_2(U_1\cap U_2)\subset\mathbf R^n\) is \(C^k\)-differentiable (equivalent to \(\phi_1\circ\phi_2^{-1}\) being \(C^k\)-differentiable).

    An important point to mention is that a given topological \(n\)-manifold may have multiple different \(C^k\)-atlases. According to the above definition, that would seem to give rise to \(2\) distinct \(C^k\)-manifolds…but intuitively that would be undesirable; they should be viewed as really the same \(C^k\) manifold. This motivates imposing an equivalence relation on \(C^k\)-manifolds by asserting that two \(C^k\)-manifolds are equivalent iff their corresponding \(C^k\)-atlases are compatible…which just means that the union of their \(C^k\)-atlases is itself a \(C^k\)-atlas, or equivalently every chart in one \(C^k\)-atlas is compatible with every chart in the other \(C^k\)-atlas. This is sometimes formalized by introducing a so-called “maximal \(C^k\)-atlas”, but the basic idea should be clear.

    Finally, note that topological manifolds are just \(C^0\)-manifolds because homeomorphisms are continuous. In physics, the most common case is to deal with \(C^{\infty}\)-manifolds like spacetime in GR or classical configuration/phase spaces in CM, or state space in thermodynamics, also called smooth manifolds.

    Problem: Let \(S^1\) be the circle, a \(1\)-dimensional smooth manifold which is regarded as being embedded in \(\mathbf R^2\) via the set of points \(x^2+y^2=1\). Explain why \(S^1\) is locally (but not globally) homeomorphic to the real line \(\mathbf R\). Furthermore, explain whether or not the pair \((S^1,\theta)\) (where \(\theta:(\cos\theta,\sin\theta)\in S^1\mapsto\theta\in[0,2\pi)\) is the obvious angular coordinate mapping) is a chart on \(S^1\) or not.

    Solution: \(S^1\) is locally homeomorphic to \(\mathbf R\) because, roughly speaking, every little strip of open arc in \(S^1\), upon zooming in, looks like a strip of straight line in \(\mathbf R\). However, \(S^1\) is not globally homeomorphic to \(\mathbf R\) because \(S^1\) is compact whereas \(\mathbf R\) is unbounded. The pair \((S^1,\theta)\) is not a chart on \(S^1\) because its image \([0,2\pi)\) isn’t open in \(\mathbf R\).

    Problem: Hence, demonstrate an example of an atlas for \(S^1\).

    Solution: To circumvent the problem above, one requires \(2\) charts to cover \(S^1\), thereby forming an atlas for \(S^1\). For example, a chart \(\theta_1:S^1-\{(1,0)\}\to (0,2\pi)\) similar to the above but with the point \((1,0)\) removed, and a second chart \(\theta_2:S^1-\{(-1,0)\}\to (-\pi,\pi)\) with the antipodal point \((-1,0)\) removed. Then the domain of these \(2\) charts overlap on the upper and lower semicircles respectively, so the transition function on this overlapping domain is:

    \[\theta_2(\theta_1^{-1}(\theta))=\theta[\theta\in(0,\pi)]+(\theta-2\pi)[\theta\in(\pi,2\pi)]\]

    which is indeed \(C^{\infty}\).

    Problem: Define the real associative algebra \(C^k(X\to\mathbf R)\).

    Solution: This is simply the space of all \(C^k\)-differentiable scalar fields on the \(C^k\)-manifold \(X\). To be precise, \(f\in C^k(X\to\mathbf R)\) iff for all charts \((U,\phi)\) in the \(C^k\)-atlas defining the differential structure of \(X\), the map \(f\circ\phi^{-1}:\phi(U)\to\mathbf R\) is \(C^k\)-differentiable.

    Problem: What is a tangent vector to a \(C^1\)-manifold \(X\) at a point \(x\in X\)?

    Solution: In the algebraic formulation, a tangent vector is simply any \(\mathbf R\)-derivation on \(C^1(X\to\mathbf R)\) at a specific point \(x\in X\) on the manifold. Concretely, this means that a tangent vector at \(x\in X\) is any real-valued function \(\partial_{\mu}|_x:C^1(X\to\mathbf R)\to\mathbf R\) which is linear and satisfies the product rule:

    \[\partial_{\mu}|_x(\alpha f+\beta g)=\alpha\partial_{\mu}|_x(f)+\beta\partial_{\mu}|x(g)\]

    \[\partial_{\mu}|_x(fg)=f(x)\partial_{\mu}|_x(g)+\partial_{\mu}|_x(f)g(x)\]

    A more intuitive but equivalent formulation of what “tangent vector” means is to consider trajectories \(\gamma(t)\in X\) through the manifold \(X\) passing through the point \(x\in X\) at time \(t\). Then, choosing an arbitrary chart \((U,\phi)\) in a neighbourhood of \(x\in U\), one considers the coordinatized version of \(\gamma(t)\in U\subset X\), namely \(\phi(\gamma(t))\in\mathbf R^n\) which will be differentiable with some derivative \(\frac{d}{dt}\phi(\gamma(t))\in\mathbf R^n\). All curves \(\gamma(t)\) with the same derivative at \(x\) are considered to form an equivalence class, and the value of that derivative is then effectively the tangent vector. Unlike the algebraic formulation, here one has to check that this definition is independent of the choice of neighbourhood chart \((U,\phi)\) of \(x\in U\subset X\).

    Problem: What is the tangent space \(T_x(X)\) to a \(C^1\)-manifold \(X\) at a point \(x\in X\)?

    Solution: \(T_x(X)\) is simply the set of all tangent vectors to \(X\) at \(x\in X\). As the name of its elements (i.e. “tangent vectors“) suggests, \(T_x(X)\) is a real, \(n\)-dimensional vector space. By choosing a neighbourhood chart \((U,\phi)\) containing \(x\in U\), this provides \(n\) local coordinates \(\phi(x)=(x^0(x),…,x^{n-1}(x))\) on \(U\). An explicit basis for \(T_x(X)\) then consists of the \(n\) partial derivative tangent vector operators \(\partial_{\mu}|_x\in T_x(X)\) for \(\mu=0,…,n-1\) defined by:

    \[\partial_{\mu}|_x(f):=\frac{\partial (f\circ\phi^{-1})}{\partial x^{\mu}}\biggr|_{\phi(x)}\]

    (which is well-defined because \(f\in C^1(X\to\mathbf R)\)). Any tangent vector \(v_x\in T_x(X)\) at \(x\in X\) can then be written as a linear combination of these basis tangent vectors:

    \[v_x=v^{\mu}\partial_{\mu}|_x\]

    with real coefficients \(v^{\mu}\in\mathbf R\) called the contravariant components of \(v_x\in T_x(X)\) in the coordinate basis \(\partial_{\mu}|_x\).

    Problem: A given tangent vector \(v_x\in T_x(X)\) may be written with contravariant components in \(2\) distinct coordinate bases:

    \[v_x=v^{\mu}\partial_{\mu}|_x=v’^{\nu}\partial’_{\nu}|_x\]

    simply from picking \(2\) different charts \((U,\phi),(U’,\phi’)\) in the \(C^1\)-atlas of \(X\) containing \(x\in U\cap U’\). Describe how the contravariant components \(v’^{\nu}\) may be obtained from the contravariant components \(v^{\mu}\).

    Solution: Act on an arbitrary \(f\in C^1(X\to\mathbf R)\) to obtain:

    \[v_x(f)=v^{\mu}\frac{\partial (f\circ\phi^{-1})}{\partial x^{\mu}}\biggr|_{\phi(x)}=v’^{\nu}\frac{\partial (f\circ\phi’^{-1})}{\partial x’^{\nu}}\biggr|_{\phi'(x)}\]

    Now insert a “resolution of the identity” \(f\circ\phi’^{-1}\circ\phi’\circ\phi^{-1}\) and because \(X\) is a \(C^1\)-manifold, the transition map \(\phi’\circ\phi^{-1}\) will be differentiable and in particular, by the chain rule:

    \[\frac{\partial (f\circ\phi^{-1})}{\partial x^{\mu}}\biggr|_{\phi(x)}=\frac{\partial x’^{\nu}}{\partial x^{\mu}}\biggr|_{\phi(x)}\frac{\partial(f\circ\phi’^{-1})}{\partial x’^{\nu}}\biggr|_{\phi'(x)}\]

    where \(\phi'(x)=(x’^0(x),…,x’^{n-1}(x))\). This simple chain rule identity can by itself already be viewed as a passive change of \(T_x(X)\)-basis:

    \[\frac{\partial}{\partial x’^{\nu}}\biggr|_{x}=\frac{\partial x^{\mu}}{\partial x’^{\nu}}\biggr|_{\phi'(x)}\frac{\partial}{\partial x^{\mu}}\biggr|_x\]

    Or equivalently, as an active change of the contravariant components via the Jacobian matrix:

    \[v’^{\nu}=\frac{\partial x’^{\nu}}{\partial x^{\mu}}\biggr|_{\phi(x)}v^{\mu}\]

    Problem: What is a vector field \(v\) on a \(C^1\)-manifold \(X\)?

    Solution: The more intuitive way to understand it is as a map \(v:X\to TX\), where \(TX:=\bigsqcup_{x\in X}T_x(X)\) is the so-called tangent bundle of \(X\). This basically corresponds to assigning a tangent vector \(v_x\in T_x(X)\) rooted at each point \(x\in X\) across the manifold \(X\). Alternatively, a more algebraic definition of a vector field is as a map from scalar fields to scalar fields \(v:C^1(X\to\mathbf R)\to C^1(X\to\mathbf R)\), defined so that for a scalar field \(f:X\to\mathbf R\) on the manifold \(X\), the new scalar field \(v(f):X\to\mathbf R\) can be thought of as a global directional derivative map \(v(f)(x):=v_x(f)\) giving the directional derivative \(v_x(f)\in\mathbf R\) of the scalar field \(f\) at each point \(x\in X\) according to some assignment \(v_x\in T_x(X)\) of tangent vectors (in the spirit of the first definition).

    Problem: Given a \(C^1\)-manifold \(X\), and a neighbourhood of a point \(x\in X\) with local coordinates \((x^0,…,x^{n-1})\), explain how a vector field \(v\) may be expressed in this coordinate basis over said neighbourhood.

    Solution: \(v=v^{\mu}\partial_{\mu}\) is essentially a directional derivative operator. Just to be clear, there is a standard abuse of notation going on here where the underlying chart \(\phi\) defining the local coordinates \((x^0,…,x^{n-1})\) is swept under the rug, so for instance \(v(f)(x)=v^{\mu}(x)\frac{\partial f}{\partial x^{\mu}}(x)\) where the shorthand \(\frac{\partial f}{\partial x^{\mu}}(x)=\frac{\partial (f\circ\phi^{-1})}{\partial x^{\mu}}(\phi(x))\) is being used.

    Problem: Given \(2\) vector fields \(v,w\) on the same \(C^2\)-manifold \(X\), explain why the composition \(v\circ w\) (and hence also \(w\circ v\)) is not a vector field on \(X\).

    Solution: Although it’s linear, it fails to satisfy the product rule.

    Problem: Explain how the above situation can be remedied.

    Solution: Consider instead the commutator \([v,w]:C^2(X\to\mathbf R)\to C^2(X\to\mathbf R)\). This is now both linear, but also importantly obeys the product rule:

    \[[v,w](fg)=f[v,w](g)+g[v,w](f)\]

    Indeed, as with any commutator this makes the space of all vector fields not only a real vector space but also a Lie algebra. In a common coordinate basis \((x^0,…,x^{n-1})\) where \(v=v^{\mu}\partial_{\mu}\) and \(w=w^{\nu}\partial_{\nu}\), the commutator has the representation:

    \[[v,w]=\left(v^{\mu}\partial_{\mu}w^{\nu}-w^{\mu}\partial_{\mu}v^{\nu}\right)\partial_{\nu}\]

    Posted in Blog | Leave a comment

    Support Vector Machines

    Problem: Explain how a hard-margin support vector machine would perform binary classification.

    Solution: Conceptually, it’s simple. Given a training set of \(N\) feature vectors \(\mathbf x_1,…,\mathbf x_N\in\mathbf R^n\) each associated with some binary target label \(y_1,…,y_N\in\{-1,1\}\) (notice the \(2\) binary classes are \(y=-1\) and \(y=1\) rather than the more conventional \(y=0\) and \(y=1\); this is simply a matter of convenience), the goal of an SVM is to compute the unique hyperplane in \(\mathbf R^n\) that will separate the \(y=-1\) and \(y=1\) classes as “best” as possible. For \(n=2\) this can be visualized as trying to find the “widest possible street” between \(2\) neighborhoods:

    Mathematically, if the hyperplane decision boundary \(\mathbf w\cdot\mathbf x+b=0\) is defined by a normal vector \(\mathbf w\in\mathbf R^n\) such that the SVM classifies points with \(\mathbf w\cdot\mathbf x+b\geq 0\) as \(y=1\) and \(\mathbf w\cdot\mathbf x+b\leq 0\) as \(y=-1\) (more compactly, \(\hat y_{\text{SVM}}(\mathbf x|\mathbf w,b)=\text{sgn}(\mathbf w\cdot\mathbf x+b)\)), and the overall normalization of \(\mathbf w\) and \(b\) is fixed by stipulating that the “gutters” of the street are defined by the hyperplanes \(\mathbf w\cdot\mathbf x+b=\pm 1\) passing through the support (feature) vectors (drawn as bigger red/green dots in the diagram), then the “width of the street” is the distance between these \(2\) “gutter hyperplanes”, namely \(2/|\mathbf w|\). Since one would like to maximize this margin \(2/|\mathbf w|\), this is equivalent to minimizing the quadratic \(|\mathbf w|^2/2\). Thus, one can formulate this “hard-margin SVM” algorithm as the programming problem:

    \[\text{argmin}_{\mathbf w\in\mathbf R^n,b\in\mathbf R:\forall i=1,…,N,y_i(\mathbf w\cdot\mathbf x_i+b)\geq 1}\frac{|\mathbf w|^2}{2}\]

    where the \(N\) constraints \(y_i(\mathbf w\cdot\mathbf x_i+b)\geq 1\) (or vectorially \(\mathbf y\odot(X^T\mathbf w+b\mathbf{1})\geq\mathbf 1\)) are just saying that all \(N\) feature vectors \(\mathbf x_1,…,\mathbf x_N\) in the training set should lie on the correct side of the decision boundary hyperplane \(\mathbf w\cdot\mathbf x+b=0\).

    Problem: What is the glaring problem with a hard-margin SVM? How does a soft-margin SVM help to address this issue?

    Solution: The basic problem is that the hard-margin SVM is way too sensitive to outliers, e.g. a single \(y=-1\) feature vector amidst the \(y=1\) neighborhood would cause the hard-margin SVM to be unable to find a solution (because indeed there would be no hyperplane that perfectly separates all the training data into their correct classes). Intuitively, it seems like in such a case the best way to deal with this bias-variance tradeoff is to shrug and ignore the outliers:

    Mathematically, the usual way to encode this is by introducing \(N\) “slack variables” \(\xi_1,…,\xi_N\geq 0\) such that the earlier \(N\) constraints are relaxed to \(y_i(\mathbf w\cdot\mathbf x_i+b)\geq 1-\xi_i\). However, to avoid cutting too much slack (i.e. incurring too much misclassification on the training set), their use should be penalized by the \(\ell^1\)-norm \(|\boldsymbol{\xi}|_1:=\boldsymbol{\xi}\cdot\mathbf{1}=\sum_{i=1}^N\xi_i\). To this effect, the programming problem is modified to that of soft-margin SVM:

    \[\text{argmin}_{\mathbf w\in\mathbf R^n,b\in\mathbf R,\boldsymbol{\xi}\in\mathbf R^N:\mathbf y\odot(X^T\mathbf w+b\mathbf{1})\geq\mathbf 1-\boldsymbol{\xi},\boldsymbol{\xi}\geq\mathbf 0}\frac{|\mathbf w|^2}{2}+C|\boldsymbol{\xi}|_1\]

    where \(C\geq 0\) is a “strictness” hyperparameter that must be selected ahead of time to balance hard vs. soft SVM.

    Problem: The above represents the primal formulation of the programming problem in terms of so-called primal variables \(\mathbf w,b,\boldsymbol{\xi}\). Show how the KKT conditions enable one to obtain a corresponding dual formulation of the programming problem in terms of a dual variable \(\boldsymbol{\lambda}\):

    \[\text{argmax}_{\boldsymbol{\lambda}\in\mathbf R^N:\boldsymbol{\lambda}\cdot\mathbf y=0,\mathbf 0\leq\boldsymbol{\lambda}\leq C\mathbf 1}|\boldsymbol{\lambda}|_1-\frac{1}{2}(\boldsymbol{\lambda}\odot\mathbf y)X^TX(\boldsymbol{\lambda}\odot\mathbf y)\]

    Solution: The idea is to combine the \(2\) constraints \(\mathbf y\odot(X^T\mathbf w+b\mathbf{1})\geq\mathbf 1-\boldsymbol{\xi}\) and \(\boldsymbol{\xi}\geq\mathbf 0\) using \(2\) KKT multipliers \(\boldsymbol{\lambda},\boldsymbol{\mu}\) into the Lagrangian:

    \[L(\mathbf w,b,\boldsymbol{\xi},\boldsymbol{\lambda},\boldsymbol{\mu})=\frac{|\mathbf w|^2}{2}+C|\boldsymbol{\xi}|_1-\boldsymbol{\lambda}\cdot(\mathbf y\odot(X^T\mathbf w+b\mathbf{1})-\mathbf 1+\boldsymbol{\xi})-\boldsymbol{\mu}\cdot\boldsymbol{\xi}\]

    The KKT necessary conditions assert:

    \[\frac{\partial L}{\partial\mathbf w}=\mathbf 0\Rightarrow\mathbf w=X(\boldsymbol{\lambda}\odot\mathbf y)\]

    \[\frac{\partial L}{\partial b}=0\Rightarrow\boldsymbol{\lambda}\cdot\mathbf y=0\]

    \[\frac{\partial L}{\partial\boldsymbol{\xi}}=\mathbf 0\Rightarrow C\mathbf 1=\boldsymbol{\lambda}+\boldsymbol{\mu}\]

    in addition to the \(2\) original constraints on the primal variables (primal feasibility), and dual feasibility \(\boldsymbol{\lambda},\boldsymbol{\mu}\geq\mathbf 0\) on both dual variables (as a corollary, \(\mathbf 0\leq\boldsymbol{\lambda}\leq C\mathbf 1\)), and complementary slackness \(\boldsymbol{\lambda}\odot(\mathbf y\odot(X^T\mathbf w+b\mathbf{1})-\mathbf 1+\boldsymbol{\xi})=\boldsymbol{\mu}\odot\boldsymbol{\xi}=\mathbf 0\). The idea of the support vectors is clearly seen in the \(1^{\text{st}}\) of these complementary slackness conditions.

    By substituting the stationary conditions found back into the Lagrangian, one can eliminate all the primal variables (and even eliminate one of the dual variables \(\boldsymbol{\mu}\)) to obtain the claim:

    \[L(\boldsymbol{\lambda})=|\boldsymbol{\lambda}|_1-\frac{1}{2}(\boldsymbol{\lambda}\odot\mathbf y)X^TX(\boldsymbol{\lambda}\odot\mathbf y)\]

    which is a standard quadratic programming exercise with efficient solutions for \(\boldsymbol{\lambda}\).

    (mention something about strong duality as enabling a min-max interchange of primal vs. dual?)

    Problem: Although soft-margin SVM is a significant improvement over hard-margin SVM, it too has a glaring issue. What is that issue and how can it be addressed?

    Solution: Whether hard-margin or soft-margin, SVMs are ultimately linear binary classifiers; they can only directly learn a hyperplane decision boundary. Therefore, if the data is simply linearly inseparable, then even soft-margin SVM would be unwise to apply directly.

    However, there is a creative solution; just take the \(N\) feature vectors \(\mathbf x_1,…,\mathbf x_N\in\mathbf R^n\) and apply a nonlinear transformation \(\prime:\mathbf R^n\to\mathbf R^{>n}\) from the current feature space \(\mathbf R^n\) to some higher-dimensional feature space \(\mathbf R^{>n}\) (essentially a kind of feature engineering), thereby obtaining \(N\) transformed feature vectors \(\mathbf x’_1,…,\mathbf x’_N\in\mathbf R^{>n}\). Ideally, if the nonlinear transformation \(\prime\) is well-chosen, then the transformed feature vectors \(\mathbf x’_1,…,\mathbf x’_N\) will become linearly separable in the higher-dimensional feature space \(\mathbf R^{>n}\), in which case the usual soft-margin SVM may be applied in \(\mathbf R^{>n}\). Then, applying the inverse nonlinear mapping \(\prime^{-1}:\mathbf R^{>n}\to\mathbf R^n\) back to the original feature space \(\mathbf R^n\), the SVM (soft) maximal-margin hyperplane in \(\mathbf R^{>n}\) would be transformed into some nonlinear decision boundary back in \(\mathbf R^n\), thereby effectively extending the applicability of the SVM method to linearly inseparable data! See this YouTube video for an example.

    Problem: Continuing off the previous problem, suppose one is already in the transformed space with feature vectors \(\mathbf x’_1,…,\mathbf x’_N\in\mathbf R^{>n}\), and one has solved the dual programming problem in this space to obtain some optimal \(\boldsymbol{\lambda}\in\mathbf R^N\). How is binary classification thus performed?

    Solution: Based on the previous analysis, the SVM binary classifier in this transformed feature space \(\mathbf R^{>n}\) may be written:

    \[\hat y_{\text{SVM}}(\mathbf x’)=\text{sgn}(\mathbf w\cdot\mathbf x’+b)=\text{sgn}\left(\sum_{i=1}^N\lambda_iy_i\mathbf x’\cdot\mathbf x’_i+y_S-\sum_{i=1}^N\lambda_iy_i\mathbf x’_S\cdot\mathbf x’_i\right)\]

    where \(\mathbf x’_S\) is any support feature vector. Note by complementary slackness that most \(\lambda_i=0\) are vanishing (corresponding non-support feature vectors \(\mathbf x_i\)), so the sums \(\sum_{i=1}^N\) may also be written as sums over only support vectors.

    Problem: Explain how the expression above can be simplified by means of the kernel trick.

    Solution: Basically, the feature space transformation \(\prime:\mathbf R^n\to\mathbf R^{>n}\) is a complicated mapping of vectors to vectors. By contrast, only their scalar dot products \(\mathbf x’\cdot\mathbf x’_i\) appear in the SVM classifier \(\hat y_{\text{SVM}}(\mathbf x)\), and scalars are simpler than vectors. The kernel trick essentially invites one to forget about the details of the transformation \(\prime\) and just directly specify an analytical expression for the dot product in the transformed feature space \(K(\mathbf x,\tilde{\mathbf x}):=\mathbf x’\cdot\tilde{\mathbf x}’\), where here \(K:\mathbf R^n\times\mathbf R^n\to\mathbf R\) is called the kernel function. Effectively, this means one never has to actually visit the transformed feature space \(\mathbf R^{>n}\) since the kernel is perfectly happy to just take the feature vectors in the original space \(\mathbf R^n\). Thus, one can write:

    \[\hat y_{\text{SVM}}(\mathbf x)=\text{sgn}(\mathbf w\cdot\mathbf x’+b)=\text{sgn}\left(\sum_{i=1}^N\lambda_iy_iK(\mathbf x,\mathbf x_i)+y_S-\sum_{i=1}^N\lambda_iy_iK(\mathbf x_S,\mathbf x_i)\right)\]

    Problem: What is the Gaussian (radial basis function) kernel \(K_{\sigma}(\mathbf x,\tilde{\mathbf x})\)? Sketch why it’s a valid kernel function.

    Solution: This is by far the most popular kernel for nonlinear SVM classification, involving a hyperparameter \(\sigma\):

    \[K_{\sigma}(\mathbf x,\tilde{\mathbf x})=e^{-|\mathbf x-\tilde{\mathbf x}|^2/2\sigma^2}\]

    algebraically, it is clearly equivalent to:

    \[K_{\sigma}(\mathbf x,\tilde{\mathbf x})=e^{-|\mathbf x|^2/2\sigma^2}e^{-|\tilde{\mathbf x}|^2/2\sigma^2}e^{\mathbf x\cdot\tilde{\mathbf x}/\sigma^2}\]

    Taylor expanding the last factor:

    \[K_{\sigma}(\mathbf x,\tilde{\mathbf x})=e^{-|\mathbf x|^2/2\sigma^2}e^{-|\tilde{\mathbf x}|^2/2\sigma^2}\sum_{d=0}^{\infty}\frac{(\mathbf x\cdot\tilde{\mathbf x})^d}{\sigma^{2d}d!}\]

    so it’s clearly a (countably-infinite-dimensional!) inner product because of the presence of the \(d\)-dimensional polynomial kernels \(K_d(\mathbf x,\tilde{\mathbf x})=(\mathbf x\cdot\tilde{\mathbf x})^d\).

    Problem: What were the main problems with SVMs that spurred the deep learning renaissance via the development of artificial neural networks?

    Solution: The main issue with SVMs is their lack of scalability. Roughly speaking, the computational cost of training an SVM on a dataset of \(N\) training examples scales like \(O(N^2)\) in the best case (essentially because the Gram matrix \(X^TX\) is an \(N\times N\) matrix of all pairwise interactions of features), even with modern tricks like sequential minimal optimization this is still unavoidable. Another smaller issue is that

    Problem: Write a short Python program using the scikit-learn library to train a support vector machine on the Kaggle Titanic competition dataset.

    Solution:

    titanic
    Posted in Blog | Leave a comment

    Numerical Computation

    Problem: In numerical computation, what are the \(2\) main kinds of rounding error?

    Solution: Overflow error (\(N\approx\infty\)) but perhaps even more dangerous is underflow error (\(\varepsilon\approx 0\)) which are in some sense inverses of each other:

    \[0=\frac{1}{\infty}\]

    Problem: Explain how rounding errors can affect evaluation of the softmax activation function and strategies to address it.

    Solution: Given a vector \(\mathbf x\in\mathbf R^n\), the corresponding softmax probability \(\ell^1\)-unit vector is \(e^{\mathbf x}/|e^{\mathbf x}|_1\in S^{n-1}\). If one of the components of \(\mathbf x\) is very negative, then the corresponding component of \(e^{\mathbf x}\) can underflow, whereas if it is instead very positive, then overflow becomes a possibility.

    To address these numerical stability issues, the trick is to notice that softmax is invariant under “diagonal” translations along \(\mathbf 1\in\mathbf R^n\), i.e. \(\mathbf x\mapsto\mathbf x-\lambda\mathbf 1\) leads to \(e^{\mathbf x-\lambda\mathbf 1}=e^{\mathbf x}\odot e^{-\lambda\mathbf 1}=e^{-\lambda}e^{\mathbf x}\) and hence the \(\ell^1\)-norm \(|e^{\mathbf x-\lambda\mathbf 1}|_1=e^{-\lambda}|e^{\mathbf x}|_1\) is also merely scaled by the same factor \(e^{-\lambda}\). Choosing \(\lambda:=\text{max}(\mathbf x)\) to be maximum component of \(\mathbf x\) (which coincides with the \(\ell^{\infty}\) norm if \(\mathbf x\geq 0\) is positive semi-definite) would eliminate any overflow error that may have been present in evaluating the numerator \(e^{\mathbf x}\) by instead evaluating \(e^{\mathbf x-\text{max}(\mathbf x)\mathbf 1}\). For the same reason, the shifted denominator \(|e^{\mathbf x-\text{max}(\mathbf x)\mathbf 1}|_1\) is also immune to overflow error. Furthermore, the denominator is also resistant to underflow error because at least one term in the series for \(|e^{\mathbf x-\lambda\mathbf 1}|_1=\sum_{i}e^{x_i-\lambda}\) will be \(e^0=1\gg 0\), namely term \(i=\text{argmax}(\mathbf x)\). The only possible problem is that the modified numerator could still experience an underflow error; this would be bad if subsequently one were to evaluate the information content (loss function) of a softmax outcome in which taking \(\log(0)=-\infty\) would produce a nan. For this case, one can apply a similar trick to numerically stabilize the computation of this logarithm.

    Problem: Describe the numerical analysis phenomenon of catastrophic cancellation.

    Solution: Subtraction \(f(x,y):=x-y\) is ill-conditioned when \(x\approx y\). This is because even if \(x\) and \(y\) have small relative errors, the relative error in their difference \(f(x,y)\) can still be substantial.

    Problem: What does the condition number of a matrix \(X\) measure?

    Solution: The condition number measures how numerically stable it is to invert \(X\mapsto X^{-1}\). It is given by the ratio of the largest to smallest eigenvalues of \(X\) (by absolute magnitude, assuming \(X\) is diagonalizable):

    \[\frac{\text{max}|\text{spec}(X)|}{\text{min}|\text{spec}(X)|}\]

    For instance, a non-invertible matrix would have \(\text{min}|\text{spec}(X)|=0\) and thus an infinite condition number, since it’s non-invertible.

    Problem: Show that during gradient descent of a scalar field \(f(\mathbf x)\), if at some point \(\mathbf x\) in the optimization landscape, one has \(\frac{\partial f}{\partial\mathbf x}\cdot\frac{\partial^2 f}{\partial\mathbf x^2}\frac{\partial f}{\partial\mathbf x}>0\) (e.g. such as if the Hessian \(\frac{\partial^2 f}{\partial\mathbf x^2}\) is just generally positive-definite or the gradient \(\frac{\partial f}{\partial\mathbf x}\) is an eigenvector of the Hessian \(\frac{\partial^2 f}{\partial\mathbf x^2}\) with positive eigenvalue), then by a line search logic, the most effective local learning rate \(\alpha(\mathbf x)>0\) is given by:

    \[\alpha(\mathbf x)=\frac{\left|\frac{\partial f}{\partial\mathbf x}\right|^2}{\frac{\partial f}{\partial\mathbf x}\cdot\frac{\partial^2 f}{\partial\mathbf x^2}\frac{\partial f}{\partial\mathbf x}}\]

    Solution: Since gradient descent takes a step \(\mathbf x\mapsto\mathbf x-\alpha\frac{\partial f}{\partial\mathbf x}\), and the goal is to decrease the value of the function \(f\) as much as possible to get closer to the minimum, one would like to maximize:

    \[f(\mathbf x)-f\left(\mathbf x-\alpha\frac{\partial f}{\partial\mathbf x}\right)\]

    as a function of the learning rate \(\alpha>0\). Assuming that a quadratic approximation is valid:

    \[f(\mathbf x)-f\left(\mathbf x-\alpha\frac{\partial f}{\partial\mathbf x}\right)\approx \alpha\left|\frac{\partial f}{\partial\mathbf x}\right|^2-\frac{\alpha^2}{2}\frac{\partial f}{\partial\mathbf x}\cdot\frac{\partial^2 f}{\partial\mathbf x^2}\frac{\partial f}{\partial\mathbf x}\]

    which has a well-defined maximum of the \(\alpha\)-parabola provided the \(\alpha^2\) coefficient is positive.

    Problem: Explain why gradient descent is behaving like this in this case? What can be done to address that?

    Solution: The problem is that the Hessian is ill-conditioned (in this case the condition number is \(5\) everywhere); the curvature in the \((1,1)\) direction is \(5\times\) greater than the curvature in the \((1,-1)\) direction. Gradient descent wastes time repeatedly descending “canyon walls” because it sees those as the steepest features. A \(2^{\text{nd}}\)-order optimization algorithm (e.g. Newton’s method) that uses information from the Hessian would do better.

    Problem: Explain how the Karush-Kuhn-Tucker (KKT) conditions generalize the method of Lagrange multipliers.

    Solution: The idea is, in addition to equality constraints like \(x+y=1\), one can also allow for inequality constraints like \(x^2+y^2\leq 1\) (indeed, an equality \(a=b\) is just \(2\) inequalities \(a\leq b\) and \(a\geq b\)). Then, constrained optimization of a scalar field \(f(\mathbf x)\) subject to a bunch of equality constraints \(g_1(\mathbf x)=g_2(\mathbf x)=…=0\) and a bunch of inequality constraints \(h_1(\mathbf x),h_2(\mathbf x),…\leq 0\) is equivalent to unconstrained optimization of the Lagrangian:

    \[L(\mathbf x,\boldsymbol{\lambda}_{\mathbf g},\boldsymbol{\lambda}_{\mathbf h})=f(\mathbf x)+\boldsymbol{\lambda}_{\mathbf g}\cdot\mathbf g(\mathbf x)+\boldsymbol{\lambda}_{\mathbf h}\cdot\mathbf h(\mathbf x)\]

    where \(\boldsymbol{\lambda}_{\mathbf g},\boldsymbol{\lambda}_{\mathbf h}\) are called KKT multipliers. The KKT necessary (but not sufficient!) conditions for optimality are then (ignoring technicalities about equating objects of possibly different dimensions):

    1. \(\frac{\partial L}{\partial\mathbf x}\mathbf 0\) (stationary)
    2. \(\mathbf g(\mathbf x)=\mathbf 0\geq \mathbf h(\mathbf x)\) (primal feasibility)
    3. \(\boldsymbol{\lambda}_{\mathbf h}\geq 0\) (if the inequalities were instead written as \(\mathbf h(\mathbf x)\geq\mathbf 0\) then this would instead require \(\boldsymbol{\lambda}_{\mathbf h}\leq 0\)) (dual feasibility)
    4. \(\boldsymbol{\lambda}_{\mathbf h}\odot\mathbf h(\mathbf x)=\mathbf 0\) (complementary slackness)
    Posted in Blog | Leave a comment

    Information Theory & Inference

    Problem: Draw a schematic of a binary symmetric channel (BSC) with bit flip probability \(p_f\).

    Solution: Classically, one has:

    On the other hand, taking a more quantum perspective, in the computational basis \((|0\rangle,|1\rangle)\) one might define a binary symmetric channel as a kind of quantum logic gate:

    \[\text{BSC}_{p_f}=\begin{pmatrix}\sqrt{1-p_f}&\sqrt{p_f}\\\sqrt{p_f}&\sqrt{1-p_f}\end{pmatrix}\]

    The word “symmetric” in the name emphasizes that the bit flip probability from \(0\to 1\) is identical to the bit flip probability \(1\to 0\).

    Problem: Consider the radio communication link between Galileo and Earth, and suppose that the noisy communication channel between them may be modelled as a BSC with bit flip probability \(p_f=0.01\). In a sequence of \(N=10^5\) bit transmissions, what is the probability that the error rate is less than \(1\%\)?

    Solution: This means that one requires \(N'<0.01N=10^3\) bit flips. Each such \(N’\) has binomial probability \({{N}\choose{N’}}p_f^{N’}(1-p_f)^{N’}\) so the total probability is given by the cumulative distribution function:

    \[\sum_{N’=0}^{N’=999}{{N}\choose{N’}}p_f^{N’}(1-p_f)^{N’}\]

    Problem: What are the \(2\) solutions for addressing noise in communication channels? Which one is preferable?

    Solution:

    1. The physical solution: build more reliable hardware, cooling circuits, etc. (basically all the stuff one learns in a typical experimental physics class).
    2. The system solution: accept the noisy communication channel as is, and add communication systems to it so as to detect and correct errors. This is the goal of coding theory.

    The system solution is preferable because it only comes at an increased computational cost whereas the former comes at an increased financial cost. But more importantly, the improvements arising from the system solution can be very dramatic, unlike the incremental improvements of the physical solution.

    Problem: What is the distinction between information theory and coding theory?

    Solution: The difference between information theory and coding theory is analogous to the distinction between theoretical physics and experimental physics; the former is concerned with the theoretical limitations of the aforementioned communication systems (e.g. “what is the best error-correcting performance possible over the space of all possible error-correcting algorithms?”) whereas the latter is interested in specifically designing practical such algorithms (also called codes) for error correction.

    Problem: Draw a schematic that depicts the high-level structure of an error-correcting code.

    Solution: Coding = Encoding + Decoding:

    Here \(\mathbf s\) is the source message vector which is encoded as a transmitted message vector \(\mathbf t=\mathbf s+\textbf{redundancy}\). After transmission across the noisy communication channel, the transmitted message vector has been distorted into \(\tilde{\mathbf t}=\mathbf t+\textbf{noise}\). This is then decoded and received as a message vector \(\tilde{\mathbf s}\) with the hope that \(\tilde{\mathbf s}=\mathbf s\) with probability close to \(1\).

    Problem: Consider a first naive attempt at designing an error-correcting code for a BSC, known as a repetition code. Explain how that works with an example.

    Solution: A choice of error-correcting code basically amounts to a choice of encoder + decoder, since again coding = encoding + decoding. For the repetition code, the encoder is literally a repetition of each bit a predefined number of times, say \(3\) in the example below. The corresponding optimal decoder turns out to be the obvious “majority-vote” among the triplets (assuming \(p_f<1/2\); if it were the case that \(p_f>1/2\) then instead a “minority-vote” decoder would be better!)

    (aside: “optimal” means the decoder should map each \(\tilde{\mathbf t}\) to \(\tilde{\mathbf s}=\text{argmax}_{\mathbf s}p(\mathbf s|\tilde{\mathbf t})\) and in particular as part of computing this conditional probability, another implicit assumption is that both \(p(0)=p(1)=1/2\) have equal prior probabilities)

    Here \(\mathbf n\) is a spare noise vector each of whose bits are sampled from a \(p_f\)-Bernoulli distribution, where \(0\) means no bit flip and \(1\) means yes bit flip, and \(\tilde{\mathbf t}\equiv\mathbf t+\mathbf n\pmod{2}\), or using the xor notation, \(\tilde{\mathbf t}=\mathbf t\oplus\mathbf n\)

    Problem: Show that, compared to no repetition for which the bit flip probability across the BSC is \(p_f<1/2\), the \(3\)-fold repetition code has an effective bit flip probability \(p^{\text{eff}}_f<p_f\) which is reduced compared to the original bit flip probability \(p_f\).

    Solution: In order to get a bit flip, either \(2\) or all \(3\) bits in the \(3\)-fold repetition encoded vector \(\mathbf t\) have to be flipped. Thus, the effective bit flip probability \(p^{\text{eff}}_f\) is a cubic function of \(p_f\):

    \[p^{\text{eff}}_f=3p_f^2(1-p_f)+p_f^3=3p_f^2-2p_f^3\]

    In order to have \(p^{\text{eff}}_f<p_f\), this requires:

    \[3p_f-2p_f^2<1\Rightarrow 2(p_f-1/2)(p_f-1)>0\]

    which is satisfied for \(p_f<1/2\).

    Problem: Instead of \(N=3\) repetitions, generalize the above expression for the effective bit flip probability \(p^{\text{eff}}_f\) to an arbitrary odd number \(N\) of repetitions.

    Solution: The effective bit flip probability \(p^{\text{eff}}_f\) is now a degree-\(N\) polynomial in \(p_f\):

    \[p^{\text{eff}}_f=\sum_{n=(N+1)/2}^N{N\choose n}p_f^n(1-p_f)^{N-n}\]

    which for \(p_f\ll 0.5\) is dominated by the first term \(n=(N+1)/2\).

    Problem: Describe how the \((7,4)\) Hamming error correction code works, and explain why it’s an example of a linear block code.

    Solution: The idea is that if the source vector \(\mathbf s\) has length \(L_{\mathbf s}\) and the encoded transmitted vector \(\mathbf t\) has some larger length \(L_{\mathbf t}>L_{\mathbf s}\), then one has the “rate” \(L_{\mathbf s}/L_{\mathbf t}=4/7\). Specifically, for every \(4\) source bits \(\mathbf s=s_1s_2s_3s_4\), the transmitted vector will contain \(7\) bits \(\mathbf t=t_1t_2t_3t_4t_5t_6t_7\) such that \(t_{1,2,3,4}:=s_{1,2,3,4}\) and \(t_{5,6,7}\) are uniquely specified by the requirements that \(t_1+t_2+t_3+t_5\equiv t_2+t_3+t_4+t_6\equiv t_1+t_3+t_4+t_7\equiv 0\pmod 2\); in other words, \(t_{5,6,7}\) are said to be parity-check bits in the sense that they check the parity of a certain subset of the source bits \(s_{1,2,3,4}\). This encoding can be depicted visually as such:

    where the parity of each of the \(3\) circles has to be even. For example, if \(\mathbf s=1011\), then the Hamming encoding of this would be \(\mathbf t=1011001\).

    As for the \((7,4)\) Hamming decoder, for generic \(p_f<1/2\) it can be a bit tricky but if one assumes \(p_f\ll 1/2\), then it is pretty safe to assume that there will be at most \(1\) bit flip during transmission across the noisy binary symmetric communication channel \(\mathbf t\mapsto\tilde{\mathbf t}\). In that case, one can take a given \(\tilde{\mathbf t}\) and compute its syndrome vector \(\mathbf z\); i.e. in each of the \(3\) circles, has the parity remained even \((0)\) or has it become odd \((1)\)? For instance, getting a syndrome \(\mathbf z=000\) means that most likely no bits were flipped and thus there are no errors! Or, if \(\mathbf z=011\), then under the premise that there was only a single bit flip, it must have been \(s_4\), etc. In general, \(\tilde{\mathbf t}\) can have \(8\) possible syndromes \(\mathbf z\), and each such syndrome \(\mathbf z\) maps uniquely onto the \(7\) possible single bit flips together with the possibility of zero bit flips.

    The Hamming code is a block code because the parity-check bits look at whole blocks (or subsets) of the source bitstring \(\mathbf s\), rather than one bit of \(\mathbf s\) at a time (as was essentially done in repetition coding). The Hamming code is also said to be a linear code (mod \(2\)) because the encoding operation is linear in the sense that (assuming \(\mathbf s,\mathbf t\) are both column vectors):

    \[\mathbf t=G^T\mathbf s\]

    where the \(7\times 4\) generator matrix is given by:

    \[G^T=\begin{pmatrix}1&0&0&0\\0&1&0&0\\0&0&1&0\\0&0&0&1\\1&1&1&0\\0&1&1&1\\1&0&1&1\end{pmatrix}\]

    Or transposing both sides, \(\mathbf t^T=\mathbf s^TG\), and in some coding theory texts one often takes \(\mathbf s,\mathbf t\) to be row vectors by default in which case this would just be written \(\mathbf t=\mathbf sG\). Of course, the columns of \(G^T\) represent the encodings \(\mathbf t\) of the “basis” bitstrings \(\mathbf s=1000,0100,0010,0001\).

    Problem: The \((7,4)\) Hamming decoder \(\tilde{\mathbf t}\to\mathbf z\to\tilde{\mathbf s}\) uses a syndrome vector \(\mathbf z\) as an intermediate in the decoding process. Show that the map from \(\tilde{\mathbf t}\to\mathbf z\) is linear in the sense that:

    \[\mathbf z=H\tilde{\mathbf t}\]

    where the parity-check matrix \(H:=(P,1_3)\), and \(P=\begin{pmatrix}1&1&1&0\\0&1&1&1\\1&0&1&1\end{pmatrix}\) also appears in the generator matrix via:

    \[G^T=\begin{pmatrix}1_4\\P\end{pmatrix}\]

    Show that \(HG^T=0\) and hence if there were no errors such that \(\tilde{\mathbf t}=\mathbf t\), then the corresponding syndrome vector \(\mathbf z=\mathbf 0\Leftrightarrow\mathbf t\in\text{ker}H\). Hence, explain why a maximum-likelihood decoder is equivalent to finding the most probable noise vector \(\mathbf{n}\) for which \(\mathbf z=H\mathbf n\).

    Solution: Just check it, it’s true. And clearly \(HG^T=2P=0\) because \(2\equiv 0\pmod 2\). It follows then that \(\mathbf t\in\text{ker}H\) because \(\mathbf t=G^T\mathbf s\). Thus, because \(\tilde{\mathbf t}=\mathbf t+\mathbf n\), acting on both sides with \(H\) results in \(\mathbf z=H\mathbf n\).

    Problem: As the \((7,4)\) Hamming code is a block code, given a possibly very long source message, the only way to implement the code is to chop up the message into blocks of length \(L_{\mathbf s}=4\); naturally, one can ask, for any given block in this “assembly line”, what’s the probability \(p(\tilde{\mathbf s}\neq\mathbf s)\) of it being decoded incorrectly? By contrast, a different question one can ask is what is the effective bit flip probability \(p_f^{\text{eff}}\) within one of these blocks? (give both to leading order in \(p_f\))

    Solution: First, notice that without the Hamming code, any odd number of bit flips as well as certain combinations of an even number of bit flips would lead very easily to a decoding error \(\tilde{\mathbf s}\neq\mathbf s\); as \(p_f\ll 1/2\), the dominant process is having a single bit flip, so in this case \(p(\tilde{\mathbf s}\neq\mathbf s)={4\choose{1}}p_f(1-p_f)^3+…=4p_f+O(p_f^2)\).

    By contrast, with the \((7,4)\) Hamming code, the block error probability is no longer linear in \(p_f\), but quadratic. This is simply because the \((7,4)\) Hamming code is designed to correct all \(1\)-bit flip errors, thereby eradicating the dominant process above. Unfortunately, any \(2\) bit flips (implicitly on distinct bits) in the \(7\)-bit encoding \(\mathbf t\) will not be corrected, leading to block error \(\tilde{\mathbf s}\neq\mathbf s\) (identical remarks apply to \(3,4,5,6,7\) bit flips); after all, no matter what syndrome vector \(\mathbf z\) one finds in \(\tilde{\mathbf t}\), the Hamming decoder tells one to unflip at most \(1\) bit because it operates under the assumption that there was only \(1\) bit flip, so it is literally impossible to fully correct a \(\tilde{\mathbf t}\) containing more than \(1\) bit flip. Thus:

    \[p(\tilde{\mathbf s}\neq\mathbf s)=\sum_{n=2}^7{7\choose{n}}p_f^n(1-p_f)^{7-n}=21p_f^2+O(p_f^3)\]

    By contrast, the effective bit flip probability is in general:

    \[p_f^{\text{eff}}=\frac{1}{L_{\mathbf s}}\sum_{i=1}^{L_{\mathbf s}}p(\tilde s_i\neq s_i)\]

    But here all \(L_{\mathbf s}=4\) source bits are indistinguishable (indeed all \(L_{\mathbf t}=7\) transmitted bits are indistinguishable), so one can focus arbitrarily on e.g. \(i=1\) and ask what is \(p_f^{\text{eff}}=p(\tilde s_1\neq s_1)\). There are then \(2\) distinct ways to obtain the same answer (to leading order); the first is to explicitly enumerate all the possible \(2\) bit flip errors in \(\tilde{\mathbf t}\) such that the final decoded source vector \(\tilde{\mathbf s}\) will have an error in bit \(\tilde s_1\neq s_1\); clearly there are \(6\) such combinations that look like \(s_1\) being flipped in addition to one of the other \(6\) bits; then, another \(3^{\text{rd}}\) bit will always be flipped. Meanwhile, one can check there are also \(3\) other configurations in which \(s_1\) is not initially flipped, but rather the decoder incorrectly flips it when going from \(\tilde{\mathbf t}\to\tilde{\mathbf s}\):

    Since each of these are still \(2\) bit flip processes, one thus has:

    \[p_f^{\text{eff}}=9p_f^2(1-p_f)^5+…=9p_f^2+O(p_f^3)\]

    To leading order, one thus observes that \(p_f^{\text{eff}}=\frac{3}{7}p(\tilde{\mathbf s}\neq\mathbf s)\); the \(3/7\) prefactor has a simple interpretation as saying that whenever there was \(2\) bit flips in \(\tilde{\mathbf t}\), it would inevitably imply \(3\) bit flips in the decoded \(\tilde{\mathbf s}\) (prior to truncating away the \(3\) parity-check bits) so \(3\) out of the \(7\) bits in \(\tilde{\mathbf s}\) are in error, and all \(7\) bits are indistinguishable. This thus has a very Bayesian interpretation where \(p(\tilde{\mathbf s}\neq\mathbf s)\) is the prior and \(3/7\) is the conditional probability of a bit like \(s_1\) being flipped if it’s known that the whole \(7\)-bit block has suffered a \(3\)-bit error.

    Problem: How many distinct elements are there in \(\ker H\)? What is the interpretation of such an element?

    Solution: There are \(15\) elements in \(\ker H\), and each such \(\mathbf n\in\ker H\) has the property that the corresponding syndrome \(\mathbf z=\mathbf 0\) if \(\textbf n\) were interpreted as a certain symmetric pattern of bit flips/noise. For example:

    Problem: The \((7,4)\) Hamming code is not the only one; what other possible \((L_{\mathbf t},L_{\mathbf s})\) Hamming codes are possible?

    Solution: Any for which \(L_{\mathbf t}-L_{\mathbf s}=\log_2(L_{\mathbf t}+1)\), the RHS being the number of parity-check bits which grows merely logarithmically with the total transmission size \(L_{\mathbf t}\)!

    Problem: What are the \(2\) main interpretations of probability?

    Solution: There is the frequentist and Bayesian interpretations of probability; the former interprets probabilities as long-run event frequencies while the latter inteprets probabilities as degrees of belief (this is possible provided the concept of “degree of belief” satisfies certain reasonable Cox axioms). In general, probability theory is simply an extension of logic to account for uncertainty.

    Problem: Whenever dealing with multiple random variables \(X,Y,…\), what is the fundamental object to look at? What can be derived from this fundamental object? (cf. the partition function in statistical mechanics as serving the same sort of “all-encompassing” role).

    Solution: The fundamental object which one should prioritize figuring out is the joint probability distribution function \(p_{(X,Y,…)}(x,y,…)\) of all the random variables \((X,Y,…)\). Consider a simple visualization of this concept for just \(2\) real-valued random variables \(X, Y\) and the marginal as well as conditional probability distributions one can obtain from their joint probability distribution \(p_{(X,Y)}(x,y)\equiv p(x,y)\):

    Problem: A certain disease is present among \(1\%\) of the world’s population, and a medical test for the disease has a \(95\%\) true positive rate and a \(90\%\) true negative rate. If Joe tests positive for the disease, what are his odds of actually having it?

    Solution: For these sorts of problems, there is essentially a \(3\)-step recipe to solving them:

    1. Map out the complete marginal distribution of the prior.
    2. For each value of the prior, “unshovel” your way back to the joint distribution w.r.t. all values of the new data.
    3. Hence, compute whatever you need to.

    In this case, the prior \(X\) may be modelled as a Bernoulli random variable taking on the value \(x=0\) if a person doesn’t have the disease and \(x=1\) if they do. Following the steps:

    Problem: Suppose that on a certain online store, products can be rated either positively or negatively. There are \(3\) different sellers of the same product. Seller \(A\) has \(10\) reviews with \(100\%\) positive ratings, seller \(B\) has \(50\) reviews with \(96\%\) positive ratings, and seller \(C\) has \(200\) reviews with \(93\%\) positive ratings. Which seller should one buy from?

    Solution: This is another Bayesian inference problem (draw a picture with \(p_+\) on the horizontal axis and \(N_+\) on the vertical axis!). The model is that the seller has some underlying fixed, but unknown probability \(p_+\) of delivering a positive experience and \(1-p_+\) of delivering a negative experience to a user. A priori, if one had no knowledge about the review information of other customers, then the prior \(p_+\) can be modelled by a uniform distribution \(p(p_+)=1\) for \(p_+\in[0,1]\). But then, equipped with the review info of \(N_+\) positive reviews out of \(N\), Bayesian updating says that this uniform prior should be updated to a beta distributed posterior:

    \[p(p_+|N_+,N)=\frac{p(p_+)p(N_+,N|p_+)}{\int_0^1dp_+p(p_+)p(N_+,N|p_+)}=\frac{{N\choose{N_+}}p_+^{N_+}(1-p_+)^{N-N_+}}{\int_0^1 dp_+{N\choose{N_+}}p_+^{N_+}(1-p_+)^{N-N_+}}=\frac{(N+1)!p_+^{N_+}(1-p_+)^{N-N_+}}{N_+!(N-N_+)!}\]

    where the beta function \(\textrm{B}(z,w):=\int_0^1 dx x^{z-1}(1-x)^{w-1}=\frac{\Gamma(z)\Gamma(w)}{\Gamma(z+w)}\) normalization has been (optionally) used.

    A reasonable goal would be to maximize one’s own probability of a positive experience. This means taking the posterior \(p(p_+|N_+,N)\) computed above and recycling that as one’s prior \(p(p_+|N_+,N)\mapsto p(p_+)\). Then, the probability \(p(+)\) of a positive experience coincides with the expected value of \(p_+\):

    \[p(+)=\int_0^1 dp_+p(p_+)p_+=\frac{\textrm{B}(N_++2,N-N_++1)}{\textrm{B}(N_++1,N-N_++1)}=\frac{N_++1}{N+2}\]

    this result is called Laplace’s rule of succession, and may be remembered by taking the original \(N\) reviews, and appending \(2\) new fictitious reviews in which one is positive and one is negative. For instance, if a product had no reviews so that \(N=N_+=0\), then reasonably enough Laplace’s rule of succession predicts one to have a \(p(+)=1/2\) probability of a positive experience. Applying it to each of the sellers in the question, one finds that seller \(B\) has the highest probability \(p(+)\), so should vendor with them.

    Remember that Bayesian inference is always predicated on assumptions, which above meant a uniform equiprobable ignorance \(p(p_+)=1\) at the very beginning about \(p_+\), though it has been suggested by Jeffreys, Haldane, etc. to use a different prior like \(p(p_+)=\frac{1}{\pi\sqrt{p(1-p)}}\) or \(p(p_+)\sim\frac{1}{p(1-p)}\) to model one’s ignorance. Of course, this would modify how Laplace’s rule of succession looks with each of these priors.

    Problem: (for fun!) Prove the earlier beta function relation \(\textrm{B}(z,w):=\int_0^1 dx x^{z-1}(1-x)^{w-1}=\frac{\Gamma(z)\Gamma(w)}{\Gamma(z+w)}\). Show that in the special case of positive integers \(z,w\in\mathbf Z^+\), the result may be derived combinatorially.

    Solution:

    (loosely speaking, this is reminiscent of the identity for the exponential function \(\exp(z)\exp(w)=\exp(z+w)\) except for the \(\Gamma\) function there is an “extra factor” in the form of the beta function \(B(z,w)\)). In the special case where \(z,w\in\mathbf Z^+\), the initial temptation would be to expand \((1-x)^{w-1}\) using the binomial theorem, but it turns out there is a much cuter way to evaluate the integral by telling a story.

    Imagine seeing a random distribution of \(N+1\) balls on the unit interval \([0,1]\), of which \(1\) of them is pink while the other \(N\) are black. There are \(2\) ways they could’ve gotten there. On the one hand, it may be that there were initially \(N+1\) black balls, and then one of them was painted pink at random, and then the balls were thrown altogether onto \([0,1]\). Or it may be that the \(N+1\) black balls were first thrown altogether onto \([0,1]\), and subsequently one of them was painted pink at random. Clearly, these \(2\) processes are equivalent definitions of a discrete random variable \(N_+\) defined as the number of black balls to the left of the pink ball on \([0,1]\). This has probability mass function \(p(N_+)\) in the first story given by:

    \[p(N_+)=\int_0^1dp_+p(\text{pink thrown at }p_+)p(N_+|\text{pink thrown at }p_+)\]

    \[=\int_0^1 dp_+{N\choose{N_+}}p_+^{N_+}(1-p_+)^{N-N_+}\]

    On the other hand, according to the second story, \(N_+=0,1,…,N\) each with uniform probability!

    \[p(N_+)=\frac{1}{N+1}\]

    Hence it is established that:

    \[\int_0^1 dp_+p_+^{N_+}(1-p_+)^{N-N_+}=\frac{N_+!(N-N_+)!}{(N+1)!}\]

    which is equivalent to the earlier more general result for \(z,w\in\mathbf C\).

    Problem: Given a discrete random variable \(X\) with probability mass function \(p(x)\), define the information content \(I(x)\) associated to drawing outcome \(x\) from \(X\). Hence, define the entropy \(S_X\) of \(X\).

    Solution: A picture is worth a thousand words:

    where \(I(x):=-\log p(x)\) is the surprise associated with drawing \(x\) from \(X\), and \(S_X:=\sum_xp(x)I(x)=-\sum_x p(x)\log p(x)=\sum_x I(x)2^{-I(x)}\) is the average surprise of the discrete random variable \(X\). The choice of base in the \(\log\) defines the unit of information (e.g. bits, nats, etc.).

    Problem: Show that for \(X\perp Y\) independent discrete random variables, their joint entropy \(S_{(X,Y)}=S_X+S_Y\).

    Solution: This is really rooted in Shannon’s fundamental axioms that the joint information content for independent random variables should be additive \(I_{(X,Y)}(x,y)=I_X(x)+I_Y(y)\):

    \[S_{(X,Y)}:=-\sum_{(x,y)}p(x,y)\log p(x,y)=-\sum_{(x,y)}p(x)p(y)\log p(x)-\sum_{(x,y)}p(x)p(y)\log p(y)=S_X+S_Y\]

    Problem: Give the intuition behind the Kullback-Leibler divergence \(D_{\text{KL}}(\tilde X|X)\) of a random variable \(\tilde X\) with respect to another random variable \(X\) (both defined over the same sample space).

    Solution: Any random variable \(X\) inherently contains some non-negative average level of surprise \(S_X\geq 0\) because of the word “random” in “random variable” (indeed, \(S_X=0\) iff \(p(x)=1\) for some outcome \(x\)). That’s one kind of surprise. But in practice, if one is under the illusion that the outcomes \(x\) are being drawn from a random variable \(\tilde X\) when in fact they are actually following a different random variable \(X\neq\tilde X\), then the perceived degree of surprise \(S_{\tilde X|X}=-\sum_xp_X(x)\log p_{\tilde X}(x)\) (called the cross-entropy of \(\tilde X\) with respect to \(X\)) would intuitively be larger than the average surprise \(S_X\) inherent in the intrinsic randomness of \(X\) itself. The KL divergence of \(\tilde X\) with respect to \(X\) thus measures how surprising the data is purely due to the wrong model \(\tilde X\) being used in lieu of the ground truth \(X\):

    \[D_{\textrm{KL}}(\tilde X|X):=S_{\tilde X|X}-S_X\]

    thus, it should be intuitively clear that \(D_{\textrm{KL}}(\tilde X|X)\geq 0\), a result known as the Gibbs inequality.

    Problem: What is the intuition behind Jensen’s inequality.

    Solution: Take any “smiling” curve and hang a bunch of masses on the curve. Then Jensen’s inequality says that their center of mass will also lie above the curve.

    More precisely, for a convex function \(f(x)\) (which means any secant line lies above the function’s graph, or algebraically \(f(\lambda x+(1-\lambda)x’)\leq\lambda f(x)+(1-\lambda)f(x’)\) for \(\lambda\in[0,1]\)), one has:

    \[\langle f(x)\rangle\geq f(\langle x\rangle)\]

    with equality \(\langle f(x)\rangle=f(\langle x\rangle)\) iff \(x\) is a uniform random variable. More explicitly, for any convex linear combination of the form \(p_1x_1+p_2x_2+…+p_Nx_N\) where each \(p_i\geq 0\) and \(\sum_{i=1}^Np_i=1\), one has:

    \[\sum_i p_if(x_i)\geq f\left(\sum_ip_ix_i\right)\]

    The inequality is reversed for concave functions \(f(x)\).

    Problem: Unstable particles are emitted from a source and decay at a distance \(x\geq 0\), an exponentially distributed real random variable with characteristic length \(x_0\). Decay events can only be observed if they occur in a window extending from \(x=1\) to \(x=20\) (arbitrary units). If \(100\) decays are registered at some locations \(1\leq x_1,x_2,…,x_{100}\leq 20\), estimate the characteristic length \(x_0\).

    Solution: The prior distribution of \(x\) is, as stated, an exponential decay:

    \[p(x|x_0)=\frac{e^{-x/x_0}}{x_0}[x\geq 0]\]

    which is normalized \(\int_{-\infty}^{\infty}dxp(x|x_0)=1\). After learning that \(x\in [1,20]\), the posterior distribution of \(x\) becomes:

    \[p(x|x\in [1,20],x_0)=\frac{p(x|x_0)p(x\in[1,20]|x, x_0)}{p(x\in[1,20]|x_0)}\]

    but if \(x\) is measured, and happens to lie in \(x\in[1,20]\), then \(p(x\in[1,20]|x, x_0)=1\), otherwise if \(x\notin [1,20]\) then \(p(x\in[1,20]|x, x_0)=0\); all in all one has the compact Iverson bracket expression \(p(x\in[1,20]|x, x_0)=[x\in [1,20]]\). The denominator is the integral of the numerator and indeed is just the usual way that probability density functions are meant to be used:

    \[p(x\in[1,20]|x_0)=\int_{-\infty}^{\infty}dxp(x|x_0)p(x\in[1,20]|x,x_0)=\int_1^{20}dx\frac{e^{-x/x_0}}{x_0}=e^{-1/x_0}-e^{-20/x_0}\]

    Now, using this updated prior, one has:

    \[p(\{x_1,…,x_N\}|x\in[1,20],x_0)=p(x_1|x\in[1,20],x_0)…p(x_N|x\in[1,20],x_0)=\frac{e^{-(x_1+…+x_N)/x_0}}{x_0^N(e^{-1/x_0}-e^{-20/x_0})^N}\]

    But one would instead like to Bayesian-infer \(p(x_0|\{x_1,…,x_N\},x\in[1,20])\) so as to find the most probable value of the characteristic length \(x_0\):

    \[p(x_0|\{x_1,…,x_N\},x\in[1,20])=\frac{p(x_0|x\in[1,20])p(\{x_1,…,x_N\}|x\in[1,20],x_0)}{p(\{x_1,…,x_N\}|x\in[1,20])}\]

    Clearly, the only missing puzzle piece in here is the prior distribution \(p(x_0|x\in[1,20])\) on the characteristic length itself. Even without specifying this, one can still gain a lot of insight by plotting \(p(\{x_1,…,x_N\}|x\in[1,20],x_0)\) as a function of \(x_0\), or more precisely, to get the key idea, plot the likelihood \(p(x|x\in[1,20],x_0)\) of \(x_0\) for various fixed \(x\in[1,20]\), and notice in particular a peak in \(x_0\) as \(x\) decreases.

    Posted in Blog | Leave a comment

    SVD, Pseudoinverse, and PCA

    Problem: State the form of the singular value decomposition (SVD) of an arbitrary linear operator \(X:\mathbf C^m\to\mathbf C^n\).

    Solution: The SVD of \(X\) is given by:

    \[X=U_2\Sigma U^{\dagger}_1\]

    where \(U_1\in U(m)\) and \(U_2\in U(n)\) are unitary operators and \(\Sigma:\mathbf C^m\to\mathbf C^n\) is a “Hermitian” positive semi-definite diagonal operator. Strictly speaking, one should prove that such a factorization of “rotate-stretch-rotate” really does exist for any \(X\); the details are omitted here.

    Problem: In practice, how does one find the SVD of a linear operator \(X\)?

    Solution: The idea is to look at the \(2\) Hermitian, positive semi-definite operators constructed from \(X\):

    \[XX^{\dagger}=U_2\Sigma\Sigma^{\dagger}U_2^{\dagger}\]

    \[X^{\dagger}X=U_1\Sigma^{\dagger}\Sigma U_1^{\dagger}\]

    And in particular, by the spectral theorem, both \(XX^{\dagger}\) and \(X^{\dagger}X\) have an orthonormal eigenbasis (of \(\mathbf C^n\), \(\mathbf C^m\) respectively) corresponding to real, non-negative eigenvalues \(\sigma^2_1>\sigma^2_2>…>\sigma^2_{\text{min}(n,m)}\), so since the above matches the standard eigendecomposition template, it’s clear that computing the SVD of \(X\) just amounts to diagonalizing \(XX^{\dagger}\) and \(X^{\dagger}X\) (actually just \(1\) has to be diagonalized, namely diagonalize \(XX^{\dagger}\) if \(n\leq m\) and diagonalize \(X^{\dagger}X\) if \(m\leq n\). Then, because the singular eigenvalues \(\sigma^2_1,\sigma^2_2,…\sigma^2_{\text{min}(n,m)}\) are identical for both \(XX^{\dagger}\) and \(X^{\dagger}X\), it’s easy to find the eigenvectors of the other).

    Problem: The most important application of SVD is to estimate a complicated linear operator \(X:\mathbf C^m\to\mathbf C^n\) by a low-rank approximation. Explain how this is achieved.

    Solution: Order the \(\text{min}(n,m)\) singular values of \(X:\mathbf C^m\to\mathbf C^n\) from greatest to least as before \(\sigma^2_1,\sigma^2_2,…\sigma^2_{\text{min}(n,m)}\). Write the column eigenvectors \(U_1=(\mathbf u_1^{(1)}, \mathbf u_1^{(2)},…,\mathbf u_1^{(m)})\) and \(U_2=(\mathbf u_2^{(1)}, \mathbf u_2^{(2)},…,\mathbf u_2^{(n)})\). Then the singular value decomposition may also be written:

    \[X=U_1\Sigma U^{\dagger}_2=\sum_{i=1}^{\text{min}(n,m)}\sigma_i\mathbf u_1^{(i)}\otimes\mathbf u_2^{(i)}\]

    In particular, the Eckart-Young theorem states the best rank-\(r\) approximation to \(X\) (for \(r\leq\text{rank}(X)\)) is given by a truncating the above series at the first \(r\) terms:

    \[X\approx \sum_{i=1}^{r}\sigma_i\mathbf u_1^{(i)}\otimes\mathbf u_2^{(i)}\]

    Here, the word “best” is with respect to the Frobenius norm:

    \[\text{argmin}_{\tilde X\in\mathbf C^{n\times m}:\text{rank}(\tilde X)=r}\sqrt{\text{Tr}((X-\tilde X)^{\dagger}(X-\tilde X))}=\sum_{i=1}^{r}\sigma_i\mathbf u_1^{(i)}\otimes\mathbf u_2^{(i)}\]

    (insert proof here:)

    Problem: Here is another application of SVD: given a diagonal operator \(\Sigma:\mathbf C^m\to\mathbf C^n\), explain how to compute its (Moore-Penrose) pseudoinverse \(\Sigma^+:\mathbf C^n\to\mathbf C^m\). Hence, how can the pseudoinverse of a general linear operator \(X:\mathbf C^m\to\mathbf C^n\) be computed?

    Solution: \(\Sigma^+\) is almost like the usual inverse \(\Sigma^{-1}\) in that one simply inverts all the eigenvalues on the diagonal. But the exception is if \(\Sigma\) has a zero eigenvalue…then ordinarily \(\Sigma^{-1}\) would not exist, so for this case the pseudoinverse is created and the instruction is to simply leave the zeros be (rather than trying to invert them and get \(1/0=\infty\)). Finally, also don’t forget to transpose the result. For a general \(X\), the quickest way to compute the pseudoinverse is to use its SVD:

    \[X=U_2\Sigma U^{\dagger}_1\]

    Ordinarily trying to take the inverse (even if \(n\neq m\)) would appear to give:

    \[X^{-1}=U_1\Sigma^{-1}U^{\dagger}_2\]

    But since \(\Sigma^{-1}\) may not exist, the pseudoinverse of \(X\) is thus given by:

    \[X^+=U_1\Sigma^+U_2^{\dagger}\]

    (of course, if \(X^{-1}\) actually exists then the pseudoinverse is in fact not “pseudo” at all \(X^+=X^{-1}\)).

    Problem: In general, given a vector \(\mathbf y\in\mathbf C^n\) and linear operator \(X:\mathbf C^m\to\mathbf C^n\), the equation \(\mathbf y=X\mathbf x\) only has a unique solution for \(\mathbf x\in\mathbf C^m\) provided that \(X\) is invertible, i.e. \(\mathbf x=X^{-1}\mathbf y\). If this isn’t the case however, one can still compute the pseudoinverse \(X^+\); what interpretation does \(X^+\mathbf y\in\mathbf C^m\) have?

    Solution: There are \(2\) cases; if \(m\leq n\), then there will be many solutions \(\mathbf x\in\mathbf C^m\) to \(\mathbf y=X\mathbf x\), and in particular \(X^+\mathbf y\) is also an element of this solution space (i.e. \(XX^+\mathbf y=\mathbf y\)) but has the additional property that it is closest to the origin:

    \[X^+\mathbf y=\text{argmin}_{\mathbf x\in\mathbf C^m:\mathbf y=X\mathbf x}|\mathbf x|\]

    By contrast, if \(m\geq n\), then there will be no solutions \(\mathbf x\in\mathbf C^m\) to \(\mathbf y=X\mathbf x\) and in that case of course \(X^+\mathbf y\) is also not a solution, but nevertheless it is the closest one can come to an actual solution in the sense that:

    \[X^+\mathbf y=\text{argmin}_{\mathbf x\in\mathbf C^m}|\mathbf y-X\mathbf x|\]

    Problem: Explain the what and how of conducting principal component analysis (PCA) of a bunch of (say \(m\)) feature vectors \(\mathbf x_1,…,\mathbf x_m\in\mathbf C^n\).

    Solution: Earlier, \(X:\mathbf C^m\to\mathbf C^n\) simply had the interpretation of an arbitrary linear operator, in which case the interpretation of \(XX^{\dagger}\) and \(X^{\dagger}X\) were both likewise arbitrary. However, one intuitive interpretation is to view \(X\) as a data matrix; if one takes the \(m\) columns of \(X=(\mathbf x_1,…,\mathbf x_m)\) to represent the \(m\) feature vectors in \(\mathbf C^n\), then the Gram matrix \(X^{\dagger}X\) takes on a statistical interpretation as an autocorrelation matrix:

    \[X^{\dagger}X=
    \begin{pmatrix} |\mathbf{x}_1|^2 & \cdots & \mathbf{x}_1^{\dagger}\mathbf{x}_m \\
    \vdots & \ddots & \vdots \\
    \mathbf{x}_m^{\dagger}\mathbf{x}_1 & \cdots & |\mathbf{x}_m|^2
    \end{pmatrix}\]

    or alternatively, if \(X\mapsto X-\boldsymbol{\mu}\otimes\mathbf{1}_m\) has already been mean-subtracted (where \(\boldsymbol{\mu}=\sum_{i=1}^m\mathbf x_i/m\)) so as to have zero-mean, then \(X^{\dagger}X\) is just the covariance matrix of the data. Based on the earlier discussion of SVD, it follows that the columns of \(U_1\in U(m)\) give principal axes of the covariance matrix \(X^{\dagger}X\), and that the corresponding singular values in \(\Sigma\) measure the standard deviation of the data along those principal axes (as variance is commonly denoted \(\sigma^2\), this also explains the choice of notation in the SVD). Note that in such applications, one is almost always in the regime \(m\gg n\) so that there will be \(n\) singular values \(\sigma_1,…,\sigma_n\). Together, each principal axis together with corresponding singular value is said to be a principal component (PC) of the data \(X\), hence the name. The “analysis” part comes from analyzing how the total variance \(\sum_{i=1}^n\sigma_i^2\) of the data \(X\) is explained by each PC, specifically the fraction of it explained by the \(j^{\text{th}}\) PC is \(\sigma_j^2/\sum_{i=1}^n\sigma_i^2\). If the singular values are ordered \(\sigma_1\geq…\geq\sigma_n\), then PCA also provides a lossy data compression algorithm, specifically, in analogy to the low-rank approximation application of SVD, one can pick some \(r< n\) and project all \(m\) feature vectors \(\mathbf x_1,…,\mathbf x_m\in\mathbf C^n\) onto the \(r\)-dimensional subspace spanned by the corresponding first \(r\) principal axes associated to the first \(r\) singular values \(\sigma_1,…,\sigma_r\).

    If the data matrix contained the feature vectors as its rows instead \(X=(\mathbf x_1,…,\mathbf x_m)^{\dagger}\), then repeat the above discussion with \(XX^{\dagger}\) instead.

    Posted in Blog | Leave a comment

    Hartree-Fock Method

    Problem: The Hamiltonian of an atom may be written in the perturbative form:

    \[H=H_0+V\]

    what are \(H_0\) and \(V\)?

    Solution: If the atom has \(N\) electrons and atomic number \(Z\) (if neutral then \(Z=N\) but here one can also allow the possibility of ions), then the state space of the atom is considered to be \(\bigwedge^NL^2(\mathbf R^3\to\mathbf C)\otimes\mathbf C^2\), i.e. only the \(N\) electrons are considered part of the system, not the nucleus; rather, the nucleus plays an external role, providing the background Coulomb potential in which the electrons move. Thus, \(H_0\) is given by the one-body operator:

    \[H_0=\sum_{i=1}^N\frac{|\mathbf P_i|^2}{2m}+V_{\text{ext}}(\mathbf X_i)\]

    where \(V_{\text{ext}}(\mathbf x)=-\frac{Z\alpha\hbar c}{|\mathbf x|}\) and the perturbative interaction potential is given by:

    \[V=\sum_{1\leq i<j\leq N}\frac{\alpha\hbar c}{|\mathbf X_i-\mathbf X_j|}\]

    Problem: What assumptions are implicitly going into the form of \(H\) above?

    Solution: Complete disregard of special relativity; specifically, no fine/hyperfine structure effects, and also using the Coulomb potential which is strictly only valid for electrostatics (with relativistic corrections assumed to be small).

    Problem: The goal as always in QM is to find the energies \(E\) and corresponding eigenstates of \(H\) in \(\bigwedge^NL^2(\mathbf R^3\to\mathbf C)\otimes\mathbf C^2\). However, for \(N\geq 2\), no exact such \(H\)-eigenstates are known, due to the presence of the repulsive electron-electron interactions in \(V\); if for a moment one assumes \(V=0\), then the exact \(H\)-eigenstates and energies are again known…what are they?

    Solution: The ground state consists of occupying (also called filling) the first \(N\) available single-particle eigenstates of the hydrogenic \(Z\)-atom (antisymmetrized appropriately) as per Pauli exclusion; this of course is analogous to the usual Fermi sea picture where \(V_{\text{ext}}=0\) and \(H_0\)-eigenstates are just plane waves \(|\mathbf k\rangle\). Excited states are then analogous to excitations of the Fermi sea.

    Problem: What does it actually mean to say that e.g. a certain atom like \(\text{Cr}\) has ground state “electron configuration” \(1s^22s^22p^63s^23p^64s^13d^5\)?

    Solution: On the one hand, electron configuration is just a notation for a particular ket in what is almost an occupation number representation for the single-electron basis \(\{|n\ell m_{\ell}m_s\rangle\}\), e.g.

    \[1s^22s^1=|N_{|100\uparrow\rangle}=1, N_{|100\downarrow\rangle}=1,N_{|200\uparrow\rangle}=1\rangle\]

    though of course the last occupation number might also have been \(N_{|200\downarrow\rangle}=1\); thus, electron configuration is a coarse-graining of the occupation number representation over the quantum numbers \(m_{\ell},m_s\)). I would conjecture then that the ground state electron configuration assigned to a given atom is the one which has the largest overlap with its true ground state in the presence of Coulomb repulsion \(V_{\text{int}}\) (and even including all the relativistic effects). This can be formalized in terms of the Hartree-Fock ansatz explained later.

    Problem: As a warmup, show how first-order time-independent perturbation theory may be applied to estimate the ground state energy of a helium atom. How does it compare with the experimental result?

    Solution: If the Coulomb repulsion between the two electrons in the helium atom are assumed to be “weak”, then \(V=\frac{\alpha\hbar c}{|\mathbf X_1-\mathbf X_2|}\) may be treated as a perturbation of \(H_0\). The ground state \(|\Psi\rangle\) of \(H_0\) is has position-space wavefunction:

    \[\langle\mathbf x_1,\mathbf x_2|\Psi\rangle=\psi_{100}(\mathbf x_1)\psi_{100}(\mathbf x_2)\otimes\frac{|\uparrow\downarrow\rangle-|\downarrow\uparrow\rangle}{\sqrt 2}\]

    where \(\psi_{100}(r)=\sqrt{\frac{Z^3}{\pi a_0^3}}e^{-Zr/a_0}\). Accordingly, because \(V\) is blind to the spin degrees of freedom, \(1^{\text{st}}\)-order perturbation theory asserts that the \(1^{\text{st}}\)-order correction to the non-interacting ground state energy \(-(Z\alpha)^2mc^2\approx -109\text{ eV}\) is given by:

    \[\alpha\hbar c\int d^3\mathbf x_1 d^3\mathbf x_2\frac{|\psi_{100}(\mathbf x_1)|^2|\psi_{100}(\mathbf x_2)|^2}{|\mathbf x_1-\mathbf x_2|}\]

    \[=\alpha\hbar c\left(\frac{Z^3}{\pi a_0^3}\right)^2\int d^3\mathbf x_1 e^{-2Zr_1/a_0}\int d^3\mathbf x_2\frac{e^{-2Zr_2/a_0}}{|\mathbf x_1-\mathbf x_2|}\]

    First, to evaluate the inner integral, notice that on symmetry grounds one can take \(\mathbf x_1=r_1\hat{\mathbf z}\) without loss of generality, thereby aligning it with the usual spherical coordinates to obtain:

    \[\int d^3\mathbf x_2\frac{e^{-2Zr_2/a_0}}{|\mathbf x_1-\mathbf x_2|}=2\pi\int_0^{\infty}dr_2 r_2^2e^{-2Zr_2/a_0}\int_{-1}^1\frac{d\cos\theta}{\sqrt{r_1^2+r_2^2-2r_1r_2\cos\theta}}\]

    \[=2\pi\int_0^{\infty}dr_2 r_2^2e^{-2Zr_2/a_0}\frac{r_1+r_2-|r_1-r_2|}{r_1r_2}\]

    \[=2\pi\int_0^{r_1}dr_2 r_2^2e^{-2Zr_2/a_0}\frac{2}{r_1}+2\pi\int_{r_1}^{\infty}dr_2 r_2^2e^{-2Zr_2/a_0}\frac{2}{r_2}\]

    \[= \pi \left( \frac{a_0^3}{Z^3 r_1} – \left( \frac{a_0^2}{Z^2} + \frac{a_0^3}{Z^3 r_1} \right) e^{-2Zr_1/a_0} \right)\]

    using integration by parts. Plugging this result back into the main integral and this time doing the \(\int_0^{\infty}dr_1\) integrals quickly using the \(\Gamma\) function, one eventually ends up with a positive \(1^{\text{st}}\)-order energy correction \(5Z\alpha^2mc^2/8\) (of course it should be positive given that the Coulomb repulsion \(V\) is positive-definite). The total ground state energy is therefore:

    \[(-Z^2+5Z/8)\alpha^2mc^2\approx -74.8\text{ eV}\]

    which for \(Z=2\) is close to the actual helium ground state energy \(\approx -79\text{ eV}\).

    Problem: Show that a variational ansatz:

    \[\Psi(\mathbf x_1,\mathbf x_2|Z_{\text{eff}})=\frac{Z_{\text{eff}}^3}{\pi a_0^3}e^{-2Z_{\text{eff}}(r_1+r_2)/a_0}\]

    with the effective nuclear charge \(Z_{\text{eff}}\) the corresponding variational parameter to be optimized, produces a better estimate of the ground state energy of a helium atom. What is the physical interpretation of \(Z_{\text{eff}}\)?

    Solution: The variational ansatz must do at least as well as \(1^{\text{st}}\)-order perturbation theory because setting \(Z_{\text{eff}}:=Z\) would reproduce the perturbative result. Thus, the ground state energy to be minimized is:

    \[E_0(Z_{\text{eff}})=\int d^3\mathbf x_1 d^3\mathbf x_2\Psi^{\dagger}(\mathbf x_1,\mathbf x_2|Z_{\text{eff}})H_Z\Psi(\mathbf x_1,\mathbf x_2|Z_{\text{eff}})\]

    where \(H=H_Z\) emphasizes that the original helium Hamiltonian \(H\) is being used with \(Z=2\). Of course though, it can be rewritten as:

    \[H_Z=H_{Z_{\text{eff}}}+(Z_{\text{eff}}-Z)\alpha\hbar c\left(\frac{1}{|\mathbf X_1|}+\frac{1}{|\mathbf X_2|}\right)\]

    So:

    \[E_0(Z_{\text{eff}})=(-Z_{\text{eff}}^2+5Z_{\text{eff}}/8)\alpha^2mc^2+2(Z_{\text{eff}}-Z)\alpha\hbar c\int d^3\mathbf x\frac{|\psi_{100}(r|Z_{\text{eff}})|^2}{r}\]

    \[=(-Z_{\text{eff}}^2+5Z_{\text{eff}}/8)\alpha^2mc^2+2(Z_{\text{eff}}-Z)\alpha\hbar c\frac{Z_{\text{eff}}^3}{\pi a_0^3}\int_0^{\infty}dr 4\pi r^2\frac{e^{-2Z_{\text{eff}}r/a_0}}{r}\]

    \[=(-Z_{\text{eff}}^2+5Z_{\text{eff}}/8)\alpha^2mc^2+2Z_{\text{eff}}(Z_{\text{eff}}-Z)\alpha^2mc^2\]

    \[=\left(Z_{\text{eff}}^2-\left(\frac{5}{8}-2Z\right)Z_{\text{eff}}\right)\alpha mc^2\]

    which is clearly minimized when \(Z_{\text{eff}}=Z-\frac{5}{16}=1.6875\). The phenomenon \(Z_{\text{eff}}<Z\) is a manifestation of screening.

    Problem: What does the Aufbau principle assert?

    Solution: Empirically (though with several exceptions) the occupation of single-electron states in the ground states of the atoms (as per their electron configurations) proceeds like:

    which of course was subsequently built into the topology of the periodic table.

    Problem: Previously, both perturbation theory and the variational method were used to estimate the ground state energy of the helium atom. One can now ask: what about the \(1^{\text{st}}\) excited state?

    Solution: From the perspective of \(H_0\), the \(1^{\text{st}}\) excited manifold has \(16\)-fold degeneracy; specifically, by applying the \(2\)-antisymmetrizer \(\sqrt{2}\mathcal A_2\) to any of the following, one obtains an \(H_0\)-eigenstate in the \(1^{\text{st}}\) excited subspace \(\text{ker}(H_0-E_21)\cap\bigwedge^2L^2(\mathbf R^3\to\mathbf C)\otimes\mathbf C^2\):

    (aka Slater determinants). Even though this of course provides a basis for the \(16\)-dimensional space \(\text{ker}(H_0-E_21)\cap\bigwedge^2L^2(\mathbf R^3\to\mathbf C)\otimes\mathbf C^2\), some of these Slater determinants (e.g. \(\sqrt{2}\mathcal A_2|100\uparrow,200\downarrow\rangle\)) are not eigenstates of \(V\). Rather, because \([V,\mathbf L^2]=[V,L_z]=0\) is isotropic and \([V,\mathbf S^2]=[V,S_z]=0\) is furthermore spin-blind, where \(\mathbf L=\mathbf L_1+\mathbf L_2\) and \(\mathbf S=\mathbf S_1+\mathbf S_2\), it makes sense to instead work in an unentangled basis, specifically where each basis state factorizes into the product of a spatial wavefunction and a spinorial factor. This is achieved by the usual Clebsch-Gordan decomposition \((1/2)^{\otimes 2}=0\oplus 1\), and means that one should instead hybridize some of the Slater determinants:

    so that one is still left with \(16\) states at the end of the day, but now all of them are also simultaneous \((\mathbf S^2,S_3)\)-eigenstates. Now degenerate perturbation theory is very easy. Given \(|\Psi\rangle\in\biggr\{\frac{|100,200\rangle+|200,100\rangle}{\sqrt{2}}|0,0\rangle,\frac{|100,200\rangle-|200,100\rangle}{\sqrt{2}}|1,-1\rangle,…\biggr\}\) where the quantum numbers for \(\mathbf S^2,S_3\) are being used, because there are \(4\) different combinations of \((\ell,m_{\ell})=(0,0),(1,1),(1,0),(1,-1)\), one expects that to \(1^{\text{st}}\)-order the \(1^{\text{st}}\)-excited manifold’s \(16\)-fold degeneracy will decompose into \(4\) submanifolds each of \(4\)-fold degeneracy. Specifically:

    \[\langle\Psi|V|\Psi\rangle=J\pm K\]

    where \(J\) is called the direct energy and \(K\) the exchange energy (both are between \(2\)-particular \(|n\ell m_{\ell}\rangle\) states, and the \(+\) sign is taken if the spatial wavefunction is symmetric, while the \(-\) sign is taken if it’s antisymmetric). For example, for \(|\Psi\rangle=\frac{|100,211\rangle-|211,100\rangle}{\sqrt{2}}|1,1\rangle\), take the \(-\) sign with:

    \[J_{100,211}=\langle 100,211|V|100,211\rangle=\alpha\hbar c\int d^3\mathbf x_1 d^3\mathbf x_2\frac{|\psi_{100}(\mathbf x_1)\psi_{211}(\mathbf x_2)|^2}{|\mathbf x_1-\mathbf x_2|}=\frac{59}{243}Z\alpha^2mc^2\]

    \[K_{100,211}=\langle 100,211|V|211,100\rangle=\alpha\hbar c\int d^3\mathbf x_1 d^3\mathbf x_2\frac{\psi^{\dagger}_{100}(\mathbf x_1)\psi^{\dagger}_{211}(\mathbf x_2)\psi_{211}(\mathbf x_1)\psi_{100}(\mathbf x_2)}{|\mathbf x_1-\mathbf x_2|}=\frac{112}{6561}Z\alpha^2mc^2\]

    And one indeed one can make an energy-splitting diagram that gives credence to the Aufbau rule (of course, the energy of all \(H_0\)-eigenstates increases with the electronic Coulomb repulsion \(V\)):

    Note if one had instead used the Slater basis, ultimately one would still arrive at the same answer through having to find eigenvalues of block-diagonal matrices with blocks that look like \(\begin{pmatrix} J&-K\\-K&J\end{pmatrix}\).

    (aside: a key selection rule \(\Delta s=0\) for \(\text E1\) transitions combined with the fact that \(s=0\) for the ground state means that starting in any \(s=1\) excited state would seem to be metastable…and indeed this is a forbidden transition, leading to a lifetime around \(\tau\approx 2.2^{\text h}\). Historically it was thought that \(s=0\) excited states corresponded to “parahelium” while \(s=1\) excited states corresponded to “orthohelium”).

    Problem: After all the perturbation theory math in the above problem, what’s the key conceptual takeaway?

    Solution: For unentangled states which may be factorized as the product of a spatial wavefunction and a spinorial component, when the spatial wavefunction is antisymmetric, the corresponding fermions are more likely to be found far from each other, and consequently the direct energy would be suppressed by this destructive quantum statistical interference (hence \(J-K\)). Conversely, symmetric spatial wavefunctions are enhanced by a sort of constructive quantum statistical interference (hence \(J+K\)).

    Of course, one could also apply a variational approach just like was done for the ground state. This would confirm that electron screening is the key physics at play.

    Problem: Show that for the Hartree product ansatz:

    \[\Psi(\mathbf x_1,…,\mathbf x_N)=\psi_1(\mathbf x_1)…\psi_N(\mathbf x_N)\]

    where \(\psi_1,…,\psi_N\) are \(N\) normalized single-particle wavefunctions, the expected energy \(\langle\Psi|H|\Psi\rangle\) with respect to the usual atomic Hamiltonian \(H\) is given by:

    \[\langle\Psi|H|\Psi\rangle=-\frac{\hbar^2}{2m}\sum_{i=1}^N\int d^3\mathbf x\psi^{\dagger}_i(\mathbf x)\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\psi_i(\mathbf x)-Z\alpha\hbar c\sum_{i=1}^N\int d^3\mathbf x\frac{|\psi_i(\mathbf x)|^2}{|\mathbf x|}\]

    \[+\alpha\hbar c\sum_{1\leq i<j\leq N}\int d^3\mathbf x d^3\mathbf x’\frac{|\psi_i(\mathbf x)\psi_j(\mathbf x’)|^2}{|\mathbf x-\mathbf x’|}\]

    (the last term is a sum over all pairs of direct energies \(J_{ij}\); there is no exchange energies \(K_{ij}\) yet because the Hartree ansatz \(|\Psi\rangle\) does not live in \(\bigwedge^N L^2(\mathbf R^3\to\mathbf C)\otimes\mathbf C^2\), indeed the ansatz doesn’t have any knowledge of the \(\mathbf C^2\) spinor! These deficiencies will be addressed later by the Hartree-Fock ansatz).

    Solution: The key point about the Hartree product ansatz is that it’s separable, so just plug and separate integrals and remember that the \(\psi_i\) are normalized.

    Problem: One would like to estimate the ground state wavefunction \(\Psi(\mathbf x_1,…,\mathbf x_N)\) of the \(N\)-electron atom within the Hartree framework (essentially mean-field theory). How can this be achieved?

    Solution: By constrained minimization of the energy functional of \(\psi_1,…,\psi_N\):

    \[\langle\Psi|H|\Psi\rangle-\sum_{i=1}^NE_i\left(\int d^3\mathbf x|\psi_i(\mathbf x)|^2-1\right)\]

    where \(E_1,…,E_N\) are \(N\) Lagrange multipliers enforcing normalization constraints on the \(\psi_i\). Setting the functional derivative with respect to \(\psi^{\dagger}_i\) to zero is the quickest way to get the Schrodinger-equation like answer for each of the \(N\) single-particle wavefunctions \(\psi_i\):

    \[\left(-\frac{\hbar^2}{2m}\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2-\frac{Z\alpha\hbar c}{|\mathbf x|}+V_i(\mathbf x)\right)\psi_i(\mathbf x)=E_i\psi_i(\mathbf x)\]

    where:

    \[V_i(\mathbf x)=\alpha\hbar c\sum_{j\neq i}\int d^3\mathbf x’\frac{|\psi_j(\mathbf x’)|^2}{|\mathbf x-\mathbf x’|}\]

    is the Coulomb electrostatic potential energy experienced by the \(i^{\text{th}}\) electron due to the other \(N-1\) electrons (cf. the classical Coulomb electrostatic potential \(\phi(\mathbf x)=\frac{1}{4\pi\varepsilon_0}\int d^3\mathbf x’\frac{\rho(\mathbf x’)}{|\mathbf x-\mathbf x’|}\) with the charge density \(\rho(\mathbf x’)=-e|\psi(\mathbf x’)|^2\)). This system of \(N\) coupled nonlinear integrodifferential PDEs for \(\psi_1,…,\psi_N\) are also called the Hartree equations.

    Problem: How are the Hartree equations solved numerically?

    Solution: One can begin with a crude guess for the ground state \(\psi_i\) (such as the usual “Fermi sea” picture with hydrogenic atomic orbitals) and for each \(i=1,…,N\) obtain the interaction potential \(V_i(\mathbf x)\). Then, numerically solve the resulting \(i=1,…,N\) Schrodinger equations to obtain new \(\psi_i\), and repeat this process until hopefully it converges (this is also called the self-consistent field method).

    Problem: Once these SCF iterations converges (hopefully) to some optimal set of \(N\) single-particle wavefunctions \(\psi^{\star}_i\), the corresponding Hartree estimate of the ground state wavefunction is of course \(\Psi^{\star}(\mathbf x_1,…,\mathbf x_N)=\psi^{\star}_1(\mathbf x_1)…\psi^{\star}_N(\mathbf x_N)\), but what about its corresponding ground state energy?

    Solution: Although on the one hand this is just \(\langle\Psi^{\star}|H|\Psi^{\star}\rangle\), one can actually also write this in terms of the optimal Lagrange multipliers \(E^{\star}_i\), specifically because from the Hartree equations one can isolate:

    \[E^{\star}_i=\int d^3\mathbf x\psi^{\star\dagger}_i(\mathbf x)\left(-\frac{\hbar^2}{2m}\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2-\frac{Z\alpha\hbar c}{|\mathbf x|}+V^{\star}_i(\mathbf x)\right)\psi^{\star}_i(\mathbf x)\]

    So summing \(\sum_{i=1}^NE^{\star}_i\) would almost give \(\langle\Psi^{\star}|H|\Psi^{\star}\rangle\) except that it double-counts the direct energy term \(\sum_{i=1}^N\sum_{j\neq i}=2\sum_{1\leq i<j\leq N}\) so overall:

    \[\langle\Psi^{\star}|H|\Psi^{\star}\rangle=\sum_{i=1}^NE^{\star}_i-\alpha\hbar c\sum_{1\leq i<j\leq N}\int d^3\mathbf x d^3\mathbf x’\frac{|\psi^{\star}_i(\mathbf x)\psi^{\star}_j(\mathbf x’)|^2}{|\mathbf x-\mathbf x’|}\]

    (more precisely, since this is a variational estimate, the true ground state energy must be less than this).

    Problem: Qualitatively, why is the ground state electron configuration of \(Z=19\) potassium \(\text K\) given by \(1s^22s^22p^63s^23p^64s^1\) rather than \(1s^22s^22p^63s^23p^63d^1\)?

    Solution: Due to screening, a \(3d\) electron would have higher energy than a \(4s\) electron, so as far as the ground state is concerned, the latter is preferable. Hence, the Aufbau principle says \(4s\) is occupied before \(3d\). This can sort of be seen in the Hartree equations where, approximating \(V_i=V_i(r)\) as being approximately isotropic, one can define an effective nuclear charge \(Z_{\text{eff}}(r)\) such that the effective potential experienced by the \(i^{\text{th}}\) electron looks like:

    \[-\frac{Z\alpha\hbar c}{r}+V_i(r)=-\frac{Z_{\text{eff}}(r)\alpha\hbar c}{r}\]

    where \(\lim_{r\to 0}Z_{\text{eff}}(r)=Z=19\) and \(\lim_{r\to \infty}Z_{\text{eff}}(r)=1\)

    Problem: The Hartree product ansatz \(\Psi(\mathbf x_1,…,\mathbf x_N)\) was purely a position space many-body wavefunction that did not involve any mention of spin (if it did, it would have canceled out in \(\langle\Psi|H|\Psi\rangle\) anyways because \(H\) is spin-blind). Even so, it didn’t live in \(\bigwedge^NL^2(\mathbf R^3\to\mathbf C)\). These deficiencies can be remedied via the Hartree-Fock ansatz, which is simply an antisymmetrization of the Hartree ansatz:

    \[|\Psi\rangle=\sqrt{N!}\mathcal A_N|1\rangle\otimes…\otimes|N\rangle\]

    where each single-particle spin-orbital \(\langle\mathbf x|i\rangle=\psi_i(\mathbf x)|\sigma\rangle\) with \(\sigma\in\{\uparrow,\downarrow\}\). It is often also written as a Slater determinant of the \(N\) single-particle spin-orbitals, but here the assignment of which electron occupies which single-particle spin-orbital \(|i\rangle\) is implicit in the ordering of the tensor product. Repeat everything from the Hartree derivation to derive the analogous Hartree-Fock equations.

    Solution: One can show that, compared to the Hartree state, the expected energy of the Hartree-Fock state contains an additional (negative) exchange energy contribution for every pair of electrons with aligned spins \(\sigma_i=\sigma_j\):

    \[\langle\Psi|H|\Psi\rangle=-\frac{\hbar^2}{2m}\sum_{i=1}^N\int d^3\mathbf x\psi^{\dagger}_i(\mathbf x)\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\psi_i(\mathbf x)-Z\alpha\hbar c\sum_{i=1}^N\int d^3\mathbf x\frac{|\psi_i(\mathbf x)|^2}{|\mathbf x|}\]

    \[+\alpha\hbar c\sum_{1\leq i<j\leq N}\int d^3\mathbf x d^3\mathbf x’\frac{|\psi_i(\mathbf x)\psi_j(\mathbf x’)|^2}{|\mathbf x-\mathbf x’|}-\alpha\hbar c\sum_{1\leq i<j\leq N}\delta_{\sigma_i\sigma_j}\int d^3\mathbf x d^3\mathbf x’\frac{\psi^{\dagger}_i(\mathbf x)\psi^{\dagger}_j(\mathbf x’)\psi_i(\mathbf x’)\psi_j(\mathbf x)}{|\mathbf x-\mathbf x’|}\]

    (aside: the fact that the exchange energy lowers the total energy when spins are aligned means that the spins will prefer to align, and this is the essence of one of Hund’s rules).

    Doing the constrained minimization again now yields the Hartree-Fock equations which are basically the Hartree equations with an additional non-local exchange potential:

    \[\left(-\frac{\hbar^2}{2m}\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2-\frac{Z\alpha\hbar c}{|\mathbf x|}+V_{\text{int}}(\mathbf x)\right)\psi_i(\mathbf x)-\int d^3\mathbf x’ V_{\text{ex},i}(\mathbf x,\mathbf x’)\psi_i(\mathbf x’)=E_i\psi_i(\mathbf x)\]

    but note that instead of \(\sum_{j\neq i}\), there is an apparent “self-energy” in the sum \(\sum_{j=1}^N\):

    \[V_{\text{int}}(\mathbf x)=\alpha\hbar c\sum_{j=1}^N\int d^3\mathbf x’\frac{|\psi_j(\mathbf x’)|^2}{|\mathbf x-\mathbf x’|}\]

    (so \(V_{\text{int}}\) is now the same for all electrons \(i=1,…,N\)). The self-energy is artificial though, and cancelled by a corresponding term from the exchange potential for each electron \(i\):

    \[V_{\text{ex},i}(\mathbf x,\mathbf x’)=\alpha\hbar c\sum_{j=1}^N\delta_{\sigma_i\sigma_j}\int d^3\mathbf x’\frac{\psi^{\dagger}_j(\mathbf x’)\psi_j(\mathbf x)}{|\mathbf x-\mathbf x’|}\]

    Problem: What is the key limitation of the Hartree-Fock method?

    Solution: The fact that the Hartree-Fock ansatz only uses a single Slater determinant. Of course, the Hartree ansatz suffers from the same issue, but both in that it basically assumes that \(|\Psi\rangle\) is a non-interacting many-body state. In reality, the true ground state will of course be a superposition of Slater determinants, or in the \(2^{\text{nd}}\)-quantization view, a superposition of Fock states in the \(N\)-particle sector of the fermionic Fock space, and in that case the very language of “filling/occupying” shells/subshells which is familiar becomes unsuitable.

    Posted in Blog | Leave a comment