PyTorch Fundamentals (Part \(2\))

Problem: Do an end-to-end walkthrough of the PyTorch machine learning workflow using the most basic univariate linear regression example. In particular, generate some linear data over a normalized feature space (whose slope \(w\) and intercept \(b\) would in practice be a priori unknown), split that linear data into training and testing subsets (no cross-validation dataset needed for this simple example), define the linear layer class, instantiate a model object of the class, and starting from random values of \(w\) and \(b\), use stochastic gradient descent with learning rate \(\alpha=0.01\) to minimize the training cost function \(C_{\text{train}}(w,b)\) based on an \(L^1\) loss. Iterate SGD for \(300\) epochs, and for every \(20\) epochs, record the current value of \(C_{\text{train}}(w,b)\) and the current value of \(C_{\text{test}}(w,b)\). Plot these cost function curves as a function of the epoch number \(0, 20, 40,…\). Save the final state dictionary of the model’s learned parameters \(w,b\) post-training, and load it back onto a new instance of the model class.

Solution:

pytorch_workflow
Posted in Blog | Leave a comment

PyTorch Fundamentals (Part \(1\))

Problem: Illustrate some of the basic fundamentals involved in using the PyTorch deep learning library. In particular, discuss the attributes of PyTorch tensors (e.g. dtype, CPU/GPU devices, etc.), how to generate random PyTorch tensors with/without seeding, and operations that can be performed on and between PyTorch tensors.

Solution:

pytorch_fundamentals
Posted in Blog | Leave a comment

Tokenization & Transformers

Problem: Let \(|\mathcal V|,N_c\in\mathbf Z^+\) be positive integers (where \(|\mathcal V|\) is the cardinality of an arbitrary set \(\mathcal V\) called the vocabulary and \(N_c\) will come to be seen as the number of codebooks), and let \(\mathcal T\) be a (finite or infinite) set whose elements are called tokens. What does it mean for a function \(\mathbf i\) to be a \((|\mathcal V|,N_c)\)-tokenizer on the token space \(\mathcal T\)? Give some examples of tokenizers.

Solution: It means that \(\mathbf i:\mathcal T\to\{1,…,|\mathcal V|\}^{N_c}\) quantizes each abstract token \(\tau\in\mathcal T\) as some concrete \(N_c\)-tuple of integers \(\mathbf i(\tau):=(i_1(\tau),…,i_{N_c}(\tau))\) where each of these \(N_c\) token IDs \(i_1(\tau),…,i_{N_c}(\tau)\in\mathbf N\) associated to the token \(\tau\in\mathcal T\) ranges from \(1\) to the vocabulary size \(|\mathcal V|\).

  • For \(\mathcal T:=\{a, b, …, z, @, \&, ., uh, th, …\}\) the set of natural language tokens (could be finite or infinite depending on how one defines \(\mathcal T\)), tokenization \(\mathbf i\) is typically done via a single (\(N_c=1\)) lookup inside a dictionary (“codebook/vocabulary”) of size \(|\mathcal V|\sim 10^5\) (indeed, one can simply take \(\mathcal V:=\mathcal T\)).
  • For \(\mathcal T\) the (infinite) set of \(20\text{ ms}\) Nyquist-downsampled audio waveform tokens, tokenization \(\mathbf i\) might be implemented by first passing an audio token \(\tau\in\mathcal T\) into some CNN encoder \(\mathbf e_{\text{CNN}}:\mathcal T\to\mathbf R^{<|\mathcal T|}\), thereby obtaining some hidden latent vector \(\mathbf e_{\text{CNN}}(\tau)\in\mathbf R^{<|\mathcal T|}\), followed by residual vector quantization \(\text{RVQ}:\mathbf R^{<|\mathcal T|}\to\{1,…,|\mathcal V|\}^{N_c}\) of \(\mathbf e_{\text{CNN}}(\tau)\) through a sequence of \(N_c\) ordered codebooks \(\mathcal V_1,…,\mathcal V_{N_c}\) each containing the same number \(|\mathcal V|:=|\mathcal V_1|=…=|\mathcal V_{N_c}|\) of vectors:
    \[\text{RVQ}(\mathbf e):=(\text{argmin}_{1\leq i\leq |\mathcal V|}|\mathbf e-\mathbf e_{i}^{(1)}|,\text{argmin}_{1\leq i\leq |\mathcal V|}|\mathbf e-\text{argmin}_{\mathbf e^{(1)}\in\mathcal V_1}|\mathbf e-\mathbf e^{(1)}|-\mathbf e_{i}^{(2)}|,…)\]
    (one could also consider codebooks of different sizes, though conceptually that would simply change the range of the tokenizer to \(\mathbf i:\mathcal T\to\{1,…,|\mathcal V_1|\}\times…\times\{1,…,|\mathcal V_{N_c}|\}\)).

Problem: Explain the transformer architecture as first described in Attention Is All You Need purely from the inference perspective of a forward pass on a model which is already well-trained (i.e. by articulating this, one thus defines what the objective of one’s model is, informing the choice of cost function during training).

Solution: Note that the tokenization process described above is not part of the transformer, but rather just some pre-processing of natural language into a format suitable to be input into the transformer.

  1. (One-Hot Encoding) Take the sequence of token IDs from tokenization, and one-hot encode them with respect to a vocabulary of size \(d_V\). Note this step is parameter-free, i.e. for a fixed vocabulary, there’s nothing to be learned here.
  2. (Embedding) Take each one-hot encoded vector \(\hat{\mathbf e}\in{0,1}^{d_V}\), and multiply it by the transformer’s (learnable) embedding matrix \(W_E\in\mathbf R^{d_E\times d_V}\), thereby obtaining some embedding vector \(\hat{\mathbf e}\mapsto\mathbf x:=W_E\hat{\mathbf e}\in\mathbf R^{d_E}\) that, after pre-training the transformer, should ideally learn a latent space representation of the vocabulary in which semantically similar words are close together and certain directions in the embedding space convey certain concepts.
  3. (Positional Encoding) Add something to each embedding vector in a way that signals where in the input natural language prompt it appears (e.g. sinusoidal encoding in the original paper).
  4. (Single Self-Attention Head) For each positional embedding vector \(\mathbf x_i\in\mathbf R^{d_e}\), compute its query vector \(\mathbf q_i=W_{\mathbf q}\mathbf x_i\), its key vector \(\mathbf k_i=W_{\mathbf k}\mathbf x_i\), and its value vector \(\mathbf v_i=W_{\mathbf v}\mathbf x_i\). Here, \(W_{\mathbf q},W_{\mathbf k}\in\mathbf R^{n_{qk}\times n_e}\) are weight matrices that map from the embedding space \(\mathbf R^{n_e}\) to the query/key space \(\mathbf R^{n_{qk}}\) of dimension \(n_{qk}\) and \(W_{\mathbf v}\in\mathbf R^{n_e\times n_e}\) is the weight matrix of values (which in practice is decomposed into a low-rank approximation \(W_{\mathbf v}=W_{\mathbf v\uparrow}W_{\mathbf v\downarrow}\) where typically \(W_{\mathbf v\downarrow}\in\mathbf R^{n_{qk}\times n_e}\) and \(W_{\mathbf v\uparrow}\in\mathbf R^{n_e\times n_{qk}}\)). For each \(\mathbf x_i\), one computes an update vector \(\Delta\mathbf x_i\) to be added to it (called a skip connection as typical in ResNets) according to a convex linear combination of the value vectors \(\mathbf v_1,…,\mathbf v_N\) of all the embeddings \(\mathbf x_1,…,\mathbf x_N\) in the context, specifically:

\[\Delta\mathbf x_i=V\text{softmax}\left(\frac{K^T\mathbf q_i}{\sqrt{n_{qk}}}\right)\]

where \(K=(\mathbf k_1,…,\mathbf k_N)\in\mathbf R^{n_{qk}\times N}\) and \(V=(\mathbf v_1,…,\mathbf v_N)\in\mathbf R^{n_e\times N}\) are key and value matrices associated to the inputted context (filled with column vectors here rather than the ML convention of row vectors). This map that takes the initial, generic token embeddings \(\mathbf x_i\) and nudges them towards more contextualized embeddings \(\mathbf x_i\mapsto\mathbf x’_i=\mathbf x_i+\Delta\mathbf x_i\) is called a head of self-attention. The \(1/\sqrt{n_{qk}}\) scaling in the softmax temperature is justified on the grounds that if \(\mathbf k\) and \(\mathbf q\) are random vectors whose independent components each have mean \(0\) and variance \(1\), then \(\mathbf k\cdot\mathbf q\) will have mean \(0\) and variance \(n_{qk}\), hence the need to normalize by \(\sqrt{n_{qk}}\) to ensure \(\mathbf k\cdot\mathbf q/\sqrt{n_{qk}}\) continues to have variance \(1\).

4. (Multi-Headed Self-Attention) Since context can influence meaning in different ways, repeat the above procedure in parallel for several heads of self-attention; each head will propose a displacement update to each of the \(N\) original embeddings \(\mathbf x_i\); add up all of them.

5. (Multilayer Perceptron) Linear, ReLU, Linear basically. It is hypothesized that facts are stored in this part of the transformer.

6. (Layers) Alternate between the multi-headed self-attention blocks and MLP blocks, make a probabilistic prediction of the next token \(\hat{\tau}_{N+1}\) using only the final, context-rich, modified embedding \(\mathbf x’_N\) of the last token \(\tau_N\) in the context by applying an unembedding matrix \(\mathbf u=W_{\mathbf u}\mathbf x’_N\) and running it through a softmax \(\text{softmax}(\mathbf u)\).

Problem: Based on the above discussion of the transformer architecture, explain how a large language model (LLM) like Gemini, ChatGPT, Claude, Grok, DeepSeek, etc. works (at a high level).

Solution: Essentially, since an LLM is a neural network which takes as input some string of text and probabilistically predicts the next token, by seeding it with some corpus of text \(T\), the LLM can sample according to the probability distribution it generates for the next token, and append that to \(T\mapsto T+\tau\). Then, simply repeat this except pretend that \(T+\tau\) was the seed all along. In this way, generative AI models such as ChatGPT (where GPT stands for generative pre-trained transformer) work. In practice, it is helpful to also provide some system prompt like “What follows is a conversation between a user and a knowledgeable AI assistant:”.

Posted in Blog | Leave a comment

JAX Fundamentals (Part \(1\))

JAX_tutorial
Posted in Blog | Leave a comment

Monte Carlo Methods

Problem: Distinguish between Las Vegas methods and Monte Carlo methods.

Solution: Both are umbrella terms referring to broad classes of methods that draw (repeatedly) from (not necessarily i.i.d.) random variables to compute the value of some deterministic variable. Here, “compute” means “find that deterministic value with certainty” in the case of a Las Vegas method whereas “compute” means “(point) estimate” in the case of a Monte Carlo method. Thus, their key difference lies in how they tradeoff time \(t\) vs. accuracy \(\alpha\):

Problem: Given a scalar continuous random variable with probability density function \(p(x)\), how can one sample faithfully from such an \(x\sim p(x)\)?

Solution: Fundamentally, a computer can (approximately) draw from a uniform random variable \(u\in [0,1]\) (using seed-based pseudorandom number generators for instance). In order to convert this \(u\to x\), the standard way is to use the cumulative distribution function of \(x\):

\[\int_{-\infty}^xdx’p(x’):=u\]

Then, as the probability density of \(u\) is uniform \(p(u)=1\), the probability density of \(x\) will be \(p(x)=\frac{du}{dx}p(u)\). As the CDF is a monotonically increasing bijection, there always exists an \(x=x(u)\) for each \(u\in[0,1]\). And intuitively, this whole sampling scheme should make a lot of sense:

Problem: So, it seems like the above “inverse CDF trick” has solved the problem of sampling from an arbitrary random variable \(x\) (and it generalizes to the case of a higher-dimensional random vector \(\mathbf x\in\mathbf R^n\) by sampling conditionally). So how come in many practical applications this “sampling” problem can still be difficult?

Solution: In practice, one may not have access to \(p(\mathbf x)\) itself, but only the general shape \(\tilde p(\mathbf x)=Zp(\mathbf x)\) of it (e.g. in Bayesian inference). One might object that \(Z\) is known, namely \(Z=\int d^n\mathbf x\tilde p(\mathbf x)\) can be written explicitly as an \(n\)-dimensional integral, so wherein lies the difficulty? The difficulty lies in evaluating the integral, which suffers from the curse of dimensionality if \(n\gg 1\) is large (just imagine approximating the integral by a sum, and then realizing that the number of terms \(N\) in such a series would grow as \(N\sim O(\exp n)\)). The inverse CDF trick thus fails because it requires one to have \(p(\mathbf x)\) itself, not merely \(\tilde p(\mathbf x)\).

Problem: Describe the Monte Carlo method known as importance sampling for estimating the expectation \(\langle f(\mathbf x)\rangle\) of a random variable \(f(\mathbf x)\).

Solution: Importance sampling consists of a deliberate reframing of the integrand of the expectation inner product integral:

\[\langle f(\mathbf x)\rangle=\int d^n\mathbf x p(\mathbf x)f(\mathbf x)=\int d^n\mathbf x q(\mathbf x)\frac{p(\mathbf x)f(\mathbf x)}{q(\mathbf x)}\]

in other words, replacing \(p(\mathbf x)\mapsto q(\mathbf x)\) and \(f(\mathbf x)\mapsto p(\mathbf x)f(\mathbf x)/q(\mathbf x)\). There are \(2\) cases in which importance sampling can be useful. The first case is related to the previous problem, namely when sampling directly from \(p(\mathbf x)\) is too difficult, in which case \(q(\mathbf x)\) should be chosen to be easier to both sample from and evaluate than \(p(\mathbf x)\). The second case is, even if sampling directly from \(p(\mathbf x)\) is feasible, it can still be useful to instead sample from \(q(\mathbf x)\) as a means of reducing the variance of the Monte Carlo estimator. That is, whereas the original Monte Carlo estimator of the expectation involving \(N\) i.i.d. draws \(\mathbf x_1,…,\mathbf x_N\) from \(p(\mathbf x)\) is \(\sum_{i=1}^Nf(\mathbf x_i)/N\) with variance \(\sigma^2_{f(\mathbf x)}/N\), the new Monte Carlo estimator of the expectation involving \(N\) i.i.d. draws from \(q(\mathbf x)\) is \(\sum_{i=1}^Np(\mathbf x_i)f(\mathbf x_i)/q(\mathbf x_i)N\) with variance \(\sigma^2_{p(\mathbf x)f(\mathbf x)/q(\mathbf x)}/N\); for importance sampling to be useful, \(q(\mathbf x)\) should be chosen such that:

\[\sigma^2_{p(\mathbf x)f(\mathbf x)/q(\mathbf x)}<\sigma^2_{f(\mathbf x)}\]

One can explicitly find the function \(q(\mathbf x)\) that minimizes the variance functional of \(q(\mathbf x)\) (using a Lagrange multiplier to enforce normalization \(\int d^n\mathbf x q(\mathbf x)=1\)):

\[q(\mathbf x)\sim p(\mathbf x)|f(\mathbf x)|\]

In and of itself, this \(q(\mathbf x)\) cannot be used because it contains the difficult piece \(p(\mathbf x)\). Nevertheless, motivated by this result, one can heuristically importance sample using a \(q(\mathbf x)\) which ideally is large at \(\mathbf x\in\mathbf R^n\) where \(p(\mathbf x)|f(\mathbf x)|\) is large. Intuitively, this is because that’s the integrand in the expectation inner product integral, so locations where it’s large will contribute most to the integral, hence are the most “important” regions to sample from with \(q(\mathbf x)\).

One glaring flaw with the above discussion is that if \(Z\) is not known, then it will also be impossible to sample from \(p(\mathbf x)\). Instead, suppose as before that only the shape \(\tilde p(\mathbf x)=Zp(\mathbf x)\) is known with \(Z=\int d^n\mathbf x\tilde p(\mathbf x)\). Thus, the expectation is:

\[\frac{\int d^n\mathbf x\tilde{p}(\mathbf x)f(\mathbf x)}{\int d^n\mathbf x\tilde{p}(\mathbf x)}\]

The key is to notice that the denominator is just the \(f(\mathbf x)=1\) special case of the numerator, hence the importance sampling trick can be applied to both numerator and denominator, thereby obtaining the (biased but asymptotically unbiased Monte Carlo estimator):

\[\frac{\sum_{i=1}^N\tilde p(\mathbf x_i)f(\mathbf x_i)/q(\mathbf x_i)}{\sum_{i=1}^N\tilde p(\mathbf x_i)/q(\mathbf x_i)}\]

for the expectation.

Problem: Describe what Markov chain Monte Carlo (MCMC) methods are about.

Solution: Whereas in “naive” Monte Carlo methods, the random vector draws \(\mathbf x_1,…,\mathbf x_N\) are often i.i.d., here the key innovation is that the sequence of random vectors is no longer i.i.d., rather \(\mathbf x_t\) is correlated with \(\mathbf x_{t-1}\) (but uncorrelated with all the earlier \(\mathbf x_{<t-1}\). In other words, the sequence \(\mathbf x_1,…,\mathbf x_N\) forms a (discrete-time) Markov chain, i.e. there exists a transition kernel with “matrix elements” \(p(\mathbf x’|\mathbf x)\) for any pair of states \(\mathbf x,\mathbf x’\in\mathbf R^n\) (assuming the DTMC is \(t\)-independent/homogeneous). The goal of an MCMC method is to find a suitable \(p(\mathbf x’|\mathbf x)\) such that the stationary \(N\to\infty\) distribution of the chain matches \(p(\mathbf x)\). Thus, it’s basically an “MCMC sampler”.

Problem: What does it mean to burn in an MCMC run? What does it mean to thin an MCMC run?

Solution: Given a DTMC \(\mathbf x_1,…,\mathbf x_N\), the idea of burning-in the chain is to discard the first few samples \(\mathbf x_1,\mathbf x_2,…\) as they’re correlated with the (arbitrary) choice of starting \(\mathbf x_1\) and so are not representative of \(p(\mathbf x)\) (equivalently, one hopes to burn-in for the first \(N_b\) samples where \(N_b\) is around the mixing time of the Markov chain). Burn-in therefore gives the MCMC run a chance to find the high-probability regions of \(p(\mathbf x)\).

Thinning an MCMC run means only keep e.g. every \(10\) DTMC points or something like that…this is to reduce the correlation between adjacent samples as the goal is to sample independently from \(p(\mathbf x)\).

Problem: The Metropolis-Hastings method is a general template for a fairly broad family of MCMC methods. Explain how it works.

Solution: In order for \(p(\mathbf x)\) to be the stationary distribution of a DTMC, by definition one requires it to be an eigenvector (with eigenvalue \(1\)) of the transition kernel:

\[\int d^n\mathbf xp(\mathbf x’|\mathbf x)p(\mathbf x)=p(\mathbf x’)\]

A sufficient (though not necessary) condition for this is if the chain is in detailed balance:

\[p(\mathbf x’|\mathbf x)p(\mathbf x)=p(\mathbf x|\mathbf x’)p(\mathbf x’)\]

for all \(\mathbf x,\mathbf x’\in\mathbf R^n\). The idea of Metropolis-Hastings is to conceptually decompose the transition from state \(\mathbf x\to\mathbf x’\) into \(2\) simpler steps, namely:

  1. Propose the transition \(\mathbf x\to\mathbf x’\) with proposal probability \(p_p(\mathbf x’|\mathbf x)\) (this distribution \(p_p\) needs be easy to sample \(\mathbf x’\) from for any \(\mathbf x\)!)
  2. Accept/reject the proposed transition \(\mathbf x\to\mathbf x’\) with an acceptance probability \(p_a(\mathbf x’|\mathbf x)\) (this distribution also needs to be easy to sample from!)

Thus, \(p(\mathbf x’|\mathbf x)=p_a(\mathbf x’|\mathbf x)p_p(\mathbf x’|\mathbf x)\). The ratio of acceptance probabilities from detailed balance is therefore:

\[\frac{p_a(\mathbf x’|\mathbf x)}{p_a(\mathbf x|\mathbf x’)}=\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p_p(\mathbf x)}\]

The ratio on the RHS is known (furthermore, because it’s a ratio, neither \(p_p\) nor \(p\) need be normalized! In other words, one can replace \(p_p\mapsto \tilde p_p\) and \(p\mapsto\tilde p\)). The task remains to specify the form of \(p_a(\mathbf x’|\mathbf x)\). The detailed balance condition doesn’t give a unique solution for \(p_a\), but heuristically one can get a unique solution by further insisting that the walker accept moves often. That is, w.l.o.g. suppose the RHS is computed to be \(>1\), so that \(p_a(\mathbf x’|\mathbf x)>p_a(\mathbf x|\mathbf x’)\). This implies that the transition from \(\mathbf x\to\mathbf x’\) is in some sense more variable than the reverse transition \(\mathbf x’\to\mathbf x\), so it seems that if at time \(t\) the MCMC walker were at \(\mathbf x_t=\mathbf x\), then at time \(t+1\) the MCMC walker should definitely move to state \(\mathbf x_{t+1}:=\mathbf x’\). In other words, one would like to set \(p_a(\mathbf x’|\mathbf x):=1\) and hence \(p_a(\mathbf x|\mathbf x’)=\frac{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}{p_p(\mathbf x|\mathbf x’)p_p(\mathbf x’)}\). But this therefore fixes the form of the function \(p_a\)! Specifically, to evaluate \(p_a(\mathbf x’|\mathbf x)\), one should set it to be \(1\) if the RHS ratio is \(>1\) (which coincidentally is also the value that \(p_a\) itself needs to be set to), whereas if the RHS ratio is \(<1\), then \(p_a\) should be set to the RHS ratio. An elegant way to encode all of this into a single compact expression is just:

\[p_a(\mathbf x’|\mathbf x)=\min\left(1,\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}\right)\]

Practically, to actually implement this Metropolis-Hastings acceptance probability \(p_a\), one can essentially flip a “biased” coin with probability \(p_a(\mathbf x’|\mathbf x)\) of landing “heads” (meaning to accept the transition \(\mathbf x\to\mathbf x’\)), e.g. randomly picking a number \(u\in[0,1]\) and declaring acceptance iff \(u\leq p_a(\mathbf x’|\mathbf x)\).

(aside: if the proposal probability \(p_p(\mathbf x’|\mathbf x)\sim e^{-|\mathbf x’-\mathbf x|^2/2\sigma^2}\) is taken to be normally and isotropically distributed about \(\mathbf x\), then it is symmetric \(p_p(\mathbf x’|\mathbf x)=p_p(\mathbf x|\mathbf x’)\) and the MH acceptance probability simplifies \(\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}=\frac{p(\mathbf x’)}{p(\mathbf x)}\) (in which case this is sometimes just called the Metropolis method). Another tip is that the burn-in period of the MCMC walk can often be used to tune the variance \(\sigma^2\) of the above Gaussian \(p_p\) distribution, namely if the burn-in period lasts for \(N_b\) samples, then look at the number of times \(N_a\) the MCMC walker accepted a proposed state from \(p_p\); a rule-of-thumb is that in high-dimensions \(n\gg 1\), one should tune \(\sigma^2\) such that the fraction of accepted proposals \(N_a/N_b\approx 23.4\%\), see “Weak convergence and optimal scaling of random walk Metropolis algorithms” by Roberts, Gelman, and Gilks (\(1997\)). Finally, one more MCMC trick worth mentioning is that one can always run e.g. \(\sim 100\) MCMC sequences in parallel and sample randomly from them as an alternative to thinning).

Problem: Explain how Gibbs sampling works.

Solution: Whereas a Metropolis-Hastings sampler moves around in state space like a queen in chess, a Gibbs sampler moves around like a rook. That is, it only ever moves along the \(n\) coordinate directions in \(\mathbf x=(x_1,…,x_n)\). Indeed, the Gibbs sampler can be viewed as a special case of a Metropolis-Hastings sampler in which the proposal distribution \(p_p\) is given by:

\[p_p(\mathbf x’|\mathbf x):=p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)\delta(x’_1-x_1)…\delta(x’_{i-1}-x_{i-1})\delta(x’_{i+1}-x_{i+1})…\delta(x’_n-x_n)\]

where \(i=1,…,n\) is cycled through the \(n\) coordinate directions. For each \(1\leq i\leq n\), the MH acceptance probability is:

\[\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}=\frac{p(x_i|x’_1,…,x’_{i-1},x’_{i+1},…,x’_n)\delta(x_1-x’_1)…\delta(x_{i-1}-x’_{i-1})\delta(x_{i+1}-x’_{i+1})…\delta(x_n-x’_n)p(\mathbf x’)}{p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)\delta(x’_1-x_1)…\delta(x’_{i-1}-x_{i-1})\delta(x’_{i+1}-x_{i+1})…\delta(x’_n-x_n)p(\mathbf x)}\]

\[=\frac{p(x_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x’_i,…,x_n)}{p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x_n)}\]

\[=\frac{p(x_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x_{i-1},x_{i+1},…,x_n)p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)}{p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_i|x_1,…,x_{i-1},x_{i+1},…,x_n)}\]

\[=1\]

so \(\text{min}(1,1)=1\); in other words, the Gibbs sampler always accepts because the proposal is coming from the conditional probability slices themselves (assumed to be known even if the full joint distribution \(p(\mathbf x)\) is unknown), so they must be authentic samples in a sense. For a Gibbs sampler, only after cycling through all \(n\) coordinate directions \(i=1,…,n\) does it count as \(1\) MCMC iteration.

(aside: a common variant of the basic Gibbs sampling formalism described above is block Gibbs sampling in which one may choose to update \(2\) or more features simultaneously, conditioning on the distribution of those features given all other features fixed; this would enable diagonal movements within that subspace which, depending how correlated those features are, could significantly increase the effective sample size (ESS)).

Problem: Explain how the Hamiltonian Monte Carlo method works.

Solution:

Posted in Blog | Leave a comment

Differential Geometry

Problem: What does it mean for a topological space \(X\) to be locally homeomorphic to a topological space \(Y\)? Hence, what does it mean for a topological space \(X\) to be locally Euclidean?

Solution: \(X\) is said to be locally homeomorphic to \(Y\) iff every point \(x\in X\) has a neighbourhood homeomorphic to an open subset of \(Y\). In particular, \(X\) is said to be locally Euclidean iff it is locally homeomorphic to Euclidean space \(Y=\mathbf R^n\) for some \(n\in\mathbf N\) called the dimension of \(X\).

Explicitly, this means every point \(x\in X\) is associated to at least one neighbourhood \(U_x\) along with at least one homeomorphism \(\phi_x:U_x\to\phi_x(U_x)\subset\mathbf R^n\); the ordered pair \((U_x,\phi_x)\) is an example of a chart on \(X\), and any collection of charts covering \(X\) (such as \(\{(U_x,\phi_x):x\in X\}\)) is called an atlas for \(X\). Hence, \(X\) is locally Euclidean iff there exists an atlas for \(X\).

(caution: there is a distinct concept of a local homeomorphism from \(X\to Y\); existence of such implies that \(X\) is locally homeomorphic to \(Y\), but the converse is false).

Problem: Give an example of topological spaces \(X\) and \(Y\) such that \(X\) is locally homeomorphic to \(Y\) but not vice versa.

Solution: \(X=(0,1)\) and \(Y=[0,1)\). Thus, be wary that local homeomorphicity is not a symmetric relation of topological spaces.

Problem: Let \(X\) be an \(n\)-dimensional locally Euclidean topological space. What additional topological constraints are typically placed on \(X\) in order for it to qualify as a topological \(n\)-manifold?

Solution:

  1. \(X\) is Hausdorff.
  2. \(X\) is second countable.

That being said, conceptually the most important criterion to remember is the local Euclideanity of \(X\). Topological manifolds are the most basic/general/fundamental of manifolds, onto which additional structures may be attached.

Problem: Let \(X\) be a topological \(n\)-manifold. What additional piece of structure should be attached to \(X\) so as to consider it a \(C^k\)-differentiable \(n\)-manifold?

Solution: The additional piece of structure is a \(C^k\)-atlas. A \(C^k\)-atlas for \(X\) is just an atlas for \(X\) with the property that all overlapping charts are compatible with each other. This means that if \((U_1,\phi_1),(U_2,\phi_2)\) are \(2\) charts of a \(C^k\)-atlas for \(X\) such that \(U_1\cap U_2\neq\emptyset\), then the transition map \(\phi_2\circ\phi_1^{-1}:\phi_1(U_1\cap U_2)\subset\mathbf R^n\to\phi_2(U_1\cap U_2)\subset\mathbf R^n\) is \(C^k\)-differentiable (equivalent to \(\phi_1\circ\phi_2^{-1}\) being \(C^k\)-differentiable).

An important point to mention is that a given topological \(n\)-manifold may have multiple different \(C^k\)-atlases. According to the above definition, that would seem to give rise to \(2\) distinct \(C^k\)-manifolds…but intuitively that would be undesirable; they should be viewed as really the same \(C^k\) manifold. This motivates imposing an equivalence relation on \(C^k\)-manifolds by asserting that two \(C^k\)-manifolds are equivalent iff their corresponding \(C^k\)-atlases are compatible…which just means that the union of their \(C^k\)-atlases is itself a \(C^k\)-atlas, or equivalently every chart in one \(C^k\)-atlas is compatible with every chart in the other \(C^k\)-atlas. This is sometimes formalized by introducing a so-called “maximal \(C^k\)-atlas”, but the basic idea should be clear.

Finally, note that topological manifolds are just \(C^0\)-manifolds because homeomorphisms are continuous. In physics, the most common case is to deal with \(C^{\infty}\)-manifolds like spacetime in GR or classical configuration/phase spaces in CM, or state space in thermodynamics, also called smooth manifolds.

Problem: Let \(S^1\) be the circle, a \(1\)-dimensional smooth manifold which is regarded as being embedded in \(\mathbf R^2\) via the set of points \(x^2+y^2=1\). Explain why \(S^1\) is locally (but not globally) homeomorphic to the real line \(\mathbf R\). Furthermore, explain whether or not the pair \((S^1,\theta)\) (where \(\theta:(\cos\theta,\sin\theta)\in S^1\mapsto\theta\in[0,2\pi)\) is the obvious angular coordinate mapping) is a chart on \(S^1\) or not.

Solution: \(S^1\) is locally homeomorphic to \(\mathbf R\) because, roughly speaking, every little strip of open arc in \(S^1\), upon zooming in, looks like a strip of straight line in \(\mathbf R\). However, \(S^1\) is not globally homeomorphic to \(\mathbf R\) because \(S^1\) is compact whereas \(\mathbf R\) is unbounded. The pair \((S^1,\theta)\) is not a chart on \(S^1\) because its image \([0,2\pi)\) isn’t open in \(\mathbf R\).

Problem: Hence, demonstrate an example of an atlas for \(S^1\).

Solution: To circumvent the problem above, one requires \(2\) charts to cover \(S^1\), thereby forming an atlas for \(S^1\). For example, a chart \(\theta_1:S^1-\{(1,0)\}\to (0,2\pi)\) similar to the above but with the point \((1,0)\) removed, and a second chart \(\theta_2:S^1-\{(-1,0)\}\to (-\pi,\pi)\) with the antipodal point \((-1,0)\) removed. Then the domain of these \(2\) charts overlap on the upper and lower semicircles respectively, so the transition function on this overlapping domain is:

\[\theta_2(\theta_1^{-1}(\theta))=\theta[\theta\in(0,\pi)]+(\theta-2\pi)[\theta\in(\pi,2\pi)]\]

which is indeed \(C^{\infty}\).

Problem: Define the real associative algebra \(C^k(X\to\mathbf R)\).

Solution: This is simply the space of all \(C^k\)-differentiable scalar fields on the \(C^k\)-manifold \(X\). To be precise, \(f\in C^k(X\to\mathbf R)\) iff for all charts \((U,\phi)\) in the \(C^k\)-atlas defining the differential structure of \(X\), the map \(f\circ\phi^{-1}:\phi(U)\to\mathbf R\) is \(C^k\)-differentiable.

Problem: Let \(X\) be a \(C^1\)-differentiable \(n\)-manifold and let \(x\in X\). What is the modern definition of tangent vectors used in differential geometry? How can one reconcile this with one’s prior intuition about tangent vectors in Euclidean space \(\mathbf R^n\)?

Solution: Imagine embedding a line \(\mathbf R\) inside a plane \(\mathbf R^2\). Clearly, the “tangent vectors” to the line, viewed within \(\mathbf R^2\), simply lie along the line itself, so in fact there was no need for the embedding in the first place. Similarly, if one embeds a plane \(\mathbf R^2\) inside \(\mathbf R^3\), the tangent vectors to the plane are confined within that plane. However, this no longer holds if one instead embeds a manifold like \(S^2\) within \(\mathbf R^3\); now the tangent vectors to the sphere “leak” outside \(S^2\) and into the embedding manifold \(\mathbf R^3\).

One would like to be able to define tangent vectors in a way that is intrinsic to whatever manifold \(X\) one working with (i.e. without needing to reference an external embedding). It turns out one way differential geometers have gone about this (nb. not the only way) is to exploit the duality \(\mathbf v\leftrightarrow\mathbf v\cdot\frac{\partial}{\partial\mathbf x}\) between (tangent) vectors \(\mathbf v\in\mathbf R^n\) and their corresponding directional derivative operators \(\mathbf v\cdot\frac{\partial}{\partial\mathbf x}\) in the Euclidean case \(X=\mathbf R^n\). Since tangent vectors are physically associated with velocity vectors (e.g. \(\mathbf v=\dot{\mathbf x}\) is tangent to a particle’s trajectory \(\mathbf x(t)\)), the bijection \(\mathbf v\rightarrow\mathbf v\cdot\frac{\partial}{\partial\mathbf x}\) converts a temporal rate of change to a spatial rate of change. These geometric considerations motivate the algebraic “\(1^{\text{st}}\)-order linear differential operator” definition of tangent vectors; \(v_x:C^{\infty}(X)\to\mathbf R\) is a tangent vector at \(x\in X\) iff it behaves derivative-like in the sense that it’s linear and obeys the product rule (technical term: derivation):

\[v_x(\alpha\phi+\beta\psi)=\alpha v_x(\phi)+\beta v_x(\psi)\]

\[v_x(\phi\psi)=\phi(x)v_x(\psi)+v_x(\phi)\psi(x)\]

for arbitrary scalars \(\alpha,\beta\in\mathbf R\) and smooth scalar fields \(\phi,\psi:X\to\mathbf R\); thus, this definition is \(X\)-intrinsic. To emphasize again, tangent vectors are just \(1^{\text{st}}\)-order linear differential operators (evaluated at some \(x\in X\)). The “linear” part comes from literally requiring linearity. The “first-order” part comes from the product rule; if it were e.g. \(2^{\text{nd}}\)-order, then the product rule would fail due to cross-terms. Finally, it’s worth emphasizing the parallels between this construction and Schwarz’s theory of distributions in which the tangent vector \(v\) plays the role of a distribution while the scalar field \(\phi\) plays the role of a test function; two tangent vectors \(v,v’\) are equal iff \(v(\phi)=v'(\phi\)\) when evaluated on an arbitrary test “bump” \(\phi\).

(aside: it was mentioned above that “tangent vector” doesn’t have to be defined algebraically; there exists an equivalent formulation that’s more intuitive: consider trajectories \(x(t)\in X\) slithering across \(X\) which pass through a point \(x_0\in X\) at time \(t=0\). Now look at some Euclidean projection \(\mathbf x(x(t))\in\mathbf R^n\) where \(\mathbf x:(x_0\in\subset X)\to\mathbf R^n\) is some chart in a neighbourhood of \(x_0\). One can sort the trajectories \(x(t)\) into equivalence classes based on the velocity vector \(\left(\frac{d}{dt}\right)_{t=0}\mathbf x(x(t))\in\mathbf R^n\) when \(x(t)\) is passing through \(x_0\). These equivalence classes are then taken to be the tangent vectors. Although more intuitive than the algebraic formulation, the drawback is that one has to check everything is indeed independent of the choice of chart \(\mathbf x\)).

Problem: What is the tangent (vector) space \(T_x(X)\) to a smooth \(n\)-manifold \(X\) at a point \(x\in X\)?

Solution: \(T_x(X)\) is simply the set of all tangent vectors at \(x\in X\) (informally, think tangent line or tangent plane, though remember that such conceptions implicitly require an embedding). By endowing it with the obvious notions of vector addition and scalar multiplication:

\[(v_x+v’_x)(\phi):=v_x(\phi)+v’_x(\phi)\]

\[(cv_x)(\phi):=cv_x(\phi)\]

it’s easy to check these operations are closed in \(T_x(X)\), thereby giving it the structure of a real, \(n\)-dimensional vector space (this justifies calling \(v_x\) a tangent vector in the first place). More precisely, to prove that \(\dim T_x(X)=n\) is indeed true for all \(x\in X\), one can construct an explicit coordinate basis for \(T_x(X)\) by pasting a chart \(\mathbf x=(x^0,…,x^{n-1})\) onto some neighbourhood of \(x\) and defining \(n\) basis tangent vectors \(\partial_{\mu}|_x:C^{\infty}(X)\to\mathbf R\) for \(\mu=0,1,…,n-1\) induced by the choice of chart \(\mathbf x\):

\[\partial_{\mu}|_x(\phi):=\frac{\partial\phi\circ\mathbf x^{-1}}{\partial x^{\mu}}\biggr|_{\mathbf x(x)}\]

where the RHS is just the standard partial derivative on \(\mathbf R^n\). It should be emphasized that this defines the meaning of expressions like \(\frac{\partial\phi}{\partial x^{\mu}}(x)\), so it’s not an abuse of notation.

Then, check that:

  1. These are genuinely tangent vectors \(\partial_{\mu}|_x\in T_x(X)\) in that they are linear and obey product rule.
  2. Check that they are linearly independent, i.e. \(\sum_{\mu=0}^{n-1}v_x^{\mu}\partial_{\mu}|_x=0\Rightarrow v_x^{\mu}=0\) (use \(\partial_{\mu}|_x(x^{\nu})=\delta^{\nu}_{\mu}\)).
  3. Check that \(\text{span}_{\mathbf R}\{\partial_{\mu}|_x:\mu=0,…,n-1\}=T_x(X)\) (use \(\partial_{\mu}|_x(x^{\nu})=\delta^{\nu}_{\mu}\) again to show \(v_x=\sum_{\mu=0}^{n-1}v_x(x^{\mu})\partial_{\mu}|_x\)).

Problem: A given tangent vector \(v_x\in T_x(X)\) may be written with contravariant components in \(2\) distinct coordinate bases:

\[v_x=v^{\mu}\partial_{\mu}|_x=v’^{\nu}\partial’_{\nu}|_x\]

simply from picking \(2\) different charts \((U,\phi),(U’,\phi’)\) in the \(C^1\)-atlas of \(X\) containing \(x\in U\cap U’\). Describe how the contravariant components \(v’^{\nu}\) may be obtained from the contravariant components \(v^{\mu}\).

Solution: Act on an arbitrary \(f\in C^1(X\to\mathbf R)\) to obtain:

\[v_x(f)=v^{\mu}\frac{\partial (f\circ\phi^{-1})}{\partial x^{\mu}}\biggr|_{\phi(x)}=v’^{\nu}\frac{\partial (f\circ\phi’^{-1})}{\partial x’^{\nu}}\biggr|_{\phi'(x)}\]

Now insert a “resolution of the identity” \(f\circ\phi’^{-1}\circ\phi’\circ\phi^{-1}\) and because \(X\) is a \(C^1\)-manifold, the transition map \(\phi’\circ\phi^{-1}\) will be differentiable and in particular, by the chain rule:

\[\frac{\partial (f\circ\phi^{-1})}{\partial x^{\mu}}\biggr|_{\phi(x)}=\frac{\partial x’^{\nu}}{\partial x^{\mu}}\biggr|_{\phi(x)}\frac{\partial(f\circ\phi’^{-1})}{\partial x’^{\nu}}\biggr|_{\phi'(x)}\]

where \(\phi'(x)=(x’^0(x),…,x’^{n-1}(x))\). This simple chain rule identity can by itself already be viewed as a passive change of \(T_x(X)\)-basis:

\[\frac{\partial}{\partial x’^{\nu}}\biggr|_{x}=\frac{\partial x^{\mu}}{\partial x’^{\nu}}\biggr|_{\phi'(x)}\frac{\partial}{\partial x^{\mu}}\biggr|_x\]

Or equivalently, as an active change of the contravariant components via the Jacobian matrix:

\[v’^{\nu}=\frac{\partial x’^{\nu}}{\partial x^{\mu}}\biggr|_{\phi(x)}v^{\mu}\]

Problem: What is a (tangent) vector field \(v\in\mathcal X(X)\) on a smooth manifold \(X\)?

Solution: There are \(2\) equivalent definitions.

  1. (Intuitive Definition): A vector field \(v\) is a smooth assignment of a tangent vector \(v_x\in T_x(X)\) at each point \(x\in X\); formally this means it is a map \(v:X\to TX\) where the tangent bundle of \(X\) is simply the collection of all tangent vectors with a memory of where they are rooted \(TX:=\bigsqcup_{x\in X}T_x(X)\).
  2. (Algebraic Definition) A vector field \(v:C^{\infty}(X)\to C^{\infty}(X)\) is a derivation from scalar fields to scalar fields. Formally, for each \(\phi:X\to\mathbf R\), the scalar field \(v(\phi):X\to\mathbf R\) needs to behave like a directional derivative of \(\phi\) “along” \(v\) in the sense that (again!) it actually behaves like a \(1^{\text{st}}\)-order linear differential operator:
    \[v(\alpha\phi+\beta\psi)=\alpha v(\phi)+\beta v(\psi)\]
    \[v(\phi\psi)=v(\phi)\psi+\phi v(\psi)\]

Problem: Given \(2\) vector fields \(v,v’\in\mathfrak{X}(X)\) on the same smooth manifold \(X\), explain why their composition \(vv’:=v\circ v’\) (and hence also \(v’v:=v’\circ v\)) is not a vector field on \(X\).

Solution: Just think about composing two \(1^{\text{st}}\)-order linear differential operators together. In general, one expects a \(2^{\text{nd}}\)-order linear differential operator. Therefore, while \(vv’\) is still linear, it won’t pass the product rule test for “\(1^{\text{st}}\)-orderness”, hence \(vv’\notin\mathfrak{X}(X)\). One can explicitly compute:

\[vv'(\phi\psi)=vv'(\phi)\psi+\phi vv'(\psi)+v'(\phi)v(\psi)+v(\phi)v'(\psi)\]

to see that the \(2^{\text{nd}}\)-order cross-terms \(v'(\phi)v(\psi)+v(\phi)v'(\psi)\) prevent \(vv’\) from fulfilling the product rule.

Problem: By analyzing the cross terms, explain how one can recover the product rule!

Solution: Notice the cross terms are invariant under the interchange of vector fields \(v\leftrightarrow v’\). Therefore, to cancel them out, one might consider a commutator \([v,v’]:=vv’-v’v\). This now is not only linear but has also recovered its “\(1^{\text{st}}\)-orderness”:

\[[v,v’](\phi\psi)=([v,v’]\phi)\psi+\phi[v,v’](\psi)\]

Thus, \(\mathfrak{X}(X)\) had a real Lie algebra structure after all. It may still feel a bit strange that the commutator somehow manages to recover “\(1^{\text{st}}\)-orderness”. The clearest way to see this is to work in a suitable chart \(x^{\mu}\), expanding the vector fields \(v=v^{\mu}\partial_{\mu}\) and \(v’=v’^{\nu}\partial_{\nu}\) which leads to the commutator \([v,v’]=[v,v’]^{\mu}\partial_{\mu}\) with:

\[[v,v’]^{\mu}=v^{\nu}\partial_{\nu}v’^{\mu}-v’^{\nu}\partial_{\nu}v^{\mu}\]

In particular, it’s clear that \([v,v’]\) is \(1^{\text{st}}\)-order because it’s just a linear combination of \(1^{\text{st}}\)-order partial differential operators \(\partial_{\mu}\) weighted by scalar components \([v,v’]^{\mu}\). It would not have been possible to write, for instance, \(vv’\neq (vv’)^{\mu}\partial_{\mu}\) because it’s not a \(1^{\text{st}}\)-order differential operator.

Problem: Any vector field \(v\in\mathfrak{X}(X)\) on a smooth manifold \(X\) induces a corresponding flow \(v_t\) on \(X\). Explain this generation process \(v\rightarrow v_t\).

Solution: Classically, if one had a steady fluid velocity field \(\mathbf v(\mathbf x)\) on \(\mathbf x\in\mathbf R^n\), the streamlines/pathlines/streaklines coincide and are given by solving the system of \(1^{\text{st}}\)-order ODEs:

\[\frac{d\mathbf x(t)}{dt}=\mathbf v(\mathbf x(t))\]

By analogy, a flow \(v_t:X\to X\) generated by a vector field \(v\in\mathfrak{X}(X)\) is defined by requiring:

\[\frac{d\phi(v_t(x_0))}{dt}=v(\phi)(v_t(x_0))\]

for all test scalar fields \(\phi\in C^{\infty}(X)\) and initial conditions \(x_0\in X\). Just as the quantum mechanical time-evolution operator \(U_t\) causes an initial state \(|\psi(t=0)\rangle\in\mathcal H\) to flow to another state \(|\psi(t)\rangle=U_t|\psi(t=0)\rangle\in\mathcal H\), one can think of the flow \(v_t\) as a one-parameter family of diffeomorphisms (\(v_{t+t’}=v_t\circ v_{t’}\Rightarrow v_{t=0}=1,v^{-1}_t=v_{-t}\)) taking an initial point \(x_0\in X\) and time translating it to a new point \(v_t(x_0)\in X\).

A corollary of this is that \(v(\phi)(x_0)=\frac{d\phi(v_t(x_0))}{dt}\biggr|_{t=0}\), so one has the useful Maclaurin expansion about \(t=0\):

\[\phi(v_t(x_0))=\phi(x_0)+tv(\phi)(x_0)+O_{t\to 0}(t^2)\]

Problem: For the case \(X=\mathbf R\), let the vector field \(v_x:=x^2\frac{d}{dx}\). Compute the corresponding flow \(v_t(x_0)\) for an arbitrary initial condition \(x_0\in\mathbf R\), and comment on its behavior.

Solution: The flow is governed by the ODE \(\dot x=x^2\) which is solved by \(v_t(x_0)=x_0/(1-x_0t)\). But this flow is undefined at \(t=1/x_0\), so \(v\) is considered an incomplete vector field. In this case, the reason can be traced to the non-compact nature of the manifold \(X=\mathbf R\).

Problem: Let \(T\) be a tensor field on some smooth manifold \(X\). Suppose one would like to “transport” \(T\) from \(X\) onto some diffeomorphic manifold \(X’\cong X\) (this assumption is essential! Without it a lot of what is about to be said fails). Explain the \(2\) mechanisms whereby this transport can be achieved. Furthermore, explain how, despite the fact that these \(2\) transport methods always exist, depending on the type of \(T\), one method may be more “natural” than another.

Solution: The key is to clearly identify which direction (in this case \(X\to X’\)) one would like to transport \(T\). With that reference direction in mind:

  1. (Pushforward) Since \(X’\cong X\), there must exist an explicit diffeomorphism \(\cong\) between them. If the diffeomorphism is aligned in the same direction as one’s desired transport direction (i.e. \(\cong:X\to X’\)), then one can use this diffeomorphism to pushforward the tensor \(T\mapsto\cong_*T\) from \(X\to X’\).
  2. (Pullback) If instead one’s desired direction of transport (\(X\to X’\)) goes against the direction of the diffeomorphism (i.e. \(\cong:X’\to X\)), then one can still transport \(T\) from \(X\) to \(X’\) by using \(\cong\) to pullback \(T\mapsto\cong^*T\).

    Suppose \(T=\phi\) is a scalar field on \(X\), and suppose one would like to transport \(\phi\mapsto\phi’\) on \(X’\). The natural way to do this is to insist that \(\phi'(x’)=\phi(x)\). The question is whether one should take \(x’=\cong(x)\) (i.e. using a pushforward) or \(x=\cong(x’)\) (i.e. using a pullback). Since \(\cong\) is a diffeomorphism, and thus a bijection, both of these choices work, but if \(\cong^{-1}\) did not exist, then the pushforward \(\phi’=\phi\circ\cong^{-1}\) would also not exist. Thus, the pullback is more natural because it doesn’t actually rely on \(\cong\) being a diffeomorphism:

By contrast, for \(T=v\) a vector field on \(X\), one naturally demands \(v’\in\mathfrak{X}(X’)\) to obey \(v'(\phi’)(x’)=v(\phi)(x)\). Again, because \(\cong\) is assumed to be a diffeomorphism, \(v\) can be transported from \(X\to X’\) via either a pushforward or a pullback, but unlike for scalar fields, here it turns out (why?) to be more natural to use a pushforward \(\cong:X\to X’\) so that \(v’=\cong_*v\), \(\phi=\cong^*\phi’\), and \(x’=\cong(x)\).

If \(x^{\mu}\) is some chart on \(X\) with respect to which \(v=v^{\mu}\partial_{\mu}\), and \(x’^{\mu}\) is some chart on \(X’\) (note that generically \(x^{\mu}\neq\cong^*x’^{\mu}\)), then the components of the pushforward are given by:

\[(\cong_*v)^{\mu}(x’)=\frac{\partial x’^{\mu}}{\partial x^{\nu}}v^{\nu}(x)\]

Problem: Let \(X\) be a smooth manifold, let \(\phi\in C^{\infty}(X)\) be a scalar field on \(X\), and let \(v\in\mathfrak{X}(X)\) be a vector field flowing on \(X\). Explain how the scalar field \(\mathcal L_v(\phi)\in C^{\infty}(X)\) is defined (this is called the Lie derivative of \(\phi\) “along” \(v\)), and how it is calculated.

Solution: Recall that for a scalar field \(\phi(\mathbf x)\) on \(\mathbf x\in\mathbf R^n\), the directional derivative of \(\phi\) along a velocity vector \(\mathbf v\in\mathbf R^n\) is defined by the limit:

\[\lim_{t\to 0}\frac{\phi(\mathbf x+t\mathbf v)-\phi(\mathbf x)}{t}=\left(\frac{d\phi(\mathbf x+t\mathbf v)}{dt}\right)_{t=0}=\mathbf v\cdot\frac{\partial\phi}{\partial\mathbf x}\]

By analogy, one defines:

\[\mathcal L_v\phi(x):=\lim_{t\to 0}\frac{v_t^*\phi(x)-\phi(x)}{t}=\left(\frac{d\phi(v_t(x))}{dt}\right)_{t=0}=v(\phi)(x)\]

Thus, \(\mathcal L_v\phi=v(\phi)\), or more abstractly (remembering this is the Lie derivative \(\mathcal L_v:C^{\infty}(X)\to C^{\infty}(X)\) on scalar fields) \(\mathcal L_v=v\).

Problem: Repeat the above for the action of the Lie derivative \(\mathcal L_v\) on another vector field \(v’\in\mathfrak{X}(X)\) to produce the vector field \(\mathcal L_v(v’)\in\mathfrak{X}(X)\).

Solution: The basic idea is that tangent vectors living in distinct tangent spaces cannot be subtracted. So one has to pushforward the future tangent vector “back in time”:

\[\mathcal L_vv’:=\lim_{t\to 0}\frac{(v^{-1}_t)_*v’-v’}{t}\]

One way to proceed is to define an auxiliary scalar field \(f(t,t’):=v'(\phi\circ v_{-t})(v_{t’}(x))\) in which case one can use the chain rule to compute \(\mathcal L_vv'(\phi)(x)=\left(\frac{df(t,t)}{dt}\right)_{t=0}=\left(\frac{\partial f(t,0)}{\partial t}\right)_{t=0}+\left(\frac{\partial f(0,t’)}{\partial t’}\right)_{t’=0}\). However, here it will be fun to directly Maclaurin expand the numerator of the limit. Recall from earlier the fundamental property of infinitesimal flows \(\text{test scalar field}(v_t(x))\approx\text{test scalar field}(x)+tv(\text{test scalar field})(x)\). First, apply it to \(\text{test scalar field}=v'(\phi\circ v_{-t})\):

\[v'(\phi\circ v_{-t})(v_t(x))\approx v'(\phi\circ v_{-t})(x)+tv(v'(\phi\circ v_{-t}))(x)\]

Then apply it again for \(\text{test scalar field}=\phi\) and replace \(t\mapsto -t\):

\[\phi\circ v_{-t}\approx \phi-tv(\phi)\]

Distributing everything up to \(O(t)\), using linearity of vector fields, the limit reduces to \(\mathcal L_vv’=[v,v’]\). More abstractly, the Lie derivative is just the Lie bracket (hence the name!) on \(\mathfrak{X}(X)\), i.e. \(\mathcal L_v=[v,\space]\). Indeed, one can check that it is a Lie algebra representation because of the homomorphism property \(\mathcal L_{[v,v’]}=[\mathcal L_v,\mathcal L_{v’}]\) thanks to the Jacobi identity.

Problem: Define a covector at some point \(x\in X\) on a \(C^1\)-manifold \(X\). Hence, define a \(1\)-form on \(X\). Clearly emphasize the difference between a covector and a \(1\)-form.

Solution: At \(x\in X\), one of course has the tangent space \(T_x(X)\). Since \(T_x(X)\) is a vector space, it has an associated dual vector space \(T^*_x(X)\) which in this context is called the cotangent space at \(x\in X\) (cf. \(\tan\) vs. \(\cot\)). As the linear functionals \(v_x\in T_x(X)\) are called tangent vectors at \(x\in X\), so it makes sense that the linear functionals \(A_x\in T^*_x(X)\) are called cotangent vectors at \(x\in X\), or covectors for short.

Just as a (tangent) vector field \(v\) assigns a tangent vector \(v_x\in T_x(X)\) to each point \(x\in X\) across the manifold \(X\), a (cotangent) vector field \(\omega\) assigns a cotangent vector \(A_x\in T_x^*(X)\) to each point \(x\in X\). This “covector field” is called a differential \(1\)-form, or \(1\)-form for short.

Problem: Show how, by applying the exterior derivative \(d\) to any scalar field \(\phi\in C^{\infty}(X)\cong\Omega^0(X)\) (also called a differential \(0\)-form), the resulting object \(d\phi\in\Omega^1(X)\) is a \(1\)-form.

Solution: This is because \(d\phi\) is defined by its action on an arbitrary vector field \(v\) as:

\[d\phi(v):=v(\phi)=\mathcal L_v(\phi)\]

Problem: Explain why, when the manifold \(X\) is covered by a coordinate chart \((x^0,…,x^{n-1})\), any exact differential \(1\)-form is given by the familiar chain rule:

\[d\phi=\frac{\partial\phi}{\partial x^{\mu}}dx^{\mu}\]

Solution: The first thing is to unpack the meaning of the differentials \(dx^{\mu}\). This amounts to substituting \(\phi=x^{\mu}\) for the exterior derivative of a \(0\)-form, so for an arbitrary vector field \(v\), one has:

\[dx^{\mu}(v):=v(x^{\mu})\]

In particular, if \(v=\partial_{\nu}\) for some \(\nu\), then \(\partial x^{\mu}/\partial x^{\nu}=\delta^{\mu}_{\nu}\), so \(dx^{\mu}\) is the dual coordinate basis for \(\Omega^1(X)\) with respect to the coordinate basis \(\partial_{\mu}\) of \(\mathfrak{X}(X)\). The components of \(d\phi\) in the \(dx^{\mu}\) basis are thus as claimed:

\[d\phi=d\phi(\partial_{\mu})dx^{\mu}=\partial_{\mu}\phi dx^{\mu}\]

Problem: Let \(X\) be a \(C^1\)-manifold and let \(x^{\mu}\) and \(x’^{\mu}\) be two coordinate charts for \(X\). Compare how the vector field basis, the one-form basis, and the components of vectors and one-forms in their respective bases transform between these \(2\) coordinate charts.

Solution: Let \(v=v^{\mu}\partial_{\mu}=v’^{\mu}\partial’_{\mu}\) be a vector field and \(A=A_{\mu}dx^{\mu}=A’_{\mu}dx’^{\mu}\) be a \(1\)-form. In what follows, the key is to always remember that the underlying objects \(v, A\) are chart-invariant, so if the basis transforms under one particular Jacobian, then the components must transform under the inverse Jacobian.

One can start with the vector field basis \(\partial’_{\mu}=\partial’_{\mu}x^{\nu}\partial_{\nu}\) which is just the indisputable chain rule. With that as an anchoring point, one immediately concludes \(v’^{\mu}=\partial_{\nu}x’^{\mu}v^{\nu}\). Then, by enforcing that \(dx’^{\mu}(\partial’_{\nu})=\delta^{\mu}_{\nu}\) remains biorthogonal, this leads to the intuitive chain rule requirement \(dx’^{\mu}=\partial_{\nu}x’^{\mu}dx^{\nu}\), and hence \(A’_{\mu}=\partial’_{\mu}x^{\nu}A_{\nu}\).

Problem: Let \(X\) and \(X’\) be smooth, diffeomorphic manifolds. Earlier it was seen that scalar fields are naturally transported via pullback whereas vector fields are naturally transported via pushforward. What about for \(1\)-forms? Hence, define the Lie derivative \(\mathcal L_vA\) of a \(1\)-form \(A\) with respect to a vector field \(v\).

Solution: Just like scalar fields, \(1\)-forms naturally pullback (can remember this because they both map to \(\mathbf R\)). Specifically, if \(A\in\Omega^1(X)\) currently lives on \(X\), and one has a diffeomorphism \(\cong:X’\to X\), then:

\[\cong^*A(v):=A(\cong_*v)\]

Or in components (using the earlier result for vector field pushforward components \((\cong_*v)^{\mu}(x’)=\frac{\partial x’^{\mu}}{\partial x^{\nu}}v^{\nu}(x)\)):

\[(\cong^*A)_{\mu}=\frac{\partial x’^{\nu}}{\partial x^{\mu}}A_{\nu}\]

The Lie derivative is then given by:

\[\mathcal L_{v}A:=\lim_{t\to 0}\frac{v_t^*A-A}{t}\]

It turns out (how?) one can show that \(\mathcal L_v A=(\mathcal L_v A)_{\mu}dx^{\mu}\) where the components are:

\[(\mathcal L_v A)_{\mu}=v^{\nu}\partial_{\nu}A_{\mu}+A_{\nu}\partial_{\mu}v^{\nu}\]

Problem: Let \(X\) be a smooth \(n\)-manifold, and let \(x\in X\). Define a tensor \(T_x\) of type \((k^*,k)\) (i.e. rank \(k^*+k\)) at \(x\). Define the components of the tensor \(T_x\) with respect to a suitable ordered basis. Give an example of a tensor field defined over any manifold \(X\).

Solution: Just as vector fields are smooth assignments of tangent vectors across \(X\), just as \(1\)-forms are smooth assignments of covectors across \(X\), tensor fields are smooth assignments of tensors across \(X\). In particular, it only makes sense to speak of a tensor \(T_x\) at a specific point \(x\in X\), and to view the object \(T\) itself as a tensor field. With that in mind, a tensor at \(x\in X\) of type \((k^*,k)\in\mathbf N^2\) is defined (at least in differential geometry) to be any multilinear map \(T_x:T^*_x(X)^{k^*}\times T_x(X)^k\to\mathbf R\) that eats in \(k^*\) covectors in the cotangent space \(T^*_x(X)\) and \(k\) tangent vectors in the tangent space \(T_x(X)\) and spits out a scalar. The corresponding tensor field can be viewed as a map \(T:\Omega^1(X)^{k^*}\times\mathfrak{X}(X)^k\to C^{\infty}(X)\). Given an ordered (possibly non-coordinate) basis \(\partial_{\mu}\) for \(\mathfrak{X}(X)\) with corresponding dual ordered (again! possibly non-coordinate) basis \(dx^{\mu}\) for \(\Omega^1(X)\), the \(n^{k^*+k}\) components of \(T\) are defined by:

\[T^{\mu_1,…,\mu_{k^*}}_{\nu_1,…,\nu_{k}}:=T(dx^{\mu_1},…,dx^{\mu_{k^*}},\partial_{\nu_1},…,\partial_{\nu_{k}})\]

Thus, covectors are type \((0,1)\) tensors while tangent vectors are type \((1,0)\) tensors. Any manifold \(X\) is equipped with a type \((1,1)\) tensor field \(\delta:\Omega^1(X)\times\mathfrak{X}(X)\to C^{\infty}(X)\) defined by:

\[\delta(A,v):=A(v)\Leftrightarrow\delta(dx^{\mu},\partial_{\nu})=\delta^{\mu}_{\nu}\]

Problem: (tensor component transformation)

Solution:

Problem: Let \(T,T’\) be respectively type \((k^*,k)\) and \((k’^*,k’)\) tensor fields defined over the same smooth \(n\)-manifold \(X\). Define the \((k^*+k’^*,k+k’)\) tensor field given by their tensor product \(T\otimes T’:\Omega^1(X)^{k^*+k’^*}\times\mathfrak{X}(X)^{k+k’}\to C^{\infty}(X)\).

Solution: It’s pretty much the obvious thing one can write down that sums the tensor ranks:

\[T\otimes T'(A_1,…,A_{k^*},A’_1,…,A’_{k’^*},v_1,…,v_k,v’_1,…,v’_{k’})\]

\[:=T(A_1,…,A_{k^*},v_1,…,v_k)T'(A’_1,…,A’_{k’^*},v’_1,…,v’_{k’})\]

Or, with respect to an ordered (possibly non-coordinate!) basis \(\{\partial_{\mu}\}\) of \(\mathfrak{X}(X)\):

\[(T\otimes T’)^{\mu_1,…,\mu_{k^*},\mu’_1,…,\mu’_{k’^*}}_{\nu_1,…,\nu_k,\nu’_1,…,\nu’_{k’}}=T^{\mu_1,…,\mu_{k^*}}_{\nu_1,…,\nu_k}T’^{\mu’_1,…,\mu’_{k’^*}}_{\nu’_1,…,\nu’_{k’}}\]

Problem: Show how to compute the pushforward \(\cong_*T\) of a tensor field \(T\), and hence show how to take the Lie derivative \(\mathcal L_{\partial}T\) of a tensor field \(T\) along a vector field \(\partial\).

Solution:

Problem: Define a differential \(k\)-form.

Solution: Differential \(k\)-forms are simply antisymmetric type \((0,k)\) tensor fields. At each point \(x\in X\), one can think of them as measuring devices that eat in \(k\) tangent vectors in \(T_x(X)\) and spit out a number. More globally, \(\omega\in\Omega^k(X)\) is said to be a differential \(k\)-form iff \(\omega:\mathfrak{X}(X)\to C^{\infty}(X)\) takes in \(k\) vector fields and spits out a scalar field.

Problem: Given a differential \(k\)-form \(\omega\in\Omega^k(X)\) and a differential \(k’\)-form \(\omega’\in\Omega^{k’}(X)\), define the differential \(k+k’\)-form given by their wedge product \(\omega\wedge\omega’\in\Omega^{k+k’}(X)\).

Solution:

Problem: Define the exterior derivative \(d:\Omega^k(X)\to\Omega^{k+1}(X)\) of a differential \(k\)-form, showing how it generalizes the earlier definition given for \(k=0\)-forms.

Solution: Recall that the wedge product \(\wedge\) of a differential \(k’\)-form and a differential \(k\)-form is a differential \(k’+k\)-form. In particular, the exterior derivative can loosely be viewed as a special case of the wedge product with \(k’=1\) and \(d=\partial_{\mu}dx^{\mu}\wedge\); for instance \(d\phi=\partial_{\mu}dx^{\mu}\), and for a \(1\)-form \(A=A_{\mu}dx^{\mu}\) the exterior derivative \(F:=dA\) is given by:

\[F=\partial_{\mu}dx^{\mu}\wedge A_{\nu}dx^{\nu}=\partial_{\mu}A_{\nu}dx^{\mu}\wedge dx^{\nu}=\frac{1}{2}F_{\mu\nu}dx^{\mu}\wedge dx^{\nu}\]

with \(F_{\mu\nu}:=\partial_{\mu}A_{\nu}-\partial_{\nu}A_{\mu}\).

Problem: Prove the following properties of the wedge product \(\wedge\) and its interaction with the exterior derivative \(d\), the pushback, pushforward, tensor product, interior product…?

  • (Antisymmetry of odd-degree forms) \[\omega\wedge\omega’=(-1)^{kk’}\omega’\wedge\omega\]
  • (Graded product rule with respect to \(d\)) \[d(\omega\wedge\omega’)=(d\omega)\wedge\omega’+(-1)^k\omega\wedge d\omega’\]

Problem: Prove Cartan’s magic formula:

\[\mathcal L_{\partial}=\{d,\iota_{\partial}\}\]

Posted in Blog | Leave a comment

Support Vector Machines

Problem: Explain how a hard-margin support vector machine would perform binary classification.

Solution: Conceptually, it’s simple. Given a training set of \(N\) feature vectors \(\mathbf x_1,…,\mathbf x_N\in\mathbf R^n\) each associated with some binary target label \(y_1,…,y_N\in\{-1,1\}\) (notice the \(2\) binary classes are \(y=-1\) and \(y=1\) rather than the more conventional \(y=0\) and \(y=1\); this is simply a matter of convenience), the goal of an SVM is to compute the unique hyperplane in \(\mathbf R^n\) that will separate the \(y=-1\) and \(y=1\) classes as “best” as possible. For \(n=2\) this can be visualized as trying to find the “widest possible street” between \(2\) neighborhoods:

Mathematically, if the hyperplane decision boundary \(\mathbf w\cdot\mathbf x+b=0\) is defined by a normal vector \(\mathbf w\in\mathbf R^n\) such that the SVM classifies points with \(\mathbf w\cdot\mathbf x+b\geq 0\) as \(y=1\) and \(\mathbf w\cdot\mathbf x+b\leq 0\) as \(y=-1\) (more compactly, \(\hat y_{\text{SVM}}(\mathbf x|\mathbf w,b)=\text{sgn}(\mathbf w\cdot\mathbf x+b)\)), and the overall normalization of \(\mathbf w\) and \(b\) is fixed by stipulating that the “gutters” of the street are defined by the hyperplanes \(\mathbf w\cdot\mathbf x+b=\pm 1\) passing through the support (feature) vectors (drawn as bigger red/green dots in the diagram), then the “width of the street” is the distance between these \(2\) “gutter hyperplanes”, namely \(2/|\mathbf w|\). Since one would like to maximize this margin \(2/|\mathbf w|\), this is equivalent to minimizing the quadratic \(|\mathbf w|^2/2\). Thus, one can formulate this “hard-margin SVM” algorithm as the programming problem:

\[\text{argmin}_{\mathbf w\in\mathbf R^n,b\in\mathbf R:\forall i=1,…,N,y_i(\mathbf w\cdot\mathbf x_i+b)\geq 1}\frac{|\mathbf w|^2}{2}\]

where the \(N\) constraints \(y_i(\mathbf w\cdot\mathbf x_i+b)\geq 1\) (or vectorially \(\mathbf y\odot(X^T\mathbf w+b\mathbf{1})\geq\mathbf 1\)) are just saying that all \(N\) feature vectors \(\mathbf x_1,…,\mathbf x_N\) in the training set should lie on the correct side of the decision boundary hyperplane \(\mathbf w\cdot\mathbf x+b=0\).

Problem: What is the glaring problem with a hard-margin SVM? How does a soft-margin SVM help to address this issue?

Solution: The basic problem is that the hard-margin SVM is way too sensitive to outliers, e.g. a single \(y=-1\) feature vector amidst the \(y=1\) neighborhood would cause the hard-margin SVM to be unable to find a solution (because indeed there would be no hyperplane that perfectly separates all the training data into their correct classes). Intuitively, it seems like in such a case the best way to deal with this bias-variance tradeoff is to shrug and ignore the outliers:

Mathematically, the usual way to encode this is by introducing \(N\) “slack variables” \(\xi_1,…,\xi_N\geq 0\) such that the earlier \(N\) constraints are relaxed to \(y_i(\mathbf w\cdot\mathbf x_i+b)\geq 1-\xi_i\). However, to avoid cutting too much slack (i.e. incurring too much misclassification on the training set), their use should be penalized by the \(\ell^1\)-norm \(|\boldsymbol{\xi}|_1:=\boldsymbol{\xi}\cdot\mathbf{1}=\sum_{i=1}^N\xi_i\). To this effect, the programming problem is modified to that of soft-margin SVM:

\[\text{argmin}_{\mathbf w\in\mathbf R^n,b\in\mathbf R,\boldsymbol{\xi}\in\mathbf R^N:\mathbf y\odot(X^T\mathbf w+b\mathbf{1})\geq\mathbf 1-\boldsymbol{\xi},\boldsymbol{\xi}\geq\mathbf 0}\frac{|\mathbf w|^2}{2}+C|\boldsymbol{\xi}|_1\]

where \(C\geq 0\) is a “strictness” hyperparameter that must be selected ahead of time to balance hard vs. soft SVM.

Problem: The above represents the primal formulation of the programming problem in terms of so-called primal variables \(\mathbf w,b,\boldsymbol{\xi}\). Show how the KKT conditions enable one to obtain a corresponding dual formulation of the programming problem in terms of a dual variable \(\boldsymbol{\lambda}\):

\[\text{argmax}_{\boldsymbol{\lambda}\in\mathbf R^N:\boldsymbol{\lambda}\cdot\mathbf y=0,\mathbf 0\leq\boldsymbol{\lambda}\leq C\mathbf 1}|\boldsymbol{\lambda}|_1-\frac{1}{2}(\boldsymbol{\lambda}\odot\mathbf y)X^TX(\boldsymbol{\lambda}\odot\mathbf y)\]

Solution: The idea is to combine the \(2\) constraints \(\mathbf y\odot(X^T\mathbf w+b\mathbf{1})\geq\mathbf 1-\boldsymbol{\xi}\) and \(\boldsymbol{\xi}\geq\mathbf 0\) using \(2\) KKT multipliers \(\boldsymbol{\lambda},\boldsymbol{\mu}\) into the Lagrangian:

\[L(\mathbf w,b,\boldsymbol{\xi},\boldsymbol{\lambda},\boldsymbol{\mu})=\frac{|\mathbf w|^2}{2}+C|\boldsymbol{\xi}|_1-\boldsymbol{\lambda}\cdot(\mathbf y\odot(X^T\mathbf w+b\mathbf{1})-\mathbf 1+\boldsymbol{\xi})-\boldsymbol{\mu}\cdot\boldsymbol{\xi}\]

The KKT necessary conditions assert:

\[\frac{\partial L}{\partial\mathbf w}=\mathbf 0\Rightarrow\mathbf w=X(\boldsymbol{\lambda}\odot\mathbf y)\]

\[\frac{\partial L}{\partial b}=0\Rightarrow\boldsymbol{\lambda}\cdot\mathbf y=0\]

\[\frac{\partial L}{\partial\boldsymbol{\xi}}=\mathbf 0\Rightarrow C\mathbf 1=\boldsymbol{\lambda}+\boldsymbol{\mu}\]

in addition to the \(2\) original constraints on the primal variables (primal feasibility), and dual feasibility \(\boldsymbol{\lambda},\boldsymbol{\mu}\geq\mathbf 0\) on both dual variables (as a corollary, \(\mathbf 0\leq\boldsymbol{\lambda}\leq C\mathbf 1\)), and complementary slackness \(\boldsymbol{\lambda}\odot(\mathbf y\odot(X^T\mathbf w+b\mathbf{1})-\mathbf 1+\boldsymbol{\xi})=\boldsymbol{\mu}\odot\boldsymbol{\xi}=\mathbf 0\). The idea of the support vectors is clearly seen in the \(1^{\text{st}}\) of these complementary slackness conditions.

By substituting the stationary conditions found back into the Lagrangian, one can eliminate all the primal variables (and even eliminate one of the dual variables \(\boldsymbol{\mu}\)) to obtain the claim:

\[L(\boldsymbol{\lambda})=|\boldsymbol{\lambda}|_1-\frac{1}{2}(\boldsymbol{\lambda}\odot\mathbf y)X^TX(\boldsymbol{\lambda}\odot\mathbf y)\]

which is a standard quadratic programming exercise with efficient solutions for \(\boldsymbol{\lambda}\).

(mention something about strong duality as enabling a min-max interchange of primal vs. dual?)

Problem: Although soft-margin SVM is a significant improvement over hard-margin SVM, it too has a glaring issue. What is that issue and how can it be addressed?

Solution: Whether hard-margin or soft-margin, SVMs are ultimately linear binary classifiers; they can only directly learn a hyperplane decision boundary. Therefore, if the data is simply linearly inseparable, then even soft-margin SVM would be unwise to apply directly.

However, there is a creative solution; just take the \(N\) feature vectors \(\mathbf x_1,…,\mathbf x_N\in\mathbf R^n\) and apply a nonlinear transformation \(\prime:\mathbf R^n\to\mathbf R^{>n}\) from the current feature space \(\mathbf R^n\) to some higher-dimensional feature space \(\mathbf R^{>n}\) (essentially a kind of feature engineering), thereby obtaining \(N\) transformed feature vectors \(\mathbf x’_1,…,\mathbf x’_N\in\mathbf R^{>n}\). Ideally, if the nonlinear transformation \(\prime\) is well-chosen, then the transformed feature vectors \(\mathbf x’_1,…,\mathbf x’_N\) will become linearly separable in the higher-dimensional feature space \(\mathbf R^{>n}\), in which case the usual soft-margin SVM may be applied in \(\mathbf R^{>n}\). Then, applying the inverse nonlinear mapping \(\prime^{-1}:\mathbf R^{>n}\to\mathbf R^n\) back to the original feature space \(\mathbf R^n\), the SVM (soft) maximal-margin hyperplane in \(\mathbf R^{>n}\) would be transformed into some nonlinear decision boundary back in \(\mathbf R^n\), thereby effectively extending the applicability of the SVM method to linearly inseparable data! See this YouTube video for an example.

Problem: Continuing off the previous problem, suppose one is already in the transformed space with feature vectors \(\mathbf x’_1,…,\mathbf x’_N\in\mathbf R^{>n}\), and one has solved the dual programming problem in this space to obtain some optimal \(\boldsymbol{\lambda}\in\mathbf R^N\). How is binary classification thus performed?

Solution: Based on the previous analysis, the SVM binary classifier in this transformed feature space \(\mathbf R^{>n}\) may be written:

\[\hat y_{\text{SVM}}(\mathbf x’)=\text{sgn}(\mathbf w\cdot\mathbf x’+b)=\text{sgn}\left(\sum_{i=1}^N\lambda_iy_i\mathbf x’\cdot\mathbf x’_i+y_S-\sum_{i=1}^N\lambda_iy_i\mathbf x’_S\cdot\mathbf x’_i\right)\]

where \(\mathbf x’_S\) is any support feature vector. Note by complementary slackness that most \(\lambda_i=0\) are vanishing (corresponding non-support feature vectors \(\mathbf x_i\)), so the sums \(\sum_{i=1}^N\) may also be written as sums over only support vectors.

Problem: Explain how the expression above can be simplified by means of the kernel trick.

Solution: Basically, the feature space transformation \(\prime:\mathbf R^n\to\mathbf R^{>n}\) is a complicated mapping of vectors to vectors. By contrast, only their scalar dot products \(\mathbf x’\cdot\mathbf x’_i\) appear in the SVM classifier \(\hat y_{\text{SVM}}(\mathbf x)\), and scalars are simpler than vectors. The kernel trick essentially invites one to forget about the details of the transformation \(\prime\) and just directly specify an analytical expression for the dot product in the transformed feature space \(K(\mathbf x,\tilde{\mathbf x}):=\mathbf x’\cdot\tilde{\mathbf x}’\), where here \(K:\mathbf R^n\times\mathbf R^n\to\mathbf R\) is called the kernel function. Effectively, this means one never has to actually visit the transformed feature space \(\mathbf R^{>n}\) since the kernel is perfectly happy to just take the feature vectors in the original space \(\mathbf R^n\). Thus, one can write:

\[\hat y_{\text{SVM}}(\mathbf x)=\text{sgn}(\mathbf w\cdot\mathbf x’+b)=\text{sgn}\left(\sum_{i=1}^N\lambda_iy_iK(\mathbf x,\mathbf x_i)+y_S-\sum_{i=1}^N\lambda_iy_iK(\mathbf x_S,\mathbf x_i)\right)\]

Problem: What is the Gaussian (radial basis function) kernel \(K_{\sigma}(\mathbf x,\tilde{\mathbf x})\)? Sketch why it’s a valid kernel function.

Solution: This is by far the most popular kernel for nonlinear SVM classification, involving a hyperparameter \(\sigma\):

\[K_{\sigma}(\mathbf x,\tilde{\mathbf x})=e^{-|\mathbf x-\tilde{\mathbf x}|^2/2\sigma^2}\]

algebraically, it is clearly equivalent to:

\[K_{\sigma}(\mathbf x,\tilde{\mathbf x})=e^{-|\mathbf x|^2/2\sigma^2}e^{-|\tilde{\mathbf x}|^2/2\sigma^2}e^{\mathbf x\cdot\tilde{\mathbf x}/\sigma^2}\]

Taylor expanding the last factor:

\[K_{\sigma}(\mathbf x,\tilde{\mathbf x})=e^{-|\mathbf x|^2/2\sigma^2}e^{-|\tilde{\mathbf x}|^2/2\sigma^2}\sum_{d=0}^{\infty}\frac{(\mathbf x\cdot\tilde{\mathbf x})^d}{\sigma^{2d}d!}\]

so it’s clearly a (countably-infinite-dimensional!) inner product because of the presence of the \(d\)-dimensional polynomial kernels \(K_d(\mathbf x,\tilde{\mathbf x})=(\mathbf x\cdot\tilde{\mathbf x})^d\).

Problem: What were the main problems with SVMs that spurred the deep learning renaissance via the development of artificial neural networks?

Solution: The main issue with SVMs is their lack of scalability. Roughly speaking, the computational cost of training an SVM on a dataset of \(N\) training examples scales like \(O(N^2)\) in the best case (essentially because the Gram matrix \(X^TX\) is an \(N\times N\) matrix of all pairwise interactions of features), even with modern tricks like sequential minimal optimization this is still unavoidable. Another smaller issue is that

Problem: Write a short Python program using the scikit-learn library to train a support vector machine on the Kaggle Titanic competition dataset.

Solution:

titanic
Posted in Blog | Leave a comment

Numerical Computation

Problem: In numerical computation, what are the \(2\) main kinds of rounding error?

Solution: Overflow error (\(N\approx\infty\)) but perhaps even more dangerous is underflow error (\(\varepsilon\approx 0\)) which are in some sense inverses of each other:

\[0=\frac{1}{\infty}\]

Problem: Explain how rounding errors can affect evaluation of the softmax activation function and strategies to address it.

Solution: Given a vector \(\mathbf x\in\mathbf R^n\), the corresponding softmax probability \(\ell^1\)-unit vector is \(e^{\mathbf x}/|e^{\mathbf x}|_1\in S^{n-1}\). If one of the components of \(\mathbf x\) is very negative, then the corresponding component of \(e^{\mathbf x}\) can underflow, whereas if it is instead very positive, then overflow becomes a possibility.

To address these numerical stability issues, the trick is to notice that softmax is invariant under “diagonal” translations along \(\mathbf 1\in\mathbf R^n\), i.e. \(\mathbf x\mapsto\mathbf x-\lambda\mathbf 1\) leads to \(e^{\mathbf x-\lambda\mathbf 1}=e^{\mathbf x}\odot e^{-\lambda\mathbf 1}=e^{-\lambda}e^{\mathbf x}\) and hence the \(\ell^1\)-norm \(|e^{\mathbf x-\lambda\mathbf 1}|_1=e^{-\lambda}|e^{\mathbf x}|_1\) is also merely scaled by the same factor \(e^{-\lambda}\). Choosing \(\lambda:=\text{max}(\mathbf x)\) to be maximum component of \(\mathbf x\) (which coincides with the \(\ell^{\infty}\) norm if \(\mathbf x\geq 0\) is positive semi-definite) would eliminate any overflow error that may have been present in evaluating the numerator \(e^{\mathbf x}\) by instead evaluating \(e^{\mathbf x-\text{max}(\mathbf x)\mathbf 1}\). For the same reason, the shifted denominator \(|e^{\mathbf x-\text{max}(\mathbf x)\mathbf 1}|_1\) is also immune to overflow error. Furthermore, the denominator is also resistant to underflow error because at least one term in the series for \(|e^{\mathbf x-\lambda\mathbf 1}|_1=\sum_{i}e^{x_i-\lambda}\) will be \(e^0=1\gg 0\), namely term \(i=\text{argmax}(\mathbf x)\). The only possible problem is that the modified numerator could still experience an underflow error; this would be bad if subsequently one were to evaluate the information content (loss function) of a softmax outcome in which taking \(\log(0)=-\infty\) would produce a nan. For this case, one can apply a similar trick to numerically stabilize the computation of this logarithm.

Problem: Describe the numerical analysis phenomenon of catastrophic cancellation.

Solution: Subtraction \(f(x,y):=x-y\) is ill-conditioned when \(x\approx y\). This is because even if \(x\) and \(y\) have small relative errors, the relative error in their difference \(f(x,y)\) can still be substantial.

Problem: What does the condition number of a matrix \(X\) measure?

Solution: The condition number measures how numerically stable it is to invert \(X\mapsto X^{-1}\). It is given by the ratio of the largest to smallest eigenvalues of \(X\) (by absolute magnitude, assuming \(X\) is diagonalizable):

\[\frac{\text{max}|\text{spec}(X)|}{\text{min}|\text{spec}(X)|}\]

For instance, a non-invertible matrix would have \(\text{min}|\text{spec}(X)|=0\) and thus an infinite condition number, since it’s non-invertible.

Problem: Show that during gradient descent of a scalar field \(f(\mathbf x)\), if at some point \(\mathbf x\) in the optimization landscape, one has \(\frac{\partial f}{\partial\mathbf x}\cdot\frac{\partial^2 f}{\partial\mathbf x^2}\frac{\partial f}{\partial\mathbf x}>0\) (e.g. such as if the Hessian \(\frac{\partial^2 f}{\partial\mathbf x^2}\) is just generally positive-definite or the gradient \(\frac{\partial f}{\partial\mathbf x}\) is an eigenvector of the Hessian \(\frac{\partial^2 f}{\partial\mathbf x^2}\) with positive eigenvalue), then by a line search logic, the most effective local learning rate \(\alpha(\mathbf x)>0\) is given by:

\[\alpha(\mathbf x)=\frac{\left|\frac{\partial f}{\partial\mathbf x}\right|^2}{\frac{\partial f}{\partial\mathbf x}\cdot\frac{\partial^2 f}{\partial\mathbf x^2}\frac{\partial f}{\partial\mathbf x}}\]

Solution: Since gradient descent takes a step \(\mathbf x\mapsto\mathbf x-\alpha\frac{\partial f}{\partial\mathbf x}\), and the goal is to decrease the value of the function \(f\) as much as possible to get closer to the minimum, one would like to maximize:

\[f(\mathbf x)-f\left(\mathbf x-\alpha\frac{\partial f}{\partial\mathbf x}\right)\]

as a function of the learning rate \(\alpha>0\). Assuming that a quadratic approximation is valid:

\[f(\mathbf x)-f\left(\mathbf x-\alpha\frac{\partial f}{\partial\mathbf x}\right)\approx \alpha\left|\frac{\partial f}{\partial\mathbf x}\right|^2-\frac{\alpha^2}{2}\frac{\partial f}{\partial\mathbf x}\cdot\frac{\partial^2 f}{\partial\mathbf x^2}\frac{\partial f}{\partial\mathbf x}\]

which has a well-defined maximum of the \(\alpha\)-parabola provided the \(\alpha^2\) coefficient is positive.

Problem: Explain why gradient descent is behaving like this in this case? What can be done to address that?

Solution: The problem is that the Hessian is ill-conditioned (in this case the condition number is \(5\) everywhere); the curvature in the \((1,1)\) direction is \(5\times\) greater than the curvature in the \((1,-1)\) direction. Gradient descent wastes time repeatedly descending “canyon walls” because it sees those as the steepest features. A \(2^{\text{nd}}\)-order optimization algorithm (e.g. Newton’s method) that uses information from the Hessian would do better.

Problem: Explain how the Karush-Kuhn-Tucker (KKT) conditions generalize the method of Lagrange multipliers.

Solution: The idea is, in addition to equality constraints like \(x+y=1\), one can also allow for inequality constraints like \(x^2+y^2\leq 1\) (indeed, an equality \(a=b\) is just \(2\) inequalities \(a\leq b\) and \(a\geq b\)). Then, constrained optimization of a scalar field \(f(\mathbf x)\) subject to a bunch of equality constraints \(g_1(\mathbf x)=g_2(\mathbf x)=…=0\) and a bunch of inequality constraints \(h_1(\mathbf x),h_2(\mathbf x),…\leq 0\) is equivalent to unconstrained optimization of the Lagrangian:

\[L(\mathbf x,\boldsymbol{\lambda}_{\mathbf g},\boldsymbol{\lambda}_{\mathbf h})=f(\mathbf x)+\boldsymbol{\lambda}_{\mathbf g}\cdot\mathbf g(\mathbf x)+\boldsymbol{\lambda}_{\mathbf h}\cdot\mathbf h(\mathbf x)\]

where \(\boldsymbol{\lambda}_{\mathbf g},\boldsymbol{\lambda}_{\mathbf h}\) are called KKT multipliers. The KKT necessary (but not sufficient!) conditions for optimality are then (ignoring technicalities about equating objects of possibly different dimensions):

  1. \(\frac{\partial L}{\partial\mathbf x}\mathbf 0\) (stationary)
  2. \(\mathbf g(\mathbf x)=\mathbf 0\geq \mathbf h(\mathbf x)\) (primal feasibility)
  3. \(\boldsymbol{\lambda}_{\mathbf h}\geq 0\) (if the inequalities were instead written as \(\mathbf h(\mathbf x)\geq\mathbf 0\) then this would instead require \(\boldsymbol{\lambda}_{\mathbf h}\leq 0\)) (dual feasibility)
  4. \(\boldsymbol{\lambda}_{\mathbf h}\odot\mathbf h(\mathbf x)=\mathbf 0\) (complementary slackness)
Posted in Blog | Leave a comment

Information Theory & Inference

Problem: Draw a schematic of a binary symmetric channel (BSC) with bit flip probability \(p_f\).

Solution: Classically, one has:

On the other hand, taking a more quantum perspective, in the computational basis \((|0\rangle,|1\rangle)\) one might define a binary symmetric channel as a kind of quantum logic gate:

\[\text{BSC}_{p_f}=\begin{pmatrix}\sqrt{1-p_f}&\sqrt{p_f}\\\sqrt{p_f}&\sqrt{1-p_f}\end{pmatrix}\]

The word “symmetric” in the name emphasizes that the bit flip probability from \(0\to 1\) is identical to the bit flip probability \(1\to 0\).

Problem: Consider the radio communication link between Galileo and Earth, and suppose that the noisy communication channel between them may be modelled as a BSC with bit flip probability \(p_f=0.01\). In a sequence of \(N=10^5\) bit transmissions, what is the probability that the error rate is less than \(1\%\)?

Solution: This means that one requires \(N'<0.01N=10^3\) bit flips. Each such \(N’\) has binomial probability \({{N}\choose{N’}}p_f^{N’}(1-p_f)^{N’}\) so the total probability is given by the cumulative distribution function:

\[\sum_{N’=0}^{N’=999}{{N}\choose{N’}}p_f^{N’}(1-p_f)^{N’}\]

Problem: What are the \(2\) solutions for addressing noise in communication channels? Which one is preferable?

Solution:

  1. The physical solution: build more reliable hardware, cooling circuits, etc. (basically all the stuff one learns in a typical experimental physics class).
  2. The system solution: accept the noisy communication channel as is, and add communication systems to it so as to detect and correct errors. This is the goal of coding theory.

The system solution is preferable because it only comes at an increased computational cost whereas the former comes at an increased financial cost. But more importantly, the improvements arising from the system solution can be very dramatic, unlike the incremental improvements of the physical solution.

Problem: What is the distinction between information theory and coding theory?

Solution: The difference between information theory and coding theory is analogous to the distinction between theoretical physics and experimental physics; the former is concerned with the theoretical limitations of the aforementioned communication systems (e.g. “what is the best error-correcting performance possible over the space of all possible error-correcting algorithms?”) whereas the latter is interested in specifically designing practical such algorithms (also called codes) for error correction.

Problem: Draw a schematic that depicts the high-level structure of an error-correcting code.

Solution: Coding = Encoding + Decoding:

Here \(\mathbf s\) is the source message vector which is encoded as a transmitted message vector \(\mathbf t=\mathbf s+\textbf{redundancy}\). After transmission across the noisy communication channel, the transmitted message vector has been distorted into \(\tilde{\mathbf t}=\mathbf t+\textbf{noise}\). This is then decoded and received as a message vector \(\tilde{\mathbf s}\) with the hope that \(\tilde{\mathbf s}=\mathbf s\) with probability close to \(1\).

Problem: Consider a first naive attempt at designing an error-correcting code for a BSC, known as a repetition code. Explain how that works with an example.

Solution: A choice of error-correcting code basically amounts to a choice of encoder + decoder, since again coding = encoding + decoding. For the repetition code, the encoder is literally a repetition of each bit a predefined number of times, say \(3\) in the example below. The corresponding optimal decoder turns out to be the obvious “majority-vote” among the triplets (assuming \(p_f<1/2\); if it were the case that \(p_f>1/2\) then instead a “minority-vote” decoder would be better!)

(aside: “optimal” means the decoder should map each \(\tilde{\mathbf t}\) to \(\tilde{\mathbf s}=\text{argmax}_{\mathbf s}p(\mathbf s|\tilde{\mathbf t})\) and in particular as part of computing this conditional probability, another implicit assumption is that both \(p(0)=p(1)=1/2\) have equal prior probabilities)

Here \(\mathbf n\) is a spare noise vector each of whose bits are sampled from a \(p_f\)-Bernoulli distribution, where \(0\) means no bit flip and \(1\) means yes bit flip, and \(\tilde{\mathbf t}\equiv\mathbf t+\mathbf n\pmod{2}\), or using the xor notation, \(\tilde{\mathbf t}=\mathbf t\oplus\mathbf n\)

Problem: Show that, compared to no repetition for which the bit flip probability across the BSC is \(p_f<1/2\), the \(3\)-fold repetition code has an effective bit flip probability \(p^{\text{eff}}_f<p_f\) which is reduced compared to the original bit flip probability \(p_f\).

Solution: In order to get a bit flip, either \(2\) or all \(3\) bits in the \(3\)-fold repetition encoded vector \(\mathbf t\) have to be flipped. Thus, the effective bit flip probability \(p^{\text{eff}}_f\) is a cubic function of \(p_f\):

\[p^{\text{eff}}_f=3p_f^2(1-p_f)+p_f^3=3p_f^2-2p_f^3\]

In order to have \(p^{\text{eff}}_f<p_f\), this requires:

\[3p_f-2p_f^2<1\Rightarrow 2(p_f-1/2)(p_f-1)>0\]

which is satisfied for \(p_f<1/2\).

Problem: Instead of \(N=3\) repetitions, generalize the above expression for the effective bit flip probability \(p^{\text{eff}}_f\) to an arbitrary odd number \(N\) of repetitions.

Solution: The effective bit flip probability \(p^{\text{eff}}_f\) is now a degree-\(N\) polynomial in \(p_f\):

\[p^{\text{eff}}_f=\sum_{n=(N+1)/2}^N{N\choose n}p_f^n(1-p_f)^{N-n}\]

which for \(p_f\ll 0.5\) is dominated by the first term \(n=(N+1)/2\).

Problem: Describe how the \((7,4)\) Hamming error correction code works, and explain why it’s an example of a linear block code.

Solution: The idea is that if the source vector \(\mathbf s\) has length \(L_{\mathbf s}\) and the encoded transmitted vector \(\mathbf t\) has some larger length \(L_{\mathbf t}>L_{\mathbf s}\), then one has the “rate” \(L_{\mathbf s}/L_{\mathbf t}=4/7\). Specifically, for every \(4\) source bits \(\mathbf s=s_1s_2s_3s_4\), the transmitted vector will contain \(7\) bits \(\mathbf t=t_1t_2t_3t_4t_5t_6t_7\) such that \(t_{1,2,3,4}:=s_{1,2,3,4}\) and \(t_{5,6,7}\) are uniquely specified by the requirements that \(t_1+t_2+t_3+t_5\equiv t_2+t_3+t_4+t_6\equiv t_1+t_3+t_4+t_7\equiv 0\pmod 2\); in other words, \(t_{5,6,7}\) are said to be parity-check bits in the sense that they check the parity of a certain subset of the source bits \(s_{1,2,3,4}\). This encoding can be depicted visually as such:

where the parity of each of the \(3\) circles has to be even. For example, if \(\mathbf s=1011\), then the Hamming encoding of this would be \(\mathbf t=1011001\).

As for the \((7,4)\) Hamming decoder, for generic \(p_f<1/2\) it can be a bit tricky but if one assumes \(p_f\ll 1/2\), then it is pretty safe to assume that there will be at most \(1\) bit flip during transmission across the noisy binary symmetric communication channel \(\mathbf t\mapsto\tilde{\mathbf t}\). In that case, one can take a given \(\tilde{\mathbf t}\) and compute its syndrome vector \(\mathbf z\); i.e. in each of the \(3\) circles, has the parity remained even \((0)\) or has it become odd \((1)\)? For instance, getting a syndrome \(\mathbf z=000\) means that most likely no bits were flipped and thus there are no errors! Or, if \(\mathbf z=011\), then under the premise that there was only a single bit flip, it must have been \(s_4\), etc. In general, \(\tilde{\mathbf t}\) can have \(8\) possible syndromes \(\mathbf z\), and each such syndrome \(\mathbf z\) maps uniquely onto the \(7\) possible single bit flips together with the possibility of zero bit flips.

The Hamming code is a block code because the parity-check bits look at whole blocks (or subsets) of the source bitstring \(\mathbf s\), rather than one bit of \(\mathbf s\) at a time (as was essentially done in repetition coding). The Hamming code is also said to be a linear code (mod \(2\)) because the encoding operation is linear in the sense that (assuming \(\mathbf s,\mathbf t\) are both column vectors):

\[\mathbf t=G^T\mathbf s\]

where the \(7\times 4\) generator matrix is given by:

\[G^T=\begin{pmatrix}1&0&0&0\\0&1&0&0\\0&0&1&0\\0&0&0&1\\1&1&1&0\\0&1&1&1\\1&0&1&1\end{pmatrix}\]

Or transposing both sides, \(\mathbf t^T=\mathbf s^TG\), and in some coding theory texts one often takes \(\mathbf s,\mathbf t\) to be row vectors by default in which case this would just be written \(\mathbf t=\mathbf sG\). Of course, the columns of \(G^T\) represent the encodings \(\mathbf t\) of the “basis” bitstrings \(\mathbf s=1000,0100,0010,0001\).

Problem: The \((7,4)\) Hamming decoder \(\tilde{\mathbf t}\to\mathbf z\to\tilde{\mathbf s}\) uses a syndrome vector \(\mathbf z\) as an intermediate in the decoding process. Show that the map from \(\tilde{\mathbf t}\to\mathbf z\) is linear in the sense that:

\[\mathbf z=H\tilde{\mathbf t}\]

where the parity-check matrix \(H:=(P,1_3)\), and \(P=\begin{pmatrix}1&1&1&0\\0&1&1&1\\1&0&1&1\end{pmatrix}\) also appears in the generator matrix via:

\[G^T=\begin{pmatrix}1_4\\P\end{pmatrix}\]

Show that \(HG^T=0\) and hence if there were no errors such that \(\tilde{\mathbf t}=\mathbf t\), then the corresponding syndrome vector \(\mathbf z=\mathbf 0\Leftrightarrow\mathbf t\in\text{ker}H\). Hence, explain why a maximum-likelihood decoder is equivalent to finding the most probable noise vector \(\mathbf{n}\) for which \(\mathbf z=H\mathbf n\).

Solution: Just check it, it’s true. And clearly \(HG^T=2P=0\) because \(2\equiv 0\pmod 2\). It follows then that \(\mathbf t\in\text{ker}H\) because \(\mathbf t=G^T\mathbf s\). Thus, because \(\tilde{\mathbf t}=\mathbf t+\mathbf n\), acting on both sides with \(H\) results in \(\mathbf z=H\mathbf n\).

Problem: As the \((7,4)\) Hamming code is a block code, given a possibly very long source message, the only way to implement the code is to chop up the message into blocks of length \(L_{\mathbf s}=4\); naturally, one can ask, for any given block in this “assembly line”, what’s the probability \(p(\tilde{\mathbf s}\neq\mathbf s)\) of it being decoded incorrectly? By contrast, a different question one can ask is what is the effective bit flip probability \(p_f^{\text{eff}}\) within one of these blocks? (give both to leading order in \(p_f\))

Solution: First, notice that without the Hamming code, any odd number of bit flips as well as certain combinations of an even number of bit flips would lead very easily to a decoding error \(\tilde{\mathbf s}\neq\mathbf s\); as \(p_f\ll 1/2\), the dominant process is having a single bit flip, so in this case \(p(\tilde{\mathbf s}\neq\mathbf s)={4\choose{1}}p_f(1-p_f)^3+…=4p_f+O(p_f^2)\).

By contrast, with the \((7,4)\) Hamming code, the block error probability is no longer linear in \(p_f\), but quadratic. This is simply because the \((7,4)\) Hamming code is designed to correct all \(1\)-bit flip errors, thereby eradicating the dominant process above. Unfortunately, any \(2\) bit flips (implicitly on distinct bits) in the \(7\)-bit encoding \(\mathbf t\) will not be corrected, leading to block error \(\tilde{\mathbf s}\neq\mathbf s\) (identical remarks apply to \(3,4,5,6,7\) bit flips); after all, no matter what syndrome vector \(\mathbf z\) one finds in \(\tilde{\mathbf t}\), the Hamming decoder tells one to unflip at most \(1\) bit because it operates under the assumption that there was only \(1\) bit flip, so it is literally impossible to fully correct a \(\tilde{\mathbf t}\) containing more than \(1\) bit flip. Thus:

\[p(\tilde{\mathbf s}\neq\mathbf s)=\sum_{n=2}^7{7\choose{n}}p_f^n(1-p_f)^{7-n}=21p_f^2+O(p_f^3)\]

By contrast, the effective bit flip probability is in general:

\[p_f^{\text{eff}}=\frac{1}{L_{\mathbf s}}\sum_{i=1}^{L_{\mathbf s}}p(\tilde s_i\neq s_i)\]

But here all \(L_{\mathbf s}=4\) source bits are indistinguishable (indeed all \(L_{\mathbf t}=7\) transmitted bits are indistinguishable), so one can focus arbitrarily on e.g. \(i=1\) and ask what is \(p_f^{\text{eff}}=p(\tilde s_1\neq s_1)\). There are then \(2\) distinct ways to obtain the same answer (to leading order); the first is to explicitly enumerate all the possible \(2\) bit flip errors in \(\tilde{\mathbf t}\) such that the final decoded source vector \(\tilde{\mathbf s}\) will have an error in bit \(\tilde s_1\neq s_1\); clearly there are \(6\) such combinations that look like \(s_1\) being flipped in addition to one of the other \(6\) bits; then, another \(3^{\text{rd}}\) bit will always be flipped. Meanwhile, one can check there are also \(3\) other configurations in which \(s_1\) is not initially flipped, but rather the decoder incorrectly flips it when going from \(\tilde{\mathbf t}\to\tilde{\mathbf s}\):

Since each of these are still \(2\) bit flip processes, one thus has:

\[p_f^{\text{eff}}=9p_f^2(1-p_f)^5+…=9p_f^2+O(p_f^3)\]

To leading order, one thus observes that \(p_f^{\text{eff}}=\frac{3}{7}p(\tilde{\mathbf s}\neq\mathbf s)\); the \(3/7\) prefactor has a simple interpretation as saying that whenever there was \(2\) bit flips in \(\tilde{\mathbf t}\), it would inevitably imply \(3\) bit flips in the decoded \(\tilde{\mathbf s}\) (prior to truncating away the \(3\) parity-check bits) so \(3\) out of the \(7\) bits in \(\tilde{\mathbf s}\) are in error, and all \(7\) bits are indistinguishable. This thus has a very Bayesian interpretation where \(p(\tilde{\mathbf s}\neq\mathbf s)\) is the prior and \(3/7\) is the conditional probability of a bit like \(s_1\) being flipped if it’s known that the whole \(7\)-bit block has suffered a \(3\)-bit error.

Problem: How many distinct elements are there in \(\ker H\)? What is the interpretation of such an element?

Solution: There are \(15\) elements in \(\ker H\), and each such \(\mathbf n\in\ker H\) has the property that the corresponding syndrome \(\mathbf z=\mathbf 0\) if \(\textbf n\) were interpreted as a certain symmetric pattern of bit flips/noise. For example:

Problem: The \((7,4)\) Hamming code is not the only one; what other possible \((L_{\mathbf t},L_{\mathbf s})\) Hamming codes are possible?

Solution: Any for which \(L_{\mathbf t}-L_{\mathbf s}=\log_2(L_{\mathbf t}+1)\), the RHS being the number of parity-check bits which grows merely logarithmically with the total transmission size \(L_{\mathbf t}\)!

Problem: What are the \(2\) main interpretations of probability?

Solution: There is the frequentist and Bayesian interpretations of probability; the former interprets probabilities as long-run event frequencies while the latter inteprets probabilities as degrees of belief (this is possible provided the concept of “degree of belief” satisfies certain reasonable Cox axioms). In general, probability theory is simply an extension of logic to account for uncertainty.

Problem: Whenever dealing with multiple random variables \(X,Y,…\), what is the fundamental object to look at? What can be derived from this fundamental object? (cf. the partition function in statistical mechanics as serving the same sort of “all-encompassing” role).

Solution: The fundamental object which one should prioritize figuring out is the joint probability distribution function \(p_{(X,Y,…)}(x,y,…)\) of all the random variables \((X,Y,…)\). Consider a simple visualization of this concept for just \(2\) real-valued random variables \(X, Y\) and the marginal as well as conditional probability distributions one can obtain from their joint probability distribution \(p_{(X,Y)}(x,y)\equiv p(x,y)\):

Problem: A certain disease is present among \(1\%\) of the world’s population, and a medical test for the disease has a \(95\%\) true positive rate and a \(90\%\) true negative rate. If Joe tests positive for the disease, what are his odds of actually having it?

Solution: For these sorts of problems, there is essentially a \(3\)-step recipe to solving them:

  1. Map out the complete marginal distribution of the prior.
  2. For each value of the prior, “unshovel” your way back to the joint distribution w.r.t. all values of the new data.
  3. Hence, compute whatever you need to.

In this case, the prior \(X\) may be modelled as a Bernoulli random variable taking on the value \(x=0\) if a person doesn’t have the disease and \(x=1\) if they do. Following the steps:

Problem: Suppose that on a certain online store, products can be rated either positively or negatively. There are \(3\) different sellers of the same product. Seller \(A\) has \(10\) reviews with \(100\%\) positive ratings, seller \(B\) has \(50\) reviews with \(96\%\) positive ratings, and seller \(C\) has \(200\) reviews with \(93\%\) positive ratings. Which seller should one buy from?

Solution: This is another Bayesian inference problem (draw a picture with \(p_+\) on the horizontal axis and \(N_+\) on the vertical axis!). The model is that the seller has some underlying fixed, but unknown probability \(p_+\) of delivering a positive experience and \(1-p_+\) of delivering a negative experience to a user. A priori, if one had no knowledge about the review information of other customers, then the prior \(p_+\) can be modelled by a uniform distribution \(p(p_+)=1\) for \(p_+\in[0,1]\). But then, equipped with the review info of \(N_+\) positive reviews out of \(N\), Bayesian updating says that this uniform prior should be updated to a beta distributed posterior:

\[p(p_+|N_+,N)=\frac{p(p_+)p(N_+,N|p_+)}{\int_0^1dp_+p(p_+)p(N_+,N|p_+)}=\frac{{N\choose{N_+}}p_+^{N_+}(1-p_+)^{N-N_+}}{\int_0^1 dp_+{N\choose{N_+}}p_+^{N_+}(1-p_+)^{N-N_+}}=\frac{(N+1)!p_+^{N_+}(1-p_+)^{N-N_+}}{N_+!(N-N_+)!}\]

where the beta function \(\textrm{B}(z,w):=\int_0^1 dx x^{z-1}(1-x)^{w-1}=\frac{\Gamma(z)\Gamma(w)}{\Gamma(z+w)}\) normalization has been (optionally) used.

A reasonable goal would be to maximize one’s own probability of a positive experience. This means taking the posterior \(p(p_+|N_+,N)\) computed above and recycling that as one’s prior \(p(p_+|N_+,N)\mapsto p(p_+)\). Then, the probability \(p(+)\) of a positive experience coincides with the expected value of \(p_+\):

\[p(+)=\int_0^1 dp_+p(p_+)p_+=\frac{\textrm{B}(N_++2,N-N_++1)}{\textrm{B}(N_++1,N-N_++1)}=\frac{N_++1}{N+2}\]

this result is called Laplace’s rule of succession, and may be remembered by taking the original \(N\) reviews, and appending \(2\) new fictitious reviews in which one is positive and one is negative. For instance, if a product had no reviews so that \(N=N_+=0\), then reasonably enough Laplace’s rule of succession predicts one to have a \(p(+)=1/2\) probability of a positive experience. Applying it to each of the sellers in the question, one finds that seller \(B\) has the highest probability \(p(+)\), so should vendor with them.

Remember that Bayesian inference is always predicated on assumptions, which above meant a uniform equiprobable ignorance \(p(p_+)=1\) at the very beginning about \(p_+\), though it has been suggested by Jeffreys, Haldane, etc. to use a different prior like \(p(p_+)=\frac{1}{\pi\sqrt{p(1-p)}}\) or \(p(p_+)\sim\frac{1}{p(1-p)}\) to model one’s ignorance. Of course, this would modify how Laplace’s rule of succession looks with each of these priors.

Problem: (for fun!) Prove the earlier beta function relation \(\textrm{B}(z,w):=\int_0^1 dx x^{z-1}(1-x)^{w-1}=\frac{\Gamma(z)\Gamma(w)}{\Gamma(z+w)}\). Show that in the special case of positive integers \(z,w\in\mathbf Z^+\), the result may be derived combinatorially.

Solution:

(loosely speaking, this is reminiscent of the identity for the exponential function \(\exp(z)\exp(w)=\exp(z+w)\) except for the \(\Gamma\) function there is an “extra factor” in the form of the beta function \(B(z,w)\)). In the special case where \(z,w\in\mathbf Z^+\), the initial temptation would be to expand \((1-x)^{w-1}\) using the binomial theorem, but it turns out there is a much cuter way to evaluate the integral by telling a story.

Imagine seeing a random distribution of \(N+1\) balls on the unit interval \([0,1]\), of which \(1\) of them is pink while the other \(N\) are black. There are \(2\) ways they could’ve gotten there. On the one hand, it may be that there were initially \(N+1\) black balls, and then one of them was painted pink at random, and then the balls were thrown altogether onto \([0,1]\). Or it may be that the \(N+1\) black balls were first thrown altogether onto \([0,1]\), and subsequently one of them was painted pink at random. Clearly, these \(2\) processes are equivalent definitions of a discrete random variable \(N_+\) defined as the number of black balls to the left of the pink ball on \([0,1]\). This has probability mass function \(p(N_+)\) in the first story given by:

\[p(N_+)=\int_0^1dp_+p(\text{pink thrown at }p_+)p(N_+|\text{pink thrown at }p_+)\]

\[=\int_0^1 dp_+{N\choose{N_+}}p_+^{N_+}(1-p_+)^{N-N_+}\]

On the other hand, according to the second story, \(N_+=0,1,…,N\) each with uniform probability!

\[p(N_+)=\frac{1}{N+1}\]

Hence it is established that:

\[\int_0^1 dp_+p_+^{N_+}(1-p_+)^{N-N_+}=\frac{N_+!(N-N_+)!}{(N+1)!}\]

which is equivalent to the earlier more general result for \(z,w\in\mathbf C\).

Problem: Given a discrete random variable \(X\) with probability mass function \(p(x)\), define the information content \(I(x)\) associated to drawing outcome \(x\) from \(X\). Hence, define the entropy \(S_X\) of \(X\).

Solution: A picture is worth a thousand words:

where \(I(x):=-\log p(x)\) is the surprise associated with drawing \(x\) from \(X\), and \(S_X:=\sum_xp(x)I(x)=-\sum_x p(x)\log p(x)=\sum_x I(x)2^{-I(x)}\) is the average surprise of the discrete random variable \(X\). The choice of base in the \(\log\) defines the unit of information (e.g. bits, nats, etc.).

Problem: Show that for \(X\perp Y\) independent discrete random variables, their joint entropy \(S_{(X,Y)}=S_X+S_Y\).

Solution: This is really rooted in Shannon’s fundamental axioms that the joint information content for independent random variables should be additive \(I_{(X,Y)}(x,y)=I_X(x)+I_Y(y)\):

\[S_{(X,Y)}:=-\sum_{(x,y)}p(x,y)\log p(x,y)=-\sum_{(x,y)}p(x)p(y)\log p(x)-\sum_{(x,y)}p(x)p(y)\log p(y)=S_X+S_Y\]

Problem: Give the intuition behind the Kullback-Leibler divergence \(D_{\text{KL}}(\tilde X|X)\) of a random variable \(\tilde X\) with respect to another random variable \(X\) (both defined over the same sample space).

Solution: Any random variable \(X\) inherently contains some non-negative average level of surprise \(S_X\geq 0\) because of the word “random” in “random variable” (indeed, \(S_X=0\) iff \(p(x)=1\) for some outcome \(x\)). That’s one kind of surprise. But in practice, if one is under the illusion that the outcomes \(x\) are being drawn from a random variable \(\tilde X\) when in fact they are actually following a different random variable \(X\neq\tilde X\), then the perceived degree of surprise \(S_{\tilde X|X}=-\sum_xp_X(x)\log p_{\tilde X}(x)\) (called the cross-entropy of \(\tilde X\) with respect to \(X\)) would intuitively be larger than the average surprise \(S_X\) inherent in the intrinsic randomness of \(X\) itself. The KL divergence of \(\tilde X\) with respect to \(X\) thus measures how surprising the data is purely due to the wrong model \(\tilde X\) being used in lieu of the ground truth \(X\):

\[D_{\textrm{KL}}(\tilde X|X):=S_{\tilde X|X}-S_X\]

thus, it should be intuitively clear that \(D_{\textrm{KL}}(\tilde X|X)\geq 0\), a result known as the Gibbs inequality.

Problem: What is the intuition behind Jensen’s inequality.

Solution: Take any “smiling” curve and hang a bunch of masses on the curve. Then Jensen’s inequality says that their center of mass will also lie above the curve.

More precisely, for a convex function \(f(x)\) (which means any secant line lies above the function’s graph, or algebraically \(f(\lambda x+(1-\lambda)x’)\leq\lambda f(x)+(1-\lambda)f(x’)\) for \(\lambda\in[0,1]\)), one has:

\[\langle f(x)\rangle\geq f(\langle x\rangle)\]

with equality \(\langle f(x)\rangle=f(\langle x\rangle)\) iff \(x\) is a uniform random variable. More explicitly, for any convex linear combination of the form \(p_1x_1+p_2x_2+…+p_Nx_N\) where each \(p_i\geq 0\) and \(\sum_{i=1}^Np_i=1\), one has:

\[\sum_i p_if(x_i)\geq f\left(\sum_ip_ix_i\right)\]

The inequality is reversed for concave functions \(f(x)\).

Problem: Unstable particles are emitted from a source and decay at a distance \(x\geq 0\), an exponentially distributed real random variable with characteristic length \(x_0\). Decay events can only be observed if they occur in a window extending from \(x=1\) to \(x=20\) (arbitrary units). If \(100\) decays are registered at some locations \(1\leq x_1,x_2,…,x_{100}\leq 20\), estimate the characteristic length \(x_0\).

Solution: The prior distribution of \(x\) is, as stated, an exponential decay:

\[p(x|x_0)=\frac{e^{-x/x_0}}{x_0}[x\geq 0]\]

which is normalized \(\int_{-\infty}^{\infty}dxp(x|x_0)=1\). After learning that \(x\in [1,20]\), the posterior distribution of \(x\) becomes:

\[p(x|x\in [1,20],x_0)=\frac{p(x|x_0)p(x\in[1,20]|x, x_0)}{p(x\in[1,20]|x_0)}\]

but if \(x\) is measured, and happens to lie in \(x\in[1,20]\), then \(p(x\in[1,20]|x, x_0)=1\), otherwise if \(x\notin [1,20]\) then \(p(x\in[1,20]|x, x_0)=0\); all in all one has the compact Iverson bracket expression \(p(x\in[1,20]|x, x_0)=[x\in [1,20]]\). The denominator is the integral of the numerator and indeed is just the usual way that probability density functions are meant to be used:

\[p(x\in[1,20]|x_0)=\int_{-\infty}^{\infty}dxp(x|x_0)p(x\in[1,20]|x,x_0)=\int_1^{20}dx\frac{e^{-x/x_0}}{x_0}=e^{-1/x_0}-e^{-20/x_0}\]

Now, using this updated prior, one has:

\[p(\{x_1,…,x_N\}|x\in[1,20],x_0)=p(x_1|x\in[1,20],x_0)…p(x_N|x\in[1,20],x_0)=\frac{e^{-(x_1+…+x_N)/x_0}}{x_0^N(e^{-1/x_0}-e^{-20/x_0})^N}\]

But one would instead like to Bayesian-infer \(p(x_0|\{x_1,…,x_N\},x\in[1,20])\) so as to find the most probable value of the characteristic length \(x_0\):

\[p(x_0|\{x_1,…,x_N\},x\in[1,20])=\frac{p(x_0|x\in[1,20])p(\{x_1,…,x_N\}|x\in[1,20],x_0)}{p(\{x_1,…,x_N\}|x\in[1,20])}\]

Clearly, the only missing puzzle piece in here is the prior distribution \(p(x_0|x\in[1,20])\) on the characteristic length itself. Even without specifying this, one can still gain a lot of insight by plotting \(p(\{x_1,…,x_N\}|x\in[1,20],x_0)\) as a function of \(x_0\), or more precisely, to get the key idea, plot the likelihood \(p(x|x\in[1,20],x_0)\) of \(x_0\) for various fixed \(x\in[1,20]\), and notice in particular a peak in \(x_0\) as \(x\) decreases.

Posted in Blog | Leave a comment

SVD, Pseudoinverse, and PCA

Problem: State the form of the singular value decomposition (SVD) of an arbitrary linear operator \(X:\mathbf C^m\to\mathbf C^n\).

Solution: The SVD of \(X\) is given by:

\[X=U_2\Sigma U^{\dagger}_1\]

where \(U_1\in U(m)\) and \(U_2\in U(n)\) are unitary operators and \(\Sigma:\mathbf C^m\to\mathbf C^n\) is a “Hermitian” positive semi-definite diagonal operator. Strictly speaking, one should prove that such a factorization of “rotate-stretch-rotate” really does exist for any \(X\); the details are omitted here.

Problem: In practice, how does one find the SVD of a linear operator \(X\)?

Solution: The idea is to look at the \(2\) Hermitian, positive semi-definite operators constructed from \(X\):

\[XX^{\dagger}=U_2\Sigma\Sigma^{\dagger}U_2^{\dagger}\]

\[X^{\dagger}X=U_1\Sigma^{\dagger}\Sigma U_1^{\dagger}\]

And in particular, by the spectral theorem, both \(XX^{\dagger}\) and \(X^{\dagger}X\) have an orthonormal eigenbasis (of \(\mathbf C^n\), \(\mathbf C^m\) respectively) corresponding to real, non-negative eigenvalues \(\sigma^2_1>\sigma^2_2>…>\sigma^2_{\text{min}(n,m)}\), so since the above matches the standard eigendecomposition template, it’s clear that computing the SVD of \(X\) just amounts to diagonalizing \(XX^{\dagger}\) and \(X^{\dagger}X\) (actually just \(1\) has to be diagonalized, namely diagonalize \(XX^{\dagger}\) if \(n\leq m\) and diagonalize \(X^{\dagger}X\) if \(m\leq n\). Then, because the singular eigenvalues \(\sigma^2_1,\sigma^2_2,…\sigma^2_{\text{min}(n,m)}\) are identical for both \(XX^{\dagger}\) and \(X^{\dagger}X\), it’s easy to find the eigenvectors of the other).

Problem: The most important application of SVD is to estimate a complicated linear operator \(X:\mathbf C^m\to\mathbf C^n\) by a low-rank approximation. Explain how this is achieved.

Solution: Order the \(\text{min}(n,m)\) singular values of \(X:\mathbf C^m\to\mathbf C^n\) from greatest to least as before \(\sigma^2_1,\sigma^2_2,…\sigma^2_{\text{min}(n,m)}\). Write the column eigenvectors \(U_1=(\mathbf u_1^{(1)}, \mathbf u_1^{(2)},…,\mathbf u_1^{(m)})\) and \(U_2=(\mathbf u_2^{(1)}, \mathbf u_2^{(2)},…,\mathbf u_2^{(n)})\). Then the singular value decomposition may also be written:

\[X=U_1\Sigma U^{\dagger}_2=\sum_{i=1}^{\text{min}(n,m)}\sigma_i\mathbf u_1^{(i)}\otimes\mathbf u_2^{(i)}\]

In particular, the Eckart-Young theorem states the best rank-\(r\) approximation to \(X\) (for \(r\leq\text{rank}(X)\)) is given by a truncating the above series at the first \(r\) terms:

\[X\approx \sum_{i=1}^{r}\sigma_i\mathbf u_1^{(i)}\otimes\mathbf u_2^{(i)}\]

Here, the word “best” is with respect to the Frobenius norm:

\[\text{argmin}_{\tilde X\in\mathbf C^{n\times m}:\text{rank}(\tilde X)=r}\sqrt{\text{Tr}((X-\tilde X)^{\dagger}(X-\tilde X))}=\sum_{i=1}^{r}\sigma_i\mathbf u_1^{(i)}\otimes\mathbf u_2^{(i)}\]

(insert proof here:)

Problem: Here is another application of SVD: given a diagonal operator \(\Sigma:\mathbf C^m\to\mathbf C^n\), explain how to compute its (Moore-Penrose) pseudoinverse \(\Sigma^+:\mathbf C^n\to\mathbf C^m\). Hence, how can the pseudoinverse of a general linear operator \(X:\mathbf C^m\to\mathbf C^n\) be computed?

Solution: \(\Sigma^+\) is almost like the usual inverse \(\Sigma^{-1}\) in that one simply inverts all the eigenvalues on the diagonal. But the exception is if \(\Sigma\) has a zero eigenvalue…then ordinarily \(\Sigma^{-1}\) would not exist, so for this case the pseudoinverse is created and the instruction is to simply leave the zeros be (rather than trying to invert them and get \(1/0=\infty\)). Finally, also don’t forget to transpose the result. For a general \(X\), the quickest way to compute the pseudoinverse is to use its SVD:

\[X=U_2\Sigma U^{\dagger}_1\]

Ordinarily trying to take the inverse (even if \(n\neq m\)) would appear to give:

\[X^{-1}=U_1\Sigma^{-1}U^{\dagger}_2\]

But since \(\Sigma^{-1}\) may not exist, the pseudoinverse of \(X\) is thus given by:

\[X^+=U_1\Sigma^+U_2^{\dagger}\]

(of course, if \(X^{-1}\) actually exists then the pseudoinverse is in fact not “pseudo” at all \(X^+=X^{-1}\)).

Problem: In general, given a vector \(\mathbf y\in\mathbf C^n\) and linear operator \(X:\mathbf C^m\to\mathbf C^n\), the equation \(\mathbf y=X\mathbf x\) only has a unique solution for \(\mathbf x\in\mathbf C^m\) provided that \(X\) is invertible, i.e. \(\mathbf x=X^{-1}\mathbf y\). If this isn’t the case however, one can still compute the pseudoinverse \(X^+\); what interpretation does \(X^+\mathbf y\in\mathbf C^m\) have?

Solution: There are \(2\) cases; if \(m\leq n\), then there will be many solutions \(\mathbf x\in\mathbf C^m\) to \(\mathbf y=X\mathbf x\), and in particular \(X^+\mathbf y\) is also an element of this solution space (i.e. \(XX^+\mathbf y=\mathbf y\)) but has the additional property that it is closest to the origin:

\[X^+\mathbf y=\text{argmin}_{\mathbf x\in\mathbf C^m:\mathbf y=X\mathbf x}|\mathbf x|\]

By contrast, if \(m\geq n\), then there will be no solutions \(\mathbf x\in\mathbf C^m\) to \(\mathbf y=X\mathbf x\) and in that case of course \(X^+\mathbf y\) is also not a solution, but nevertheless it is the closest one can come to an actual solution in the sense that:

\[X^+\mathbf y=\text{argmin}_{\mathbf x\in\mathbf C^m}|\mathbf y-X\mathbf x|\]

Problem: Explain the what and how of conducting principal component analysis (PCA) of a bunch of (say \(m\)) feature vectors \(\mathbf x_1,…,\mathbf x_m\in\mathbf C^n\).

Solution: Earlier, \(X:\mathbf C^m\to\mathbf C^n\) simply had the interpretation of an arbitrary linear operator, in which case the interpretation of \(XX^{\dagger}\) and \(X^{\dagger}X\) were both likewise arbitrary. However, one intuitive interpretation is to view \(X\) as a data matrix; if one takes the \(m\) columns of \(X=(\mathbf x_1,…,\mathbf x_m)\) to represent the \(m\) feature vectors in \(\mathbf C^n\), then the Gram matrix \(X^{\dagger}X\) takes on a statistical interpretation as an autocorrelation matrix:

\[X^{\dagger}X=
\begin{pmatrix} |\mathbf{x}_1|^2 & \cdots & \mathbf{x}_1^{\dagger}\mathbf{x}_m \\
\vdots & \ddots & \vdots \\
\mathbf{x}_m^{\dagger}\mathbf{x}_1 & \cdots & |\mathbf{x}_m|^2
\end{pmatrix}\]

or alternatively, if \(X\mapsto X-\boldsymbol{\mu}\otimes\mathbf{1}_m\) has already been mean-subtracted (where \(\boldsymbol{\mu}=\sum_{i=1}^m\mathbf x_i/m\)) so as to have zero-mean, then \(X^{\dagger}X\) is just the covariance matrix of the data. Based on the earlier discussion of SVD, it follows that the columns of \(U_1\in U(m)\) give principal axes of the covariance matrix \(X^{\dagger}X\), and that the corresponding singular values in \(\Sigma\) measure the standard deviation of the data along those principal axes (as variance is commonly denoted \(\sigma^2\), this also explains the choice of notation in the SVD). Note that in such applications, one is almost always in the regime \(m\gg n\) so that there will be \(n\) singular values \(\sigma_1,…,\sigma_n\). Together, each principal axis together with corresponding singular value is said to be a principal component (PC) of the data \(X\), hence the name. The “analysis” part comes from analyzing how the total variance \(\sum_{i=1}^n\sigma_i^2\) of the data \(X\) is explained by each PC, specifically the fraction of it explained by the \(j^{\text{th}}\) PC is \(\sigma_j^2/\sum_{i=1}^n\sigma_i^2\). If the singular values are ordered \(\sigma_1\geq…\geq\sigma_n\), then PCA also provides a lossy data compression algorithm, specifically, in analogy to the low-rank approximation application of SVD, one can pick some \(r< n\) and project all \(m\) feature vectors \(\mathbf x_1,…,\mathbf x_m\in\mathbf C^n\) onto the \(r\)-dimensional subspace spanned by the corresponding first \(r\) principal axes associated to the first \(r\) singular values \(\sigma_1,…,\sigma_r\).

If the data matrix contained the feature vectors as its rows instead \(X=(\mathbf x_1,…,\mathbf x_m)^{\dagger}\), then repeat the above discussion with \(XX^{\dagger}\) instead.

Posted in Blog | Leave a comment