Problem: Let \(|\mathcal V|,N_c\in\mathbf Z^+\) be positive integers (where \(|\mathcal V|\) is the cardinality of an arbitrary set \(\mathcal V\) called the vocabulary and \(N_c\) will come to be seen as the number of codebooks), and let \(\mathcal T\) be a (finite or infinite) set whose elements are called tokens. What does it mean for a function \(\mathbf i\) to be a \((|\mathcal V|,N_c)\)-tokenizer on the token space \(\mathcal T\)? Give some examples of tokenizers.
Solution: It means that \(\mathbf i:\mathcal T\to\{1,…,|\mathcal V|\}^{N_c}\) quantizes each abstract token \(\tau\in\mathcal T\) as some concrete \(N_c\)-tuple of integers \(\mathbf i(\tau):=(i_1(\tau),…,i_{N_c}(\tau))\) where each of these \(N_c\) token IDs \(i_1(\tau),…,i_{N_c}(\tau)\in\mathbf N\) associated to the token \(\tau\in\mathcal T\) ranges from \(1\) to the vocabulary size \(|\mathcal V|\).
- For \(\mathcal T:=\{a, b, …, z, @, \&, ., uh, th, …\}\) the set of natural language tokens (could be finite or infinite depending on how one defines \(\mathcal T\)), tokenization \(\mathbf i\) is typically done via a single (\(N_c=1\)) lookup inside a dictionary (“codebook/vocabulary”) of size \(|\mathcal V|\sim 10^5\) (indeed, one can simply take \(\mathcal V:=\mathcal T\)).
- For \(\mathcal T\) the (infinite) set of \(20\text{ ms}\) Nyquist-downsampled audio waveform tokens, tokenization \(\mathbf i\) might be implemented by first passing an audio token \(\tau\in\mathcal T\) into some CNN encoder \(\mathbf e_{\text{CNN}}:\mathcal T\to\mathbf R^{<|\mathcal T|}\), thereby obtaining some hidden latent vector \(\mathbf e_{\text{CNN}}(\tau)\in\mathbf R^{<|\mathcal T|}\), followed by residual vector quantization \(\text{RVQ}:\mathbf R^{<|\mathcal T|}\to\{1,…,|\mathcal V|\}^{N_c}\) of \(\mathbf e_{\text{CNN}}(\tau)\) through a sequence of \(N_c\) ordered codebooks \(\mathcal V_1,…,\mathcal V_{N_c}\) each containing the same number \(|\mathcal V|:=|\mathcal V_1|=…=|\mathcal V_{N_c}|\) of vectors:
\[\text{RVQ}(\mathbf e):=(\text{argmin}_{1\leq i\leq |\mathcal V|}|\mathbf e-\mathbf e_{i}^{(1)}|,\text{argmin}_{1\leq i\leq |\mathcal V|}|\mathbf e-\text{argmin}_{\mathbf e^{(1)}\in\mathcal V_1}|\mathbf e-\mathbf e^{(1)}|-\mathbf e_{i}^{(2)}|,…)\]
(one could also consider codebooks of different sizes, though conceptually that would simply change the range of the tokenizer to \(\mathbf i:\mathcal T\to\{1,…,|\mathcal V_1|\}\times…\times\{1,…,|\mathcal V_{N_c}|\}\)).
Problem: Explain the transformer architecture as first described in Attention Is All You Need purely from the inference perspective of a forward pass on a model which is already well-trained (i.e. by articulating this, one thus defines what the objective of one’s model is, informing the choice of cost function during training).
Solution: Note that the tokenization process described above is not part of the transformer, but rather just some pre-processing of natural language into a format suitable to be input into the transformer.
- (One-Hot Encoding) Take the sequence of token IDs from tokenization, and one-hot encode them with respect to a vocabulary of size \(d_V\). Note this step is parameter-free, i.e. for a fixed vocabulary, there’s nothing to be learned here.
- (Embedding) Take each one-hot encoded vector \(\hat{\mathbf e}\in{0,1}^{d_V}\), and multiply it by the transformer’s (learnable) embedding matrix \(W_E\in\mathbf R^{d_E\times d_V}\), thereby obtaining some embedding vector \(\hat{\mathbf e}\mapsto\mathbf x:=W_E\hat{\mathbf e}\in\mathbf R^{d_E}\) that, after pre-training the transformer, should ideally learn a latent space representation of the vocabulary in which semantically similar words are close together and certain directions in the embedding space convey certain concepts.
- (Positional Encoding) Add something to each embedding vector in a way that signals where in the input natural language prompt it appears (e.g. sinusoidal encoding in the original paper).
- (Single Self-Attention Head) For each positional embedding vector \(\mathbf x_i\in\mathbf R^{d_e}\), compute its query vector \(\mathbf q_i=W_{\mathbf q}\mathbf x_i\), its key vector \(\mathbf k_i=W_{\mathbf k}\mathbf x_i\), and its value vector \(\mathbf v_i=W_{\mathbf v}\mathbf x_i\). Here, \(W_{\mathbf q},W_{\mathbf k}\in\mathbf R^{n_{qk}\times n_e}\) are weight matrices that map from the embedding space \(\mathbf R^{n_e}\) to the query/key space \(\mathbf R^{n_{qk}}\) of dimension \(n_{qk}\) and \(W_{\mathbf v}\in\mathbf R^{n_e\times n_e}\) is the weight matrix of values (which in practice is decomposed into a low-rank approximation \(W_{\mathbf v}=W_{\mathbf v\uparrow}W_{\mathbf v\downarrow}\) where typically \(W_{\mathbf v\downarrow}\in\mathbf R^{n_{qk}\times n_e}\) and \(W_{\mathbf v\uparrow}\in\mathbf R^{n_e\times n_{qk}}\)). For each \(\mathbf x_i\), one computes an update vector \(\Delta\mathbf x_i\) to be added to it (called a skip connection as typical in ResNets) according to a convex linear combination of the value vectors \(\mathbf v_1,…,\mathbf v_N\) of all the embeddings \(\mathbf x_1,…,\mathbf x_N\) in the context, specifically:
\[\Delta\mathbf x_i=V\text{softmax}\left(\frac{K^T\mathbf q_i}{\sqrt{n_{qk}}}\right)\]
where \(K=(\mathbf k_1,…,\mathbf k_N)\in\mathbf R^{n_{qk}\times N}\) and \(V=(\mathbf v_1,…,\mathbf v_N)\in\mathbf R^{n_e\times N}\) are key and value matrices associated to the inputted context (filled with column vectors here rather than the ML convention of row vectors). This map that takes the initial, generic token embeddings \(\mathbf x_i\) and nudges them towards more contextualized embeddings \(\mathbf x_i\mapsto\mathbf x’_i=\mathbf x_i+\Delta\mathbf x_i\) is called a head of self-attention. The \(1/\sqrt{n_{qk}}\) scaling in the softmax temperature is justified on the grounds that if \(\mathbf k\) and \(\mathbf q\) are random vectors whose independent components each have mean \(0\) and variance \(1\), then \(\mathbf k\cdot\mathbf q\) will have mean \(0\) and variance \(n_{qk}\), hence the need to normalize by \(\sqrt{n_{qk}}\) to ensure \(\mathbf k\cdot\mathbf q/\sqrt{n_{qk}}\) continues to have variance \(1\).
4. (Multi-Headed Self-Attention) Since context can influence meaning in different ways, repeat the above procedure in parallel for several heads of self-attention; each head will propose a displacement update to each of the \(N\) original embeddings \(\mathbf x_i\); add up all of them.
5. (Multilayer Perceptron) Linear, ReLU, Linear basically. It is hypothesized that facts are stored in this part of the transformer.
6. (Layers) Alternate between the multi-headed self-attention blocks and MLP blocks, make a probabilistic prediction of the next token \(\hat{\tau}_{N+1}\) using only the final, context-rich, modified embedding \(\mathbf x’_N\) of the last token \(\tau_N\) in the context by applying an unembedding matrix \(\mathbf u=W_{\mathbf u}\mathbf x’_N\) and running it through a softmax \(\text{softmax}(\mathbf u)\).
Problem: Based on the above discussion of the transformer architecture, explain how a large language model (LLM) like Gemini, ChatGPT, Claude, Grok, DeepSeek, etc. works (at a high level).
Solution: Essentially, since an LLM is a neural network which takes as input some string of text and probabilistically predicts the next token, by seeding it with some corpus of text \(T\), the LLM can sample according to the probability distribution it generates for the next token, and append that to \(T\mapsto T+\tau\). Then, simply repeat this except pretend that \(T+\tau\) was the seed all along. In this way, generative AI models such as ChatGPT (where GPT stands for generative pre-trained transformer) work. In practice, it is helpful to also provide some system prompt like “What follows is a conversation between a user and a knowledgeable AI assistant:”.