Problem: Distinguish between a supervised data set and unsupervised data set.
Solution: Supervised data is of the form \(\{(\textbf x_1,y_1),…,(\textbf x_N,y_N)\}\). Unsupervised data consists of stripping away the target labels \(y_1,…,y_N\), leaving behind just the feature vectors \(\{\textbf x_1,…,\textbf x_N\}\).
Problem: \(k\)-means clustering is an unsupervised learning algorithm (not to be confused with \(k\)-nearest neighbors which is a supervised learning algorithm) which arguably is easier to explain in words than with math. Hence, explain it in words.
Solution: Start with an unsupervised data set \(\textbf x_1,…,\textbf x_N\in\textbf R^n\) of feature vectors, and initialize some number \(k\leq N\) of “cluster centroids” \(\boldsymbol{\mu}_1,…,\boldsymbol{\mu}_k\in\textbf R^n\). Then, repeat the following \(2\)-step algorithm:
- Partition the unsupervised data set of feature vectors \(\{\textbf x_1,…,\textbf x_N\}\) into \(L^2\)-Voronoi clusters with respect to the cluster centroids \(\boldsymbol{\mu}_1,…,\boldsymbol{\mu}_k\).
- Update the locations of each cluster centroid \(\boldsymbol{\mu}_1,…,\boldsymbol{\mu}_k\) to the mean of the feature vectors associated to its \(L^2\)-Voronoi cluster.
Problem: What is the so-called distortion cost function \(C\) that \(k\)-means clustering seeks to minimize? Intuitively, why does it work?
Solution: There are \(2\) distinct theorems one can prove, which justify respectively the \(2\) steps of the \(k\)-means clustering algorithm described above. Informally, in the space \(\textbf R^n\) one has \(N+k\) characters, namely the \(N\) feature vectors \(\textbf x_1,…,\textbf x_N\) and the \(k\) cluster centroids \(\boldsymbol{\mu}_1,…,\boldsymbol{\mu}_k\).
Theorem #\(1\): Fix both the locations of the \(N\) “households” \(\textbf x_1,…,\textbf x_N\) and the \(k\) “wells” \(\boldsymbol{\mu}_1,…,\boldsymbol{\mu}_k\). Define the distortion cost functional \(C[\boldsymbol{\mu}^*]\) for a choice of household-to-well allocation \(\boldsymbol{\mu}^*:\{\textbf x_1,…,\textbf x_N\}\to\{\boldsymbol{\mu}_1,…,\boldsymbol{\mu}_k\}\) by the formula:
\[C[\boldsymbol{\mu}^*]:=\frac{1}{N}\sum_{i=1}^N|\textbf x_i-\boldsymbol{\mu}^*(\textbf x_i)|^2\]
Then \(\delta C/\delta\boldsymbol{\mu}^*=0\) on the allocation:
\[\boldsymbol{\mu}^*(\textbf x):=\text{argmin}_{\boldsymbol{\mu}\in\{\boldsymbol{\mu}_1,…,\boldsymbol{\mu}_k\}}|\textbf x-\boldsymbol\mu|\]
Theorem #\(2\): Fix only the locations of the \(N\) “households” \(\textbf x_1,…,\textbf x_N\), and fix the allocation \(\boldsymbol{\mu}^*\) from Theorem #\(1\). Define the distortion cost function:
\[C(\boldsymbol{\mu}_1,…,\boldsymbol{\mu}_k):=\frac{1}{N}\sum_{j=1}^k\sum_{\textbf x\in\boldsymbol{\mu}^{-1}(\boldsymbol{\mu}_j)}|\textbf x-\boldsymbol{\mu}_j|^2\]
Then, if a contractor were allowed to come and relocate the \(k\) “wells” \(\boldsymbol{\mu}_1,…,\boldsymbol{\mu}_k\mapsto\boldsymbol{\mu}’_1,…,\boldsymbol{\mu}’_k\), then \(\partial C/\partial(\boldsymbol{\mu}’_1,…,\boldsymbol{\mu}’_k)=\textbf 0\) for:
\[\boldsymbol{\mu}’_j:=\frac{1}{\#{\boldsymbol{\mu}^{*-1}(\boldsymbol{\mu}_j)}}\sum_{\textbf x\in\boldsymbol{\mu}^{*-1}(\boldsymbol{\mu}_j)}\textbf x\]
where \(1\leq j\leq k\).
Problem: How should the cluster centroids \(\boldsymbol{\mu}_1,…,\boldsymbol{\mu}_k\) be initialized?
Solution: Do \(50\to 1000\) random initializations of the \(k\) means on the feature vectors themselves, and run the algorithm to convergence. Each time, one will get some configuration of clusters with an associated minimized distortion cost \(C_*\). Then, just take the cluster with the lowest \(C_*\).
Problem: How should the number of clusters \(k\in\textbf Z^+\) be selected?
Solution: This is more of an art than a science (one heuristic is to plot the minimized distortion cost function \(C_*\) against \(k\) and look for the “elbow”, alternatively one can just vary \(k\) and, for each clustering that results, empirically see which one is best for whatever downstream application one has in mind.
Problem: Describe how anomaly detection works.
Solution: Given a bunch of unlabelled feature vectors \(\textbf x_1,…,\textbf x_N\in\textbf R^n\) each with \(n\) scalar features, the idea is to assume that they are i.i.d. draws from an underlying normal random vector \(\textbf x\) with mean \(\boldsymbol{\mu}\) and covariance matrix \(\sigma^2\), i.e.
\[\rho(\textbf x)=\frac{1}{\det\sigma(2\pi)^{d/2}}\exp-\frac{1}{2}(\textbf x-\boldsymbol{\mu})^T\sigma^{-2}(\textbf x-\boldsymbol{\mu})\]
Thus, for a given probability density threshold \(\varepsilon>0\), the isosurface \(\rho(\textbf x)=\varepsilon\) in \(\textbf R^d\) defines an ellipsoid centered at \(\boldsymbol{\mu}\). If a draw of \(\textbf x\) lies outside of this ellipsoid so that \(\rho(\textbf x)<\varepsilon\), then one would classify or “detect” this \(\textbf x\) as an “anomaly”.
Note that typically, if the features are uncorrelated, then \(\sigma\) may be taken diagonal, which is equivalent to asserting that all \(n\) scalar features are also normal random variables. If this is not initially the case, one can often artificially enforce it by performing some suitable transformation of the scalar features to make them “more Gaussian”. As usual, the techniques of error analysis and feature engineering apply here.
Problem: How should the probability density threshold \(\varepsilon\) be chosen in anomaly detection?
Solution: By definition of an “anomaly”, it is typically the case that the vast majorities of draws from \(\textbf x\) will be “normal”, i.e. not anomalous. Therefore, any given training set \(\textbf x_1,…,\textbf x_N\) will likely be highly skewed. For this reason, it is often a good idea to maximize the \(F_1\)-score on a cross-validation set as a function of \(\varepsilon\), by trial-and-error with various values of \(\varepsilon\).
Problem: The objective of anomaly detection is essentially identical to that of supervised binary classification; anomalous examples might be labelled \(1\) whereas normal examples are labelled \(0\). So when should one use anomaly detection vs. supervised binary classification?
Solution: One key idea lies again in the definition of the word “anomaly” as suggesting a deviation from the norm which is furthermore rare. Therefore, the idea is as follows: suppose one initially were planning to do supervised binary classification. Then, because this is supervised learning, in addition to the feature vectors, each training example would also need to be accompanied by its target label \(y\in\{0,1\}\). If one goes ahead and manually labels each feature vector \(\textbf x\) by its corresponding \(y\), and one finds that the number of \(y=0\) labels vastly outnumbers the \(y=1\) labels (or vice versa), then this suggests that it may be more prudent to interpret this as an anomaly detection problem rather than supervised binary classification.
Another useful cue lies in asking the question “how many ways are there to be anomalous?”. If there are many ways by which an example could be anomalous, and one isn’t interested in exactly which way but only to have a large “wastebasket” to collect all anomalous examples…well then as one may anticipate, anomaly detection is the way to go.
Problem: Explain what the objective of collaborative filtering is.
Solution: The objective is loosely similar to filling out a Sudoku grid, or in a pure math context, trying to predict missing entries in a matrix. The context in which collaborative filtering is typically discussed centers around recommender systems in the sense that one might have a matrix \(Y_{mu}\) with \(1\leq m\leq N_m\) representing \(N_m\) movies and \(1\leq u\leq N_u\) representing \(N_u\) users, and some entries of this \(N_m\times N_u\) matrix \(Y_{mu}\) are filled with movie ratings (e.g. from \(0\to 5\)) given by certain users, but since not all users have rated all movies yet (because they haven’t watched them), the goal is to predict what they would rate the movie, hence whether they would like it or not, hence whether it should be recommended to them. Simply put, “filtering” and “recommending” in this context are synonymous, so actually I personally think a better name would have been “collaborative recommending”.
Problem: Write down the collaborative filtering cost function \(C(\textbf w_1,…,\textbf w_{N_u},b_1,…,b_{N_u},\textbf x_1,…,\textbf x_{N_m}|Y, \lambda)\)
Solution: Essentially, one just uses a (bi)linear regression model \(\hat Y_{mu}=\textbf w_u\cdot\textbf x_m+b_u\):
\[C(\textbf w_1,…,\textbf w_{N_u},b_1,…,b_{N_u},\textbf x_1,…,\textbf x_{N_m}|Y, \lambda)=\frac{1}{2}\sum_{u=1}^{N_u}\sum_{m=1}^{N_m}(\textbf w_u\cdot\textbf x_m+b_u-Y_{mu})^2+\frac{\lambda}{2}\left(\sum_{u=1}^{N_u}|\textbf w_u|^2+\sum_{m=1}^{N_m}|\textbf x_m|^2\right)\]
But note that in the double series, since \(Y_{mu}\) is only defined for movies \(m\) where user \(u\) has actually given a rating, the sum should only be taken over matrix elements for which \(Y_{mu}\) exists.
As an aside, if instead \(Y_{mu}\in\{0,1\}\) are restricted to binary ratings, then a binary cross-entropy loss function would be more appropriate:
\[C(\textbf w_1,…,\textbf w_{N_u},b_1,…,b_{N_u},\textbf x_1,…,\textbf x_{N_m}|Y, \lambda)=-\sum_{u=1}^{N_u}\sum_{m=1}^{N_m}Y_{mu}\ln\left(\frac{1}{1+e^{-(\textbf w_u\cdot\textbf x_m+b_u)}}\right)+\]
\[(1-Y_{mu})\ln\left(1-\frac{1}{1+e^{-(\textbf w_u\cdot\textbf x_m+b_u)}}\right)+\frac{\lambda}{2}\left(\sum_{u=1}^{N_u}|\textbf w_u|^2+\sum_{m=1}^{N_m}|\textbf x_m|^2\right)\]
Finally, an important point often glossed over is that, in order to be able to compute the dot product \(\textbf w_u\cdot\textbf x_m\), both the user weight vectors \(\textbf w_u\) and the movie feature vectors \(\textbf x_m\) need to be equidimensional, but the exact value of this dimension is another degree of freedom that one can play around with.
Problem: How is content-based filtering similar to and distinct from collaborative filtering? What weaknesses of collaborative filtering does content-based filtering address?
Solution: Both content-based filtering and collaborative filtering seek to recommend items (e.g. movies) to users; in other words, both are still recommender systems. They differ in what information they use to make their recommendations.
Whereas collaborative filtering uses other users’ ratings to predict unknown ratings, without taking into account any features about the user or movies, content-based filtering assumes that one has access to this kind of additional information and so is able to make more tailored recommendations even to e.g. a first-time user who has never rated any movies before (thus, content-based filtering solves the cold-start problem of collaborative filtering).
The idea is that each user \(1\leq u\leq N_u\) has an associated feature vector \(\textbf w_u\) (previously thought of as an abstract “weight vector”) and similarly each movie \(1\leq m\leq N_m\) also has an associated feature vector \(\textbf x_m\). The idea of content-based filtering is to train \(2\) neural networks, one which maps user features \(\textbf w_u\mapsto\hat{\textbf u}_u\), and another which maps movie features \(\textbf x_m\mapsto\hat{\textbf v}_m\), where \(\textbf w_u\) and \(\textbf x_m\) need not have the same dimension but \(\hat{\textbf u}_u\) and \(\hat{\textbf v}_m\) must be equidimensional unit vectors; this is because, ultimately, the prediction model will be a zero-bias form of linear regression, i.e. the cosine similarity
\[\hat Y_{mu}:=\hat{\textbf u}_u\cdot\hat{\textbf v}_m\]
Problem: Define a Markov decision process (MDP).
Solution: Formally, a Markov decision process is a \(4\)-tuple \((S,A,P,R)\), where \(S\) is called the state space, \(A\) is called the action space (equipped with a group action on \(S\)?), \(P(\textbf s’|\textbf s,a)\) is a conditional probability function for being in a state \(\textbf s’\in S\) given one is in state \(\textbf s\in S\) and applies action \(a\in A\), and \(R:S\to\textbf R\) is a reward function specifying the reward \(R(\textbf s)\) of each state \(\textbf s\in S\).
Informally, the important point about an MDP is that it is memoryless, like a \(1^{\text{st}}\)-order ODE in \(t\) (e.g. Schrodinger’s equation), it given an initial condition at \(t=0\), then the subsequent time evolution is uniquely specified, it doesn’t matter whatever was happened for \(t<0\).
Problem: Define the return functional \(R[\textbf s(t)]\) of a path \(\textbf s(t)\) in state space \(S\).
Solution: As a path integral:
\[R[\textbf s(t)]:=\int\mathcal D\textbf s(t) \gamma^t r(\textbf s(t))\]
Although in practice, because the path \(\textbf s(t)\) is stroboscopic due to discrete \(t\), this path integral always just reduces to a discrete series. In addition, time \(t\) starts from \(t=0\) until one reaches some terminal state, and the discount factor \(\gamma\in [0,1)\) is typically \(\approx 0.99\) or something like that, giving rise to delayed rewards.
Problem: What does it mean to say that \(\pi\) is a policy for a particular Markov decision process?
Solution: It means that \(\pi:S\to A\) is a function which, given the current state \(\textbf s\in S\) of the MDP, says which action \(\pi(\textbf s)\in A\) to apply (which would result in ending up at some other state \(\textbf s’\in S\) with probability \(P(\textbf s’|\textbf s,a)\)).
Problem: What is the goal of reinforcement learning?
Solution: To find the optimal policy \(\pi^*\) maximizing the expected return \(\langle R[\textbf s(t)]\rangle\), where the trajectory \(\textbf s(t)\) is determined by the choice of policy \(\pi^*\).
Problem: Define the state-action value function \(Q:S\times A\to\textbf R\), and state the \(2\) key theorems about it that motivate why it’s a useful construction.
Solution: Loosely speaking, it is defined by:
\[Q(\textbf s,a):=R[\textbf s\to a\cdot\textbf s\to\pi^*]\]
i.e. the return from starting in state \(s\), then doing action \(a\), and then acting optimally from there onward. The \(2\) key properties of the \(Q\)-function are:
Theorem #\(1\): The expected max return from any state \(\textbf s\in S\) is \(\langle \text{max}_{a\in A}Q(\textbf s,a)\rangle\).
Theorem #\(2\): \(\pi^*(\textbf s)=\text{argmax}_{a\in A}Q(\textbf s,a)\)
Problem: State the Bellman (optimality) equation.
Solution: Writing \(\textbf s’:=a\cdot\textbf s\):
\[Q(\textbf s,a)=R(\textbf s)+\gamma\langle\text{max}_{a’\in A}Q(\textbf s’,a’)\rangle\]
which follows trivially from the definition of the \(Q\)-function, more precisely from the \(\pi^*\) part of that definition; thus in a way the Bellman equation is a bit like a recurrence relation where the reward is what one gets right away \(r(\textbf s)\) plus the delayed rewards that one gets in the future.
Problem: Give a reasonable example of a state vector \(\textbf s\) for an autonomous truck driving application, and for an autonomous helicopter.
Solution: In both of these cases, one is working with a continuous-state MDP, since the state space \(S\) is uncountably infinite. For a truck, one might have:
\[\textbf s=(x,y,\theta,\dot x,\dot y,\dot\theta)^T\]
whereas for a helicopter which can also move in the \(z\)-direction, a suitable triplet of Euler angles (roll-pitch-yaw) might instead be used:
\[\textbf s=(x,y,z,\phi,\theta,\psi,\dot x,\dot y,\dot z,\dot\phi,\dot\theta,\dot\psi)^T\]
Problem: Tying together all the previous problems, describe the optimized version of deep reinforcement learning using a deep-\(Q\)-network (DQN).
Solution: The idea is to use a neural network which takes a state vector \(\textbf s\) in its input layer and, in the output layer, seeks to estimate the \(Q\)-function evaluated on \(\textbf s\), for all possible actions \(a\in A\), i.e. estimate a vector of length \(\#A\) whose components are \(Q(\textbf s,a)\) for \(a\in A\).
Then, as with any supervised regression problem, one needs to first train the DQN on some labelled training examples (in mini-batches for speed and with soft updates for reliability). This is where the Bellman equation comes in. Using say a physics engine simulator (or in real life?), one can initialize the system in different states \(\textbf s\in S\), perform all possible actions \(a\in A\), and simply observe what state \(\textbf s’\in S\) the system is “scattered” into afterwards. Then, assuming one has pre-specified a reward function \(r\) so that \(r(\textbf s)\) is known, and the discount factor \(\gamma\) is also pre-set, and by randomly initializing the weights and biases in the neural network, one can get an (initially rather poor) estimate of \(Q(\textbf s’,a’)\) for all actions \(a’\in A\), thereby allowing one to get \(\text{max}_{a’}Q(\textbf s’,a’)\), and hence estimate \(Q(\textbf s,a)\) from the Bellman equation (i.e. as a target label).
Then, just minimize the usual MSE cost function with gradient descent.
One more important tip during training is that, rather than always follow \(\pi^*\), it is a good idea to use an \(\varepsilon\)-greedy policy (a misnomer; a better name would be the \((1-\varepsilon)\)-greedy policy):
\[\pi_{\varepsilon}(\textbf s):=
\begin{cases}
\text{argmax}_{a\in A} Q(s,a), & \text{with probability } 1-\varepsilon \\
\text{random } a\in A, & \text{with probability } \varepsilon
\end{cases}\]
where the greedy action is also called exploitation while the random action is called exploration. When just beginning training, take \(\varepsilon\approx 1\) but decrease it gradually towards \(\varepsilon\to 0\).