Supervised Machine Learning: Regression & Classification

Problem: Distinguish between supervised learning and unsupervised learning.

Solution: In supervised learning, the machine is trained on a labelled training set \(\{(\textbf x_1,y_1),…,(\textbf x_N,y_N)\}\) consisting of feature vectors \(\textbf x_i\) with their target values \(y_i\). The goal is then to fit a function \(\hat y(\textbf x)\) (historically called a hypothesis, also called a model) to this data in order to predict/estimate the target value of arbitrary feature vectors \(\textbf x\) beyond the training set.

In unsupervised learning, the machine is simply given a bunch of data \(\textbf x_1,…,\textbf x_N\) and asked to find patterns/structure within the data. Thus, supervision and training are synonymous.

Problem: What are the \(2\) most common types of supervised learning?

Solution: If the range of \(\hat y\) is at most countable (e.g \(\{0,1\},\textbf N\), etc.) then this type of supervised learning is called classification. If instead the range of \(\hat y\) is uncountable (e.g. \(\textbf R\)) then this type of supervised learning is called regression (cf. the distinction between discrete and continuous random variables).

Problem: What are some kinds of unsupervised learning?

Solution: Clustering, etc.

Problem: (clarifying the theory of multivariate linear regression)

Problem: On using Python to do univariate linear regression, suggesting multiple ways/libraries to do it.

Problem: Write down using mathematical notation the iterative formula for gradient descent minimization of an arbitrary function \(C(\textbf x)\), then write down the same formula using programming notation.

Solution: Mathematical version:

\[\textbf x_n=\textbf x_{n-1}-\alpha\frac{\partial C}{\partial\textbf x}(\textbf x_{n-1})\]

where \(\alpha>0\) is called the learning rate (here assumed to be fixed though in principle could also have \(\alpha=\alpha_n\)).

Programming version:

\[\textbf x=\textbf x-\alpha\frac{\partial C}{\partial\textbf x}(\textbf x)\]

Problem: How to choose \(\alpha\)?

Problem: For the typical case in machine learning where:

\[C(\textbf w,b):=\frac{1}{2N}\sum_{i=1}^N(\hat{y}(\textbf x_i|\textbf w,b)-y_i)^2\]

is a mean-square cost function for a linear regression model \(\hat y(\textbf x|\textbf w,b)=\textbf w\cdot\textbf x+b\) attempting to predict a training set \(\{(\textbf x_1,y_1),…,(\textbf x_N,y_N)\}\) (here \(\textbf w\) is a weight vector and \(b\) is called a bias), what do the update rules for gradient descent look like in mathematical notation (explicitly)?

Solution: Note that \(\textbf w_n\) and \(b_n\) should be updated simultaneously:

\[\textbf w_n=\textbf w_{n-1}-\frac{\alpha}{N}\sum_{i=1}^N(\textbf w_{n-1}\cdot\textbf x_i+b_{n-1}-y_i)\textbf x_i\]

\[b_n=b_{n-1}-\frac{\alpha}{N}\sum_{i=1}^N(\textbf w_{n-1}\cdot\textbf x_i+b_{n-1}-y_i)\]

Problem: What are the \(2\) key advantages of using NumPy?

Solution: Vectorization and broadcasting.

Vectorization is what it sounds like: do all your computations with NumPy arrays (i.e. ndarray).

Broadcasting allows for standard operations such as scalar multiplication \(\textbf x\mapsto c\textbf x\) to occur.

Universal functions (ufuncs) in NumPy (precompiled \(C\)-loop).

(crudely speaking, these are Cartesian tensors in math/physics). Typically never have to bother with Python loops.

Problem: Implement gradient descent in Python for the training set \(\{(1,1),(2,2),(3,3)\}\) using a linear regression model. In particular, for a fixed initial guess, determine the optimal learning rate \(\alpha\). Also try different initial guesses? And batch gradient descent vs. mini-batch…

Solution:

numpy_exercises
gradient_descent

Problem: Explain how feature renormalization and feature engineering can help to speed up gradient descent and obtain a model with greater predictive power.

Solution: Because the learning rate \(\alpha\) in gradient descent is, roughly speaking, a dimensionless parameter, it makes sense to nondimensionalize all feature variables in some sense. This is the idea of feature renormalization. For instance, one common method is to simply perform mean renormalization:

\[x\mapsto\frac{x-\mu}{x^*-x_*}\]

Another method is \(z\)-score renormalization:

\[x\mapsto\frac{x-\mu}{\sigma}\]

Feature engineering is about using one’s intuition to design new features by transforming or combining original features. Typically, this means expanding the set of basis functions one works with, thus allowing one to curve fit even for nonlinear function functions. Thus, linear regression also works with nonlinear functions; the word “linear” in “linear regression” shouldn’t be thought of as “linear fit” but as “linear algebra”.

Problem: Now consider the other type of supervised learning, namely classification, and specifically consider the method of logistic classification (commonly called by the misnomer of logistic “regression” even though it’s about classification). Write down a table comparing linear regression with logistic classification with regards to their:

i) Model function \(\hat y(\textbf x|\textbf w,b)\) to be fit to the training set \(\{(\textbf x_1,y_1),…,(\textbf x_N,y_N)\}\).

ii) Loss functions \(L(\hat y,y)\) appearing in the cost function \(C(\textbf w,b)=\frac{1}{N}\sum_{i=1}^NL(\hat y(\textbf x_i|\textbf w,b),y_i)\).

Solution:

where, as a minor aside, the loss function for logistic classification can also be written in the explicit “Pauli blocking” or “entropic” form:

\[L(\hat y,y)=-y\ln\hat y-(1-y)\ln(1-\hat y)\]

Indeed, it is possible to more rigorously justify this choice of loss function precisely through such maximum-likelihood arguments. For simplicity, one can simply think of this choice of \(L\) for logistic classification as ensuring that the corresponding cost function \(C\) is convex so that gradient descent can be made to converge to a global minimum (which wouldn’t have been the case if one had simply stuck with the old quadratic cost function from linear regression). A remarkable fact (related to this?) is that the explicit gradient descent update formulas for each iteration look exactly the same for linear regression and logistic classification:

\[\textbf w_n=\textbf w_{n-1}-\frac{\alpha}{N}\sum_{i=1}^N(\hat y(\textbf x_i|\textbf w_{n-1},b_{n-1})-y_i)\textbf x_i\]

\[b_n=b_{n-1}-\frac{\alpha}{N}\sum_{i=1}^N(\hat y(\textbf x_i|\textbf w_{n-1},b_{n-1})-y_i)\]

just with the model function \(\hat y(\textbf x|\textbf w,b)\) specific to each case.

Problem: Given a supervised classification problem involving some training set to which a logistic sigmoid is fit with some optimal weights \(\textbf w\) and bias \(b\), how does the actual classification then arise?

Solution: One has to decide on some critical “activation energy”/threshold \(\hat y_c\in[0,1]\) such that the classification of a (possibly unseen) feature vector \(\textbf x\) is \(\Theta(\hat y(\textbf x|\textbf w,b)-\hat y_c)\in\{0,1\}\). Thus, the set of feature vectors \(\textbf x\) for which \(\hat y(\textbf x|\textbf w,b)=\hat y_c\) is called the decision boundary.

Problem: Using the scikit learn library, show how to perform logistic classification given \(N\) training examples of feature \(n\)-vectors \(\textbf x\in\textbf R^n\) with \(\hat y_c=0.5\).

Solution:

Problem: Explain what it means to regularize a cost function \(C(\textbf w,b)\) and what the purpose of this is.

Solution: It means adding an isotropic convex paraboloid component in the weight \(\textbf w\) (analogous to a harmonic trap in ultracold atoms), i.e.

\[C(\textbf w,b)\mapsto C(\textbf w,b)+\frac{\lambda}{2N}|\textbf w|^2\]

(optionally though not typically one can also regularize the bias term \(b\) by adding \(\lambda b^2/2N\) to the cost function \(C\)). Here, \(\lambda\) like \(\alpha\) is another hyperparameter (sometimes called a regularization parameter). The purpose is to avoid overfitting (a.k.a. high variance) when one has a lot of features and a relatively small training set (naturally, other ways to address overfitting include having more training examples and feature selection).

This entry was posted in Blog. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *