Problem: Distinguish between supervised learning and unsupervised learning.
Solution: In supervised learning, the machine is trained on a labelled training set \(\{(\textbf x_1,y_1),…,(\textbf x_N,y_N)\}\) consisting of feature vectors \(\textbf x_i\) with their target values \(y_i\). The goal is then to fit a function \(\hat y(\textbf x)\) (historically called a hypothesis, also called a model) to this data in order to predict/estimate the target value of arbitrary feature vectors \(\textbf x\) beyond the training set.
In unsupervised learning, the machine is simply given a bunch of data \(\textbf x_1,…,\textbf x_N\) and asked to find patterns/structure within the data. Thus, supervision and training are synonymous.
Problem: What are the \(2\) most common types of supervised learning?
Solution: If the range of \(\hat y\) is at most countable (e.g \(\{0,1\},\textbf N\), etc.) then this type of supervised learning is called classification. If instead the range of \(\hat y\) is uncountable (e.g. \(\textbf R\)) then this type of supervised learning is called regression (cf. the distinction between discrete and continuous random variables).
Problem: What are some kinds of unsupervised learning?
Solution: Clustering, etc.
Problem: (clarifying the theory of multivariate linear regression)
Problem: On using Python to do univariate linear regression, suggesting multiple ways/libraries to do it.
Problem: Write down using mathematical notation the iterative formula for gradient descent minimization of an arbitrary function \(C(\textbf x)\), then write down the same formula using programming notation.
Solution: Mathematical version:
\[\textbf x_n=\textbf x_{n-1}-\alpha\frac{\partial C}{\partial\textbf x}(\textbf x_{n-1})\]
where \(\alpha>0\) is called the learning rate (here assumed to be fixed though in principle could also have \(\alpha=\alpha_n\)).
Programming version:
\[\textbf x=\textbf x-\alpha\frac{\partial C}{\partial\textbf x}(\textbf x)\]
Problem: How to choose \(\alpha\)?
Problem: For the typical case in machine learning where:
\[C(\textbf w,b):=\frac{1}{2N}\sum_{i=1}^N(\hat{y}(\textbf x_i|\textbf w,b)-y_i)^2\]
is a mean-square cost function for a linear regression model \(\hat y(\textbf x|\textbf w,b)=\textbf w\cdot\textbf x+b\) attempting to predict a training set \(\{(\textbf x_1,y_1),…,(\textbf x_N,y_N)\}\) (here \(\textbf w\) is a weight vector and \(b\) is called a bias), what do the update rules for gradient descent look like in mathematical notation (explicitly)?
Solution: Note that \(\textbf w_n\) and \(b_n\) should be updated simultaneously:
\[\textbf w_n=\textbf w_{n-1}-\frac{\alpha}{N}\sum_{i=1}^N(\textbf w_{n-1}\cdot\textbf x_i+b_{n-1}-y_i)\textbf x_i\]
\[b_n=b_{n-1}-\frac{\alpha}{N}\sum_{i=1}^N(\textbf w_{n-1}\cdot\textbf x_i+b_{n-1}-y_i)\]
Problem: What are the \(2\) key advantages of using NumPy?
Solution: Vectorization and broadcasting.
Vectorization is what it sounds like: do all your computations with NumPy arrays (i.e. ndarray).
Broadcasting allows for standard operations such as scalar multiplication \(\textbf x\mapsto c\textbf x\) to occur.
Universal functions (ufuncs) in NumPy (precompiled \(C\)-loop).
(crudely speaking, these are Cartesian tensors in math/physics). Typically never have to bother with Python loops.
Problem: Implement gradient descent in Python for the training set \(\{(1,1),(2,2),(3,3)\}\) using a linear regression model. In particular, for a fixed initial guess, determine the optimal learning rate \(\alpha\). Also try different initial guesses? And batch gradient descent vs. mini-batch…
Solution:
$\textbf{Problem}$: Write Python code to generate two random vectors $\textbf w,\textbf x\in\mathbf R^{10^7}$ and compute their dot product $\textbf w\cdot\textbf x$:
i) Without vectorization (i.e. using a for loop)
ii) With vectorization
Show that vectorization significantly speeds up the computation of $\textbf w\cdot\textbf x$.
$\textbf{Solution}$:
import numpy as np # import the NumPy library
from numpy import random # import the random module from the NumPy library
from time import time # import the time library
n = int(1e7)
random.seed(62831853) # seed for "predictable" random vectors
w = random.rand(n) # random vector with n entries between 0 and 1
print(w)
x = random.rand(n) # random vector with n entries between 0 and 1
print(x)
# Without vectorization:
t_initial = time()
dot_product = 0
for i in np.arange(n):
dot_product = dot_product + w[i]*x[i]
t_final = time()
print(f"Dot product value: {dot_product}")
print(f"Time taken: {t_final-t_initial} seconds")
# With vectorization:
t_initial = time()
dot_product = np.dot(w,x)
t_final = time()
print(f"Dot product value: {dot_product}")
print(f"Time taken: {t_final-t_initial} seconds")
[0.99139118 0.98259778 0.04074994 ... 0.37571015 0.46541306 0.60192845] [0.02930771 0.23935712 0.46679807 ... 0.67633435 0.5376593 0.09256103] Dot product value: 2499560.663731212 Time taken: 4.61395788192749 seconds Dot product value: 2499560.663731507 Time taken: 0.003765106201171875 seconds
$\textbf{Problem}$:
i) Generate (from a fixed seed) $1000$ random points in the $xy$-plane which are dispersed around the line $y=3x+1$ with standard deviation $\sigma_y=50$, in the range $0\leq x\leq 100$.
ii) Using the univariate linear regression model $\hat y=wx+b$ with initial guess $w=b=0$, apply gradient descent with learning rates $\alpha=10^{-6},10^{-5},10^{-4}$ for $100$ iterations each and plot the corresponding learning curves.
iii) Show, for each of the previous values of $\alpha$, the machine learning of $(w,b)$ in a suitable plane.
$\textbf{Solution}$:
import numpy as np
import matplotlib.pyplot as plt
import scienceplots
N = 1000
x_min = 0
x_max = 100
sigma_y = 50
rng = np.random.default_rng(seed=62831853)
x = np.array(rng.uniform(x_min, x_max, N))
y = np.array(3*x + 1 + rng.normal(0, sigma_y, N))
with plt.style.context(['science']):
plt.figure(figsize=(10, 5))
plt.scatter(x, y, s=0.5)
plt.title("Random Data")
plt.xlabel(r"$x$")
plt.ylabel(r"$y$")
plt.show()
def y_hat(x, w, b):
return w[..., None] * x + b[..., None]
def C(x, y, w, b):
return 0.5 * np.mean((y_hat(x, w, b) - y)**2, axis=-1)
def dC_dw(w, b):
return 1/N * np.sum((y_hat(x, w, b) - y) * x, axis=-1)
def dC_db(w, b):
return 1/N * np.sum(y_hat(x, w, b) - y, axis=-1)
def grad_descent(w_init, b_init, alpha, n):
w = np.zeros(n)
b = np.zeros(n)
w[0] = w_init
b[0] = b_init
for i in np.arange(1, n):
w[i] = w[i-1] - alpha * dC_dw(w[i-1], b[i-1])
b[i] = b[i-1] - alpha * dC_db(w[i-1], b[i-1])
return w, b
n = 100
w_init = 0
b_init = 0
iterations = np.arange(n)
costs = np.zeros(n)
with plt.style.context(['science']):
plt.figure(figsize=(10, 5))
for alpha in [1e-6, 1e-5, 1e-4]:
w, b = grad_descent(w_init, b_init, alpha, n)
plt.plot(iterations, C(x, y, w, b), label=r"$\alpha=$"+str(alpha))
plt.title("Learning Curves for Gradient Descent for Several Learning Rates")
plt.xlabel("Number of Iterations")
plt.ylabel("Cost Function")
plt.legend()
plt.show()
with plt.style.context(['science']):
plt.figure(figsize=(10, 5))
for alpha in [1e-6, 1e-5, 1e-4]:
weights, biases = grad_descent(w_init, b_init, alpha, n)
plt.scatter(weights, biases, label=r"$\alpha=$"+str(alpha), s=0.5)
plt.title(r"Machine Learning the Linear Regression Model Parameters $w,b$")
plt.xlabel(r"$w$")
plt.ylabel(r"$b$")
plt.legend()
plt.show()
Problem: Explain how feature renormalization and feature engineering can help to speed up gradient descent and obtain a model with greater predictive power.
Solution: Because the learning rate \(\alpha\) in gradient descent is, roughly speaking, a dimensionless parameter, it makes sense to nondimensionalize all feature variables in some sense. This is the idea of feature renormalization. For instance, one common method is to simply perform mean renormalization:
\[x\mapsto\frac{x-\mu}{x^*-x_*}\]
Another method is \(z\)-score renormalization:
\[x\mapsto\frac{x-\mu}{\sigma}\]
Feature engineering is about using one’s intuition to design new features by transforming or combining original features. Typically, this means expanding the set of basis functions one works with, thus allowing one to curve fit even for nonlinear function functions. Thus, linear regression also works with nonlinear functions; the word “linear” in “linear regression” shouldn’t be thought of as “linear fit” but as “linear algebra”.
Problem: Now consider the other type of supervised learning, namely classification, and specifically consider the method of logistic classification (commonly called by the misnomer of logistic “regression” even though it’s about classification). Write down a table comparing linear regression with logistic classification with regards to their:
i) Model function \(\hat y(\textbf x|\textbf w,b)\) to be fit to the training set \(\{(\textbf x_1,y_1),…,(\textbf x_N,y_N)\}\).
ii) Loss functions \(L(\hat y,y)\) appearing in the cost function \(C(\textbf w,b)=\frac{1}{N}\sum_{i=1}^NL(\hat y(\textbf x_i|\textbf w,b),y_i)\).
Solution:

where, as a minor aside, the loss function for logistic classification can also be written in the explicit “Pauli blocking” or “entropic” form:
\[L(\hat y,y)=-y\ln\hat y-(1-y)\ln(1-\hat y)\]
Indeed, it is possible to more rigorously justify this choice of loss function precisely through such maximum-likelihood arguments. For simplicity, one can simply think of this choice of \(L\) for logistic classification as ensuring that the corresponding cost function \(C\) is convex so that gradient descent can be made to converge to a global minimum (which wouldn’t have been the case if one had simply stuck with the old quadratic cost function from linear regression). A remarkable fact (related to this?) is that the explicit gradient descent update formulas for each iteration look exactly the same for linear regression and logistic classification:
\[\textbf w_n=\textbf w_{n-1}-\frac{\alpha}{N}\sum_{i=1}^N(\hat y(\textbf x_i|\textbf w_{n-1},b_{n-1})-y_i)\textbf x_i\]
\[b_n=b_{n-1}-\frac{\alpha}{N}\sum_{i=1}^N(\hat y(\textbf x_i|\textbf w_{n-1},b_{n-1})-y_i)\]
just with the model function \(\hat y(\textbf x|\textbf w,b)\) specific to each case.
Problem: Given a supervised classification problem involving some training set to which a logistic sigmoid is fit with some optimal weights \(\textbf w\) and bias \(b\), how does the actual classification then arise?
Solution: One has to decide on some critical “activation energy”/threshold \(\hat y_c\in[0,1]\) such that the classification of a (possibly unseen) feature vector \(\textbf x\) is \(\Theta(\hat y(\textbf x|\textbf w,b)-\hat y_c)\in\{0,1\}\). Thus, the set of feature vectors \(\textbf x\) for which \(\hat y(\textbf x|\textbf w,b)=\hat y_c\) is called the decision boundary.
Problem: Using the scikit learn library, show how to perform logistic classification given \(N\) training examples of feature \(n\)-vectors \(\textbf x\in\textbf R^n\) with \(\hat y_c=0.5\).
Solution:
Problem: Explain what it means to regularize a cost function \(C(\textbf w,b)\) and what the purpose of this is.
Solution: It means adding an isotropic convex paraboloid component in the weight \(\textbf w\) (analogous to a harmonic trap in ultracold atoms), i.e.
\[C(\textbf w,b)\mapsto C(\textbf w,b)+\frac{\lambda}{2N}|\textbf w|^2\]
(optionally though not typically one can also regularize the bias term \(b\) by adding \(\lambda b^2/2N\) to the cost function \(C\)). Here, \(\lambda\) like \(\alpha\) is another hyperparameter (sometimes called a regularization parameter). The purpose is to avoid overfitting (a.k.a. high variance) when one has a lot of features and a relatively small training set (naturally, other ways to address overfitting include having more training examples and feature selection).