Problem: Explain how a hard-margin support vector machine would perform binary classification.

Solution: Conceptually, it’s simple. Given a training set of $N$ feature vectors $\mathbf x_1,…,\mathbf x_N\in\mathbf R^n$ each associated with some binary target label $y_1,…,y_N\in\{-1,1\}$ (notice the $2$ binary classes are $y=-1$ and $y=1$ rather than the more conventional $y=0$ and $y=1$; this is simply a matter of convenience), the goal of an SVM is to compute the unique hyperplane in $\mathbf R^n$ that will separate the $y=-1$ and $y=1$ classes as “best” as possible. For $n=2$ this can be visualized as trying to find the “widest possible street” between $2$ neighborhoods:

Mathematically, if the hyperplane decision boundary $\mathbf w\cdot\mathbf x+b=0$ is defined by a normal vector $\mathbf w\in\mathbf R^n$ such that the SVM classifies points with $\mathbf w\cdot\mathbf x+b\geq 0$ as $y=1$ and $\mathbf w\cdot\mathbf x+b\leq 0$ as $y=-1$ (more compactly, $\hat y_{\text{SVM}}(\mathbf x|\mathbf w,b)=\text{sgn}(\mathbf w\cdot\mathbf x+b)$), and the overall normalization of $\mathbf w$ and $b$ is fixed by stipulating that the “gutters” of the street are defined by the hyperplanes $\mathbf w\cdot\mathbf x+b=\pm 1$ passing through the support (feature) vectors (drawn as bigger red/green dots in the diagram), then the “width of the street” is the distance between these $2$ “gutter hyperplanes”, namely $2/|\mathbf w|$. Since one would like to maximize this margin $2/|\mathbf w|$, this is equivalent to minimizing the quadratic $|\mathbf w|^2/2$. Thus, one can formulate this “hard-margin SVM” algorithm as the programming problem:

\[\text{argmin}_{\mathbf w\in\mathbf R^n,b\in\mathbf R:\forall i=1,…,N,y_i(\mathbf w\cdot\mathbf x_i+b)\geq 1}\frac{|\mathbf w|^2}{2}\]

where the $N$ constraints $y_i(\mathbf w\cdot\mathbf x_i+b)\geq 1$ (or vectorially $\mathbf y\odot(X^T\mathbf w+b\mathbf{1})\geq\mathbf 1$) are just saying that all $N$ feature vectors $\mathbf x_1,…,\mathbf x_N$ in the training set should lie on the correct side of the decision boundary hyperplane $\mathbf w\cdot\mathbf x+b=0$.

Problem: What is the glaring problem with a hard-margin SVM? How does a soft-margin SVM help to address this issue?

Solution: The basic problem is that the hard-margin SVM is way too sensitive to outliers, e.g. a single $y=-1$ feature vector amidst the $y=1$ neighborhood would cause the hard-margin SVM to be unable to find a solution (because indeed there would be no hyperplane that perfectly separates all the training data into their correct classes). Intuitively, it seems like in such a case the best way to deal with this bias-variance tradeoff is to shrug and ignore the outliers:

Mathematically, the usual way to encode this is by introducing $N$ “slack variables” $\xi_1,…,\xi_N\geq 0$ such that the earlier $N$ constraints are relaxed to $y_i(\mathbf w\cdot\mathbf x_i+b)\geq 1-\xi_i$. However, to avoid cutting too much slack (i.e. incurring too much misclassification on the training set), their use should be penalized by the $\ell^1$-norm $|\boldsymbol{\xi}|_1:=\boldsymbol{\xi}\cdot\mathbf{1}=\sum_{i=1}^N\xi_i$. To this effect, the programming problem is modified to that of soft-margin SVM:

\[\text{argmin}_{\mathbf w\in\mathbf R^n,b\in\mathbf R,\boldsymbol{\xi}\in\mathbf R^N:\mathbf y\odot(X^T\mathbf w+b\mathbf{1})\geq\mathbf 1-\boldsymbol{\xi},\boldsymbol{\xi}\geq\mathbf 0}\frac{|\mathbf w|^2}{2}+C|\boldsymbol{\xi}|_1\]

where $C\geq 0$ is a “strictness” hyperparameter that must be selected ahead of time to balance hard vs. soft SVM.

Problem: The above represents the primal formulation of the programming problem in terms of so-called primal variables $\mathbf w,b,\boldsymbol{\xi}$. Show how the KKT conditions enable one to obtain a corresponding dual formulation of the programming problem in terms of a dual variable $\boldsymbol{\lambda}$:

\[\text{argmax}_{\boldsymbol{\lambda}\in\mathbf R^N:\boldsymbol{\lambda}\cdot\mathbf y=0,\mathbf 0\leq\boldsymbol{\lambda}\leq C\mathbf 1}|\boldsymbol{\lambda}|_1-\frac{1}{2}(\boldsymbol{\lambda}\odot\mathbf y)X^TX(\boldsymbol{\lambda}\odot\mathbf y)\]

Solution: The idea is to combine the $2$ constraints $\mathbf y\odot(X^T\mathbf w+b\mathbf{1})\geq\mathbf 1-\boldsymbol{\xi}$ and $\boldsymbol{\xi}\geq\mathbf 0$ using $2$ KKT multipliers $\boldsymbol{\lambda},\boldsymbol{\mu}$ into the Lagrangian:

\[L(\mathbf w,b,\boldsymbol{\xi},\boldsymbol{\lambda},\boldsymbol{\mu})=\frac{|\mathbf w|^2}{2}+C|\boldsymbol{\xi}|_1-\boldsymbol{\lambda}\cdot(\mathbf y\odot(X^T\mathbf w+b\mathbf{1})-\mathbf 1+\boldsymbol{\xi})-\boldsymbol{\mu}\cdot\boldsymbol{\xi}\]

The KKT necessary conditions assert:

\[\frac{\partial L}{\partial\mathbf w}=\mathbf 0\Rightarrow\mathbf w=X(\boldsymbol{\lambda}\odot\mathbf y)\]

\[\frac{\partial L}{\partial b}=0\Rightarrow\boldsymbol{\lambda}\cdot\mathbf y=0\]

\[\frac{\partial L}{\partial\boldsymbol{\xi}}=\mathbf 0\Rightarrow C\mathbf 1=\boldsymbol{\lambda}+\boldsymbol{\mu}\]

in addition to the $2$ original constraints on the primal variables (primal feasibility), and dual feasibility $\boldsymbol{\lambda},\boldsymbol{\mu}\geq\mathbf 0$ on both dual variables (as a corollary, $\mathbf 0\leq\boldsymbol{\lambda}\leq C\mathbf 1$), and complementary slackness $\boldsymbol{\lambda}\odot(\mathbf y\odot(X^T\mathbf w+b\mathbf{1})-\mathbf 1+\boldsymbol{\xi})=\boldsymbol{\mu}\odot\boldsymbol{\xi}=\mathbf 0$. The idea of the support vectors is clearly seen in the $1^{\text{st}}$ of these complementary slackness conditions.

By substituting the stationary conditions found back into the Lagrangian, one can eliminate all the primal variables (and even eliminate one of the dual variables $\boldsymbol{\mu}$) to obtain the claim:

\[L(\boldsymbol{\lambda})=|\boldsymbol{\lambda}|_1-\frac{1}{2}(\boldsymbol{\lambda}\odot\mathbf y)X^TX(\boldsymbol{\lambda}\odot\mathbf y)\]

which is a standard quadratic programming exercise with efficient solutions for $\boldsymbol{\lambda}$.

(mention something about strong duality as enabling a min-max interchange of primal vs. dual?)

Problem: Although soft-margin SVM is a significant improvement over hard-margin SVM, it too has a glaring issue. What is that issue and how can it be addressed?

Solution: Whether hard-margin or soft-margin, SVMs are ultimately linear binary classifiers; they can only directly learn a hyperplane decision boundary. Therefore, if the data is simply linearly inseparable, then even soft-margin SVM would be unwise to apply directly.

However, there is a creative solution; just take the $N$ feature vectors $\mathbf x_1,…,\mathbf x_N\in\mathbf R^n$ and apply a nonlinear transformation $\prime:\mathbf R^n\to\mathbf R^{>n}$ from the current feature space $\mathbf R^n$ to some higher-dimensional feature space $\mathbf R^{>n}$ (essentially a kind of feature engineering), thereby obtaining $N$ transformed feature vectors $\mathbf x’_1,…,\mathbf x’_N\in\mathbf R^{>n}$. Ideally, if the nonlinear transformation $\prime$ is well-chosen, then the transformed feature vectors $\mathbf x’_1,…,\mathbf x’_N$ will become linearly separable in the higher-dimensional feature space $\mathbf R^{>n}$, in which case the usual soft-margin SVM may be applied in $\mathbf R^{>n}$. Then, applying the inverse nonlinear mapping $\prime^{-1}:\mathbf R^{>n}\to\mathbf R^n$ back to the original feature space $\mathbf R^n$, the SVM (soft) maximal-margin hyperplane in $\mathbf R^{>n}$ would be transformed into some nonlinear decision boundary back in $\mathbf R^n$, thereby effectively extending the applicability of the SVM method to linearly inseparable data! See this YouTube video for an example.

Problem: Continuing off the previous problem, suppose one is already in the transformed space with feature vectors $\mathbf x’_1,…,\mathbf x’_N\in\mathbf R^{>n}$, and one has solved the dual programming problem in this space to obtain some optimal $\boldsymbol{\lambda}\in\mathbf R^N$. How is binary classification thus performed?

Solution: Based on the previous analysis, the SVM binary classifier in this transformed feature space $\mathbf R^{>n}$ may be written:

\[\hat y_{\text{SVM}}(\mathbf x’)=\text{sgn}(\mathbf w\cdot\mathbf x’+b)=\text{sgn}\left(\sum_{i=1}^N\lambda_iy_i\mathbf x’\cdot\mathbf x’_i+y_S-\sum_{i=1}^N\lambda_iy_i\mathbf x’_S\cdot\mathbf x’_i\right)\]

where $\mathbf x’_S$ is any support feature vector. Note by complementary slackness that most $\lambda_i=0$ are vanishing (corresponding non-support feature vectors $\mathbf x_i$), so the sums $\sum_{i=1}^N$ may also be written as sums over only support vectors.

Problem: Explain how the expression above can be simplified by means of the kernel trick.

Solution: Basically, the feature space transformation $\prime:\mathbf R^n\to\mathbf R^{>n}$ is a complicated mapping of vectors to vectors. By contrast, only their scalar dot products $\mathbf x’\cdot\mathbf x’_i$ appear in the SVM classifier $\hat y_{\text{SVM}}(\mathbf x)$, and scalars are simpler than vectors. The kernel trick essentially invites one to forget about the details of the transformation $\prime$ and just directly specify an analytical expression for the dot product in the transformed feature space $K(\mathbf x,\tilde{\mathbf x}):=\mathbf x’\cdot\tilde{\mathbf x}’$, where here $K:\mathbf R^n\times\mathbf R^n\to\mathbf R$ is called the kernel function. Effectively, this means one never has to actually visit the transformed feature space $\mathbf R^{>n}$ since the kernel is perfectly happy to just take the feature vectors in the original space $\mathbf R^n$. Thus, one can write:

\[\hat y_{\text{SVM}}(\mathbf x)=\text{sgn}(\mathbf w\cdot\mathbf x’+b)=\text{sgn}\left(\sum_{i=1}^N\lambda_iy_iK(\mathbf x,\mathbf x_i)+y_S-\sum_{i=1}^N\lambda_iy_iK(\mathbf x_S,\mathbf x_i)\right)\]

Problem: What is the Gaussian (radial basis function) kernel $K_{\sigma}(\mathbf x,\tilde{\mathbf x})$? Sketch why it’s a valid kernel function.

Solution: This is by far the most popular kernel for nonlinear SVM classification, involving a hyperparameter $\sigma$:

\[K_{\sigma}(\mathbf x,\tilde{\mathbf x})=e^{-|\mathbf x-\tilde{\mathbf x}|^2/2\sigma^2}\]

algebraically, it is clearly equivalent to:

\[K_{\sigma}(\mathbf x,\tilde{\mathbf x})=e^{-|\mathbf x|^2/2\sigma^2}e^{-|\tilde{\mathbf x}|^2/2\sigma^2}e^{\mathbf x\cdot\tilde{\mathbf x}/\sigma^2}\]

Taylor expanding the last factor:

\[K_{\sigma}(\mathbf x,\tilde{\mathbf x})=e^{-|\mathbf x|^2/2\sigma^2}e^{-|\tilde{\mathbf x}|^2/2\sigma^2}\sum_{d=0}^{\infty}\frac{(\mathbf x\cdot\tilde{\mathbf x})^d}{\sigma^{2d}d!}\]

so it’s clearly a (countably-infinite-dimensional!) inner product because of the presence of the $d$-dimensional polynomial kernels $K_d(\mathbf x,\tilde{\mathbf x})=(\mathbf x\cdot\tilde{\mathbf x})^d$.

Problem: What were the main problems with SVMs that spurred the deep learning renaissance via the development of artificial neural networks?

Solution: The main issue with SVMs is their lack of scalability. Roughly speaking, the computational cost of training an SVM on a dataset of $N$ training examples scales like $O(N^2)$ in the best case (essentially because the Gram matrix $X^TX$ is an $N\times N$ matrix of all pairwise interactions of features), even with modern tricks like sequential minimal optimization this is still unavoidable. Another smaller issue is that

Problem: Write a short Python program using the scikit-learn library to train a support vector machine on the Kaggle Titanic competition dataset.

Solution:

titanic

Titanic – Machine Learning from Disaster (Kaggle Competition)

Problem Type: supervised binary classification

Training data: $N=891$ feature vectors (corresponding to $N=891$ passengers on the Titanic) each with $n=11$ features. Some features for certain passengers are NaN. Each feature vector is associated to a binary target label which is either $y=0$ (passenger died) or $y=1$ (passenger survived).

Test data: $N’=418$ feature vectors (each with $n=11$ features), again some features are NaN.

$$\textbf{Support Vector Machine}$$

In [39]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

Load the labelled training data in “train.csv” (stored in same folder as this Jupyter Notebook) into a Pandas DataFrame (representing a feature matrix $X_{\text{train}}$ together with the labelled targets $\mathbf y_{\text{train}}$).

In [ ]:

def df_processed(csv_loc):
    """
    Loads and processes the Titanic dataset from a CSV location.

    This function handles data cleaning and feature engineering to prepare
    the data for machine learning. It addresses missing values (NaNs) and
    converts categorical features into numerical format.

    Args:
        csv_loc (str): The file path to the training or test CSV file.

    Returns:
        pd.DataFrame: A processed DataFrame ready for modeling.

    Notes:
        Data Cleaning & Imputation:
        - 'Age': Replaces NaN entries with the median age of the column.
        - 'Fare': Replaces NaN entries with the median fare.
        - 'Embarked': Replaces NaN entries with the mode (most frequent port).

        Feature Engineering & Dropping:
        - 'Cabin', 'Ticket', 'Name': These columns are dropped. 'Cabin' has
          many NaN values, while 'Ticket' and 'Name' are non-numeric object
          types with high cardinality, which are dropped for simplicity.
        - 'Sex' & 'Embarked': These categorical columns are converted into
          numerical dummy variables using one-hot encoding. `drop_first=True`
          is used to create k-1 dummies, which helps avoid
          multicollinearity (e.g., 'Sex' becomes 'Sex_male' with 0/1 values).
    """
    df = pd.read_csv(csv_loc)
    df = df.drop(["Ticket", "Cabin", "Name"], axis=1)
    df["Age"] = df["Age"].fillna(df["Age"].median())
    df["Embarked"] = df["Embarked"].fillna(df["Embarked"].mode())
    df["Fare"] = df["Fare"].fillna(df["Fare"].median()) # 1 passenger in the test set has NaN fare
    df = pd.get_dummies(df, columns=["Embarked", "Sex"], drop_first=True)
    return df

Now, separate out the feature matrix $X_{\text{train}}$ and the target label vector $\mathbf y_{\text{train}}$ from the DataFrame, and apply identical $z$-score normalization to both $X_{\text{train}}$ and $X_{\text{test}}$, using the mean and variance derived from $X_{\text{train}}$.

In [41]:

X_tr = df_processed("train.csv").drop(["Survived"], axis=1)
y_tr = df_processed("train.csv")["Survived"]
X_ts = df_processed("test.csv")
scaler = StandardScaler()
X_tr_sc = scaler.fit_transform(X_tr)
X_ts_sc = scaler.transform(X_ts)

Fit a soft-margin support vector machine to this training data, with strictness hyperparameter $C=0.9$ (default is $C=1$) and (default) Gaussian RBF kernel $K_{\gamma}(\mathbf x,\tilde{\mathbf x})=e^{-\gamma|\mathbf x-\tilde{\mathbf x}|^2}$, where here $\gamma=1/n\sigma^2_{\hat X_{\text{train}}}$ is also the default (cf. Bienayme’s identity).

In [42]:

svm_model = SVC(C=0.9, kernel='rbf', gamma=1/(X_tr_sc.shape[1]*X_tr_sc.var())) 
svm_model.fit(X_tr_sc, y_tr)
y_ts_hat = svm_model.predict(X_ts_sc)

PassengerIDs = pd.read_csv("test.csv")["PassengerId"].to_numpy()
df_submission = pd.DataFrame({"PassengerId": PassengerIDs, 
                              "Survived": y_ts_hat})
df_submission.to_csv('SVM_submission.csv', index=False)

This approach turns out to yield a test set accuracy of $79.665\%$. One can use scikit-learn’s GridSearchCV function to try different values of the hyperparameter $C$ and even change the choice of kernel, and use cross-validation to determine which combination seems to do best.

In [ ]:

grid = {"C": np.arange(0.8, 1.1, 0.01), 
        "kernel": ["linear","rbf"]}

grid_search = GridSearchCV(
    estimator=SVC(probability=True),
    param_grid=grid,
    cv=5,
    n_jobs=-1, 
    verbose=2
)

grid_search.fit(X_tr_sc, y_tr)

print("\n--- GridSearch Results ---")
print(f"Best parameters found: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

Fitting 5 folds for each of 62 candidates, totalling 310 fits

--- GridSearch Results ---
Best parameters found: {'C': 0.9200000000000002, 'kernel': 'rbf'}
Best cross-validation score: 0.8260

Support Vector Machines

Leave a Reply Cancel reply

Archives

Meta