PyTorch Fundamentals (Part $2$)

Posted on January 27, 2026 by wdengquantum.me

Problem: Do an end-to-end walkthrough of the PyTorch machine learning workflow using the most basic univariate linear regression example. In particular, generate some linear data over a normalized feature space (whose slope $w$ and intercept $b$ would in practice be a priori unknown), split that linear data into training and testing subsets (no cross-validation dataset needed for this simple example), define the linear layer class, instantiate a model object of the class, and starting from random values of $w$ and $b$, use stochastic gradient descent with learning rate $\alpha=0.01$ to minimize the training cost function $C_{\text{train}}(w,b)$ based on an $L^1$ loss. Iterate SGD for $300$ epochs, and for every $20$ epochs, record the current value of $C_{\text{train}}(w,b)$ and the current value of $C_{\text{test}}(w,b)$. Plot these cost function curves as a function of the epoch number $0, 20, 40,…$. Save the final state dictionary of the model’s learned parameters $w,b$ post-training, and load it back onto a new instance of the model class.

Solution:

pytorch_workflow

Typical PyTorch Workflow

Get data ready (turn into tensors).
Build of pick a pretrained model (to suit one’s problem).
Fit the model to the data and make a prediction.
Evaluate the model
Improve through experimentation.
Save and reload your trained model

In [1]:

import torch
from torch import nn # nn module contains all of PyTorch's building blocks for designing architectures
import matplotlib.pyplot as plt 

torch.__version__

Out[1]:

'2.9.1+cu128'

In [2]:

# Create known parameters
weight = 0.7
bias = 0.3 
start = 0
end = 1
step = 0.02
x = torch.arange(start, end, step).unsqueeze(dim=1)
#print(x)
y = weight * x + bias
print(f"First 10 elements of x: {x[:10]}")
print(f"First 10 elements of y: {y[:10]}")
plt.plot(x, y)

First 10 elements of x: tensor([[0.0000],
        [0.0200],
        [0.0400],
        [0.0600],
        [0.0800],
        [0.1000],
        [0.1200],
        [0.1400],
        [0.1600],
        [0.1800]])
First 10 elements of y: tensor([[0.3000],
        [0.3140],
        [0.3280],
        [0.3420],
        [0.3560],
        [0.3700],
        [0.3840],
        [0.3980],
        [0.4120],
        [0.4260]])

Out[2]:

[<matplotlib.lines.Line2D at 0x7f60957a38b0>]

No description has been provided for this image

In [3]:

train_split = int(0.8 * len(x)) # or int(0.8 * len(y))
x_train, y_train = x[:train_split], y[:train_split]
x_test, y_test = x[train_split:], y[train_split:]
# can also use scikit learn's splitting method which 
# adds some randomness to the training data.
print(len(x_train), len(y_train), len(x_test), len(y_test))

40 40 10 10

In [4]:

# Create linear regression model class
class LinearRegressionModel(nn.Module):
    #almost all classes inherit from nn.Module
    def __init__(self):
        super().__init__()
        self.weights = nn.Parameter(torch.randn(1,
                                                requires_grad=True, #requires_grad=True is default
                                                dtype=torch.float))
        self.bias = nn.Parameter(torch.randn(1,
                                             requires_grad=True,
                                             dtype=torch.float))
        # or, instead of hard coding the weights and biases, PyTorch's nn.Module class has
        # a built-in nn.Linear layer, so above can be replaced by something like
        # self.linear_layer = nn.Linear(in_features=1, out_features=1)
        # Forward method to define computation in the model, should always be defined to override 
        # default method in nn.Module
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.weights * x + self.bias

PyTorch model building essentials¶

torch.nn – contains all the building blocks for computational graphs (e.g. neural networks)
torch.nn.Parameter – what parameters should our model learn, often a PyTorch layer from torch.nn will set these for us
torch.nn.Module – The base class for all neural network modules, if you subclass/inherit from it, you should overwrite forward()
torch.optim – this where the optimizers in PyTorch live, they will help with gradient descent
def forward() – All nn.Module subclasses require one to overwrite this method; this defines what happens in forward pass/propagation/computation.

In [5]:

# Checking contents of PyTorch model
torch.manual_seed(42)
# Create object instance of the class:
model_0 = LinearRegressionModel()
print(list(model_0.parameters()))
print(model_0.state_dict())

with torch.inference_mode(): #inference mode basically tells the model that this is now training/cross-validation data, not to be used for updating parameters
    y_hat = model_0(x_test)

plt.scatter(x, y)
plt.scatter(x_test, y_hat)

[Parameter containing:
tensor([0.3367], requires_grad=True), Parameter containing:
tensor([0.1288], requires_grad=True)]
OrderedDict([('weights', tensor([0.3367])), ('bias', tensor([0.1288]))])

Out[5]:

<matplotlib.collections.PathCollection at 0x7f60936cb610>

In [6]:

# Setup loss function
loss_fn = nn.L1Loss() #mean absolute deviation
print(loss_fn)
# Setup optimizer
optimizer = torch.optim.SGD(params=model_0.parameters(),
                            lr=0.01)

L1Loss()

In [7]:

N_epochs = 200 #hyperparameter

epoch_count = []
train_loss_values = []
test_loss_values = []

# Training loop
for epoch in range(N_epochs):
    model_0.train() # put model in training mode (default state)

    # Forward pass
    y_hat = model_0(x_train)

    # Calculate loss
    loss = loss_fn(y_hat, y_train)
    print(loss)
    # Optimizer zero grad
    optimizer.zero_grad()

    # Backpropagation
    loss.backward()

    # Step the optimizer (gradient descent)
    optimizer.step() # by default optimizer changes will accumulate through the loop, so need to zero the gradient in above step

    # Testing loop
    model_0.eval()
    with torch.inference_mode():
        test_pred = model_0(x_test)
        test_loss = loss_fn(test_pred, y_test)

    if epoch % 10 == 0:
        epoch_count.append(epoch)
        train_loss_values.append(loss)
        test_loss_values.append(test_loss)
        print(f"Epoch: {epoch} | Loss: {loss} | Test Loss: {test_loss}")

    print(model_0.state_dict())

tensor(0.3129, grad_fn=<MeanBackward0>)
Epoch: 0 | Loss: 0.31288138031959534 | Test Loss: 0.48106518387794495
OrderedDict([('weights', tensor([0.3406])), ('bias', tensor([0.1388]))])
tensor(0.3014, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3445])), ('bias', tensor([0.1488]))])
tensor(0.2898, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3484])), ('bias', tensor([0.1588]))])
tensor(0.2783, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3523])), ('bias', tensor([0.1688]))])
tensor(0.2668, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3562])), ('bias', tensor([0.1788]))])
tensor(0.2553, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3601])), ('bias', tensor([0.1888]))])
tensor(0.2438, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3640])), ('bias', tensor([0.1988]))])
tensor(0.2322, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3679])), ('bias', tensor([0.2088]))])
tensor(0.2207, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3718])), ('bias', tensor([0.2188]))])
tensor(0.2092, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3757])), ('bias', tensor([0.2288]))])
tensor(0.1977, grad_fn=<MeanBackward0>)
Epoch: 10 | Loss: 0.1976713240146637 | Test Loss: 0.3463551998138428
OrderedDict([('weights', tensor([0.3796])), ('bias', tensor([0.2388]))])
tensor(0.1862, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3835])), ('bias', tensor([0.2488]))])
tensor(0.1746, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3874])), ('bias', tensor([0.2588]))])
tensor(0.1631, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3913])), ('bias', tensor([0.2688]))])
tensor(0.1516, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3952])), ('bias', tensor([0.2788]))])
tensor(0.1401, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3991])), ('bias', tensor([0.2888]))])
tensor(0.1285, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4030])), ('bias', tensor([0.2988]))])
tensor(0.1170, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4069])), ('bias', tensor([0.3088]))])
tensor(0.1061, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4108])), ('bias', tensor([0.3178]))])
tensor(0.0968, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4146])), ('bias', tensor([0.3258]))])
tensor(0.0891, grad_fn=<MeanBackward0>)
Epoch: 20 | Loss: 0.08908725529909134 | Test Loss: 0.21729660034179688
OrderedDict([('weights', tensor([0.4184])), ('bias', tensor([0.3333]))])
tensor(0.0823, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4222])), ('bias', tensor([0.3403]))])
tensor(0.0764, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4258])), ('bias', tensor([0.3463]))])
tensor(0.0716, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4293])), ('bias', tensor([0.3518]))])
tensor(0.0675, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4328])), ('bias', tensor([0.3568]))])
tensor(0.0640, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4361])), ('bias', tensor([0.3613]))])
tensor(0.0610, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4394])), ('bias', tensor([0.3653]))])
tensor(0.0585, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4425])), ('bias', tensor([0.3688]))])
tensor(0.0564, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4455])), ('bias', tensor([0.3718]))])
tensor(0.0546, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4483])), ('bias', tensor([0.3743]))])
tensor(0.0531, grad_fn=<MeanBackward0>)
Epoch: 30 | Loss: 0.053148526698350906 | Test Loss: 0.14464017748832703
OrderedDict([('weights', tensor([0.4512])), ('bias', tensor([0.3768]))])
tensor(0.0518, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4539])), ('bias', tensor([0.3788]))])
tensor(0.0507, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4564])), ('bias', tensor([0.3803]))])
tensor(0.0498, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4590])), ('bias', tensor([0.3818]))])
tensor(0.0490, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4615])), ('bias', tensor([0.3833]))])
tensor(0.0482, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4639])), ('bias', tensor([0.3843]))])
tensor(0.0475, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4662])), ('bias', tensor([0.3853]))])
tensor(0.0469, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4684])), ('bias', tensor([0.3858]))])
tensor(0.0464, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4706])), ('bias', tensor([0.3863]))])
tensor(0.0459, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4728])), ('bias', tensor([0.3868]))])
tensor(0.0454, grad_fn=<MeanBackward0>)
Epoch: 40 | Loss: 0.04543796554207802 | Test Loss: 0.11360953003168106
OrderedDict([('weights', tensor([0.4748])), ('bias', tensor([0.3868]))])
tensor(0.0450, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4768])), ('bias', tensor([0.3868]))])
tensor(0.0446, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4788])), ('bias', tensor([0.3868]))])
tensor(0.0442, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4808])), ('bias', tensor([0.3868]))])
tensor(0.0438, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4828])), ('bias', tensor([0.3868]))])
tensor(0.0434, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4848])), ('bias', tensor([0.3868]))])
tensor(0.0431, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4866])), ('bias', tensor([0.3863]))])
tensor(0.0427, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4884])), ('bias', tensor([0.3858]))])
tensor(0.0424, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4902])), ('bias', tensor([0.3853]))])
tensor(0.0420, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4920])), ('bias', tensor([0.3848]))])
tensor(0.0417, grad_fn=<MeanBackward0>)
Epoch: 50 | Loss: 0.04167863354086876 | Test Loss: 0.09919948130846024
OrderedDict([('weights', tensor([0.4938])), ('bias', tensor([0.3843]))])
tensor(0.0413, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4956])), ('bias', tensor([0.3838]))])
tensor(0.0410, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4974])), ('bias', tensor([0.3833]))])
tensor(0.0406, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4992])), ('bias', tensor([0.3828]))])
tensor(0.0403, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5010])), ('bias', tensor([0.3823]))])
tensor(0.0399, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5028])), ('bias', tensor([0.3818]))])
tensor(0.0396, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5046])), ('bias', tensor([0.3813]))])
tensor(0.0392, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5064])), ('bias', tensor([0.3808]))])
tensor(0.0389, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5082])), ('bias', tensor([0.3803]))])
tensor(0.0385, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5100])), ('bias', tensor([0.3798]))])
tensor(0.0382, grad_fn=<MeanBackward0>)
Epoch: 60 | Loss: 0.03818932920694351 | Test Loss: 0.08886633068323135
OrderedDict([('weights', tensor([0.5116])), ('bias', tensor([0.3788]))])
tensor(0.0379, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5134])), ('bias', tensor([0.3783]))])
tensor(0.0375, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5152])), ('bias', tensor([0.3778]))])
tensor(0.0372, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5168])), ('bias', tensor([0.3768]))])
tensor(0.0368, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5186])), ('bias', tensor([0.3763]))])
tensor(0.0365, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5202])), ('bias', tensor([0.3753]))])
tensor(0.0361, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5220])), ('bias', tensor([0.3748]))])
tensor(0.0358, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5236])), ('bias', tensor([0.3738]))])
tensor(0.0354, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5254])), ('bias', tensor([0.3733]))])
tensor(0.0351, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5272])), ('bias', tensor([0.3728]))])
tensor(0.0348, grad_fn=<MeanBackward0>)
Epoch: 70 | Loss: 0.03476089984178543 | Test Loss: 0.0805937647819519
OrderedDict([('weights', tensor([0.5288])), ('bias', tensor([0.3718]))])
tensor(0.0344, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5306])), ('bias', tensor([0.3713]))])
tensor(0.0341, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5322])), ('bias', tensor([0.3703]))])
tensor(0.0337, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5340])), ('bias', tensor([0.3698]))])
tensor(0.0334, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5355])), ('bias', tensor([0.3688]))])
tensor(0.0330, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5373])), ('bias', tensor([0.3683]))])
tensor(0.0327, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5391])), ('bias', tensor([0.3678]))])
tensor(0.0324, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5407])), ('bias', tensor([0.3668]))])
tensor(0.0320, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5425])), ('bias', tensor([0.3663]))])
tensor(0.0317, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5441])), ('bias', tensor([0.3653]))])
tensor(0.0313, grad_fn=<MeanBackward0>)
Epoch: 80 | Loss: 0.03132382780313492 | Test Loss: 0.07232122868299484
OrderedDict([('weights', tensor([0.5459])), ('bias', tensor([0.3648]))])
tensor(0.0310, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5475])), ('bias', tensor([0.3638]))])
tensor(0.0306, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5493])), ('bias', tensor([0.3633]))])
tensor(0.0303, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5509])), ('bias', tensor([0.3623]))])
tensor(0.0300, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5527])), ('bias', tensor([0.3618]))])
tensor(0.0296, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5545])), ('bias', tensor([0.3613]))])
tensor(0.0293, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5561])), ('bias', tensor([0.3603]))])
tensor(0.0289, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5579])), ('bias', tensor([0.3598]))])
tensor(0.0286, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5595])), ('bias', tensor([0.3588]))])
tensor(0.0282, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5613])), ('bias', tensor([0.3583]))])
tensor(0.0279, grad_fn=<MeanBackward0>)
Epoch: 90 | Loss: 0.02788739837706089 | Test Loss: 0.06473556160926819
OrderedDict([('weights', tensor([0.5629])), ('bias', tensor([0.3573]))])
tensor(0.0275, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5647])), ('bias', tensor([0.3568]))])
tensor(0.0272, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5665])), ('bias', tensor([0.3563]))])
tensor(0.0269, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5681])), ('bias', tensor([0.3553]))])
tensor(0.0265, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5699])), ('bias', tensor([0.3548]))])
tensor(0.0262, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5715])), ('bias', tensor([0.3538]))])
tensor(0.0258, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5733])), ('bias', tensor([0.3533]))])
tensor(0.0255, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5748])), ('bias', tensor([0.3523]))])
tensor(0.0251, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5766])), ('bias', tensor([0.3518]))])
tensor(0.0248, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5784])), ('bias', tensor([0.3513]))])
tensor(0.0245, grad_fn=<MeanBackward0>)
Epoch: 100 | Loss: 0.024458957836031914 | Test Loss: 0.05646304413676262
OrderedDict([('weights', tensor([0.5800])), ('bias', tensor([0.3503]))])
tensor(0.0241, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5818])), ('bias', tensor([0.3498]))])
tensor(0.0238, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5834])), ('bias', tensor([0.3488]))])
tensor(0.0234, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5852])), ('bias', tensor([0.3483]))])
tensor(0.0231, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5868])), ('bias', tensor([0.3473]))])
tensor(0.0227, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5886])), ('bias', tensor([0.3468]))])
tensor(0.0224, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5902])), ('bias', tensor([0.3458]))])
tensor(0.0221, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5920])), ('bias', tensor([0.3453]))])
tensor(0.0217, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5938])), ('bias', tensor([0.3448]))])
tensor(0.0214, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5954])), ('bias', tensor([0.3438]))])
tensor(0.0210, grad_fn=<MeanBackward0>)
Epoch: 110 | Loss: 0.021020207554101944 | Test Loss: 0.04819049686193466
OrderedDict([('weights', tensor([0.5972])), ('bias', tensor([0.3433]))])
tensor(0.0207, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5988])), ('bias', tensor([0.3423]))])
tensor(0.0203, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6006])), ('bias', tensor([0.3418]))])
tensor(0.0200, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6022])), ('bias', tensor([0.3408]))])
tensor(0.0196, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6040])), ('bias', tensor([0.3403]))])
tensor(0.0193, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6058])), ('bias', tensor([0.3398]))])
tensor(0.0190, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6074])), ('bias', tensor([0.3388]))])
tensor(0.0186, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6092])), ('bias', tensor([0.3383]))])
tensor(0.0183, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6108])), ('bias', tensor([0.3373]))])
tensor(0.0179, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6126])), ('bias', tensor([0.3368]))])
tensor(0.0176, grad_fn=<MeanBackward0>)
Epoch: 120 | Loss: 0.01758546568453312 | Test Loss: 0.04060482233762741
OrderedDict([('weights', tensor([0.6141])), ('bias', tensor([0.3358]))])
tensor(0.0172, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6159])), ('bias', tensor([0.3353]))])
tensor(0.0169, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6175])), ('bias', tensor([0.3343]))])
tensor(0.0166, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6193])), ('bias', tensor([0.3338]))])
tensor(0.0162, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6211])), ('bias', tensor([0.3333]))])
tensor(0.0159, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6227])), ('bias', tensor([0.3323]))])
tensor(0.0155, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6245])), ('bias', tensor([0.3318]))])
tensor(0.0152, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6261])), ('bias', tensor([0.3308]))])
tensor(0.0148, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6279])), ('bias', tensor([0.3303]))])
tensor(0.0145, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6295])), ('bias', tensor([0.3293]))])
tensor(0.0142, grad_fn=<MeanBackward0>)
Epoch: 130 | Loss: 0.014155393466353416 | Test Loss: 0.03233227878808975
OrderedDict([('weights', tensor([0.6313])), ('bias', tensor([0.3288]))])
tensor(0.0138, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6331])), ('bias', tensor([0.3283]))])
tensor(0.0135, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6347])), ('bias', tensor([0.3273]))])
tensor(0.0131, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6365])), ('bias', tensor([0.3268]))])
tensor(0.0128, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6381])), ('bias', tensor([0.3258]))])
tensor(0.0124, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6399])), ('bias', tensor([0.3253]))])
tensor(0.0121, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6415])), ('bias', tensor([0.3243]))])
tensor(0.0118, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6433])), ('bias', tensor([0.3238]))])
tensor(0.0114, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6451])), ('bias', tensor([0.3233]))])
tensor(0.0111, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6467])), ('bias', tensor([0.3223]))])
tensor(0.0107, grad_fn=<MeanBackward0>)
Epoch: 140 | Loss: 0.010716589167714119 | Test Loss: 0.024059748277068138
OrderedDict([('weights', tensor([0.6485])), ('bias', tensor([0.3218]))])
tensor(0.0104, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6501])), ('bias', tensor([0.3208]))])
tensor(0.0100, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6519])), ('bias', tensor([0.3203]))])
tensor(0.0097, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6534])), ('bias', tensor([0.3193]))])
tensor(0.0093, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6552])), ('bias', tensor([0.3188]))])
tensor(0.0090, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6568])), ('bias', tensor([0.3178]))])
tensor(0.0087, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6586])), ('bias', tensor([0.3173]))])
tensor(0.0083, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6604])), ('bias', tensor([0.3168]))])
tensor(0.0080, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6620])), ('bias', tensor([0.3158]))])
tensor(0.0076, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6638])), ('bias', tensor([0.3153]))])
tensor(0.0073, grad_fn=<MeanBackward0>)
Epoch: 150 | Loss: 0.0072835334576666355 | Test Loss: 0.016474086791276932
OrderedDict([('weights', tensor([0.6654])), ('bias', tensor([0.3143]))])
tensor(0.0069, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6672])), ('bias', tensor([0.3138]))])
tensor(0.0066, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6688])), ('bias', tensor([0.3128]))])
tensor(0.0063, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6706])), ('bias', tensor([0.3123]))])
tensor(0.0059, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6724])), ('bias', tensor([0.3118]))])
tensor(0.0056, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6740])), ('bias', tensor([0.3108]))])
tensor(0.0052, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6758])), ('bias', tensor([0.3103]))])
tensor(0.0049, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6774])), ('bias', tensor([0.3093]))])
tensor(0.0045, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6792])), ('bias', tensor([0.3088]))])
tensor(0.0042, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6808])), ('bias', tensor([0.3078]))])
tensor(0.0039, grad_fn=<MeanBackward0>)
Epoch: 160 | Loss: 0.0038517764769494534 | Test Loss: 0.008201557211577892
OrderedDict([('weights', tensor([0.6826])), ('bias', tensor([0.3073]))])
tensor(0.0035, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6844])), ('bias', tensor([0.3068]))])
tensor(0.0032, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6860])), ('bias', tensor([0.3058]))])
tensor(0.0028, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6878])), ('bias', tensor([0.3053]))])
tensor(0.0025, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6894])), ('bias', tensor([0.3043]))])
tensor(0.0021, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6912])), ('bias', tensor([0.3038]))])
tensor(0.0018, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6927])), ('bias', tensor([0.3028]))])
tensor(0.0015, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6947])), ('bias', tensor([0.3028]))])
tensor(0.0012, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
Epoch: 170 | Loss: 0.008932482451200485 | Test Loss: 0.005023092031478882
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
Epoch: 180 | Loss: 0.008932482451200485 | Test Loss: 0.005023092031478882
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
Epoch: 190 | Loss: 0.008932482451200485 | Test Loss: 0.005023092031478882
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])

In [8]:

with torch.inference_mode(): #inference mode basically tells the model that this is now training/cross-validation data, not to be used for updating parameters
    y_hat = model_0(x_test)

plt.scatter(x, y)
plt.scatter(x_test, y_hat)

Out[8]:

<matplotlib.collections.PathCollection at 0x7f608837beb0>

In [9]:

plt.scatter(epoch_count, torch.tensor(train_loss_values).numpy(), label="Training Cost")
plt.scatter(epoch_count, torch.tensor(test_loss_values).numpy(), label="Testing Cost")
plt.legend()

/tmp/ipykernel_349241/1254886168.py:1: UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior.
Consider using tensor.detach() first. (Triggered internally at /pytorch/torch/csrc/autograd/generated/python_variable_methods.cpp:836.)
  plt.scatter(epoch_count, torch.tensor(train_loss_values).numpy(), label="Training Cost")

Out[9]:

<matplotlib.legend.Legend at 0x7f60883d6d40>

In [10]:

# Want to be able to save the model parameters! 
# There are 3 main methods for saving + loading model parameters in PyTorch:
print(model_0.state_dict())

from pathlib import Path

# Create a directory to save models into
model_path = Path("models")
model_path.mkdir(parents=True, exist_ok=True)

# Create model save path
model_name = "pytorch_workflow_model_0.pth"
model_save_path = model_path / model_name

# Save the model state_dict
torch.save(model_0.state_dict(), model_save_path)

OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])

In [11]:

!ls -l models

total 4
-rw-r--r-- 1 william william 2093 Jan 27 13:54 pytorch_workflow_model_0.pth

In [12]:

# Loading a PyTorch model

# Since only saved the model state_dict, will create new
# object instance of the LinearRegressionModel model class and 

loaded_model_0 = LinearRegressionModel()

# Load the saved state_dict of model_0 (this will update the new instance with the updated parameters)

loaded_model_0.load_state_dict(torch.load(f=model_save_path))

Out[12]:

<All keys matched successfully>

In [13]:

next(model_0.parameters()).device

Out[13]:

device(type='cpu')

Exercises!

In [ ]:

#1
import torch
import matplotlib.pyplot as plt

weight = 0.3
bias = 0.9
N = 100
# something I got reminded of in a nasty way is that all features should be normalized!!! When blindly using torch.linspace(0, 100, N), got exploding gradients...
x = torch.linspace(0, 1, N).unsqueeze(dim=1) #the need for unsqueeze is subtle...need for computing cost function C below as y_hat_train and y_hat_test will both be of shape (n, 1) for some n
y = weight * x + bias # in practice don't know how y relates to x
x_train = x[:int(0.8 * N)]
y_train = y[:int(0.8 * N)]
x_test = x[int(0.8 * N):]
y_test = y[int(0.8 * N):]
plt.plot(x_train, y_train, label="Training Data") #hmm...didn't need to convert PyTorch tensors to NumPy arrays for plotting
plt.plot(x_test, y_test, label="Testing Data")
plt.legend()

Out[ ]:

<matplotlib.legend.Legend at 0x7f60a1f5b640>

In [21]:

#2
from torch import nn

torch.manual_seed(42)

class PyTorchModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.w = nn.Parameter(torch.randn(1,
                              requires_grad=True,
                              dtype=torch.float))
        self.b = nn.Parameter(torch.randn(1,
                              requires_grad=True,
                              dtype=torch.float))
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.w * x + self.b
    
model_instance = PyTorchModel()
print(model_instance.state_dict())

#3
C = nn.L1Loss()
optimizer = torch.optim.SGD(params=model_instance.parameters(), lr=0.01)

N_epochs = 300
C_train_curve = []
C_test_curve = []
for i in range(N_epochs):
    # training loop
    model_instance.train()
    y_hat_train = model_instance(x_train) # or model_instance.forward(x_train)
    C_train = C(y_hat_train, y_train)
    optimizer.zero_grad()
    C_train.backward()
    optimizer.step()

    # testing/evaluation loop
    if i % 20 == 0:
        C_train_curve.append(C_train)
        model_instance.eval()
        with torch.inference_mode(): # stop gradient tape recording during testing
            y_hat_test = model_instance(x_test) # or model_instance.forward(x_test)
            C_test = C(y_hat_test, y_test)
            C_test_curve.append(C_test)
        print(f"Epoch: {i}, Training Cost: {C_train}, Test Cost: {C_test}")

print(f"Final Trained Model Parameters: {model_instance.state_dict()}")

OrderedDict([('w', tensor([0.3367])), ('b', tensor([0.1288]))])
Epoch: 0, Training Cost: 0.7565514445304871, Test Cost: 0.7244139909744263
Epoch: 20, Training Cost: 0.524712860584259, Test Cost: 0.45227327942848206
Epoch: 40, Training Cost: 0.29287439584732056, Test Cost: 0.18013274669647217
Epoch: 60, Training Cost: 0.07645779848098755, Test Cost: 0.07569172978401184
Epoch: 80, Training Cost: 0.0533239021897316, Test Cost: 0.11738457530736923
Epoch: 100, Training Cost: 0.046195853501558304, Test Cost: 0.10600712150335312
Epoch: 120, Training Cost: 0.03922543674707413, Test Cost: 0.09009645879268646
Epoch: 140, Training Cost: 0.03225494548678398, Test Cost: 0.07418543100357056
Epoch: 160, Training Cost: 0.02528444491326809, Test Cost: 0.05827441066503525
Epoch: 180, Training Cost: 0.018313953652977943, Test Cost: 0.04236338287591934
Epoch: 200, Training Cost: 0.01134470570832491, Test Cost: 0.025760680437088013
Epoch: 220, Training Cost: 0.004374831914901733, Test Cost: 0.009503781795501709
Epoch: 240, Training Cost: 0.004876463208347559, Test Cost: 0.0061147273518145084
Epoch: 260, Training Cost: 0.004876463208347559, Test Cost: 0.0061147273518145084
Epoch: 280, Training Cost: 0.004876463208347559, Test Cost: 0.0061147273518145084
Final Trained Model Parameters: OrderedDict([('w', tensor([0.3052])), ('b', tensor([0.9028]))])

In [28]:

# 4
import numpy as np
every_20_epochs = np.arange(0, N_epochs, 20)
plt.scatter(every_20_epochs, torch.tensor(C_train_curve).numpy(), label="Training Cost Curve")
plt.scatter(every_20_epochs, torch.tensor(C_test_curve).numpy(), label="Testing Cost Curve")
plt.legend()

#5
from pathlib import Path
model_path = Path("models")
model_path.mkdir(parents=True, exist_ok=True)
model_name = "pytorch_lin_regress_model.pth"
model_save_path = model_path / model_name
torch.save(model_instance.state_dict(), model_save_path)

In [34]:

#6 
another_model_instance = PyTorchModel()
another_model_instance.load_state_dict(torch.load(model_save_path))

another_y_hat = another_model_instance(x)
plt.plot(x, y, label="Original Data (Both Training + Testing)")
plt.plot(x, torch.tensor(another_y_hat).numpy(), label="Loaded model prediction")
plt.legend()

/tmp/ipykernel_349241/188330038.py:7: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).
  plt.plot(x, torch.tensor(another_y_hat).numpy(), label="Loaded model prediction")

Out[34]:

<matplotlib.legend.Legend at 0x7f609fe06e90>

In [ ]:

Posted in Blog | Leave a comment

PyTorch Fundamentals (Part $1$)

Posted on January 25, 2026 by wdengquantum.me

Problem: Illustrate some of the basic fundamentals involved in using the PyTorch deep learning library. In particular, discuss the attributes of PyTorch tensors (e.g. dtype, CPU/GPU devices, etc.), how to generate random PyTorch tensors with/without seeding, and operations that can be performed on and between PyTorch tensors.

Solution:

pytorch_fundamentals

In [2]:

import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
print(torch.__version__)

2.9.1+cu128

In [3]:

# scalars are the most basic type of PyTorch tensor
scalar = torch.tensor(8)
print(scalar)
print(scalar.ndim)
print(scalar.item())

tensor(8)
0
8

In [4]:

vector = torch.tensor([7, 7])
print(vector)
print(vector.ndim)
print(vector.shape)

tensor([7, 7])
1
torch.Size([2])

In [5]:

# MATRIX
MATRIX = torch.tensor([[7, 8],
                       [9, 10]])

print(MATRIX)
print(MATRIX.ndim)
print(MATRIX.shape)
print(MATRIX[1])

tensor([[ 7,  8],
        [ 9, 10]])
2
torch.Size([2, 2])
tensor([ 9, 10])

In [6]:

# TENSOR
TENSOR = torch.tensor([[[1, 2, 3], [3, 6, 9], [2, 4, 6]]])
print(TENSOR)
print(TENSOR.ndim)
print(TENSOR.shape)
print(TENSOR.shape[0])

tensor([[[1, 2, 3],
         [3, 6, 9],
         [2, 4, 6]]])
3
torch.Size([1, 3, 3])
1

In [7]:

my_tensor = torch.tensor([[[[[[[[[[[[[[[[[[[[[[[[[[2, 4, 2, 3]]]]]]]]]]]]]]]]]]]]]]]]]])
print(my_tensor.ndim) # number of onion layers of square brackets
print(my_tensor.shape) # number of elements within each onion layer

26
torch.Size([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4])

Random Tensors¶

Why random tensors?

Random tensors are important because the way many neural networks learn is that they start with tensors full of random numbers and then adjust those random numbers to better represent the data.

Start with random numbers -> look at data -> update random numbers -> look at data -> update random numbers

In [8]:

# Creating a random tensor of size (3, 4)
random_tensor = torch.rand(3, 4)
random_tensor

Out[8]:

tensor([[0.0879, 0.6826, 0.4789, 0.1849],
        [0.7974, 0.9331, 0.8372, 0.3934],
        [0.4137, 0.7374, 0.8922, 0.0088]])

In [9]:

random_image_size_tensor = torch.rand(size=(3, 224, 224))
random_image_size_tensor.shape, random_image_size_tensor.ndim

Out[9]:

(torch.Size([3, 224, 224]), 3)

In [10]:

plt.imshow(random_image_size_tensor[1])

Out[10]:

<matplotlib.image.AxesImage at 0x7f2eb56a1930>

In [11]:

zero_tensor = torch.zeros(size=(3, 4))
one_tensor = torch.ones(size=(3, 4))
print(zero_tensor)
print(one_tensor)
print(one_tensor.dtype) # by default, all tensors use float32 (single-point precision) initially
# Multiplication symbol * leads to Hadamard product
print(random_tensor)
print(one_tensor * random_tensor)

tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])
tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]])
torch.float32
tensor([[0.0879, 0.6826, 0.4789, 0.1849],
        [0.7974, 0.9331, 0.8372, 0.3934],
        [0.4137, 0.7374, 0.8922, 0.0088]])
tensor([[0.0879, 0.6826, 0.4789, 0.1849],
        [0.7974, 0.9331, 0.8372, 0.3934],
        [0.4137, 0.7374, 0.8922, 0.0088]])

In [12]:

one_to_ten = torch.arange(start=1, end=11, step=1)
print(one_to_ten)

tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [13]:

ten_zeros = torch.zeros_like(input=one_to_ten)
print(ten_zeros)

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Tensor datatypes¶

Note: Tensor datatypes is one of the 3 big errors one often runs into in PyTorch:

Tensors not the right datatype
Tensors not the right shape
Tensors not on the right device

In [14]:

# Float 32 tensor
float_32_tensor = torch.tensor([3.0, 6.0, 9.0], 
                               dtype=torch.float32,
                               device=None,
                               requires_grad=False)

print(float_32_tensor)
print(float_32_tensor.dtype)

tensor([3., 6., 9.])
torch.float32

In [15]:

float_16_tensor = float_32_tensor.type(torch.float16)
print(float_16_tensor)
print(float_16_tensor.dtype)

tensor([3., 6., 9.], dtype=torch.float16)
torch.float16

In [16]:

float_16_tensor * float_32_tensor #

Out[16]:

tensor([ 9., 36., 81.])

In [17]:

some_tensor = torch.rand(3, 4)
print(some_tensor)
print(f"Datatype of tensor: {some_tensor.dtype}")
print(f"Shape of tensor: {some_tensor.shape}")
print(f"Device that tensor is on: {some_tensor.device}")

tensor([[0.2600, 0.3511, 0.7676, 0.6426],
        [0.9504, 0.4816, 0.0339, 0.0265],
        [0.9296, 0.9317, 0.4725, 0.8148]])
Datatype of tensor: torch.float32
Shape of tensor: torch.Size([3, 4])
Device that tensor is on: cpu

Manipulating Tensors (Tensor Operation)¶

Addition
Subtraction
Hadamard multiplication
Matrix Multiplication
Division? (Inversion?)

In [72]:

tensor = torch.tensor([1, 2, 3])
print(tensor + 10)
print(tensor * 10)
print(tensor - 10)
print(torch.add(tensor, 10))

tensor([11, 12, 13])
tensor([10, 20, 30])
tensor([-9, -8, -7])
tensor([11, 12, 13])

In [ ]:

tensor1 = torch.tensor([[1, 2],
                        [3, 4]])
tensor2 = torch.tensor([[4, 5],
                        [6, 7]])

print(tensor1 @ tensor2)
print(torch.matmul(tensor1, tensor2))
print(torch.mm(tensor1, tensor2))
%timeit tensor1 @ tensor2

tensor([[16, 19],
        [36, 43]])
tensor([[16, 19],
        [36, 43]])
tensor([[16, 19],
        [36, 43]])
1.88 μs ± 96.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [ ]:

%%time
# Hard coding matrix multiplication
C = torch.tensor([[0, 0], [0, 0]])
for i in range(2):
    for j in range(2):
        for k in range(2):
            C[i, j] += tensor1[i, k] * tensor2[k, j]

print(C)

tensor([[16, 19],
        [36, 43]])
CPU times: user 1.94 ms, sys: 394 μs, total: 2.34 ms
Wall time: 1.92 ms

Tensor Aggregation

In [ ]:

x = torch.arange(1, 100, 10)
print(x)
print(x.dtype)
print(torch.min(x), x.min())
print(torch.max(x), x.max())
print(torch.mean(x.type(torch.float32)), x.type(torch.float32).mean())
print(torch.sum(x), x.sum())
print(torch.argmin(x), x.argmin())
print(torch.argmax(x), x.argmax())
print(x[9])

tensor([ 1, 11, 21, 31, 41, 51, 61, 71, 81, 91])
torch.int64
tensor(1) tensor(1)
tensor(91) tensor(91)
tensor(46.) tensor(46.)
tensor(460) tensor(460)
tensor(0) tensor(0)
tensor(9) tensor(9)
tensor(91)

In [93]:

# Reshaping, stacking, squeezing, unsqueezing PyTorch tensors
import torch
x = torch.arange(1, 10)
x, x.shape

Out[93]:

(tensor([1, 2, 3, 4, 5, 6, 7, 8, 9]), torch.Size([9]))

In [99]:

x_reshaped = x.reshape(3, 3)
x_reshaped, x_reshaped.shape

Out[99]:

(tensor([[5, 2, 3],
         [4, 5, 6],
         [7, 8, 9]]),
 torch.Size([3, 3]))

In [ ]:

z = x.view(1, 9)
z, z.shape
z[:, 0] = 5
z, x # changes in z will affect x, stored in same memory location

Out[ ]:

(tensor([[5, 2, 3, 4, 5, 6, 7, 8, 9]]), tensor([5, 2, 3, 4, 5, 6, 7, 8, 9]))

In [ ]:

x_stacked = torch.stack([x, x, x, x], dim=1)
x_h_stacked = torch.hstack([x, x, x, x])
x_v_stacked = torch.vstack([x, x, x, x])
print(x_stacked)
print(x_h_stacked)
print(x_v_stacked)

tensor([[5, 5, 5, 5],
        [2, 2, 2, 2],
        [3, 3, 3, 3],
        [4, 4, 4, 4],
        [5, 5, 5, 5],
        [6, 6, 6, 6],
        [7, 7, 7, 7],
        [8, 8, 8, 8],
        [9, 9, 9, 9]])
tensor([5, 2, 3, 4, 5, 6, 7, 8, 9, 5, 2, 3, 4, 5, 6, 7, 8, 9, 5, 2, 3, 4, 5, 6,
        7, 8, 9, 5, 2, 3, 4, 5, 6, 7, 8, 9])
tensor([[5, 2, 3, 4, 5, 6, 7, 8, 9],
        [5, 2, 3, 4, 5, 6, 7, 8, 9],
        [5, 2, 3, 4, 5, 6, 7, 8, 9],
        [5, 2, 3, 4, 5, 6, 7, 8, 9]])
tensor([5, 2, 3, 4, 5, 6, 7, 8, 9])

In [112]:

y = torch.zeros(1, 2, 3, 2, 3)
print(y)
print(y.squeeze(), y.squeeze().shape)

tensor([[[[[0., 0., 0.],
           [0., 0., 0.]],

          [[0., 0., 0.],
           [0., 0., 0.]],

          [[0., 0., 0.],
           [0., 0., 0.]]],


         [[[0., 0., 0.],
           [0., 0., 0.]],

          [[0., 0., 0.],
           [0., 0., 0.]],

          [[0., 0., 0.],
           [0., 0., 0.]]]]])
tensor([[[[0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.]]],


        [[[0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.]]]]) torch.Size([2, 3, 2, 3])

In [124]:

print(x)
print(x.unsqueeze(dim=0))
print(x.unsqueeze(dim=1))

tensor([5, 2, 3, 4, 5, 6, 7, 8, 9])
tensor([[5, 2, 3, 4, 5, 6, 7, 8, 9]])
tensor([[5],
        [2],
        [3],
        [4],
        [5],
        [6],
        [7],
        [8],
        [9]])

In [135]:

random_image = torch.rand(size=(200, 400, 3))
plt.imshow(random_image)
print(f"Shape of random_image: {random_image.shape}")
random_image_permuted = random_image.permute(2, 1, 0)
print(f"Shape of random_image_permuted: {random_image_permuted.shape}")
#random_image_permuted =

Shape of random_image: torch.Size([200, 400, 3])
Shape of random_image_permuted: torch.Size([3, 400, 200])

In [ ]:

# Indexing from PyTorch tensors is similar to indexing from NumPy arrays
x = torch.arange(1, 10).reshape(1, 3, 3)
print(x, x.shape)
print(x[0])
print(x[0][0])
print(x[0][0][0])
print(x[0][:, 2])

tensor([[[1, 2, 3],
         [4, 5, 6],
         [7, 8, 9]]]) torch.Size([1, 3, 3])
tensor([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])
tensor([1, 2, 3])
tensor(1)
tensor([3, 6, 9])

In [ ]:

# PyTorch tensors & NumPy
np_array = np.arange(1.0, 8.0)
pytorch_tensor = torch.from_numpy(np_array) # since default NumPy dtype is FP64, PyTorch will reflect this in torch.from_numpy()
print(np_array)
print(pytorch_tensor)

T = torch.ones(7)
numpy_arr = T.numpy()
print(T)
print(numpy_arr)
print(numpy_arr.dtype) #now default is FP32 b/c torch tensors default to FP32

[1. 2. 3. 4. 5. 6. 7.]
tensor([1., 2., 3., 4., 5., 6., 7.], dtype=torch.float64)
tensor([1., 1., 1., 1., 1., 1., 1.])
[1. 1. 1. 1. 1. 1. 1.]
float32

In [180]:

random_tensor_A = torch.rand(3, 4)
random_tensor_B = torch.rand(3, 4)

print(random_tensor_A)
print(random_tensor_B)
print(random_tensor_A == random_tensor_B)

tensor([[0.6327, 0.6428, 0.5100, 0.1919],
        [0.9144, 0.7456, 0.6312, 0.8555],
        [0.9164, 0.6492, 0.5402, 0.2073]])
tensor([[0.0675, 0.4836, 0.7717, 0.0868],
        [0.9814, 0.8618, 0.9880, 0.1541],
        [0.8092, 0.0201, 0.0600, 0.7161]])
tensor([[False, False, False, False],
        [False, False, False, False],
        [False, False, False, False]])

In [188]:

# Set a random seed
seed = 42
torch.manual_seed(seed)
random_tensor_C = torch.rand(3, 4)
random_tensor_D = torch.rand(3, 4)
print(random_tensor_C)
print(random_tensor_D)

tensor([[0.8823, 0.9150, 0.3829, 0.9593],
        [0.3904, 0.6009, 0.2566, 0.7936],
        [0.9408, 0.1332, 0.9346, 0.5936]])
tensor([[0.8694, 0.5677, 0.7411, 0.4294],
        [0.8854, 0.5739, 0.2666, 0.6274],
        [0.2696, 0.4414, 0.2969, 0.8317]])

In [190]:

# Running tensors and PyTorch objects on GPUs (and making faster computations)
!nvidia-smi

Sat Jan 24 20:22:27 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.188                Driver Version: 573.71         CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4050 ...    On  |   00000000:01:00.0 Off |                  N/A |
| N/A   25C    P3             14W /   39W |       0MiB /   6141MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

In [ ]:

# Check for GPU access with PyTorch
torch.cuda.is_available()

# If want code to be device-agnostic, then can check if GPU exists:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)
# Count number of GPUs
print(torch.cuda.device_count())

cuda
1

In [ ]:

# Putting tensors & models on GPU
my_tensor = torch.tensor([1, 2, 3])
print(my_tensor, my_tensor.device) #default to CPU

# To move tensor to GPU (if GPU available as encoded in device variable above)
my_tensor_on_gpu = my_tensor.to(device)
print(my_tensor_on_gpu)
print(my_tensor_on_gpu.device)

# Since NumPy only works on CPU, sometimes one may want to perform some arithmetic
# with NumPy arrays, in which case one would like to move a tensor back onto CPU,
# then convert to NumPy array.

my_tensor_on_cpu = my_tensor_on_gpu.to("cpu") # or can do my_tensor_on_gpu.cpu()
print(my_tensor_on_cpu)
print(my_tensor_on_cpu.device)
print(my_tensor_on_cpu.numpy())

tensor([1, 2, 3]) cpu
tensor([1, 2, 3], device='cuda:0')
cuda:0
tensor([1, 2, 3])
cpu
[1 2 3]

Exercise Solutions:

In [374]:

#1 Read the documentation on torch.tensor and torch.Cuda
#2 
random_tensor = torch.rand(size=(7, 7))
print(random_tensor)
#3 
another_random_tensor = torch.rand(size=(1, 7))
print(random_tensor@another_random_tensor.T)

tensor([[0.5159, 0.1636, 0.0958, 0.8985, 0.5814, 0.9148, 0.3324],
        [0.6473, 0.3857, 0.4778, 0.1955, 0.6691, 0.6581, 0.4897],
        [0.3875, 0.1918, 0.8458, 0.1278, 0.7048, 0.3319, 0.2588],
        [0.5898, 0.2403, 0.6152, 0.5982, 0.1288, 0.5832, 0.7130],
        [0.6979, 0.4371, 0.0901, 0.4229, 0.6737, 0.3176, 0.6898],
        [0.8330, 0.2389, 0.5049, 0.7067, 0.5392, 0.5418, 0.5624],
        [0.1069, 0.5393, 0.8462, 0.9506, 0.7939, 0.5670, 0.7335]])
tensor([[1.8984],
        [1.4171],
        [1.1102],
        [1.5038],
        [1.7249],
        [1.8912],
        [2.3290]])

In [ ]:

#4
seed = 1234
torch.manual_seed(seed) #this actually seeds both CPU & GPU with same seed-->same behavior
random_tensor = torch.rand(size=(7, 7), device=device)
#random_tensor = random_tensor.to(device)
print(random_tensor)
another_random_tensor = torch.rand(size=(1, 7), device=device)
#another_random_tensor = another_random_tensor.to(device)
print(random_tensor@another_random_tensor.T)

tensor([[0.1272, 0.8167, 0.5440, 0.6601, 0.2721, 0.9737, 0.3903],
        [0.3394, 0.5451, 0.7312, 0.3864, 0.5959, 0.7578, 0.2126],
        [0.7198, 0.9845, 0.5518, 0.0981, 0.0582, 0.5839, 0.1083],
        [0.9461, 0.3170, 0.8328, 0.6676, 0.2886, 0.9022, 0.8115],
        [0.1784, 0.9534, 0.1486, 0.3882, 0.7977, 0.1752, 0.5777],
        [0.1949, 0.8499, 0.3125, 0.2156, 0.0383, 0.4934, 0.3138],
        [0.3121, 0.5664, 0.1266, 0.7097, 0.0040, 0.5147, 0.2811]],
       device='cuda:0')
tensor([[0.9558],
        [1.2227],
        [0.9335],
        [1.6030],
        [0.9344],
        [0.5282],
        [0.5664]], device='cuda:0')

In [415]:

#4 (seeding the GPU, but then random tensors must be created there!)
seed = 1234
torch.cuda.manual_seed(seed)
random_tensor = torch.rand(size=(7, 7), device=device)
print(random_tensor)
another_random_tensor = torch.rand(size=(1, 7), device=device)
print(random_tensor@another_random_tensor.T)

tensor([[0.1272, 0.8167, 0.5440, 0.6601, 0.2721, 0.9737, 0.3903],
        [0.3394, 0.5451, 0.7312, 0.3864, 0.5959, 0.7578, 0.2126],
        [0.7198, 0.9845, 0.5518, 0.0981, 0.0582, 0.5839, 0.1083],
        [0.9461, 0.3170, 0.8328, 0.6676, 0.2886, 0.9022, 0.8115],
        [0.1784, 0.9534, 0.1486, 0.3882, 0.7977, 0.1752, 0.5777],
        [0.1949, 0.8499, 0.3125, 0.2156, 0.0383, 0.4934, 0.3138],
        [0.3121, 0.5664, 0.1266, 0.7097, 0.0040, 0.5147, 0.2811]],
       device='cuda:0')
tensor([[0.9558],
        [1.2227],
        [0.9335],
        [1.6030],
        [0.9344],
        [0.5282],
        [0.5664]], device='cuda:0')

In [ ]:

#6 (just to prove that torch.manual_seed() seeds both CPU and GPU devices)
torch.manual_seed(1234)
T1_rand = torch.rand(size=(2, 3), device="cuda")
T2_rand = torch.rand(size=(2, 3), device="cuda")
print(T1_rand)
print(T2_rand)
#7
T = T1_rand.T@T2_rand
print(T)
#8
print(torch.max(T), T.max())
print(torch.min(T), T.min())
#9 
print(torch.argmax(T), T.argmax())
print(torch.argmin(T), T.argmin())
#10 
T_rand = torch.rand(size=(1, 1, 1, 10))
print(T_rand, T_rand.shape)
T_rand = T_rand.squeeze()
print(T_rand, T_rand.shape)

tensor([[0.1272, 0.8167, 0.5440],
        [0.6601, 0.2721, 0.9737]], device='cuda:0')
tensor([[0.6208, 0.0276, 0.3255],
        [0.1114, 0.6812, 0.3608]], device='cuda:0')
tensor([[0.1525, 0.4531, 0.2796],
        [0.5374, 0.2079, 0.3640],
        [0.4462, 0.6783, 0.5284]], device='cuda:0')
tensor(0.6783, device='cuda:0') tensor(0.6783, device='cuda:0')
tensor(0.1525, device='cuda:0') tensor(0.1525, device='cuda:0')
tensor(7, device='cuda:0') tensor(7, device='cuda:0')
tensor(0, device='cuda:0') tensor(0, device='cuda:0')
tensor([[[[0.0290, 0.4019, 0.2598, 0.3666, 0.0583, 0.7006, 0.0518, 0.4681,
           0.6738, 0.3315]]]]) torch.Size([1, 1, 1, 10])
tensor([0.0290, 0.4019, 0.2598, 0.3666, 0.0583, 0.7006, 0.0518, 0.4681, 0.6738,
        0.3315]) torch.Size([10])

Posted in Blog | Leave a comment

Tokenization & Transformers

Posted on December 31, 2025 by wdengquantum.me

Problem: Let $|\mathcal V|,N_c\in\mathbf Z^+$ be positive integers (where $|\mathcal V|$ is the cardinality of an arbitrary set $\mathcal V$ called the vocabulary and $N_c$ will come to be seen as the number of codebooks), and let $\mathcal T$ be a (finite or infinite) set whose elements are called tokens. What does it mean for a function $\mathbf i$ to be a $(|\mathcal V|,N_c)$-tokenizer on the token space $\mathcal T$? Give some examples of tokenizers.

Solution: It means that $\mathbf i:\mathcal T\to\{1,…,|\mathcal V|\}^{N_c}$ quantizes each abstract token $\tau\in\mathcal T$ as some concrete $N_c$-tuple of integers $\mathbf i(\tau):=(i_1(\tau),…,i_{N_c}(\tau))$ where each of these $N_c$ token IDs $i_1(\tau),…,i_{N_c}(\tau)\in\mathbf N$ associated to the token $\tau\in\mathcal T$ ranges from $1$ to the vocabulary size $|\mathcal V|$.

For $\mathcal T:=\{a, b, …, z, @, \&, ., uh, th, …\}$ the set of natural language tokens (could be finite or infinite depending on how one defines $\mathcal T$), tokenization $\mathbf i$ is typically done via a single ($N_c=1$) lookup inside a dictionary (“codebook/vocabulary”) of size $|\mathcal V|\sim 10^5$ (indeed, one can simply take $\mathcal V:=\mathcal T$).
For $\mathcal T$ the (infinite) set of $20\text{ ms}$ Nyquist-downsampled audio waveform tokens, tokenization $\mathbf i$ might be implemented by first passing an audio token $\tau\in\mathcal T$ into some CNN encoder $\mathbf e_{\text{CNN}}:\mathcal T\to\mathbf R^{<|\mathcal T|}$, thereby obtaining some hidden latent vector $\mathbf e_{\text{CNN}}(\tau)\in\mathbf R^{<|\mathcal T|}$, followed by residual vector quantization $\text{RVQ}:\mathbf R^{<|\mathcal T|}\to\{1,…,|\mathcal V|\}^{N_c}$ of $\mathbf e_{\text{CNN}}(\tau)$ through a sequence of $N_c$ ordered codebooks $\mathcal V_1,…,\mathcal V_{N_c}$ each containing the same number $|\mathcal V|:=|\mathcal V_1|=…=|\mathcal V_{N_c}|$ of vectors:
\[\text{RVQ}(\mathbf e):=(\text{argmin}_{1\leq i\leq |\mathcal V|}|\mathbf e-\mathbf e_{i}^{(1)}|,\text{argmin}_{1\leq i\leq |\mathcal V|}|\mathbf e-\text{argmin}_{\mathbf e^{(1)}\in\mathcal V_1}|\mathbf e-\mathbf e^{(1)}|-\mathbf e_{i}^{(2)}|,…)\]
(one could also consider codebooks of different sizes, though conceptually that would simply change the range of the tokenizer to $\mathbf i:\mathcal T\to\{1,…,|\mathcal V_1|\}\times…\times\{1,…,|\mathcal V_{N_c}|\}$).

Problem: Explain the transformer architecture as first described in Attention Is All You Need purely from the inference perspective of a forward pass on a model which is already well-trained (i.e. by articulating this, one thus defines what the objective of one’s model is, informing the choice of cost function during training).

Solution: Note that the tokenization process described above is not part of the transformer, but rather just some pre-processing of natural language into a format suitable to be input into the transformer.

(One-Hot Encoding) Take the sequence of token IDs from tokenization, and one-hot encode them with respect to a vocabulary of size $d_V$. Note this step is parameter-free, i.e. for a fixed vocabulary, there’s nothing to be learned here.
(Embedding) Take each one-hot encoded vector $\hat{\mathbf e}\in{0,1}^{d_V}$, and multiply it by the transformer’s (learnable) embedding matrix $W_E\in\mathbf R^{d_E\times d_V}$, thereby obtaining some embedding vector $\hat{\mathbf e}\mapsto\mathbf x:=W_E\hat{\mathbf e}\in\mathbf R^{d_E}$ that, after pre-training the transformer, should ideally learn a latent space representation of the vocabulary in which semantically similar words are close together and certain directions in the embedding space convey certain concepts.
(Positional Encoding) Add something to each embedding vector in a way that signals where in the input natural language prompt it appears (e.g. sinusoidal encoding in the original paper).
(Single Self-Attention Head) For each positional embedding vector $\mathbf x_i\in\mathbf R^{d_e}$, compute its query vector $\mathbf q_i=W_{\mathbf q}\mathbf x_i$, its key vector $\mathbf k_i=W_{\mathbf k}\mathbf x_i$, and its value vector $\mathbf v_i=W_{\mathbf v}\mathbf x_i$. Here, $W_{\mathbf q},W_{\mathbf k}\in\mathbf R^{n_{qk}\times n_e}$ are weight matrices that map from the embedding space $\mathbf R^{n_e}$ to the query/key space $\mathbf R^{n_{qk}}$ of dimension $n_{qk}$ and $W_{\mathbf v}\in\mathbf R^{n_e\times n_e}$ is the weight matrix of values (which in practice is decomposed into a low-rank approximation $W_{\mathbf v}=W_{\mathbf v\uparrow}W_{\mathbf v\downarrow}$ where typically $W_{\mathbf v\downarrow}\in\mathbf R^{n_{qk}\times n_e}$ and $W_{\mathbf v\uparrow}\in\mathbf R^{n_e\times n_{qk}}$). For each $\mathbf x_i$, one computes an update vector $\Delta\mathbf x_i$ to be added to it (called a skip connection as typical in ResNets) according to a convex linear combination of the value vectors $\mathbf v_1,…,\mathbf v_N$ of all the embeddings $\mathbf x_1,…,\mathbf x_N$ in the context, specifically:

\[\Delta\mathbf x_i=V\text{softmax}\left(\frac{K^T\mathbf q_i}{\sqrt{n_{qk}}}\right)\]

where $K=(\mathbf k_1,…,\mathbf k_N)\in\mathbf R^{n_{qk}\times N}$ and $V=(\mathbf v_1,…,\mathbf v_N)\in\mathbf R^{n_e\times N}$ are key and value matrices associated to the inputted context (filled with column vectors here rather than the ML convention of row vectors). This map that takes the initial, generic token embeddings $\mathbf x_i$ and nudges them towards more contextualized embeddings $\mathbf x_i\mapsto\mathbf x’_i=\mathbf x_i+\Delta\mathbf x_i$ is called a head of self-attention. The $1/\sqrt{n_{qk}}$ scaling in the softmax temperature is justified on the grounds that if $\mathbf k$ and $\mathbf q$ are random vectors whose independent components each have mean $0$ and variance $1$, then $\mathbf k\cdot\mathbf q$ will have mean $0$ and variance $n_{qk}$, hence the need to normalize by $\sqrt{n_{qk}}$ to ensure $\mathbf k\cdot\mathbf q/\sqrt{n_{qk}}$ continues to have variance $1$.

4. (Multi-Headed Self-Attention) Since context can influence meaning in different ways, repeat the above procedure in parallel for several heads of self-attention; each head will propose a displacement update to each of the $N$ original embeddings $\mathbf x_i$; add up all of them.

5. (Multilayer Perceptron) Linear, ReLU, Linear basically. It is hypothesized that facts are stored in this part of the transformer.

6. (Layers) Alternate between the multi-headed self-attention blocks and MLP blocks, make a probabilistic prediction of the next token $\hat{\tau}_{N+1}$ using only the final, context-rich, modified embedding $\mathbf x’_N$ of the last token $\tau_N$ in the context by applying an unembedding matrix $\mathbf u=W_{\mathbf u}\mathbf x’_N$ and running it through a softmax $\text{softmax}(\mathbf u)$.

Problem: Based on the above discussion of the transformer architecture, explain how a large language model (LLM) like Gemini, ChatGPT, Claude, Grok, DeepSeek, etc. works (at a high level).

Solution: Essentially, since an LLM is a neural network which takes as input some string of text and probabilistically predicts the next token, by seeding it with some corpus of text $T$, the LLM can sample according to the probability distribution it generates for the next token, and append that to $T\mapsto T+\tau$. Then, simply repeat this except pretend that $T+\tau$ was the seed all along. In this way, generative AI models such as ChatGPT (where GPT stands for generative pre-trained transformer) work. In practice, it is helpful to also provide some system prompt like “What follows is a conversation between a user and a knowledgeable AI assistant:”.

Posted in Blog | Leave a comment

JAX Fundamentals (Part $1$)

Posted on December 7, 2025 by wdengquantum.me

JAX_tutorial

$\textbf{Problem}$: What is JAX?

$\textbf{Solution}$: JAX = Autograd + XLA, where Autograd refers to automatic differentiation, and XLA refers to accelerated linear algebra (compiler developed by Google that optimizes code to run fast on GPUs/TPUs).

At a high level, JAX is just NumPy on steroids (and indeed, much of the syntax is identical). At a lower level, JAX is a framework for composable function transformations.

$\textbf{Problem}$: What are the $4$ most important JAX transformations?

$\textbf{Solution}$: jit (just-in-time compilation), grad (gradient), vmap (vectorization), and pmap (parallelization).

The philosophy here is that one can write standard Python functions, and JAX will transform them into GPU/TPU-optimized versions.

$\textbf{Problem}$: Compare JAX vs. PyTorch vs. TensorFlow.

$\textbf{Solution}$: JAX uses a functional programming paradigm whereas PyTorch is object-oriented/imperative and TensorFlow is mixed (Keras is OOP but TF Core is graph). JAX therefore has the steepest learning curve (as it requires unlearning OOP habits). However, it’s XLA performance is arguably the fastest, making it an essential tool for research, etc.

$\textbf{Problem}$: What higher-level libraries built on top of JAX are typically used to build neural networks?

$\textbf{Solution}$: Flax, Haiku, Equinox, Optax, etc.

In [4]:

# JAX's syntax is (for the most part) same as NumPy.
# There is also SciPy API support (jax.scipy)
import jax.numpy as jnp
import numpy as np

# 4 key transform functions
from jax import jit, grad, vmap, pmap

# JAX's low-level API
from jax import lax # anagram of XLA

from jax import make_jaxpr
from jax import random
from jax import device_put

import matplotlib.pyplot as plt

In [5]:

import jax
jax.devices()

Out[5]:

[CudaDevice(id=0)]

In [6]:

# Fact #1: JAX syntax is very similar to NumPy!
x_np = np.linspace(0, 10, 1000)
y_np = 2 * np.sin(x_np) * np.cos(x_np)
plt.plot(x_np, y_np)

Out[6]:

[<matplotlib.lines.Line2D at 0x7f394ad1d180>]

In [7]:

x_jnp = jnp.linspace(0, 10, 1000)
y_jnp = 2 * jnp.sin(x_jnp) * jnp.cos(x_jnp)
plt.plot(x_jnp, y_jnp)

Out[7]:

[<matplotlib.lines.Line2D at 0x7f392ef5d1b0>]

In [8]:

# Fact 2: JAX arrays are immutable! (embrace the functional programming paradigm!)
size = 10
index = 0
value = 23

# NumPy: mutable arrays
x = np.arange(size)
print(x)
x[index] = value
print(x)

[0 1 2 3 4 5 6 7 8 9]
[23  1  2  3  4  5  6  7  8  9]

In [9]:

# JAX: immutable arrays
x = jnp.arange(size)
print(x)
x[index] = value
print(x)

[0 1 2 3 4 5 6 7 8 9]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[9], line 4
      2 x = jnp.arange(size)
      3 print(x)
----> 4 x[index] = value
      5 print(x)

File ~/jax_linus/lib/python3.10/site-packages/jax/_src/numpy/array_methods.py:596, in _unimplemented_setitem(self, i, x)
    592 def _unimplemented_setitem(self, i, x):
    593   msg = ("JAX arrays are immutable and do not support in-place item assignment."
    594          " Instead of x[idx] = y, use x = x.at[idx].set(y) or another .at[] method:"
    595          " https://docs.jax.dev/en/latest/_autosummary/jax.numpy.ndarray.at.html")
--> 596   raise TypeError(msg.format(type(self)))

TypeError: JAX arrays are immutable and do not support in-place item assignment. Instead of x[idx] = y, use x = x.at[idx].set(y) or another .at[] method: https://docs.jax.dev/en/latest/_autosummary/jax.numpy.ndarray.at.html

In [ ]:

# Solution:
y = x.at[index].set(value)
print(x)
print(y)

[0 1 2 3 4 5 6 7 8 9]
[23  1  2  3  4  5  6  7  8  9]

In [ ]:

# Fact 3: JAX handles random numbers differently (cf. NumPy)
seed = 0
key = random.PRNGKey(seed)
print(key)
x = random.normal(key,(10,)) # have to pass key here explicitly!
print(type(x),x)

[0 0]
<class 'jaxlib._jax.ArrayImpl'> [ 1.6226422   2.0252647  -0.43359444 -0.07861735  0.1760909  -0.97208923
 -0.49529874  0.4943786   0.6643493  -0.9501635 ]

In [ ]:

# Fact #4: JAX is AI accelerator agnostic (same code runs everywhere!)

size = 3000

# Data is automatically pushed to the AI accelerator.
# No more need for ".to(device)" (PyTorch syntax)
x_jnp = random.normal(key, (size, size), dtype=jnp.float32)
x_np = np.random.normal(size=(size, size)).astype(np.float32) # some diff in API

# block_until_ready() --> ignore time for asynchronous dispatch
%timeit jnp.dot(x_jnp, x_jnp.T).block_until_ready() # on AI accelerator (e.g. GPU) - fast
%timeit np.dot(x_np, x_np.T) # on CPU - slow (NumPy only works with CPUs)
%timeit jnp.dot(x_np, x_np.T).block_until_ready() # on AI accelerator (e.g. GPU) with overhead of transferring np to jnp

x_np_device = device_put(x_np)
%timeit jnp.dot(x_np_device, x_np_device.T).block_until_ready()

7.9 ms ± 90.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
101 ms ± 5.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
22.5 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
7.96 ms ± 39.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

$\textbf{Problem}$: What does jit do?

$\textbf{Solution}$: jit is how JAX compiles functions into super-optimized kernels using XLA. It performs a tracing process which makes subsequent calls of the function very fast.

In [ ]:

# Simpler visualizer
def visualize_fn(fn, l=-10, r=10, n=1000):
    x = np.linspace(l, r, num=n)
    y = fn(x)
    plt.plot(x, y); plt.show()

In [ ]:

def selu(x, alpha=1.67, lmbda=1.05): #a type of activation function
    return lmbda * jnp.where(x>0, x, alpha*jnp.exp(x)-alpha)

visualize_fn(selu)

In [ ]:

selu_jit = jit(selu)

data = random.normal(key, (1000000,))
print("Non-jit version")
%timeit selu(data).block_until_ready()
print("Jit version:")
%timeit selu_jit(data).block_until_ready()

Non-jit version
1.66 ms ± 85.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Jit version:
545 μs ± 117 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

$\textbf{Problem}$: What does grad() do?

$\textbf{Solution}$: Differentiation can be manual/symbolic or numerical. But grad() does it automatically!

In [ ]:

def L(x):
    return jnp.sum(x**3) # this function eats in a vector
# and spits out the sum of cubes of the vector components

x = jnp.arange(0, 6.0, 1)
print(x)
print(L(x)) 
print(grad(L)(x)) #manually, the grad is (3x_1^2,3x_2^2,3x_3^2)

[0. 1. 2. 3. 4. 5.]
225.0
[ 0.  3. 12. 27. 48. 75.]

In [ ]:

# Numeric diff (to check that autodiff works)

def finite_diff(f, x):
    eps = 1e-3
    return jnp.array([(f(x+eps*v)-f(x-eps*v))/(2*eps)
                      for v in jnp.eye(len(x))])

finite_diff(L, x)

Out[ ]:

Array([ 0.       ,  3.0517578, 11.901855 , 26.931763 , 47.98889  ,
       75.07324  ], dtype=float32)

In [ ]:

# Example of autodiff
x = 1.
f = lambda x: x**2 + x + 4
visualize_fn(f, l=-1, r=2, n=100)

In [ ]:

dfdx = grad(f) #2x+1
d2fdx2 = grad(dfdx) #2
d3fdx3 = grad(d2fdx2) #0

print(f(x), dfdx(x), d2fdx2(x), d3fdx3(x))
# More powerful cf. backward() (PyTorch syntax)

6.0 3.0 2.0 0.0

In [ ]:

# Modifying above example with 2 inputs (by default
# grad always takes partial derivative w.r.t. 1st input)

x = 1.5329423
y = -2.3

f = lambda x, y: jnp.exp(-jnp.pi*x**2) + 3*x**3 - jnp.cos(jnp.pi*y**2)
dfdx = grad(f) #same as grad(f, argnums=0), etc.
d2fdx2 = grad(dfdx) 
d3fdx3 = grad(d2fdx2) 
dfdy = grad(f, argnums=1)
d2fdy2 = grad(dfdy, argnums=1)

print(f(x, y), dfdx(x, y), d2fdx2(x, y), d3fdx3(x, y), dfdy(x, y), d2fdy2(x, y))

# Can also get df/dx and df/dy at the same time
grads = grad(f, argnums=(0, 1))
print(grads(x, y))

11.420368 21.143219 27.64676 17.557095 11.4187975 -132.96454
(Array(21.143219, dtype=float32, weak_type=True), Array(11.4187975, dtype=float32, weak_type=True))

In [ ]:

# JAX autodiff works not only for scalar fields, but also
# vector-valued functions (e.g. Jacobians)

from jax import jacfwd, jacrev
x = 1.
y = 1.
f = lambda x, y: x**2 + y**2 #paraboloid

#df/dx = 2x
#df/dy = 2y
#grad = [df/dx, df/dy]

#d2f/dx2 = 2
#d2f/dy2 = 2
#d2f/dxdy = d2df/dydx = 0
#H = [[d2f/dx2, d2fdxdy],[d2fdydx, d2fdy2]]

def hessian(f):
    return jit(jacfwd(jacrev(f, argnums=(0, 1)), argnums=(0, 1)))

print(jacrev(f, argnums=(0, 1))(x, y))
print(hessian(f)(x, y))

(Array(2., dtype=float32, weak_type=True), Array(2., dtype=float32, weak_type=True))
((Array(2., dtype=float32, weak_type=True), Array(0., dtype=float32, weak_type=True)), (Array(0., dtype=float32, weak_type=True), Array(2., dtype=float32, weak_type=True)))

In [ ]:

# Edge case of non-differentiable function

f = lambda x: abs(x)
visualize_fn(f)
dfdx = grad(f)
print(dfdx(0.)) # technically non-differentiable at x=0
# but JAX defaults to the derivative as x-->0+ from above

1.0

$\textbf{Problem}$: What does vmap() do in JAX?

$\textbf{Solution}$: Standing for vectorized map, it handles batch dimensions. The idea is one write functions as if they were to operate on a single data point, and vmap automatically transforms it to work on a batch of data points. This saves a lot of headache with either trying to write for loops or figure out complex broadcasting dimensions in NumPy.

In [ ]:

W = random.normal(key, (150, 100)) #e.g. treat as weights of linear NN layer 
batched_x = random.normal(key, (10, 100)) # e.g. a batch of 10 flattened 4x25  images
print(W)
print(batched_x)

def apply_matrix(x):
    return jnp.dot(W, x)

[[ 1.6226422   2.0252647  -0.43359444 ... -0.91352165  1.370097
  -0.7800775 ]
 [ 0.36481506  0.9761402  -0.0071727  ... -0.07060029  0.33603913
   2.354045  ]
 [-0.2431693   1.1728051   0.84588975 ... -0.7116935  -0.05395174
  -2.0926828 ]
 ...
 [ 0.6094885   0.8456557  -0.35652205 ... -1.2560906   0.42053804
   0.1999395 ]
 [-1.0437807   0.5227283   0.2781648  ...  0.83346254 -0.6578746
   0.67506284]
 [-0.3662531  -0.77806586  0.59587073 ...  1.1635658   0.475208
   0.6065461 ]]
[[ 1.62264216e+00  2.02526474e+00 -4.33594435e-01 -7.86173493e-02
   1.76090896e-01 -9.72089231e-01 -4.95298743e-01  4.94378597e-01
   6.64349318e-01 -9.50163484e-01  2.17953038e+00 -1.95515060e+00
   3.58570725e-01  1.57795131e-01  1.27708471e+00  1.51046479e+00
   9.70655978e-01  5.99608064e-01  2.47007050e-02 -1.91647720e+00
  -1.85934913e+00  1.72814405e+00  4.71903495e-02  8.14127982e-01
   1.31327674e-01  2.82847047e-01  1.24359429e+00  6.90280080e-01
  -8.00737441e-01 -7.40989983e-01 -1.53882873e+00  3.02691847e-01
  -2.07160451e-02  1.13287210e-01 -2.20654696e-01  7.05225617e-02
   8.53295803e-01 -8.21773827e-01 -1.46142114e-02 -1.50462165e-01
  -9.00135219e-01 -7.59072721e-01  3.33095133e-01  8.09249043e-01
   4.26925533e-02 -5.77671230e-01 -4.14398938e-01 -1.94125330e+00
   1.31611836e+00  7.54272819e-01  1.61709309e-01 -3.48330699e-02
  -1.33064091e+00  3.93620282e-01  4.82595831e-01  8.03829551e-01
  -6.33716822e-01  1.03875601e+00 -7.41591334e-01 -4.29958791e-01
  -2.25100428e-01 -5.19667149e-01 -1.66921651e+00  6.75354362e-01
   2.27387220e-01 -1.18004262e+00 -9.76733565e-01  1.19696045e+00
  -8.41275632e-01  6.59807801e-01  1.06801593e+00  3.15421283e-01
   4.37664032e-01  1.17185640e+00  9.07709897e-01  1.22262418e+00
  -5.46395242e-01  8.56304348e-01 -7.96577521e-03  4.73439127e-01
  -1.10903490e+00  2.64235139e+00  8.89576316e-01  9.95201647e-01
   2.55197197e-01  1.24961376e-01  1.16417289e+00  1.92963660e-01
  -1.90995440e-01 -4.36594725e-01 -1.14619887e+00  1.97602510e-01
   1.16866553e+00 -8.73398483e-01  8.81808579e-01 -3.44105691e-01
  -1.46149725e-01 -9.13521647e-01  1.37009704e+00 -7.80077517e-01]
 [ 3.64815056e-01  9.76140201e-01 -7.17270281e-03  2.10522056e-01
   1.90358415e-01  3.82912666e-01 -1.26563323e+00 -1.48435450e+00
  -1.14543624e-01  1.10371351e+00  1.98467016e-01  2.13889346e-01
  -6.60534799e-01 -7.27220058e-01  4.04439718e-01  1.89657375e-01
  -6.03179395e-01  9.45058823e-01  1.08387780e+00 -2.05607367e+00
  -7.13821530e-01  5.92868268e-01  1.05077624e+00 -1.46462381e+00
   6.60011351e-01 -3.01721781e-01  1.33131772e-01 -3.32813233e-01
   1.57000983e+00  5.74512124e-01  7.23415494e-01  6.96684480e-01
  -6.64234340e-01 -1.96695662e+00 -2.41625428e+00  2.73301542e-01
   1.16031730e+00  2.65512705e-01  6.90909326e-01 -2.56064266e-01
  -2.02274013e+00 -6.23128891e-01  2.79531687e-01 -1.35031724e+00
   1.01288453e-01  5.12681365e-01  2.64019489e-01 -1.82912755e+00
   1.43377745e+00  1.31885552e+00 -1.49532259e+00  9.33276117e-01
   1.40926480e+00 -1.67883754e-01 -1.18622862e-01 -2.42824897e-01
  -9.61759269e-01 -7.56359994e-01  2.57282567e+00 -1.06017923e+00
   3.12329054e-01  3.27511787e-01  8.28322321e-02 -1.08268857e+00
  -7.72234499e-01 -6.34604633e-01  1.22641027e+00 -1.48701501e+00
  -7.92869031e-01  5.53118527e-01 -1.18553972e+00  9.76909399e-01
  -4.38450336e-01 -3.29755992e-01  3.32547158e-01 -6.52719617e-01
  -1.20521224e+00 -8.86308253e-01 -2.10883713e+00 -1.55035362e-01
  -6.57932043e-01 -6.63254023e-01 -3.33620496e-02 -8.95929098e-01
   7.71168023e-02 -9.09823000e-01  1.27605200e+00 -4.01676625e-01
  -9.99925256e-01  1.73419788e-02  4.04541880e-01 -1.07132435e+00
   1.03666258e+00 -6.68480515e-01 -7.79318735e-02  1.20802212e+00
   2.00314546e+00 -7.06002936e-02  3.36039126e-01  2.35404491e+00]
 [-2.43169293e-01  1.17280507e+00  8.45889747e-01  7.28510857e-01
   5.97945750e-01 -7.66705036e-01  1.74257264e-01 -6.22359514e-01
   8.30672979e-01  3.31583172e-01 -8.89908016e-01  2.11538583e-01
  -9.24918503e-02 -1.61701906e+00 -4.12348479e-01  1.40828717e+00
   1.03313172e+00  1.08109426e+00 -1.14955103e+00 -3.21219079e-02
  -7.46680051e-02  5.98946154e-01 -1.19016111e+00 -6.29099309e-01
   5.55720888e-02  3.65632802e-01  5.46573997e-01  7.64500856e-01
  -3.39724004e-01  3.62569809e-01 -2.96992004e-01  1.03010583e+00
   9.65574086e-01  7.55760491e-01  1.21214367e-01  8.22795153e-01
   5.98026872e-01 -8.37458730e-01  5.43704927e-01 -3.22042048e-01
  -4.44730967e-01  9.14655566e-01  1.21259427e+00  1.79783368e+00
  -1.60528362e+00  7.97706470e-02 -7.69092619e-01 -1.94633031e+00
   1.59343791e+00 -9.19291854e-01 -1.71621680e+00  1.11879945e+00
   2.03491643e-01 -1.80223882e-01  1.74956903e-01 -7.34732687e-01
   4.10009056e-01 -1.53308785e+00 -1.70366681e+00  2.04102325e+00
  -2.04918474e-01 -8.45943689e-02  6.67714596e-01 -1.38450539e+00
   3.79239351e-01 -7.40188420e-01  1.30270433e+00 -1.02559209e+00
   4.96420115e-01 -7.00343609e-01 -3.66578884e-02 -4.15897846e-01
  -2.48955950e-01  9.09369051e-01 -1.88014939e-01 -7.17376173e-01
   3.89687389e-01 -2.08640948e-01 -7.79499948e-01  1.29690838e+00
  -9.13590670e-01  1.33042842e-01  8.17202270e-01 -4.76543754e-01
   4.17642653e-01 -9.76111740e-03 -9.28037405e-01  6.50380313e-01
  -3.19879699e+00  4.65263397e-01  3.35585117e-01  2.14436688e-02
  -2.76631832e-01  4.24526855e-02 -3.00587714e-01 -6.31640971e-01
  -1.22646141e+00 -7.11693525e-01 -5.39517365e-02 -2.09268284e+00]
 [ 4.73754674e-01 -1.59625351e+00 -1.53968468e-01 -2.44954348e+00
   6.49740636e-01 -7.07279503e-01  1.60122812e-01 -1.56832671e+00
   2.35200658e-01  1.16291511e+00  4.25386399e-01 -3.92973781e-01
   7.57110953e-01 -1.14584528e-01 -1.07421279e+00 -6.47926211e-01
  -9.07325804e-01  2.22105756e-01  3.95380974e-01  1.78448841e-01
  -5.79642057e-02  1.04901743e+00  1.34441698e+00 -9.69599932e-02
   9.23476666e-02  8.63941193e-01  1.11614284e-03 -1.05547857e+00
   1.09360147e+00 -1.33676386e+00  8.93752217e-01  3.83038670e-02
  -2.31079721e+00  2.61638403e-01  5.76163709e-01  7.12172329e-01
   1.13079399e-01 -1.81492853e+00  1.04789495e+00 -1.23274231e+00
   5.07376343e-02 -1.33968818e+00  2.26974034e+00  2.03760728e-01
   5.24858892e-01 -1.25730753e+00 -4.17512745e-01  6.30214334e-01
  -5.04626036e-01  2.55590463e+00  5.74609153e-02  4.38864678e-01
  -2.43704557e-01  1.28732228e+00  1.14161777e+00  5.20770967e-01
   2.16976404e-01  1.14142621e+00  8.94764543e-01  1.80909836e+00
  -1.90802503e+00 -9.99371231e-01  7.12768495e-01 -6.34517670e-01
  -1.24687783e-01  6.85439765e-01  9.78687882e-01 -2.72447288e-01
  -1.24981749e+00 -8.90857697e-01  1.01692431e-01  9.47433591e-01
   9.78131071e-02  1.25865614e+00  1.04573154e+00 -1.85578093e-01
   5.13851285e-01 -1.96585572e+00 -6.18929446e-01 -1.59196496e+00
  -7.94556320e-01  1.35239506e+00  1.30747497e-01  1.08340538e+00
   7.94674933e-01 -1.70419037e-01 -1.07815914e-01 -8.67752433e-01
   2.74028946e-02  1.42733014e+00 -3.35451245e-01 -1.18081085e-02
  -7.14428008e-01  1.29715097e+00 -1.53426766e+00 -7.59699047e-01
   1.63791358e-01 -1.65026736e+00  1.48562670e+00 -1.72110963e+00]
 [-3.48294787e-02  2.26507974e+00  4.86799538e-01  8.02250922e-01
  -5.03105819e-01 -1.51174557e+00 -1.23069298e+00 -4.37387377e-01
  -1.19922660e-01 -1.32741228e-01  9.81089592e-01 -2.00786293e-01
   8.46012473e-01  7.58226991e-01  1.37318218e+00  4.99171913e-01
   7.41392896e-02  1.14247692e+00  1.19223285e+00  1.54212606e+00
  -2.56369328e+00 -5.90210617e-01 -1.69458699e+00  3.01133752e-01
   9.47523534e-01 -1.28467157e-01 -6.07066035e-01 -6.00612879e-01
  -3.29539704e+00  3.31878364e-01  3.49914461e-01 -8.02474558e-01
   9.00005996e-02  8.55811656e-01 -6.35393858e-01 -1.88428596e-01
   8.55382442e-01 -7.71138191e-01  1.48878348e+00 -5.60926378e-01
   6.07925117e-01  1.13047779e+00 -1.20606339e+00  8.76124084e-01
   3.93461764e-01  5.42526424e-01  9.87375915e-01  1.54771483e+00
  -1.11704183e+00  9.66534257e-01 -1.50813270e+00 -4.18101907e-01
  -8.16014290e-01 -6.96839750e-01 -2.56920576e-01  4.37291175e-01
  -2.53834738e-03  2.33468676e+00 -1.50986416e-02 -6.59652352e-01
  -4.65559602e-01 -1.34911144e+00  4.10299897e-02  4.49487656e-01
  -3.28320116e-01  5.82158938e-03 -9.17235196e-01  5.47479868e-01
   1.00075707e-01  5.53778887e-01  7.95560360e-01  5.82074165e-01
   4.40597296e-01  1.41883627e-01 -1.60682452e+00  8.55871812e-02
   4.72201049e-01  6.77530110e-01 -1.82441995e-01  1.01498105e-01
  -8.04406777e-03  1.96414077e+00 -1.11661828e+00 -1.56128228e+00
  -4.74351019e-01  1.91607726e+00 -1.19328606e+00 -2.86463916e-01
   1.25017092e-01  1.13316298e+00 -2.74157941e-01 -1.64727375e-01
  -1.72950840e+00  1.12391174e+00 -6.66688561e-01  7.28739873e-02
   1.04647315e+00 -1.55617642e+00 -2.54764885e-01 -8.18689287e-01]
 [ 8.91156673e-01  3.18052262e-01  2.09722710e+00  1.73778522e+00
  -1.41319370e+00 -7.47485280e-01  6.66629076e-01  7.39560783e-01
   8.95429015e-01  1.51626244e-01  1.38383520e+00 -1.36733592e-01
   8.11541319e-01 -4.33953226e-01  7.94477880e-01 -6.55612409e-01
  -1.38271320e+00 -3.39266628e-01 -2.93511152e-01  1.46516097e+00
   1.71347570e+00 -1.40051532e+00  1.12666571e+00 -4.97829884e-01
  -5.93189120e-01 -1.45751333e+00 -2.71108794e+00 -2.00019574e+00
  -9.48191524e-01 -4.44346368e-01  8.13890696e-01 -7.42613897e-02
   9.32008401e-02 -1.23251045e+00 -8.50963406e-03  2.05505773e-01
  -1.48191705e-01 -9.23814103e-02 -2.20937908e-01  2.62501180e-01
   1.08574092e-01  7.75018990e-01  1.11870992e+00  7.86271334e-01
   9.40064609e-01  1.32092059e+00 -5.05305588e-01  1.13251865e+00
   1.40149787e-01 -1.42711639e+00 -2.81242430e-01 -1.66696763e+00
   1.10434997e+00  1.75778019e+00 -1.71847045e+00 -2.79099345e-01
   1.51310873e+00 -8.37235510e-01 -7.63156950e-01 -1.06069550e-01
   2.34096125e-01 -2.71091008e+00  2.44496793e-01  1.89915252e+00
   8.01037431e-01  1.23803353e+00 -9.71419290e-02 -1.92206979e+00
   8.28902960e-01 -4.83347476e-01  5.25316775e-01  2.30684206e-02
  -5.12029767e-01 -7.48020053e-01  4.11946565e-01  5.49255192e-01
  -2.23095298e-01 -5.12054920e-01  2.52339333e-01 -9.49054807e-02
   1.07594228e+00 -9.24595118e-01 -6.69982493e-01 -1.06690586e+00
   6.30436301e-01  2.94996917e-01  1.32940316e+00 -1.39383161e+00
  -5.40680647e-01 -7.85788894e-01  3.04403752e-01  4.66728628e-01
   2.67849475e-01  1.44306827e+00 -2.27237716e-01  1.02745861e-01
  -5.52183032e-01 -5.09359360e-01  1.01352119e+00  6.09791458e-01]
 [ 3.99991989e-01 -6.25152647e-01  7.01872110e-01 -1.16544163e+00
   8.80496144e-01  7.28263557e-01  3.84819478e-01 -1.36815324e-01
  -4.19357091e-01 -2.99014002e-01  6.20382190e-01 -1.30659953e-01
   1.17253888e+00 -5.22038400e-01  1.64784598e+00  8.08855772e-01
   1.11875927e+00  1.02307034e+00  1.02533817e+00 -8.38138282e-01
  -6.32627964e-01  8.43035996e-01 -4.72003847e-01 -1.91650724e+00
   1.87835085e+00  6.62476659e-01 -1.64310064e-03  1.54904974e+00
  -4.31367248e-01 -9.94612157e-01 -2.51327772e-02 -8.24098051e-01
   5.06091237e-01 -6.20458759e-02 -3.90430093e-01 -9.47100639e-01
  -1.86841384e-01 -8.28972280e-01 -1.91550240e-01 -2.12299919e+00
   8.34999502e-01 -2.38999426e-01 -2.41668820e-01 -9.00301933e-01
  -2.25573361e-01  1.54047862e-01  9.26408947e-01 -2.11789295e-01
   4.12847757e-01  1.14952362e+00 -7.36300349e-01  2.04192728e-01
   1.28266513e+00 -6.70982182e-01 -1.83460546e+00  2.15686157e-01
   7.07491457e-01 -5.47649086e-01  1.27539206e+00  5.64734936e-01
   2.26897240e-01 -1.17738798e-01 -4.27806437e-01  2.72378884e-02
   1.63076770e+00 -4.90351528e-01  5.36693037e-01  8.68616879e-01
   8.11795175e-01  3.30178648e-01  2.17748213e+00 -7.27532327e-01
  -1.45456925e-01 -1.59931540e+00 -2.77693480e-01 -4.46349621e-01
  -9.30218279e-01 -2.19141650e+00  8.11784804e-01  7.18652382e-02
   6.57420337e-01  1.03664204e-01 -1.04592490e+00  6.97659552e-01
   1.92154855e-01  6.56601965e-01  8.16967726e-01 -1.21444792e-01
   7.82743096e-01  1.41578543e+00  3.41761321e-01  1.23176627e-01
   6.58716798e-01 -9.68537092e-01  8.28790963e-01 -6.14581816e-02
   4.16769356e-01  7.93242037e-01  1.27035511e+00 -2.98217845e+00]
 [ 1.41787744e+00  4.30863053e-01 -1.35842001e+00 -2.31687352e-01
   4.81799468e-02 -1.13010384e-01 -2.81411290e-01  1.23678505e+00
  -4.91829105e-02  3.90700817e-01 -1.18719149e+00  5.87880611e-01
   1.05693567e+00  1.75502792e-01 -8.20879996e-01 -6.93775490e-02
   1.40034509e+00 -3.20545942e-01 -8.02284122e-01 -1.90450406e+00
   1.86188713e-01 -6.00135505e-01 -3.11949283e-01 -1.58620250e+00
   4.76210535e-01 -1.73004854e+00 -6.93736970e-01 -5.08554578e-01
   3.16312289e+00  1.95174009e-01 -2.76474714e-01  1.33630812e+00
   2.34854460e-01 -2.47333512e-01  7.94547141e-01  8.25466216e-01
   5.37205935e-01  9.24634635e-01 -1.38021624e+00 -2.07926393e-01
  -4.06866372e-01 -1.34310174e+00  1.07187748e+00  4.51775134e-01
  -9.03886437e-01  7.52393663e-01 -5.68250000e-01  6.33994997e-01
   1.08135533e+00 -1.00676358e+00 -9.19247627e-01 -2.00193954e+00
  -1.03497982e+00 -3.54129493e-01 -1.17053652e+00  2.89141273e+00
   2.00288579e-01  2.43504629e-01  5.16683877e-01 -1.94491204e-02
  -1.23140061e+00  5.12821823e-02 -4.50057536e-01 -1.03455268e-01
   2.95981437e-01  9.00486037e-02 -1.65372467e+00 -7.55375445e-01
   4.64709401e-01 -6.81560993e-01 -1.21171749e+00 -1.10982382e+00
  -8.82262439e-02 -4.89559084e-01  3.21174353e-01  4.49467182e-01
  -1.44262266e+00 -6.77745342e-02 -4.52509262e-02 -4.69268233e-01
   1.16332404e-01  1.82556927e+00  6.26976192e-01 -9.81169581e-01
   4.06791836e-01 -1.66363668e+00  7.90482312e-02 -1.15621150e+00
   1.21551013e+00 -1.50499329e-01 -8.60824715e-03  1.26619697e+00
  -6.30170465e-01 -9.19624329e-01  4.25071388e-01  1.68217826e+00
  -1.06247319e-02 -2.80735463e-01 -3.46607894e-01  6.16329730e-01]
 [ 1.23951900e+00 -1.29802644e+00 -1.11810875e+00 -1.61887482e-01
   1.66484877e-01  4.02306058e-02  2.34848931e-01 -6.17022276e-01
  -1.01548862e+00 -1.59668422e+00 -4.61611271e-01  1.00245929e+00
   2.30857298e-01  2.06788397e+00 -6.77438915e-01 -1.82206213e-01
   2.47246489e-01 -1.72794685e-01 -1.61421788e+00  3.54998648e-01
   1.22171605e+00  1.04115450e+00  6.81458652e-01 -6.77761137e-01
   1.90401602e+00  1.37473094e+00  1.93106627e+00 -8.90373051e-01
   2.69743651e-01  1.10350811e+00  9.85456467e-01 -4.17405128e-01
   1.45321739e+00 -1.62904516e-01  2.61666441e+00 -1.13660252e+00
  -5.23015976e-01  1.41417384e+00 -1.45638537e+00  1.39924064e-01
  -1.81835264e-01 -1.85092092e+00  6.44125581e-01 -3.70008737e-01
   3.54113102e-01  1.47451103e+00 -1.08801985e+00  2.43944931e+00
  -1.32531929e+00 -1.80648351e+00 -4.23250198e-01  7.63600230e-01
  -5.85210204e-01  2.35295343e+00  1.04250467e+00  1.02737755e-01
  -2.87180662e-01 -8.28941539e-02  6.05848372e-01  2.33335465e-01
   1.96693033e-01 -7.93169662e-02  1.89767814e+00 -2.49023829e-03
   9.39399183e-01  1.27057350e+00 -1.26574382e-01  1.88864917e-01
   2.41839439e-01 -5.35686433e-01 -1.35840702e+00 -1.48494470e+00
   1.21456063e+00  3.16799998e-01  7.81244099e-01  1.40346968e+00
  -8.55279490e-02 -4.50297594e-01 -1.24199247e+00 -5.65499604e-01
   1.21995819e+00 -5.08152306e-01  6.65360019e-02  1.64739639e-01
   1.54252136e+00  1.00633180e+00 -1.25837934e+00  5.13076067e-01
   4.77847643e-02  4.05105203e-02 -6.82877123e-01 -1.07491016e-01
   5.98556817e-01 -2.11435556e+00  1.08828628e+00  1.69309035e-01
  -1.96819067e+00  1.22727036e+00 -1.31954658e+00 -3.26137334e-01]
 [ 7.08347976e-01  8.58157128e-02  8.75563264e-01  2.74913341e-01
   4.60563987e-01  3.63558292e-01 -3.28417599e-01  1.39410818e+00
   2.12482005e-01  3.25346977e-01 -5.32687724e-01  8.93540680e-02
   5.19629002e-01 -1.36936378e+00 -6.61399513e-02  1.50276494e+00
   1.28715611e+00 -2.44274974e-01  5.93465090e-01  5.05267680e-01
   6.96502507e-01 -1.56812537e+00  1.53453970e+00  8.01590204e-01
  -1.70177698e+00 -5.72526097e-01 -7.42326021e-01 -4.66878384e-01
  -2.30425686e-01 -3.67261142e-01  1.15874124e+00 -2.89703548e-01
  -2.18252748e-01  3.10105979e-01 -1.29287863e+00  2.05113482e+00
   2.57617265e-01 -1.06610370e+00 -5.81136525e-01  1.32945165e-01
   6.66835070e-01  7.49220848e-01  5.12089491e-01  4.30191994e-01
  -1.51903883e-01  7.75620282e-01  1.16832733e+00  4.68485653e-01
  -4.53565627e-01 -9.55212057e-01 -2.65528321e-01 -3.27313989e-01
   9.56872046e-01  5.94931424e-01 -3.91686225e+00  8.41580987e-01
  -4.13113117e-01 -1.14385709e-01 -1.87210631e+00  5.46258092e-02
  -5.05993187e-01 -1.67221451e+00  2.15138698e+00 -2.31907427e-01
  -3.46957266e-01 -3.51842254e-01  4.06620145e-01  7.34579444e-01
   1.59595609e+00 -1.41492343e+00 -9.70596373e-01  8.89268756e-01
  -7.70626605e-01 -1.15664482e+00 -1.54152620e+00  2.41009980e-01
  -2.38504753e-01 -2.03631926e+00 -1.29285181e+00  3.12330008e-01
  -2.23036790e+00 -1.19731987e+00 -1.44340980e+00 -1.25551546e+00
  -1.41780686e+00 -6.98669374e-01 -1.49969471e+00 -4.50949132e-01
  -1.64870260e-04  8.19425404e-01 -6.81385696e-01 -7.35609353e-01
  -6.61660969e-01 -1.59738159e+00 -2.30075568e-01  1.22990265e-01
   1.80285692e-01 -5.41782714e-02 -1.50115645e+00 -2.67956465e-01]]

In [ ]:

def naively_batched_apply_matrix(batched_x):
    return jnp.stack([apply_matrix(x) for x in batched_x])

print("Naively batched time:")
%timeit naively_batched_apply_matrix(batched_x).block_until_ready( )

Naively batched time:
1.77 ms ± 93.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [ ]:

@jit
def batched_apply_matrix(batched_x):
    return jnp.dot(batched_x, W.T) #had to completely rewrite function to vectorize it

print("Manually batched time:")
%timeit batched_apply_matrix(batched_x).block_until_ready()

Manually batched time:
223 μs ± 49.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [ ]:

@jit # note: one can arbitrarily compose JAX transformations
def vmap_batched_apply_matrix(batched_x):
    return vmap(apply_matrix)(batched_x) #much simpler! and very efficient!

print("Automatically vectorized with vmap time:")
%timeit vmap_batched_apply_matrix(batched_x).block_until_ready()

Automatically vectorized with vmap time:
157 μs ± 18.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [ ]:

def apply_matrix(W, x): #accept both W and x args, in line with func prog paradigm
    return jnp.dot(W, x) 

@jit # note: we can arbitrarily compose JAX transformations
def vmap_batched_apply_matrix(W, batched_x):
    return vmap(apply_matrix)(W, batched_x) #much simpler! and very efficient!

print("Automatically vectorized with vmap time:")
%timeit vmap_batched_apply_matrix(W, batched_x).block_until_ready()
# crashes because W does not have a batch dimension.

Automatically vectorized with vmap time:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[167], line 9
      6     return vmap(apply_matrix)(W, batched_x) #much simpler! and very efficient!
      8 print("Automatically vectorized with vmap time:")
----> 9 get_ipython().run_line_magic('timeit', 'vmap_batched_apply_matrix(W, batched_x).block_until_ready()')

File ~/jax_linus/lib/python3.10/site-packages/IPython/core/interactiveshell.py:2482, in InteractiveShell.run_line_magic(self, magic_name, line, _stack_depth)
   2480     kwargs['local_ns'] = self.get_local_scope(stack_depth)
   2481 with self.builtin_trap:
-> 2482     result = fn(*args, **kwargs)
   2484 # The code below prevents the output from being displayed
   2485 # when using magics with decorator @output_can_be_silenced
   2486 # when the last Python token in the expression is a ';'.
   2487 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False):

File ~/jax_linus/lib/python3.10/site-packages/IPython/core/magics/execution.py:1209, in ExecutionMagics.timeit(self, line, cell, local_ns)
   1207 for index in range(0, 10):
   1208     number = 10 ** index
-> 1209     time_number = timer.timeit(number)
   1210     if time_number >= 0.2:
   1211         break

File ~/jax_linus/lib/python3.10/site-packages/IPython/core/magics/execution.py:174, in Timer.timeit(self, number)
    172 gc.disable()
    173 try:
--> 174     timing = self.inner(it, self.timer)
    175 finally:
    176     if gcold:

File <magic-timeit>:1, in inner(_it, _timer)

    [... skipping hidden 14 frame]

Cell In[167], line 6, in vmap_batched_apply_matrix(W, batched_x)
      4 @jit # note: we can arbitrarily compose JAX transformations
      5 def vmap_batched_apply_matrix(W, batched_x):
----> 6     return vmap(apply_matrix)(W, batched_x)

    [... skipping hidden 2 frame]

File ~/jax_linus/lib/python3.10/site-packages/jax/_src/api.py:1248, in _mapped_axis_size(fn, tree, vals, dims, name)
   1246   else:
   1247     msg.append(f"  * some axes ({ct} of them) had size {sz}, e.g. axis {ax} of {ex};\n")
-> 1248 raise ValueError(''.join(msg)[:-2])

ValueError: vmap got inconsistent sizes for array axes to be mapped:
  * one axis had size 150: axis 0 of argument W of type float32[150,100];
  * one axis had size 10: axis 0 of argument x of type float32[10,100]

In [ ]:

# Solution: use the in_axes arg for vmap

def apply_matrix(W, x): #accept both W and x args, in line with func prog paradigm
    return jnp.dot(W, x) 

@jit # note: we can arbitrarily compose JAX transformations
def vmap_batched_apply_matrix(W, batched_x):
    return vmap(apply_matrix, in_axes=(None, 0))(W, batched_x) #much simpler! and very efficient!

print("Automatically vectorized with vmap time:")
%timeit vmap_batched_apply_matrix(W, batched_x).block_until_ready()

Automatically vectorized with vmap time:
197 μs ± 29.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

$\textbf{Problem}$: Why is JAX’s API layering similar to an onion?

$\textbf{Solution}$: The highest layer of the API is NumPy, followed by Lax, followed by the XLA compiler. The Lax API is stricter than NumPy but also more powerful; it’s a Python wrapper around XLA.

In [ ]:

# Example 1: Lax is stricter than NumPy

print(jnp.add(1, 1.0))
print(lax.add(1, 1.0))

2.0

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[171], line 4
      1 # Example 1: Lax is stricter than NumPy
      3 print(jnp.add(1, 1.0))
----> 4 print(lax.add(1, 1.0))

File ~/jax_linus/lib/python3.10/site-packages/jax/_src/lax/lax.py:1196, in add(x, y)
   1176 r"""Elementwise addition: :math:`x + y`.
   1177 
   1178 This function lowers directly to the `stablehlo.add`_ operation.
   (...)
   1193 .. _stablehlo.add: https://openxla.org/stablehlo/spec#add
   1194 """
   1195 x, y = core.standard_insert_pvary(x, y)
-> 1196 return add_p.bind(x, y)

File ~/jax_linus/lib/python3.10/site-packages/jax/_src/core.py:536, in Primitive.bind(self, *args, **params)
    534 def bind(self, *args, **params):
    535   args = args if self.skip_canonicalization else map(canonicalize_value, args)
--> 536   return self._true_bind(*args, **params)

File ~/jax_linus/lib/python3.10/site-packages/jax/_src/core.py:552, in Primitive._true_bind(self, *args, **params)
    550 trace_ctx.set_trace(eval_trace)
    551 try:
--> 552   return self.bind_with_trace(prev_trace, args, params)
    553 finally:
    554   trace_ctx.set_trace(prev_trace)

File ~/jax_linus/lib/python3.10/site-packages/jax/_src/core.py:562, in Primitive.bind_with_trace(self, trace, args, params)
    559   with set_current_trace(trace):
    560     return self.to_lojax(*args, **params)  # type: ignore
--> 562 return trace.process_primitive(self, args, params)

File ~/jax_linus/lib/python3.10/site-packages/jax/_src/core.py:1066, in EvalTrace.process_primitive(self, primitive, args, params)
   1064 args = map(full_lower, args)
   1065 check_eval_args(args)
-> 1066 return primitive.impl(*args, **params)

File ~/jax_linus/lib/python3.10/site-packages/jax/_src/dispatch.py:91, in apply_primitive(prim, *args, **params)
     89 prev = lib.jax_jit.swap_thread_local_state_disable_jit(False)
     90 try:
---> 91   outs = fun(*args)
     92 finally:
     93   lib.jax_jit.swap_thread_local_state_disable_jit(prev)

    [... skipping hidden 24 frame]

File ~/jax_linus/lib/python3.10/site-packages/jax/_src/lax/lax.py:8799, in check_same_dtypes(name, *avals)
   8797   equiv = _JNP_FUNCTION_EQUIVALENTS[name]
   8798   msg += f" (Tip: jnp.{equiv} is a similar function that does automatic type promotion on inputs)."
-> 8799 raise TypeError(msg.format(name, ", ".join(str(a.dtype) for a in avals)))

TypeError: lax.add requires arguments to have the same dtypes, got int32, float32. (Tip: jnp.add is a similar function that does automatic type promotion on inputs).

In [ ]:

# Example 2: Lax is more powerful (tradeoff: less user-friendly)

x = jnp.array([1, 2, 1])
y = jnp.ones(10)

# NumPy API
convolution_jnp = jnp.convolve(x, y)

# Lax API
convolution_lax = lax.conv_general_dilated(
    x.reshape(1, 1, 3).astype(float),
    y.reshape(1, 1, 10),
    window_strides=(1,),
    padding=[(len(y)-1, len(y)-1)]
)

print(convolution_jnp)
print(convolution_lax)
print(convolution_lax[0][0]) # returns batched result, hence need for this indexing
# see:

[1. 3. 4. 4. 4. 4. 4. 4. 4. 4. 3. 1.]
[[[1. 3. 4. 4. 4. 4. 4. 4. 4. 4. 3. 1.]]]
[1. 3. 4. 4. 4. 4. 4. 4. 4. 4. 3. 1.]

In [ ]:

# another JIT example
def norm(X):
    X = X - X.mean(0)
    return X/X.std(0)

norm_compiled = jit(norm)

X = random.normal(key, (10000, 100), dtype=jnp.float32)
%timeit norm(X).block_until_ready()
%timeit norm_compiled(X).block_until_ready()

1.9 ms ± 153 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.91 ms ± 52.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [ ]:

# Example of a failure: array shapes must be static.
def get_negatives(x):
    return x[x < 0]

x = random.normal(key, (10,), dtype=jnp.float32)
print(get_negatives(x))

[-0.43359444 -0.07861735 -0.97208923 -0.49529874 -0.9501635 ]

In [ ]:

# but this fails:
print(jit(get_negatives)(x))

---------------------------------------------------------------------------
NonConcreteBooleanIndexError              Traceback (most recent call last)
Cell In[199], line 2
      1 # but this fails:
----> 2 print(jit(get_negatives)(x))

    [... skipping hidden 14 frame]

Cell In[198], line 3, in get_negatives(x)
      2 def get_negatives(x):
----> 3     return x[x < 0]

File ~/jax_linus/lib/python3.10/site-packages/jax/_src/numpy/array_methods.py:1083, in _forward_operator_to_aval.<locals>.op(self, *args)
   1082 def op(self, *args):
-> 1083   return getattr(self.aval, f"_{name}")(self, *args)

File ~/jax_linus/lib/python3.10/site-packages/jax/_src/numpy/array_methods.py:657, in _getitem(self, item)
    656 def _getitem(self, item):
--> 657   return indexing.rewriting_take(self, item)

File ~/jax_linus/lib/python3.10/site-packages/jax/_src/numpy/indexing.py:645, in rewriting_take(arr, idx, indices_are_sorted, unique_indices, mode, fill_value, out_sharding)
    639     if (isinstance(aval, core.DShapedArray) and aval.shape == () and
    640         dtypes.issubdtype(aval.dtype, np.integer) and
    641         not dtypes.issubdtype(aval.dtype, dtypes.bool_) and
    642         isinstance(arr.shape[0], int)):
    643       return lax.dynamic_index_in_dim(arr, idx, keepdims=False)
--> 645 treedef, static_idx, dynamic_idx = split_index_for_jit(idx, arr.shape)
    646 internal_gather = partial(
    647     _gather, treedef=treedef, static_idx=static_idx,
    648     indices_are_sorted=indices_are_sorted, unique_indices=unique_indices,
    649     mode=mode, fill_value=fill_value)
    650 if out_sharding is not None:

File ~/jax_linus/lib/python3.10/site-packages/jax/_src/numpy/indexing.py:738, in split_index_for_jit(idx, shape)
    734   raise TypeError(f"JAX does not support string indexing; got {idx=}")
    736 # Expand any (concrete) boolean indices. We can then use advanced integer
    737 # indexing logic to handle them.
--> 738 idx = _expand_bool_indices(idx, shape)
    740 leaves, treedef = tree_flatten(idx)
    741 dynamic = [None] * len(leaves)

File ~/jax_linus/lib/python3.10/site-packages/jax/_src/numpy/indexing.py:1059, in _expand_bool_indices(idx, shape)
   1055   abstract_i = core.get_aval(i)
   1057 if not core.is_concrete(i):
   1058   # TODO(mattjj): improve this error by tracking _why_ the indices are not concrete
-> 1059   raise errors.NonConcreteBooleanIndexError(abstract_i)
   1060 elif np.ndim(i) == 0:
   1061   out.append(bool(i))

NonConcreteBooleanIndexError: Array boolean indices must be concrete; got bool[10]

See https://docs.jax.dev/en/latest/errors.html#jax.errors.NonConcreteBooleanIndexError

In [ ]:

# So why does this happen? --> tracing on different levels of abstraction

@jit 
def f(x, y):
    print("Running f().")
    print(f"x={x}")
    print(f"y={y}")
    result = jnp.dot(x+1, y+1)
    print(f"result = {result}")
    return result

x = np.random.randn(3, 4)
y = np.random.randn(4)
print(f(x, y))

x2 = np.random.randn(3, 4)
y2 = np.random.randn(4)
print("Second Call (notice no more print statement side effects):")
print(f(x2, y2))

# Note: any time one puts the same shapes + types into a jitted function, fast b/c tracer object already XLA-compiled
# S + T! shapes + types!
x3 = np.random.randn(3, 5)
y3 = np.random.randn(5)
print(f(x3, y3)) # notice how jit has to retrace it

Running f().
x=Traced<float32[3,4]>with<DynamicJaxprTrace>
y=Traced<float32[4]>with<DynamicJaxprTrace>
result = Traced<float32[3]>with<DynamicJaxprTrace>
[1.0699788 4.141973  2.184647 ]
Second Call (notice no more print statement side effects):
[9.221527  5.545151  2.8010619]
Running f().
x=Traced<float32[3,5]>with<DynamicJaxprTrace>
y=Traced<float32[5]>with<DynamicJaxprTrace>
result = Traced<float32[3]>with<DynamicJaxprTrace>
[5.1654205 4.1818676 6.113046 ]

In [ ]:

def f(x, y):
    # same function as above but w/o the print statements, as would be written in practice
    return jnp.dot(x + 1, y + 1)

print(make_jaxpr(f)(x, y))
# produces a flow model showing what jit creates in the background when tracing
# can go to JAX docs to get better understanding of this grammar

{ lambda ; a:f32[3,4] b:f32[4]. let
    c:f32[3,4] = add a 1.0:f32[]
    d:f32[4] = add b 1.0:f32[]
    e:f32[3] = dot_general[
      dimension_numbers=(([1], [0]), ([], []))
      preferred_element_type=float32
    ] c d
  in (e,) }

In [ ]:

@jit
def f(x, neg):
    return -x if neg else X

f(1, True)

---------------------------------------------------------------------------
TracerBoolConversionError                 Traceback (most recent call last)
Cell In[14], line 5
      1 @jit
      2 def f(x, neg):
      3     return -x if neg else X
----> 5 f(1, True)

    [... skipping hidden 14 frame]

Cell In[14], line 3, in f(x, neg)
      1 @jit
      2 def f(x, neg):
----> 3     return -x if neg else X

    [... skipping hidden 1 frame]

File ~/jax_linus/lib/python3.10/site-packages/jax/_src/core.py:1661, in concretization_function_error.<locals>.error(self, arg)
   1660 def error(self, arg):
-> 1661   raise TracerBoolConversionError(arg)

TracerBoolConversionError: Attempted boolean conversion of traced array with shape bool[].
The error occurred while tracing the function f at /tmp/ipykernel_424021/236938057.py:1 for jit. This concrete value was not available in Python because it depends on the value of the argument neg.
See https://docs.jax.dev/en/latest/errors.html#jax.errors.TracerBoolConversionError

$\textbf{Problem}$: Why does running the above code cell lead to the error message shown? What can be done to debug this error?

$\textbf{Solution}$: When you jit a function, there is always a $2$-step procedure:

Tracing (Recipe Phase): JAX runs the function once using placeholder objects called tracers instead of the actual data. This is meant purely to record the computational graph of the function.
Execution (Cooking Phase): JAX compiles the graph into XLA code to run on GPU/TPU.

So the conflict here is that standard Python control flow (such as the if/else conditional statement in the function above) happened during the $1^{\text{st}}$ step of the jit compilation. The Python interpreter would have checked whether neg is True or False, but at the time it was still a tracer object, which basically told the interpreter “I don’t know what value I hold yet; I represent a value that will exist later when we run the XLA code”.

To debug this error, one can instead make neg a static argument (see below).

In [ ]:

from functools import partial

@partial(jit, static_argnums=(1, ))
def f(x, neg):
    print(x)
    return -x if neg else x

print(f(1, True))
print(f(2, True)) # no more tracing here b/c still True which was already traced
print(f(2, False))
print(f(23, False)) # same comment here

Traced<~int32[]>with<DynamicJaxprTrace>
-1
-2
Traced<~int32[]>with<DynamicJaxprTrace>
2
23

In [39]:

@jit
def f(x):
    print(x) #tracer object
    print(x.shape) #concrete value
    print(jnp.array(x.shape).prod()) #tracer object again
    return x.reshape(jnp.array(x.shape).prod()) #fail b/c try to apply reshape to a tracer

x = jnp.ones((2, 3))
print(f(x))

Traced<float32[2,3]>with<DynamicJaxprTrace>
(2, 3)
Traced<int32[]>with<DynamicJaxprTrace>

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[39], line 9
      6     return x.reshape(jnp.array(x.shape).prod()) #fail b/c try to apply reshape to a tracer
      8 x = jnp.ones((2, 3))
----> 9 print(f(x)) 

    [... skipping hidden 14 frame]

Cell In[39], line 6, in f(x)
      4 print(x.shape) #concrete value
      5 print(jnp.array(x.shape).prod()) #tracer object again
----> 6 return x.reshape(jnp.array(x.shape).prod())

    [... skipping hidden 2 frame]

File ~/jax_linus/lib/python3.10/site-packages/jax/_src/numpy/array_methods.py:454, in _compute_newshape(arr, newshape)
    452 except:
    453   newshape = [newshape]
--> 454 newshape = core.canonicalize_shape(newshape)  # type: ignore[arg-type]
    455 neg1s = [i for i, d in enumerate(newshape) if type(d) is int and d == -1]
    456 if len(neg1s) > 1:

File ~/jax_linus/lib/python3.10/site-packages/jax/_src/core.py:1864, in canonicalize_shape(shape, context)
   1862 except TypeError:
   1863   pass
-> 1864 raise _invalid_shape_error(shape, context)

TypeError: Shapes must be 1D sequences of concrete values of integer type, got [Traced<int32[]>with<DynamicJaxprTrace>].
If using `jit`, try using `static_argnums` or applying `jit` to smaller subfunctions.
The error occurred while tracing the function f at /tmp/ipykernel_424021/3415433522.py:1 for jit. This value became a tracer due to JAX operations on these lines:

  operation a:i32[] = reduce_prod[axes=(0,)] b
    from line /tmp/ipykernel_424021/3415433522.py:6 (f)

In [ ]:

@jit
def f(x):
    return x.reshape(np.prod(x.shape)) #np.prod works b/c it returns concrete array rather than tracer object

print(f(x))

[1. 1. 1. 1. 1. 1.]

$\textbf{Problem}$: What does it mean that JAX is only designed to work with “pure functions”?

$\textbf{Solution}$: Informally, a function is pure iff:

All inputs are passed through the function arguments, all outputs are passed through the return statement of the function.
When passed with the same input arguments, the same output result is always obtained.

In [46]:

# Example: identity function

def impure_print_side_effect(x):
    print("Executing function:") #violates #1
    return x

print("First call (tracing)", jit(impure_print_side_effect)(4))
print("Second call, no more side effects b/c traced and XLA-compiled already", jit(impure_print_side_effect)(5))
print("Third call, need to trace computation graph again, type change:", jit(impure_print_side_effect)(jnp.array([5,])))

Executing function:
First call (tracing) 4
Second call, no more side effects b/c traced and XLA-compiled already 5
Executing function:
Third call, need to trace computation graph again, type change: [5]

In [52]:

# Example: don't intefere with global variable values!

g = 0
def impure_uses_globals(x):
    return x + g

print("First call (caches value of g=0):", jit(impure_uses_globals)(4))

# Now do the crime of updating a global variable:
g = 10

print("Second call looks wrong:", jit(impure_uses_globals)(5))

# If change the type of the func arg, JAX will retrace and now it will read the
# most recent value of the global variable

print("Third call:", jit(impure_uses_globals)(jnp.array([4])))

First call (caches value of g=0): 4
Second call looks wrong: 5
Third call: [14]

In [59]:

# Example: valid pure function (Haiku/Flax built on this idea)

def pure_uses_internal_state(x):
    state = dict(even=0, odd=0)
    for i in range(10):
        state["even" if i%2==0 else "odd"] += x
    return state["even"], state["odd"]

print(jit(pure_uses_internal_state)(7))

(Array(35, dtype=int32, weak_type=True), Array(35, dtype=int32, weak_type=True))

In [ ]:

# Example: iterators are forbidden!

# lax.fori_loop: similarly for lax.scan, lax.cond, etc.
array = jnp.arange(10)
print("Correct Answer:", lax.fori_loop(0, 10, lambda i, x: x+array[i], 0))

# however this one breaks b/c an iterator is used.
# Iterators are stateful objects, so they violate the purity constraint of JAX functions
iterator = iter(range(10))
print("Incorrect Answer:", lax.fori_loop(0, 10, lambda i,x: x+next(iterator), 0))

Correct Answer: 45
Incorrect Answer: 0

$\text{Problem}$: Given a JAX NumPy array, how should one update the value of an element in the array in-place?

$\text{Solution}$: Using the .at[].set() syntax, for example:

In [65]:

jax_arr = jnp.zeros(shape=(3, 3), dtype=jnp.float32)
updated_jax_arr = jax_arr.at[1,:].set(jnp.pi)

print(jax_arr)
print(updated_jax_arr)

# This may seem wasteful, but XLA is smart enough to figure out that
# if one isn't using the input array jax_arr, then it won't allocate
# a special memory object for the output array, it will simply reuse
# the input array and modify it in-place despite not appearing to do so
# at this high-level API.

[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
[[0.        0.        0.       ]
 [3.1415927 3.1415927 3.1415927]
 [0.        0.        0.       ]]

In [74]:

# However, one still has access to the expressiveness of NumPy!!!
# Example:
print(jax_arr)
another_updated_jax_arr =  jax_arr.at[::2, 1:].add(7)
print(another_updated_jax_arr)

[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
[[0. 7. 7.]
 [0. 0. 0.]
 [0. 7. 7.]]

$\text{Problem}$: Since JAX wants to be accelerator agnostic, how does JAX handle out-of-bounds indexing?

$\text{Solution}$: Rather than throwing an error message or exception, it simply does a certain “clamping”.

In [78]:

# NumPy behavior

try:
    np.arange(10)[11]
except Exception as e:
    print(f"Exception {e}")

Exception index 11 is out of bounds for axis 0 with size 10

In [ ]:

# JAX behavior is disturbing and likely to be the cause of many bugs!!!
# similar to NaN behavior when doing invalid floating point arithmetic
print(jnp.arange(10).at[11].add(823942))
print(jnp.arange(10)[11])

[0 1 2 3 4 5 6 7 8 9]
9

In [82]:

# Another "gotcha" of JAX:

# NumPy
print(np.sum([1, 2, 3]))

# JAX
try:
    jnp.sum([1, 2, 3])
except Exception as e:
    print(f"TypeError: {e}")

6
TypeError: sum requires ndarray or scalar arguments, got <class 'list'> at position 0.

In [ ]:

# The reason for this behavior can, as with any JAX behavior, be dissected via jaxpr:

def permissive_sum(x):
    return jnp.sum(jnp.array(x))

x = list(range(10)) #[0,..., 9]
print(make_jaxpr(permissive_sum)(x))

# Thus, JAX is good for researchers looking for highly-optimized and flexible programs
# for a beginnner, PyTorch is friendlier

{ lambda ; a:i32[] b:i32[] c:i32[] d:i32[] e:i32[] f:i32[] g:i32[] h:i32[] i:i32[]
    j:i32[]. let
    k:i32[] = convert_element_type[new_dtype=int32 weak_type=False] a
    l:i32[1] = broadcast_in_dim[
      broadcast_dimensions=()
      shape=(1,)
      sharding=None
    ] k
    m:i32[] = convert_element_type[new_dtype=int32 weak_type=False] b
    n:i32[1] = broadcast_in_dim[
      broadcast_dimensions=()
      shape=(1,)
      sharding=None
    ] m
    o:i32[] = convert_element_type[new_dtype=int32 weak_type=False] c
    p:i32[1] = broadcast_in_dim[
      broadcast_dimensions=()
      shape=(1,)
      sharding=None
    ] o
    q:i32[] = convert_element_type[new_dtype=int32 weak_type=False] d
    r:i32[1] = broadcast_in_dim[
      broadcast_dimensions=()
      shape=(1,)
      sharding=None
    ] q
    s:i32[] = convert_element_type[new_dtype=int32 weak_type=False] e
    t:i32[1] = broadcast_in_dim[
      broadcast_dimensions=()
      shape=(1,)
      sharding=None
    ] s
    u:i32[] = convert_element_type[new_dtype=int32 weak_type=False] f
    v:i32[1] = broadcast_in_dim[
      broadcast_dimensions=()
      shape=(1,)
      sharding=None
    ] u
    w:i32[] = convert_element_type[new_dtype=int32 weak_type=False] g
    x:i32[1] = broadcast_in_dim[
      broadcast_dimensions=()
      shape=(1,)
      sharding=None
    ] w
    y:i32[] = convert_element_type[new_dtype=int32 weak_type=False] h
    z:i32[1] = broadcast_in_dim[
      broadcast_dimensions=()
      shape=(1,)
      sharding=None
    ] y
    ba:i32[] = convert_element_type[new_dtype=int32 weak_type=False] i
    bb:i32[1] = broadcast_in_dim[
      broadcast_dimensions=()
      shape=(1,)
      sharding=None
    ] ba
    bc:i32[] = convert_element_type[new_dtype=int32 weak_type=False] j
    bd:i32[1] = broadcast_in_dim[
      broadcast_dimensions=()
      shape=(1,)
      sharding=None
    ] bc
    be:i32[10] = concatenate[dimension=0] l n p r t v x z bb bd
    bf:i32[] = reduce_sum[axes=(0,)] be
  in (bf,) }

In [ ]:

# NumPy - PRNG is stateful!
print(np.random.random()) # PRNG state is advanced
print(np.random.random()) # thus consuming entropy of PRNG
# in both cases, PRNG state change hidden from us...let's expose!
np.random.seed(seed=0) #seed is INITIAL PRNG state!!!

rng_state = np.random.get_state() #get state of PRNG
print(rng_state[2:]) #print this metadata from state

_ = np.random.uniform()
rng_state = np.random.get_state()
print(rng_state[2:])

_ = np.random.uniform()
rng_state = np.random.get_state()
print(rng_state[2:])

# Mersenne Twister PRNG known to have problems (NumPy's implementation of PRNG)

0.6027633760716439
0.5448831829968969
(624, 0, 0.0)
(2, 0, 0.0)
(4, 0, 0.0)

In [ ]:

# In functional programming paradigm, JAX random functions can't modify PRNG state

key = random.PRNGKey(seed=0)
print(key)

print(random.normal(key, shape=(1,)))
print(key)

print(random.normal(key, shape=(1,)))
print(key)

# clearly not random! How to deal with it?

[0 0]
[1.6226422]
[0 0]
[1.6226422]
[0 0]

In [119]:

# Solution: splitting! Not just a key, but key + subkey!

print("Old key:", key)
key, subkey = random.split(key) # can also split into > 1 subkeys
normal_pseudorandom = random.normal(subkey, shape=(1,))
print("Key:", key)
print("Subkey:", subkey)
print("Random number:", normal_pseudorandom)
# functionally, key and subkey are indistinguishable, only a convention
# try running this cell multiple times!

Old key: [ 197075234 2075500836]
Key: [2234728722 1518742019]
Subkey: [3499959921 3652298783]
Random number: [-0.2507795]

In [ ]:

# Why this design?
# Answer: code is reproducible, parallelizable and vectorizable

np.random.seed(seed=0)

def bar(): return np.random.uniform()
def baz(): return np.random.uniform()

def foo(): return bar() + 2 * baz()

print(foo())
# b/c start from same seed state, calling foo() gives same answer each time
# However, if we were to jit foo(), then jit might decide to parallelize the
# program by calling bar() on one core and baz() on another one, hence one may
# obtain different results (e.g. 0.3 + 2*0.4 vs. 0.4 + 2*0.3 which is not reproducible)

1.9791922366721637

In [136]:

# NumPy
np.random.seed(seed=0)
print("Individually:", np.stack([np.random.uniform() for _ in range(3)]))

np.random.seed(seed=0)
print("Simultaneously:", np.random.uniform(size=3))
print("They're the same!")

# JAX
key = random.PRNGKey(seed=0)
subkeys = random.split(key, 3) # creating 3 subkeys
sequence = np.stack([random.normal(subkey) for subkey in subkeys])
print("Individually:", sequence)

key = random.PRNGKey(seed=0)
print("Simultaneously:", random.normal(key, shape=(3,)))
print("They're different!")

Individually: [0.5488135  0.71518937 0.60276338]
Simultaneously: [0.5488135  0.71518937 0.60276338]
They're the same!
Individually: [ 1.0040143 -2.4424558  1.2956359]
Simultaneously: [ 1.6226422   2.0252647  -0.43359444]
They're different!

In [ ]:

# Python control flow + grad() --> no problems!

def f(x):
    if x < 3:
        return 3* x**2
    else:
        return -4*x
    
x = np.linspace(-10, 10, 1000)
y = [f(_) for _ in x]
plt.plot(x, y); plt.show()

print(grad(f)(2.)) #correct!
print(grad(f)(4.)) #correct!

12.0
-4.0

In [147]:

# but if you also want to jit f, then because we're conditioning on x
# in the function scope (if x < 3), it must be passed as a static argument

f_jit = jit(f, static_argnums=(0,))
x = 2

print(make_jaxpr(f_jit, static_argnums=(0,))(x))
print(f_jit(x))

{ lambda ; . let
    a:i32[] = pjit[name=f jaxpr={ lambda ; . let  in (12:i32[],) }] 
  in (a,) }
12

In [156]:

def f(x, n):
    y = 0
    for i in range(n):
        y = y + x[i]
    return y

f_jit = jit(f, static_argnums=(1,))
x = (jnp.array([2, 3, 4]), 15)
print(x)
print(*x)

print(make_jaxpr(f_jit, static_argnums=(1,))(*x))
print(f_jit(*x))
print(2+3+4*13)

(Array([2, 3, 4], dtype=int32), 15)
[2 3 4] 15
{ lambda ; a:i32[3]. let
    b:i32[] = pjit[
      name=f
      jaxpr={ lambda ; a:i32[3]. let
          c:i32[1] = slice[limit_indices=(1,) start_indices=(0,) strides=None] a
          d:i32[] = squeeze[dimensions=(0,)] c
          e:i32[] = add 0:i32[] d
          f:i32[1] = slice[limit_indices=(2,) start_indices=(1,) strides=None] a
          g:i32[] = squeeze[dimensions=(0,)] f
          h:i32[] = add e g
          i:i32[1] = slice[limit_indices=(3,) start_indices=(2,) strides=None] a
          j:i32[] = squeeze[dimensions=(0,)] i
          k:i32[] = add h j
          l:i32[1] = broadcast_in_dim[
            broadcast_dimensions=()
            shape=(1,)
            sharding=None
          ] 3:i32[]
          m:i32[] = gather[
            dimension_numbers=GatherDimensionNumbers(offset_dims=(), collapsed_slice_dims=(0,), start_index_map=(0,), operand_batching_dims=(), start_indices_batching_dims=())
            fill_value=None
            indices_are_sorted=True
            mode=GatherScatterMode.PROMISE_IN_BOUNDS
            slice_sizes=(1,)
            unique_indices=True
          ] a l
          n:i32[] = add k m
          o:i32[1] = broadcast_in_dim[
            broadcast_dimensions=()
            shape=(1,)
            sharding=None
          ] 4:i32[]
          p:i32[] = gather[
            dimension_numbers=GatherDimensionNumbers(offset_dims=(), collapsed_slice_dims=(0,), start_index_map=(0,), operand_batching_dims=(), start_indices_batching_dims=())
            fill_value=None
            indices_are_sorted=True
            mode=GatherScatterMode.PROMISE_IN_BOUNDS
            slice_sizes=(1,)
            unique_indices=True
          ] a o
          q:i32[] = add n p
          r:i32[1] = broadcast_in_dim[
            broadcast_dimensions=()
            shape=(1,)
            sharding=None
          ] 5:i32[]
          s:i32[] = gather[
            dimension_numbers=GatherDimensionNumbers(offset_dims=(), collapsed_slice_dims=(0,), start_index_map=(0,), operand_batching_dims=(), start_indices_batching_dims=())
            fill_value=None
            indices_are_sorted=True
            mode=GatherScatterMode.PROMISE_IN_BOUNDS
            slice_sizes=(1,)
            unique_indices=True
          ] a r
          t:i32[] = add q s
          u:i32[1] = broadcast_in_dim[
            broadcast_dimensions=()
            shape=(1,)
            sharding=None
          ] 6:i32[]
          v:i32[] = gather[
            dimension_numbers=GatherDimensionNumbers(offset_dims=(), collapsed_slice_dims=(0,), start_index_map=(0,), operand_batching_dims=(), start_indices_batching_dims=())
            fill_value=None
            indices_are_sorted=True
            mode=GatherScatterMode.PROMISE_IN_BOUNDS
            slice_sizes=(1,)
            unique_indices=True
          ] a u
          w:i32[] = add t v
          x:i32[1] = broadcast_in_dim[
            broadcast_dimensions=()
            shape=(1,)
            sharding=None
          ] 7:i32[]
          y:i32[] = gather[
            dimension_numbers=GatherDimensionNumbers(offset_dims=(), collapsed_slice_dims=(0,), start_index_map=(0,), operand_batching_dims=(), start_indices_batching_dims=())
            fill_value=None
            indices_are_sorted=True
            mode=GatherScatterMode.PROMISE_IN_BOUNDS
            slice_sizes=(1,)
            unique_indices=True
          ] a x
          z:i32[] = add w y
          ba:i32[1] = broadcast_in_dim[
            broadcast_dimensions=()
            shape=(1,)
            sharding=None
          ] 8:i32[]
          bb:i32[] = gather[
            dimension_numbers=GatherDimensionNumbers(offset_dims=(), collapsed_slice_dims=(0,), start_index_map=(0,), operand_batching_dims=(), start_indices_batching_dims=())
            fill_value=None
            indices_are_sorted=True
            mode=GatherScatterMode.PROMISE_IN_BOUNDS
            slice_sizes=(1,)
            unique_indices=True
          ] a ba
          bc:i32[] = add z bb
          bd:i32[1] = broadcast_in_dim[
            broadcast_dimensions=()
            shape=(1,)
            sharding=None
          ] 9:i32[]
          be:i32[] = gather[
            dimension_numbers=GatherDimensionNumbers(offset_dims=(), collapsed_slice_dims=(0,), start_index_map=(0,), operand_batching_dims=(), start_indices_batching_dims=())
            fill_value=None
            indices_are_sorted=True
            mode=GatherScatterMode.PROMISE_IN_BOUNDS
            slice_sizes=(1,)
            unique_indices=True
          ] a bd
          bf:i32[] = add bc be
          bg:i32[1] = broadcast_in_dim[
            broadcast_dimensions=()
            shape=(1,)
            sharding=None
          ] 10:i32[]
          bh:i32[] = gather[
            dimension_numbers=GatherDimensionNumbers(offset_dims=(), collapsed_slice_dims=(0,), start_index_map=(0,), operand_batching_dims=(), start_indices_batching_dims=())
            fill_value=None
            indices_are_sorted=True
            mode=GatherScatterMode.PROMISE_IN_BOUNDS
            slice_sizes=(1,)
            unique_indices=True
          ] a bg
          bi:i32[] = add bf bh
          bj:i32[1] = broadcast_in_dim[
            broadcast_dimensions=()
            shape=(1,)
            sharding=None
          ] 11:i32[]
          bk:i32[] = gather[
            dimension_numbers=GatherDimensionNumbers(offset_dims=(), collapsed_slice_dims=(0,), start_index_map=(0,), operand_batching_dims=(), start_indices_batching_dims=())
            fill_value=None
            indices_are_sorted=True
            mode=GatherScatterMode.PROMISE_IN_BOUNDS
            slice_sizes=(1,)
            unique_indices=True
          ] a bj
          bl:i32[] = add bi bk
          bm:i32[1] = broadcast_in_dim[
            broadcast_dimensions=()
            shape=(1,)
            sharding=None
          ] 12:i32[]
          bn:i32[] = gather[
            dimension_numbers=GatherDimensionNumbers(offset_dims=(), collapsed_slice_dims=(0,), start_index_map=(0,), operand_batching_dims=(), start_indices_batching_dims=())
            fill_value=None
            indices_are_sorted=True
            mode=GatherScatterMode.PROMISE_IN_BOUNDS
            slice_sizes=(1,)
            unique_indices=True
          ] a bm
          bo:i32[] = add bl bn
          bp:i32[1] = broadcast_in_dim[
            broadcast_dimensions=()
            shape=(1,)
            sharding=None
          ] 13:i32[]
          bq:i32[] = gather[
            dimension_numbers=GatherDimensionNumbers(offset_dims=(), collapsed_slice_dims=(0,), start_index_map=(0,), operand_batching_dims=(), start_indices_batching_dims=())
            fill_value=None
            indices_are_sorted=True
            mode=GatherScatterMode.PROMISE_IN_BOUNDS
            slice_sizes=(1,)
            unique_indices=True
          ] a bp
          br:i32[] = add bo bq
          bs:i32[1] = broadcast_in_dim[
            broadcast_dimensions=()
            shape=(1,)
            sharding=None
          ] 14:i32[]
          bt:i32[] = gather[
            dimension_numbers=GatherDimensionNumbers(offset_dims=(), collapsed_slice_dims=(0,), start_index_map=(0,), operand_batching_dims=(), start_indices_batching_dims=())
            fill_value=None
            indices_are_sorted=True
            mode=GatherScatterMode.PROMISE_IN_BOUNDS
            slice_sizes=(1,)
            unique_indices=True
          ] a bs
          b:i32[] = add br bt
        in (b,) }
    ] a
  in (b,) }
57
57

In [ ]:

# Better way (though less readable) solution is to use low-level Lax API
def f_fori(x, n):
    body_fun = lambda i,val: val+x[i]
    return lax.fori_loop(0, n, body_fun, 0)

f_fori_jit = jit(f_fori)
print(make_jaxpr(f_fori_jit)(*x))
print(f_fori_jit(*x))

{ lambda ; a:i32[3] b:i32[]. let
    c:i32[] = pjit[
      name=f_fori
      jaxpr={ lambda ; a:i32[3] b:i32[]. let
          _:i32[] _:i32[] c:i32[] = while[
            body_jaxpr={ lambda ; d:i32[3] e:i32[] f:i32[] g:i32[]. let
                h:i32[] = add e 1:i32[]
                i:bool[] = lt e 0:i32[]
                j:i32[] = convert_element_type[new_dtype=int32 weak_type=False] e
                k:i32[] = add j 3:i32[]
                l:i32[] = select_n i e k
                m:i32[1] = dynamic_slice[slice_sizes=(1,)] d l
                n:i32[] = squeeze[dimensions=(0,)] m
                o:i32[] = convert_element_type[new_dtype=int32 weak_type=False] g
                p:i32[] = add o n
              in (h, f, p) }
            body_nconsts=1
            cond_jaxpr={ lambda ; q:i32[] r:i32[] s:i32[]. let
                t:bool[] = lt q r
              in (t,) }
            cond_nconsts=0
          ] a 0:i32[] b 0:i32[]
        in (c,) }
    ] a b
  in (c,) }
57

In [ ]:

# Conditioning on data dimensionality is permitted

def log2_if_rank_2(x):
    if x.ndim == 2:
        ln_x = jnp.log(x)
        ln_2 = jnp.log(2)
        return ln_x/ln_2
    else:
        return x
    
print(make_jaxpr(log2_if_rank_2)(jax.numpy.array([1, 2, 3])))
# because array is 3D, just return input as output

{ lambda ; a:i32[3]. let  in (a,) }

In [ ]:

jnp.divide(0, 0)

Out[ ]:

Array(nan, dtype=float32, weak_type=True)

In [160]:

jnp.divide(0, 0)
from jax import config
config.update("jax_debug_nans", True)

Invalid nan value encountered in the output of a jax.jit function. Calling the de-optimized version.

---------------------------------------------------------------------------
FloatingPointError                        Traceback (most recent call last)
Cell In[160], line 1
----> 1 jnp.divide(0, 0)
      2 from jax import config
      3 config.update("jax_debug_nans", True)

File ~/jax_linus/lib/python3.10/site-packages/jax/_src/numpy/ufuncs.py:2481, in divide(x1, x2)
   2478 @export
   2479 def divide(x1: ArrayLike, x2: ArrayLike, /) -> Array:
   2480   """Alias of :func:`jax.numpy.true_divide`."""
-> 2481   return true_divide(x1, x2)

    [... skipping hidden 4 frame]

File ~/jax_linus/lib/python3.10/site-packages/jax/_src/numpy/ufuncs.py:2473, in true_divide(x1, x2)
   2471 x1, x2 = promote_args_inexact("true_divide", x1, x2)
   2472 jnp_error._set_error_if_divide_by_zero(x2)
-> 2473 out = lax.div(x1, x2)
   2474 jnp_error._set_error_if_nan(out)
   2475 return out

    [... skipping hidden 8 frame]

File ~/jax_linus/lib/python3.10/site-packages/jax/_src/pjit.py:181, in _python_pjit_helper(fun, jit_info, *args, **kwargs)
    179 except api_util.InternalFloatingPointError as e:
    180   if getattr(fun, '_apply_primitive', False):
--> 181     raise FloatingPointError(f"invalid value ({e.ty}) encountered in {fun.__qualname__}") from None
    182   api_util.maybe_recursive_nan_check(e, fun, args, kwargs)
    184 if p.box_data:

FloatingPointError: invalid value (nan) encountered in div

In [165]:

# JAX enforces single precision, though there are ways around this.

x = random.uniform(key, shape=(1000,), dtype=jnp.float64)
print(x.dtype)
print("__________________")
print(type(x))

float32
__________________
<class 'jaxlib._jax.ArrayImpl'>

/tmp/ipykernel_424021/1243311004.py:3: UserWarning: Explicitly requested dtype <class 'jax.numpy.float64'>  is not available, and will be truncated to dtype float32. To enable more dtypes, set the jax_enable_x64 configuration option or the JAX_ENABLE_X64 shell environment variable. See https://github.com/jax-ml/jax#current-gotchas for more.
  x = random.uniform(key, shape=(1000,), dtype=jnp.float64)

Posted in Blog | Leave a comment

Monte Carlo Methods

Posted on November 21, 2025 by wdengquantum.me

Problem: Distinguish between Las Vegas methods and Monte Carlo methods.

Solution: Both are umbrella terms referring to broad classes of methods that draw (repeatedly) from (not necessarily i.i.d.) random variables to compute the value of some deterministic variable. Here, “compute” means “find that deterministic value with certainty” in the case of a Las Vegas method whereas “compute” means “(point) estimate” in the case of a Monte Carlo method. Thus, their key difference lies in how they tradeoff time $t$ vs. accuracy $\alpha$:

Problem: Given a scalar continuous random variable with probability density function $p(x)$, how can one sample faithfully from such an $x\sim p(x)$?

Solution: Fundamentally, a computer can (approximately) draw from a uniform random variable $u\in [0,1]$ (using seed-based pseudorandom number generators for instance). In order to convert this $u\to x$, the standard way is to use the cumulative distribution function of $x$:

\[\int_{-\infty}^xdx’p(x’):=u\]

Then, as the probability density of $u$ is uniform $p(u)=1$, the probability density of $x$ will be $p(x)=\frac{du}{dx}p(u)$. As the CDF is a monotonically increasing bijection, there always exists an $x=x(u)$ for each $u\in[0,1]$. And intuitively, this whole sampling scheme should make a lot of sense:

Problem: So, it seems like the above “inverse CDF trick” has solved the problem of sampling from an arbitrary random variable $x$ (and it generalizes to the case of a higher-dimensional random vector $\mathbf x\in\mathbf R^n$ by sampling conditionally). So how come in many practical applications this “sampling” problem can still be difficult?

Solution: In practice, one may not have access to $p(\mathbf x)$ itself, but only the general shape $\tilde p(\mathbf x)=Zp(\mathbf x)$ of it (e.g. in Bayesian inference). One might object that $Z$ is known, namely $Z=\int d^n\mathbf x\tilde p(\mathbf x)$ can be written explicitly as an $n$-dimensional integral, so wherein lies the difficulty? The difficulty lies in evaluating the integral, which suffers from the curse of dimensionality if $n\gg 1$ is large (just imagine approximating the integral by a sum, and then realizing that the number of terms $N$ in such a series would grow as $N\sim O(\exp n)$). The inverse CDF trick thus fails because it requires one to have $p(\mathbf x)$ itself, not merely $\tilde p(\mathbf x)$.

Problem: Describe the Monte Carlo method known as importance sampling for estimating the expectation $\langle f(\mathbf x)\rangle$ of a random variable $f(\mathbf x)$.

Solution: Importance sampling consists of a deliberate reframing of the integrand of the expectation inner product integral:

\[\langle f(\mathbf x)\rangle=\int d^n\mathbf x p(\mathbf x)f(\mathbf x)=\int d^n\mathbf x q(\mathbf x)\frac{p(\mathbf x)f(\mathbf x)}{q(\mathbf x)}\]

in other words, replacing $p(\mathbf x)\mapsto q(\mathbf x)$ and $f(\mathbf x)\mapsto p(\mathbf x)f(\mathbf x)/q(\mathbf x)$. There are $2$ cases in which importance sampling can be useful. The first case is related to the previous problem, namely when sampling directly from $p(\mathbf x)$ is too difficult, in which case $q(\mathbf x)$ should be chosen to be easier to both sample from and evaluate than $p(\mathbf x)$. The second case is, even if sampling directly from $p(\mathbf x)$ is feasible, it can still be useful to instead sample from $q(\mathbf x)$ as a means of reducing the variance of the Monte Carlo estimator. That is, whereas the original Monte Carlo estimator of the expectation involving $N$ i.i.d. draws $\mathbf x_1,…,\mathbf x_N$ from $p(\mathbf x)$ is $\sum_{i=1}^Nf(\mathbf x_i)/N$ with variance $\sigma^2_{f(\mathbf x)}/N$, the new Monte Carlo estimator of the expectation involving $N$ i.i.d. draws from $q(\mathbf x)$ is $\sum_{i=1}^Np(\mathbf x_i)f(\mathbf x_i)/q(\mathbf x_i)N$ with variance $\sigma^2_{p(\mathbf x)f(\mathbf x)/q(\mathbf x)}/N$; for importance sampling to be useful, $q(\mathbf x)$ should be chosen such that:

\[\sigma^2_{p(\mathbf x)f(\mathbf x)/q(\mathbf x)}<\sigma^2_{f(\mathbf x)}\]

One can explicitly find the function $q(\mathbf x)$ that minimizes the variance functional of $q(\mathbf x)$ (using a Lagrange multiplier to enforce normalization $\int d^n\mathbf x q(\mathbf x)=1$):

\[q(\mathbf x)\sim p(\mathbf x)|f(\mathbf x)|\]

In and of itself, this $q(\mathbf x)$ cannot be used because it contains the difficult piece $p(\mathbf x)$. Nevertheless, motivated by this result, one can heuristically importance sample using a $q(\mathbf x)$ which ideally is large at $\mathbf x\in\mathbf R^n$ where $p(\mathbf x)|f(\mathbf x)|$ is large. Intuitively, this is because that’s the integrand in the expectation inner product integral, so locations where it’s large will contribute most to the integral, hence are the most “important” regions to sample from with $q(\mathbf x)$.

One glaring flaw with the above discussion is that if $Z$ is not known, then it will also be impossible to sample from $p(\mathbf x)$. Instead, suppose as before that only the shape $\tilde p(\mathbf x)=Zp(\mathbf x)$ is known with $Z=\int d^n\mathbf x\tilde p(\mathbf x)$. Thus, the expectation is:

\[\frac{\int d^n\mathbf x\tilde{p}(\mathbf x)f(\mathbf x)}{\int d^n\mathbf x\tilde{p}(\mathbf x)}\]

The key is to notice that the denominator is just the $f(\mathbf x)=1$ special case of the numerator, hence the importance sampling trick can be applied to both numerator and denominator, thereby obtaining the (biased but asymptotically unbiased Monte Carlo estimator):

\[\frac{\sum_{i=1}^N\tilde p(\mathbf x_i)f(\mathbf x_i)/q(\mathbf x_i)}{\sum_{i=1}^N\tilde p(\mathbf x_i)/q(\mathbf x_i)}\]

for the expectation.

Problem: Describe what Markov chain Monte Carlo (MCMC) methods are about.

Solution: Whereas in “naive” Monte Carlo methods, the random vector draws $\mathbf x_1,…,\mathbf x_N$ are often i.i.d., here the key innovation is that the sequence of random vectors is no longer i.i.d., rather $\mathbf x_t$ is correlated with $\mathbf x_{t-1}$ (but uncorrelated with all the earlier $\mathbf x_{<t-1}$. In other words, the sequence $\mathbf x_1,…,\mathbf x_N$ forms a (discrete-time) Markov chain, i.e. there exists a transition kernel with “matrix elements” $p(\mathbf x’|\mathbf x)$ for any pair of states $\mathbf x,\mathbf x’\in\mathbf R^n$ (assuming the DTMC is $t$-independent/homogeneous). The goal of an MCMC method is to find a suitable $p(\mathbf x’|\mathbf x)$ such that the stationary $N\to\infty$ distribution of the chain matches $p(\mathbf x)$. Thus, it’s basically an “MCMC sampler”.

Problem: What does it mean to burn in an MCMC run? What does it mean to thin an MCMC run?

Solution: Given a DTMC $\mathbf x_1,…,\mathbf x_N$, the idea of burning-in the chain is to discard the first few samples $\mathbf x_1,\mathbf x_2,…$ as they’re correlated with the (arbitrary) choice of starting $\mathbf x_1$ and so are not representative of $p(\mathbf x)$ (equivalently, one hopes to burn-in for the first $N_b$ samples where $N_b$ is around the mixing time of the Markov chain). Burn-in therefore gives the MCMC run a chance to find the high-probability regions of $p(\mathbf x)$.

Thinning an MCMC run means only keep e.g. every $10$ DTMC points or something like that…this is to reduce the correlation between adjacent samples as the goal is to sample independently from $p(\mathbf x)$.

Problem: The Metropolis-Hastings method is a general template for a fairly broad family of MCMC methods. Explain how it works.

Solution: In order for $p(\mathbf x)$ to be the stationary distribution of a DTMC, by definition one requires it to be an eigenvector (with eigenvalue $1$) of the transition kernel:

\[\int d^n\mathbf xp(\mathbf x’|\mathbf x)p(\mathbf x)=p(\mathbf x’)\]

A sufficient (though not necessary) condition for this is if the chain is in detailed balance:

\[p(\mathbf x’|\mathbf x)p(\mathbf x)=p(\mathbf x|\mathbf x’)p(\mathbf x’)\]

for all $\mathbf x,\mathbf x’\in\mathbf R^n$. The idea of Metropolis-Hastings is to conceptually decompose the transition from state $\mathbf x\to\mathbf x’$ into $2$ simpler steps, namely:

Propose the transition $\mathbf x\to\mathbf x’$ with proposal probability $p_p(\mathbf x’|\mathbf x)$ (this distribution $p_p$ needs be easy to sample $\mathbf x’$ from for any $\mathbf x$!)
Accept/reject the proposed transition $\mathbf x\to\mathbf x’$ with an acceptance probability $p_a(\mathbf x’|\mathbf x)$ (this distribution also needs to be easy to sample from!)

Thus, $p(\mathbf x’|\mathbf x)=p_a(\mathbf x’|\mathbf x)p_p(\mathbf x’|\mathbf x)$. The ratio of acceptance probabilities from detailed balance is therefore:

\[\frac{p_a(\mathbf x’|\mathbf x)}{p_a(\mathbf x|\mathbf x’)}=\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p_p(\mathbf x)}\]

The ratio on the RHS is known (furthermore, because it’s a ratio, neither $p_p$ nor $p$ need be normalized! In other words, one can replace $p_p\mapsto \tilde p_p$ and $p\mapsto\tilde p$). The task remains to specify the form of $p_a(\mathbf x’|\mathbf x)$. The detailed balance condition doesn’t give a unique solution for $p_a$, but heuristically one can get a unique solution by further insisting that the walker accept moves often. That is, w.l.o.g. suppose the RHS is computed to be $>1$, so that $p_a(\mathbf x’|\mathbf x)>p_a(\mathbf x|\mathbf x’)$. This implies that the transition from $\mathbf x\to\mathbf x’$ is in some sense more variable than the reverse transition $\mathbf x’\to\mathbf x$, so it seems that if at time $t$ the MCMC walker were at $\mathbf x_t=\mathbf x$, then at time $t+1$ the MCMC walker should definitely move to state $\mathbf x_{t+1}:=\mathbf x’$. In other words, one would like to set $p_a(\mathbf x’|\mathbf x):=1$ and hence $p_a(\mathbf x|\mathbf x’)=\frac{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}{p_p(\mathbf x|\mathbf x’)p_p(\mathbf x’)}$. But this therefore fixes the form of the function $p_a$! Specifically, to evaluate $p_a(\mathbf x’|\mathbf x)$, one should set it to be $1$ if the RHS ratio is $>1$ (which coincidentally is also the value that $p_a$ itself needs to be set to), whereas if the RHS ratio is $<1$, then $p_a$ should be set to the RHS ratio. An elegant way to encode all of this into a single compact expression is just:

\[p_a(\mathbf x’|\mathbf x)=\min\left(1,\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}\right)\]

Practically, to actually implement this Metropolis-Hastings acceptance probability $p_a$, one can essentially flip a “biased” coin with probability $p_a(\mathbf x’|\mathbf x)$ of landing “heads” (meaning to accept the transition $\mathbf x\to\mathbf x’$), e.g. randomly picking a number $u\in[0,1]$ and declaring acceptance iff $u\leq p_a(\mathbf x’|\mathbf x)$.

(aside: if the proposal probability $p_p(\mathbf x’|\mathbf x)\sim e^{-|\mathbf x’-\mathbf x|^2/2\sigma^2}$ is taken to be normally and isotropically distributed about $\mathbf x$, then it is symmetric $p_p(\mathbf x’|\mathbf x)=p_p(\mathbf x|\mathbf x’)$ and the MH acceptance probability simplifies $\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}=\frac{p(\mathbf x’)}{p(\mathbf x)}$ (in which case this is sometimes just called the Metropolis method). Another tip is that the burn-in period of the MCMC walk can often be used to tune the variance $\sigma^2$ of the above Gaussian $p_p$ distribution, namely if the burn-in period lasts for $N_b$ samples, then look at the number of times $N_a$ the MCMC walker accepted a proposed state from $p_p$; a rule-of-thumb is that in high-dimensions $n\gg 1$, one should tune $\sigma^2$ such that the fraction of accepted proposals $N_a/N_b\approx 23.4\%$, see “Weak convergence and optimal scaling of random walk Metropolis algorithms” by Roberts, Gelman, and Gilks ($1997$). Finally, one more MCMC trick worth mentioning is that one can always run e.g. $\sim 100$ MCMC sequences in parallel and sample randomly from them as an alternative to thinning).

Problem: Explain how Gibbs sampling works.

Solution: Whereas a Metropolis-Hastings sampler moves around in state space like a queen in chess, a Gibbs sampler moves around like a rook. That is, it only ever moves along the $n$ coordinate directions in $\mathbf x=(x_1,…,x_n)$. Indeed, the Gibbs sampler can be viewed as a special case of a Metropolis-Hastings sampler in which the proposal distribution $p_p$ is given by:

\[p_p(\mathbf x’|\mathbf x):=p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)\delta(x’_1-x_1)…\delta(x’_{i-1}-x_{i-1})\delta(x’_{i+1}-x_{i+1})…\delta(x’_n-x_n)\]

where $i=1,…,n$ is cycled through the $n$ coordinate directions. For each $1\leq i\leq n$, the MH acceptance probability is:

\[\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}=\frac{p(x_i|x’_1,…,x’_{i-1},x’_{i+1},…,x’_n)\delta(x_1-x’_1)…\delta(x_{i-1}-x’_{i-1})\delta(x_{i+1}-x’_{i+1})…\delta(x_n-x’_n)p(\mathbf x’)}{p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)\delta(x’_1-x_1)…\delta(x’_{i-1}-x_{i-1})\delta(x’_{i+1}-x_{i+1})…\delta(x’_n-x_n)p(\mathbf x)}\]

\[=\frac{p(x_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x’_i,…,x_n)}{p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x_n)}\]

\[=\frac{p(x_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x_{i-1},x_{i+1},…,x_n)p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)}{p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_i|x_1,…,x_{i-1},x_{i+1},…,x_n)}\]

\[=1\]

so $\text{min}(1,1)=1$; in other words, the Gibbs sampler always accepts because the proposal is coming from the conditional probability slices themselves (assumed to be known even if the full joint distribution $p(\mathbf x)$ is unknown), so they must be authentic samples in a sense. For a Gibbs sampler, only after cycling through all $n$ coordinate directions $i=1,…,n$ does it count as $1$ MCMC iteration.

(aside: a common variant of the basic Gibbs sampling formalism described above is block Gibbs sampling in which one may choose to update $2$ or more features simultaneously, conditioning on the distribution of those features given all other features fixed; this would enable diagonal movements within that subspace which, depending how correlated those features are, could significantly increase the effective sample size (ESS)).

Problem: Explain how the Hamiltonian Monte Carlo method works.

Solution:

Posted in Blog | Leave a comment

Differential Geometry

Posted on November 15, 2025 by wdengquantum.me

Problem: What does it mean for a topological space $X$ to be locally homeomorphic to a topological space $Y$? Hence, what does it mean for a topological space $X$ to be locally Euclidean?

Solution: $X$ is said to be locally homeomorphic to $Y$ iff every point $x\in X$ has a neighbourhood homeomorphic to an open subset of $Y$. In particular, $X$ is said to be locally Euclidean iff it is locally homeomorphic to Euclidean space $Y=\mathbf R^n$ for some $n\in\mathbf N$ called the dimension of $X$.

Explicitly, this means every point $x\in X$ is associated to at least one neighbourhood $U_x$ along with at least one homeomorphism $\phi_x:U_x\to\phi_x(U_x)\subset\mathbf R^n$; the ordered pair $(U_x,\phi_x)$ is an example of a chart on $X$, and any collection of charts covering $X$ (such as $\{(U_x,\phi_x):x\in X\}$) is called an atlas for $X$. Hence, $X$ is locally Euclidean iff there exists an atlas for $X$.

(caution: there is a distinct concept of a local homeomorphism from $X\to Y$; existence of such implies that $X$ is locally homeomorphic to $Y$, but the converse is false).

Problem: Give an example of topological spaces $X$ and $Y$ such that $X$ is locally homeomorphic to $Y$ but not vice versa.

Solution: $X=(0,1)$ and $Y=[0,1)$. Thus, be wary that local homeomorphicity is not a symmetric relation of topological spaces.

Problem: Let $X$ be an $n$-dimensional locally Euclidean topological space. What additional topological constraints are typically placed on $X$ in order for it to qualify as a topological $n$-manifold?

Solution:

$X$ is Hausdorff.
$X$ is second countable.

That being said, conceptually the most important criterion to remember is the local Euclideanity of $X$. Topological manifolds are the most basic/general/fundamental of manifolds, onto which additional structures may be attached.

Problem: Let $X$ be a topological $n$-manifold. What additional piece of structure should be attached to $X$ so as to consider it a $C^k$-differentiable $n$-manifold?

Solution: The additional piece of structure is a $C^k$-atlas. A $C^k$-atlas for $X$ is just an atlas for $X$ with the property that all overlapping charts are compatible with each other. This means that if $(U_1,\phi_1),(U_2,\phi_2)$ are $2$ charts of a $C^k$-atlas for $X$ such that $U_1\cap U_2\neq\emptyset$, then the transition map $\phi_2\circ\phi_1^{-1}:\phi_1(U_1\cap U_2)\subset\mathbf R^n\to\phi_2(U_1\cap U_2)\subset\mathbf R^n$ is $C^k$-differentiable (equivalent to $\phi_1\circ\phi_2^{-1}$ being $C^k$-differentiable).

An important point to mention is that a given topological $n$-manifold may have multiple different $C^k$-atlases. According to the above definition, that would seem to give rise to $2$ distinct $C^k$-manifolds…but intuitively that would be undesirable; they should be viewed as really the same $C^k$ manifold. This motivates imposing an equivalence relation on $C^k$-manifolds by asserting that two $C^k$-manifolds are equivalent iff their corresponding $C^k$-atlases are compatible…which just means that the union of their $C^k$-atlases is itself a $C^k$-atlas, or equivalently every chart in one $C^k$-atlas is compatible with every chart in the other $C^k$-atlas. This is sometimes formalized by introducing a so-called “maximal $C^k$-atlas”, but the basic idea should be clear.

Finally, note that topological manifolds are just $C^0$-manifolds because homeomorphisms are continuous. In physics, the most common case is to deal with $C^{\infty}$-manifolds like spacetime in GR or classical configuration/phase spaces in CM, or state space in thermodynamics, also called smooth manifolds.

Problem: Let $S^1$ be the circle, a $1$-dimensional smooth manifold which is regarded as being embedded in $\mathbf R^2$ via the set of points $x^2+y^2=1$. Explain why $S^1$ is locally (but not globally) homeomorphic to the real line $\mathbf R$. Furthermore, explain whether or not the pair $(S^1,\theta)$ (where $\theta:(\cos\theta,\sin\theta)\in S^1\mapsto\theta\in[0,2\pi)$ is the obvious angular coordinate mapping) is a chart on $S^1$ or not.

Solution: $S^1$ is locally homeomorphic to $\mathbf R$ because, roughly speaking, every little strip of open arc in $S^1$, upon zooming in, looks like a strip of straight line in $\mathbf R$. However, $S^1$ is not globally homeomorphic to $\mathbf R$ because $S^1$ is compact whereas $\mathbf R$ is unbounded. The pair $(S^1,\theta)$ is not a chart on $S^1$ because its image $[0,2\pi)$ isn’t open in $\mathbf R$.

Problem: Hence, demonstrate an example of an atlas for $S^1$.

Solution: To circumvent the problem above, one requires $2$ charts to cover $S^1$, thereby forming an atlas for $S^1$. For example, a chart $\theta_1:S^1-\{(1,0)\}\to (0,2\pi)$ similar to the above but with the point $(1,0)$ removed, and a second chart $\theta_2:S^1-\{(-1,0)\}\to (-\pi,\pi)$ with the antipodal point $(-1,0)$ removed. Then the domain of these $2$ charts overlap on the upper and lower semicircles respectively, so the transition function on this overlapping domain is:

\[\theta_2(\theta_1^{-1}(\theta))=\theta[\theta\in(0,\pi)]+(\theta-2\pi)[\theta\in(\pi,2\pi)]\]

which is indeed $C^{\infty}$.

Problem: Define the real associative algebra $C^k(X\to\mathbf R)$.

Solution: This is simply the space of all $C^k$-differentiable scalar fields on the $C^k$-manifold $X$. To be precise, $f\in C^k(X\to\mathbf R)$ iff for all charts $(U,\phi)$ in the $C^k$-atlas defining the differential structure of $X$, the map $f\circ\phi^{-1}:\phi(U)\to\mathbf R$ is $C^k$-differentiable.

Problem: Let $X$ be a $C^1$-differentiable $n$-manifold and let $x\in X$. What is the modern definition of tangent vectors used in differential geometry? How can one reconcile this with one’s prior intuition about tangent vectors in Euclidean space $\mathbf R^n$?

Solution: Imagine embedding a line $\mathbf R$ inside a plane $\mathbf R^2$. Clearly, the “tangent vectors” to the line, viewed within $\mathbf R^2$, simply lie along the line itself, so in fact there was no need for the embedding in the first place. Similarly, if one embeds a plane $\mathbf R^2$ inside $\mathbf R^3$, the tangent vectors to the plane are confined within that plane. However, this no longer holds if one instead embeds a manifold like $S^2$ within $\mathbf R^3$; now the tangent vectors to the sphere “leak” outside $S^2$ and into the embedding manifold $\mathbf R^3$.

One would like to be able to define tangent vectors in a way that is intrinsic to whatever manifold $X$ one working with (i.e. without needing to reference an external embedding). It turns out one way differential geometers have gone about this (nb. not the only way) is to exploit the duality $\mathbf v\leftrightarrow\mathbf v\cdot\frac{\partial}{\partial\mathbf x}$ between (tangent) vectors $\mathbf v\in\mathbf R^n$ and their corresponding directional derivative operators $\mathbf v\cdot\frac{\partial}{\partial\mathbf x}$ in the Euclidean case $X=\mathbf R^n$. Since tangent vectors are physically associated with velocity vectors (e.g. $\mathbf v=\dot{\mathbf x}$ is tangent to a particle’s trajectory $\mathbf x(t)$), the bijection $\mathbf v\rightarrow\mathbf v\cdot\frac{\partial}{\partial\mathbf x}$ converts a temporal rate of change to a spatial rate of change. These geometric considerations motivate the algebraic “$1^{\text{st}}$-order linear differential operator” definition of tangent vectors; $v_x:C^{\infty}(X)\to\mathbf R$ is a tangent vector at $x\in X$ iff it behaves derivative-like in the sense that it’s linear and obeys the product rule (technical term: derivation):

\[v_x(\alpha\phi+\beta\psi)=\alpha v_x(\phi)+\beta v_x(\psi)\]

\[v_x(\phi\psi)=\phi(x)v_x(\psi)+v_x(\phi)\psi(x)\]

for arbitrary scalars $\alpha,\beta\in\mathbf R$ and smooth scalar fields $\phi,\psi:X\to\mathbf R$; thus, this definition is $X$-intrinsic. To emphasize again, tangent vectors are just $1^{\text{st}}$-order linear differential operators (evaluated at some $x\in X$). The “linear” part comes from literally requiring linearity. The “first-order” part comes from the product rule; if it were e.g. $2^{\text{nd}}$-order, then the product rule would fail due to cross-terms. Finally, it’s worth emphasizing the parallels between this construction and Schwarz’s theory of distributions in which the tangent vector $v$ plays the role of a distribution while the scalar field $\phi$ plays the role of a test function; two tangent vectors $v,v’$ are equal iff $v(\phi)=v'(\phi$\) when evaluated on an arbitrary test “bump” $\phi$.

(aside: it was mentioned above that “tangent vector” doesn’t have to be defined algebraically; there exists an equivalent formulation that’s more intuitive: consider trajectories $x(t)\in X$ slithering across $X$ which pass through a point $x_0\in X$ at time $t=0$. Now look at some Euclidean projection $\mathbf x(x(t))\in\mathbf R^n$ where $\mathbf x:(x_0\in\subset X)\to\mathbf R^n$ is some chart in a neighbourhood of $x_0$. One can sort the trajectories $x(t)$ into equivalence classes based on the velocity vector $\left(\frac{d}{dt}\right)_{t=0}\mathbf x(x(t))\in\mathbf R^n$ when $x(t)$ is passing through $x_0$. These equivalence classes are then taken to be the tangent vectors. Although more intuitive than the algebraic formulation, the drawback is that one has to check everything is indeed independent of the choice of chart $\mathbf x$).

Problem: What is the tangent (vector) space $T_x(X)$ to a smooth $n$-manifold $X$ at a point $x\in X$?

Solution: $T_x(X)$ is simply the set of all tangent vectors at $x\in X$ (informally, think tangent line or tangent plane, though remember that such conceptions implicitly require an embedding). By endowing it with the obvious notions of vector addition and scalar multiplication:

\[(v_x+v’_x)(\phi):=v_x(\phi)+v’_x(\phi)\]

\[(cv_x)(\phi):=cv_x(\phi)\]

it’s easy to check these operations are closed in $T_x(X)$, thereby giving it the structure of a real, $n$-dimensional vector space (this justifies calling $v_x$ a tangent vector in the first place). More precisely, to prove that $\dim T_x(X)=n$ is indeed true for all $x\in X$, one can construct an explicit coordinate basis for $T_x(X)$ by pasting a chart $\mathbf x=(x^0,…,x^{n-1})$ onto some neighbourhood of $x$ and defining $n$ basis tangent vectors $\partial_{\mu}|_x:C^{\infty}(X)\to\mathbf R$ for $\mu=0,1,…,n-1$ induced by the choice of chart $\mathbf x$:

\[\partial_{\mu}|_x(\phi):=\frac{\partial\phi\circ\mathbf x^{-1}}{\partial x^{\mu}}\biggr|_{\mathbf x(x)}\]

where the RHS is just the standard partial derivative on $\mathbf R^n$. It should be emphasized that this defines the meaning of expressions like $\frac{\partial\phi}{\partial x^{\mu}}(x)$, so it’s not an abuse of notation.

Then, check that:

These are genuinely tangent vectors $\partial_{\mu}|_x\in T_x(X)$ in that they are linear and obey product rule.
Check that they are linearly independent, i.e. $\sum_{\mu=0}^{n-1}v_x^{\mu}\partial_{\mu}|_x=0\Rightarrow v_x^{\mu}=0$ (use $\partial_{\mu}|_x(x^{\nu})=\delta^{\nu}_{\mu}$).
Check that $\text{span}_{\mathbf R}\{\partial_{\mu}|_x:\mu=0,…,n-1\}=T_x(X)$ (use $\partial_{\mu}|_x(x^{\nu})=\delta^{\nu}_{\mu}$ again to show $v_x=\sum_{\mu=0}^{n-1}v_x(x^{\mu})\partial_{\mu}|_x$).

Problem: A given tangent vector $v_x\in T_x(X)$ may be written with contravariant components in $2$ distinct coordinate bases:

\[v_x=v^{\mu}\partial_{\mu}|_x=v’^{\nu}\partial’_{\nu}|_x\]

simply from picking $2$ different charts $(U,\phi),(U’,\phi’)$ in the $C^1$-atlas of $X$ containing $x\in U\cap U’$. Describe how the contravariant components $v’^{\nu}$ may be obtained from the contravariant components $v^{\mu}$.

Solution: Act on an arbitrary $f\in C^1(X\to\mathbf R)$ to obtain:

\[v_x(f)=v^{\mu}\frac{\partial (f\circ\phi^{-1})}{\partial x^{\mu}}\biggr|_{\phi(x)}=v’^{\nu}\frac{\partial (f\circ\phi’^{-1})}{\partial x’^{\nu}}\biggr|_{\phi'(x)}\]

Now insert a “resolution of the identity” $f\circ\phi’^{-1}\circ\phi’\circ\phi^{-1}$ and because $X$ is a $C^1$-manifold, the transition map $\phi’\circ\phi^{-1}$ will be differentiable and in particular, by the chain rule:

\[\frac{\partial (f\circ\phi^{-1})}{\partial x^{\mu}}\biggr|_{\phi(x)}=\frac{\partial x’^{\nu}}{\partial x^{\mu}}\biggr|_{\phi(x)}\frac{\partial(f\circ\phi’^{-1})}{\partial x’^{\nu}}\biggr|_{\phi'(x)}\]

where $\phi'(x)=(x’^0(x),…,x’^{n-1}(x))$. This simple chain rule identity can by itself already be viewed as a passive change of $T_x(X)$-basis:

\[\frac{\partial}{\partial x’^{\nu}}\biggr|_{x}=\frac{\partial x^{\mu}}{\partial x’^{\nu}}\biggr|_{\phi'(x)}\frac{\partial}{\partial x^{\mu}}\biggr|_x\]

Or equivalently, as an active change of the contravariant components via the Jacobian matrix:

\[v’^{\nu}=\frac{\partial x’^{\nu}}{\partial x^{\mu}}\biggr|_{\phi(x)}v^{\mu}\]

Problem: What is a (tangent) vector field $v\in\mathcal X(X)$ on a smooth manifold $X$?

Solution: There are $2$ equivalent definitions.

(Intuitive Definition): A vector field $v$ is a smooth assignment of a tangent vector $v_x\in T_x(X)$ at each point $x\in X$; formally this means it is a map $v:X\to TX$ where the tangent bundle of $X$ is simply the collection of all tangent vectors with a memory of where they are rooted $TX:=\bigsqcup_{x\in X}T_x(X)$.
(Algebraic Definition) A vector field $v:C^{\infty}(X)\to C^{\infty}(X)$ is a derivation from scalar fields to scalar fields. Formally, for each $\phi:X\to\mathbf R$, the scalar field $v(\phi):X\to\mathbf R$ needs to behave like a directional derivative of $\phi$ “along” $v$ in the sense that (again!) it actually behaves like a $1^{\text{st}}$-order linear differential operator:
\[v(\alpha\phi+\beta\psi)=\alpha v(\phi)+\beta v(\psi)\]
\[v(\phi\psi)=v(\phi)\psi+\phi v(\psi)\]

Problem: Given $2$ vector fields $v,v’\in\mathfrak{X}(X)$ on the same smooth manifold $X$, explain why their composition $vv’:=v\circ v’$ (and hence also $v’v:=v’\circ v$) is not a vector field on $X$.

Solution: Just think about composing two $1^{\text{st}}$-order linear differential operators together. In general, one expects a $2^{\text{nd}}$-order linear differential operator. Therefore, while $vv’$ is still linear, it won’t pass the product rule test for “$1^{\text{st}}$-orderness”, hence $vv’\notin\mathfrak{X}(X)$. One can explicitly compute:

\[vv'(\phi\psi)=vv'(\phi)\psi+\phi vv'(\psi)+v'(\phi)v(\psi)+v(\phi)v'(\psi)\]

to see that the $2^{\text{nd}}$-order cross-terms $v'(\phi)v(\psi)+v(\phi)v'(\psi)$ prevent $vv’$ from fulfilling the product rule.

Problem: By analyzing the cross terms, explain how one can recover the product rule!

Solution: Notice the cross terms are invariant under the interchange of vector fields $v\leftrightarrow v’$. Therefore, to cancel them out, one might consider a commutator $[v,v’]:=vv’-v’v$. This now is not only linear but has also recovered its “$1^{\text{st}}$-orderness”:

\[[v,v’](\phi\psi)=([v,v’]\phi)\psi+\phi[v,v’](\psi)\]

Thus, $\mathfrak{X}(X)$ had a real Lie algebra structure after all. It may still feel a bit strange that the commutator somehow manages to recover “$1^{\text{st}}$-orderness”. The clearest way to see this is to work in a suitable chart $x^{\mu}$, expanding the vector fields $v=v^{\mu}\partial_{\mu}$ and $v’=v’^{\nu}\partial_{\nu}$ which leads to the commutator $[v,v’]=[v,v’]^{\mu}\partial_{\mu}$ with:

\[[v,v’]^{\mu}=v^{\nu}\partial_{\nu}v’^{\mu}-v’^{\nu}\partial_{\nu}v^{\mu}\]

In particular, it’s clear that $[v,v’]$ is $1^{\text{st}}$-order because it’s just a linear combination of $1^{\text{st}}$-order partial differential operators $\partial_{\mu}$ weighted by scalar components $[v,v’]^{\mu}$. It would not have been possible to write, for instance, $vv’\neq (vv’)^{\mu}\partial_{\mu}$ because it’s not a $1^{\text{st}}$-order differential operator.

Problem: Any vector field $v\in\mathfrak{X}(X)$ on a smooth manifold $X$ induces a corresponding flow $v_t$ on $X$. Explain this generation process $v\rightarrow v_t$.

Solution: Classically, if one had a steady fluid velocity field $\mathbf v(\mathbf x)$ on $\mathbf x\in\mathbf R^n$, the streamlines/pathlines/streaklines coincide and are given by solving the system of $1^{\text{st}}$-order ODEs:

\[\frac{d\mathbf x(t)}{dt}=\mathbf v(\mathbf x(t))\]

By analogy, a flow $v_t:X\to X$ generated by a vector field $v\in\mathfrak{X}(X)$ is defined by requiring:

\[\frac{d\phi(v_t(x_0))}{dt}=v(\phi)(v_t(x_0))\]

for all test scalar fields $\phi\in C^{\infty}(X)$ and initial conditions $x_0\in X$. Just as the quantum mechanical time-evolution operator $U_t$ causes an initial state $|\psi(t=0)\rangle\in\mathcal H$ to flow to another state $|\psi(t)\rangle=U_t|\psi(t=0)\rangle\in\mathcal H$, one can think of the flow $v_t$ as a one-parameter family of diffeomorphisms ($v_{t+t’}=v_t\circ v_{t’}\Rightarrow v_{t=0}=1,v^{-1}_t=v_{-t}$) taking an initial point $x_0\in X$ and time translating it to a new point $v_t(x_0)\in X$.

A corollary of this is that $v(\phi)(x_0)=\frac{d\phi(v_t(x_0))}{dt}\biggr|_{t=0}$, so one has the useful Maclaurin expansion about $t=0$:

\[\phi(v_t(x_0))=\phi(x_0)+tv(\phi)(x_0)+O_{t\to 0}(t^2)\]

Problem: For the case $X=\mathbf R$, let the vector field $v_x:=x^2\frac{d}{dx}$. Compute the corresponding flow $v_t(x_0)$ for an arbitrary initial condition $x_0\in\mathbf R$, and comment on its behavior.

Solution: The flow is governed by the ODE $\dot x=x^2$ which is solved by $v_t(x_0)=x_0/(1-x_0t)$. But this flow is undefined at $t=1/x_0$, so $v$ is considered an incomplete vector field. In this case, the reason can be traced to the non-compact nature of the manifold $X=\mathbf R$.

Problem: Let $T$ be a tensor field on some smooth manifold $X$. Suppose one would like to “transport” $T$ from $X$ onto some diffeomorphic manifold $X’\cong X$ (this assumption is essential! Without it a lot of what is about to be said fails). Explain the $2$ mechanisms whereby this transport can be achieved. Furthermore, explain how, despite the fact that these $2$ transport methods always exist, depending on the type of $T$, one method may be more “natural” than another.

Solution: The key is to clearly identify which direction (in this case $X\to X’$) one would like to transport $T$. With that reference direction in mind:

(Pushforward) Since $X’\cong X$, there must exist an explicit diffeomorphism $\cong$ between them. If the diffeomorphism is aligned in the same direction as one’s desired transport direction (i.e. $\cong:X\to X’$), then one can use this diffeomorphism to pushforward the tensor $T\mapsto\cong_*T$ from $X\to X’$.
(Pullback) If instead one’s desired direction of transport ($X\to X’$) goes against the direction of the diffeomorphism (i.e. $\cong:X’\to X$), then one can still transport $T$ from $X$ to $X’$ by using $\cong$ to pullback $T\mapsto\cong^*T$.

Suppose $T=\phi$ is a scalar field on $X$, and suppose one would like to transport $\phi\mapsto\phi’$ on $X’$. The natural way to do this is to insist that $\phi'(x’)=\phi(x)$. The question is whether one should take $x’=\cong(x)$ (i.e. using a pushforward) or $x=\cong(x’)$ (i.e. using a pullback). Since $\cong$ is a diffeomorphism, and thus a bijection, both of these choices work, but if $\cong^{-1}$ did not exist, then the pushforward $\phi’=\phi\circ\cong^{-1}$ would also not exist. Thus, the pullback is more natural because it doesn’t actually rely on $\cong$ being a diffeomorphism:

By contrast, for $T=v$ a vector field on $X$, one naturally demands $v’\in\mathfrak{X}(X’)$ to obey $v'(\phi’)(x’)=v(\phi)(x)$. Again, because $\cong$ is assumed to be a diffeomorphism, $v$ can be transported from $X\to X’$ via either a pushforward or a pullback, but unlike for scalar fields, here it turns out (why?) to be more natural to use a pushforward $\cong:X\to X’$ so that $v’=\cong_*v$, $\phi=\cong^*\phi’$, and $x’=\cong(x)$.

If $x^{\mu}$ is some chart on $X$ with respect to which $v=v^{\mu}\partial_{\mu}$, and $x’^{\mu}$ is some chart on $X’$ (note that generically $x^{\mu}\neq\cong^*x’^{\mu}$), then the components of the pushforward are given by:

\[(\cong_*v)^{\mu}(x’)=\frac{\partial x’^{\mu}}{\partial x^{\nu}}v^{\nu}(x)\]

Problem: Let $X$ be a smooth manifold, let $\phi\in C^{\infty}(X)$ be a scalar field on $X$, and let $v\in\mathfrak{X}(X)$ be a vector field flowing on $X$. Explain how the scalar field $\mathcal L_v(\phi)\in C^{\infty}(X)$ is defined (this is called the Lie derivative of $\phi$ “along” $v$), and how it is calculated.

Solution: Recall that for a scalar field $\phi(\mathbf x)$ on $\mathbf x\in\mathbf R^n$, the directional derivative of $\phi$ along a velocity vector $\mathbf v\in\mathbf R^n$ is defined by the limit:

\[\lim_{t\to 0}\frac{\phi(\mathbf x+t\mathbf v)-\phi(\mathbf x)}{t}=\left(\frac{d\phi(\mathbf x+t\mathbf v)}{dt}\right)_{t=0}=\mathbf v\cdot\frac{\partial\phi}{\partial\mathbf x}\]

By analogy, one defines:

\[\mathcal L_v\phi(x):=\lim_{t\to 0}\frac{v_t^*\phi(x)-\phi(x)}{t}=\left(\frac{d\phi(v_t(x))}{dt}\right)_{t=0}=v(\phi)(x)\]

Thus, $\mathcal L_v\phi=v(\phi)$, or more abstractly (remembering this is the Lie derivative $\mathcal L_v:C^{\infty}(X)\to C^{\infty}(X)$ on scalar fields) $\mathcal L_v=v$.

Problem: Repeat the above for the action of the Lie derivative $\mathcal L_v$ on another vector field $v’\in\mathfrak{X}(X)$ to produce the vector field $\mathcal L_v(v’)\in\mathfrak{X}(X)$.

Solution: The basic idea is that tangent vectors living in distinct tangent spaces cannot be subtracted. So one has to pushforward the future tangent vector “back in time”:

\[\mathcal L_vv’:=\lim_{t\to 0}\frac{(v^{-1}_t)_*v’-v’}{t}\]

One way to proceed is to define an auxiliary scalar field $f(t,t’):=v'(\phi\circ v_{-t})(v_{t’}(x))$ in which case one can use the chain rule to compute $\mathcal L_vv'(\phi)(x)=\left(\frac{df(t,t)}{dt}\right)_{t=0}=\left(\frac{\partial f(t,0)}{\partial t}\right)_{t=0}+\left(\frac{\partial f(0,t’)}{\partial t’}\right)_{t’=0}$. However, here it will be fun to directly Maclaurin expand the numerator of the limit. Recall from earlier the fundamental property of infinitesimal flows $\text{test scalar field}(v_t(x))\approx\text{test scalar field}(x)+tv(\text{test scalar field})(x)$. First, apply it to $\text{test scalar field}=v'(\phi\circ v_{-t})$:

\[v'(\phi\circ v_{-t})(v_t(x))\approx v'(\phi\circ v_{-t})(x)+tv(v'(\phi\circ v_{-t}))(x)\]

Then apply it again for $\text{test scalar field}=\phi$ and replace $t\mapsto -t$:

\[\phi\circ v_{-t}\approx \phi-tv(\phi)\]

Distributing everything up to $O(t)$, using linearity of vector fields, the limit reduces to $\mathcal L_vv’=[v,v’]$. More abstractly, the Lie derivative is just the Lie bracket (hence the name!) on $\mathfrak{X}(X)$, i.e. $\mathcal L_v=[v,\space]$. Indeed, one can check that it is a Lie algebra representation because of the homomorphism property $\mathcal L_{[v,v’]}=[\mathcal L_v,\mathcal L_{v’}]$ thanks to the Jacobi identity.

Problem: Define a covector at some point $x\in X$ on a $C^1$-manifold $X$. Hence, define a $1$-form on $X$. Clearly emphasize the difference between a covector and a $1$-form.

Solution: At $x\in X$, one of course has the tangent space $T_x(X)$. Since $T_x(X)$ is a vector space, it has an associated dual vector space $T^*_x(X)$ which in this context is called the cotangent space at $x\in X$ (cf. $\tan$ vs. $\cot$). As the linear functionals $v_x\in T_x(X)$ are called tangent vectors at $x\in X$, so it makes sense that the linear functionals $A_x\in T^*_x(X)$ are called cotangent vectors at $x\in X$, or covectors for short.

Just as a (tangent) vector field $v$ assigns a tangent vector $v_x\in T_x(X)$ to each point $x\in X$ across the manifold $X$, a (cotangent) vector field $\omega$ assigns a cotangent vector $A_x\in T_x^*(X)$ to each point $x\in X$. This “covector field” is called a differential $1$-form, or $1$-form for short.

Problem: Show how, by applying the exterior derivative $d$ to any scalar field $\phi\in C^{\infty}(X)\cong\Omega^0(X)$ (also called a differential $0$-form), the resulting object $d\phi\in\Omega^1(X)$ is a $1$-form.

Solution: This is because $d\phi$ is defined by its action on an arbitrary vector field $v$ as:

\[d\phi(v):=v(\phi)=\mathcal L_v(\phi)\]

Problem: Explain why, when the manifold $X$ is covered by a coordinate chart $(x^0,…,x^{n-1})$, any exact differential $1$-form is given by the familiar chain rule:

\[d\phi=\frac{\partial\phi}{\partial x^{\mu}}dx^{\mu}\]

Solution: The first thing is to unpack the meaning of the differentials $dx^{\mu}$. This amounts to substituting $\phi=x^{\mu}$ for the exterior derivative of a $0$-form, so for an arbitrary vector field $v$, one has:

\[dx^{\mu}(v):=v(x^{\mu})\]

In particular, if $v=\partial_{\nu}$ for some $\nu$, then $\partial x^{\mu}/\partial x^{\nu}=\delta^{\mu}_{\nu}$, so $dx^{\mu}$ is the dual coordinate basis for $\Omega^1(X)$ with respect to the coordinate basis $\partial_{\mu}$ of $\mathfrak{X}(X)$. The components of $d\phi$ in the $dx^{\mu}$ basis are thus as claimed:

\[d\phi=d\phi(\partial_{\mu})dx^{\mu}=\partial_{\mu}\phi dx^{\mu}\]

Problem: Let $X$ be a $C^1$-manifold and let $x^{\mu}$ and $x’^{\mu}$ be two coordinate charts for $X$. Compare how the vector field basis, the one-form basis, and the components of vectors and one-forms in their respective bases transform between these $2$ coordinate charts.

Solution: Let $v=v^{\mu}\partial_{\mu}=v’^{\mu}\partial’_{\mu}$ be a vector field and $A=A_{\mu}dx^{\mu}=A’_{\mu}dx’^{\mu}$ be a $1$-form. In what follows, the key is to always remember that the underlying objects $v, A$ are chart-invariant, so if the basis transforms under one particular Jacobian, then the components must transform under the inverse Jacobian.

One can start with the vector field basis $\partial’_{\mu}=\partial’_{\mu}x^{\nu}\partial_{\nu}$ which is just the indisputable chain rule. With that as an anchoring point, one immediately concludes $v’^{\mu}=\partial_{\nu}x’^{\mu}v^{\nu}$. Then, by enforcing that $dx’^{\mu}(\partial’_{\nu})=\delta^{\mu}_{\nu}$ remains biorthogonal, this leads to the intuitive chain rule requirement $dx’^{\mu}=\partial_{\nu}x’^{\mu}dx^{\nu}$, and hence $A’_{\mu}=\partial’_{\mu}x^{\nu}A_{\nu}$.

Problem: Let $X$ and $X’$ be smooth, diffeomorphic manifolds. Earlier it was seen that scalar fields are naturally transported via pullback whereas vector fields are naturally transported via pushforward. What about for $1$-forms? Hence, define the Lie derivative $\mathcal L_vA$ of a $1$-form $A$ with respect to a vector field $v$.

Solution: Just like scalar fields, $1$-forms naturally pullback (can remember this because they both map to $\mathbf R$). Specifically, if $A\in\Omega^1(X)$ currently lives on $X$, and one has a diffeomorphism $\cong:X’\to X$, then:

\[\cong^*A(v):=A(\cong_*v)\]

Or in components (using the earlier result for vector field pushforward components $(\cong_*v)^{\mu}(x’)=\frac{\partial x’^{\mu}}{\partial x^{\nu}}v^{\nu}(x)$):

\[(\cong^*A)_{\mu}=\frac{\partial x’^{\nu}}{\partial x^{\mu}}A_{\nu}\]

The Lie derivative is then given by:

\[\mathcal L_{v}A:=\lim_{t\to 0}\frac{v_t^*A-A}{t}\]

It turns out (how?) one can show that $\mathcal L_v A=(\mathcal L_v A)_{\mu}dx^{\mu}$ where the components are:

\[(\mathcal L_v A)_{\mu}=v^{\nu}\partial_{\nu}A_{\mu}+A_{\nu}\partial_{\mu}v^{\nu}\]

Problem: Let $X$ be a smooth $n$-manifold, and let $x\in X$. Define a tensor $T_x$ of type $(k^*,k)$ (i.e. rank $k^*+k$) at $x$. Define the components of the tensor $T_x$ with respect to a suitable ordered basis. Give an example of a tensor field defined over any manifold $X$.

Solution: Just as vector fields are smooth assignments of tangent vectors across $X$, just as $1$-forms are smooth assignments of covectors across $X$, tensor fields are smooth assignments of tensors across $X$. In particular, it only makes sense to speak of a tensor $T_x$ at a specific point $x\in X$, and to view the object $T$ itself as a tensor field. With that in mind, a tensor at $x\in X$ of type $(k^*,k)\in\mathbf N^2$ is defined (at least in differential geometry) to be any multilinear map $T_x:T^*_x(X)^{k^*}\times T_x(X)^k\to\mathbf R$ that eats in $k^*$ covectors in the cotangent space $T^*_x(X)$ and $k$ tangent vectors in the tangent space $T_x(X)$ and spits out a scalar. The corresponding tensor field can be viewed as a map $T:\Omega^1(X)^{k^*}\times\mathfrak{X}(X)^k\to C^{\infty}(X)$. Given an ordered (possibly non-coordinate) basis $\partial_{\mu}$ for $\mathfrak{X}(X)$ with corresponding dual ordered (again! possibly non-coordinate) basis $dx^{\mu}$ for $\Omega^1(X)$, the $n^{k^*+k}$ components of $T$ are defined by:

\[T^{\mu_1,…,\mu_{k^*}}_{\nu_1,…,\nu_{k}}:=T(dx^{\mu_1},…,dx^{\mu_{k^*}},\partial_{\nu_1},…,\partial_{\nu_{k}})\]

Thus, covectors are type $(0,1)$ tensors while tangent vectors are type $(1,0)$ tensors. Any manifold $X$ is equipped with a type $(1,1)$ tensor field $\delta:\Omega^1(X)\times\mathfrak{X}(X)\to C^{\infty}(X)$ defined by:

\[\delta(A,v):=A(v)\Leftrightarrow\delta(dx^{\mu},\partial_{\nu})=\delta^{\mu}_{\nu}\]

Problem: (tensor component transformation)

Solution:

Problem: Let $T,T’$ be respectively type $(k^*,k)$ and $(k’^*,k’)$ tensor fields defined over the same smooth $n$-manifold $X$. Define the $(k^*+k’^*,k+k’)$ tensor field given by their tensor product $T\otimes T’:\Omega^1(X)^{k^*+k’^*}\times\mathfrak{X}(X)^{k+k’}\to C^{\infty}(X)$.

Solution: It’s pretty much the obvious thing one can write down that sums the tensor ranks:

\[T\otimes T'(A_1,…,A_{k^*},A’_1,…,A’_{k’^*},v_1,…,v_k,v’_1,…,v’_{k’})\]

\[:=T(A_1,…,A_{k^*},v_1,…,v_k)T'(A’_1,…,A’_{k’^*},v’_1,…,v’_{k’})\]

Or, with respect to an ordered (possibly non-coordinate!) basis $\{\partial_{\mu}\}$ of $\mathfrak{X}(X)$:

\[(T\otimes T’)^{\mu_1,…,\mu_{k^*},\mu’_1,…,\mu’_{k’^*}}_{\nu_1,…,\nu_k,\nu’_1,…,\nu’_{k’}}=T^{\mu_1,…,\mu_{k^*}}_{\nu_1,…,\nu_k}T’^{\mu’_1,…,\mu’_{k’^*}}_{\nu’_1,…,\nu’_{k’}}\]

Problem: Show how to compute the pushforward $\cong_*T$ of a tensor field $T$, and hence show how to take the Lie derivative $\mathcal L_{\partial}T$ of a tensor field $T$ along a vector field $\partial$.

Solution:

Problem: Define a differential $k$-form.

Solution: Differential $k$-forms are simply antisymmetric type $(0,k)$ tensor fields. At each point $x\in X$, one can think of them as measuring devices that eat in $k$ tangent vectors in $T_x(X)$ and spit out a number. More globally, $\omega\in\Omega^k(X)$ is said to be a differential $k$-form iff $\omega:\mathfrak{X}(X)\to C^{\infty}(X)$ takes in $k$ vector fields and spits out a scalar field.

Problem: Given a differential $k$-form $\omega\in\Omega^k(X)$ and a differential $k’$-form $\omega’\in\Omega^{k’}(X)$, define the differential $k+k’$-form given by their wedge product $\omega\wedge\omega’\in\Omega^{k+k’}(X)$.

Solution:

Problem: Define the exterior derivative $d:\Omega^k(X)\to\Omega^{k+1}(X)$ of a differential $k$-form, showing how it generalizes the earlier definition given for $k=0$-forms.

Solution: Recall that the wedge product $\wedge$ of a differential $k’$-form and a differential $k$-form is a differential $k’+k$-form. In particular, the exterior derivative can loosely be viewed as a special case of the wedge product with $k’=1$ and $d=\partial_{\mu}dx^{\mu}\wedge$; for instance $d\phi=\partial_{\mu}dx^{\mu}$, and for a $1$-form $A=A_{\mu}dx^{\mu}$ the exterior derivative $F:=dA$ is given by:

\[F=\partial_{\mu}dx^{\mu}\wedge A_{\nu}dx^{\nu}=\partial_{\mu}A_{\nu}dx^{\mu}\wedge dx^{\nu}=\frac{1}{2}F_{\mu\nu}dx^{\mu}\wedge dx^{\nu}\]

with $F_{\mu\nu}:=\partial_{\mu}A_{\nu}-\partial_{\nu}A_{\mu}$.

Problem: Prove the following properties of the wedge product $\wedge$ and its interaction with the exterior derivative $d$, the pushback, pushforward, tensor product, interior product…?

(Antisymmetry of odd-degree forms) \[\omega\wedge\omega’=(-1)^{kk’}\omega’\wedge\omega\]
(Graded product rule with respect to $d$) \[d(\omega\wedge\omega’)=(d\omega)\wedge\omega’+(-1)^k\omega\wedge d\omega’\]

Problem: Prove Cartan’s magic formula:

\[\mathcal L_{\partial}=\{d,\iota_{\partial}\}\]

Posted in Blog | Leave a comment

Support Vector Machines

Posted on November 8, 2025 by wdengquantum.me

Problem: Explain how a hard-margin support vector machine would perform binary classification.

Solution: Conceptually, it’s simple. Given a training set of $N$ feature vectors $\mathbf x_1,…,\mathbf x_N\in\mathbf R^n$ each associated with some binary target label $y_1,…,y_N\in\{-1,1\}$ (notice the $2$ binary classes are $y=-1$ and $y=1$ rather than the more conventional $y=0$ and $y=1$; this is simply a matter of convenience), the goal of an SVM is to compute the unique hyperplane in $\mathbf R^n$ that will separate the $y=-1$ and $y=1$ classes as “best” as possible. For $n=2$ this can be visualized as trying to find the “widest possible street” between $2$ neighborhoods:

Mathematically, if the hyperplane decision boundary $\mathbf w\cdot\mathbf x+b=0$ is defined by a normal vector $\mathbf w\in\mathbf R^n$ such that the SVM classifies points with $\mathbf w\cdot\mathbf x+b\geq 0$ as $y=1$ and $\mathbf w\cdot\mathbf x+b\leq 0$ as $y=-1$ (more compactly, $\hat y_{\text{SVM}}(\mathbf x|\mathbf w,b)=\text{sgn}(\mathbf w\cdot\mathbf x+b)$), and the overall normalization of $\mathbf w$ and $b$ is fixed by stipulating that the “gutters” of the street are defined by the hyperplanes $\mathbf w\cdot\mathbf x+b=\pm 1$ passing through the support (feature) vectors (drawn as bigger red/green dots in the diagram), then the “width of the street” is the distance between these $2$ “gutter hyperplanes”, namely $2/|\mathbf w|$. Since one would like to maximize this margin $2/|\mathbf w|$, this is equivalent to minimizing the quadratic $|\mathbf w|^2/2$. Thus, one can formulate this “hard-margin SVM” algorithm as the programming problem:

\[\text{argmin}_{\mathbf w\in\mathbf R^n,b\in\mathbf R:\forall i=1,…,N,y_i(\mathbf w\cdot\mathbf x_i+b)\geq 1}\frac{|\mathbf w|^2}{2}\]

where the $N$ constraints $y_i(\mathbf w\cdot\mathbf x_i+b)\geq 1$ (or vectorially $\mathbf y\odot(X^T\mathbf w+b\mathbf{1})\geq\mathbf 1$) are just saying that all $N$ feature vectors $\mathbf x_1,…,\mathbf x_N$ in the training set should lie on the correct side of the decision boundary hyperplane $\mathbf w\cdot\mathbf x+b=0$.

Problem: What is the glaring problem with a hard-margin SVM? How does a soft-margin SVM help to address this issue?

Solution: The basic problem is that the hard-margin SVM is way too sensitive to outliers, e.g. a single $y=-1$ feature vector amidst the $y=1$ neighborhood would cause the hard-margin SVM to be unable to find a solution (because indeed there would be no hyperplane that perfectly separates all the training data into their correct classes). Intuitively, it seems like in such a case the best way to deal with this bias-variance tradeoff is to shrug and ignore the outliers:

Mathematically, the usual way to encode this is by introducing $N$ “slack variables” $\xi_1,…,\xi_N\geq 0$ such that the earlier $N$ constraints are relaxed to $y_i(\mathbf w\cdot\mathbf x_i+b)\geq 1-\xi_i$. However, to avoid cutting too much slack (i.e. incurring too much misclassification on the training set), their use should be penalized by the $\ell^1$-norm $|\boldsymbol{\xi}|_1:=\boldsymbol{\xi}\cdot\mathbf{1}=\sum_{i=1}^N\xi_i$. To this effect, the programming problem is modified to that of soft-margin SVM:

\[\text{argmin}_{\mathbf w\in\mathbf R^n,b\in\mathbf R,\boldsymbol{\xi}\in\mathbf R^N:\mathbf y\odot(X^T\mathbf w+b\mathbf{1})\geq\mathbf 1-\boldsymbol{\xi},\boldsymbol{\xi}\geq\mathbf 0}\frac{|\mathbf w|^2}{2}+C|\boldsymbol{\xi}|_1\]

where $C\geq 0$ is a “strictness” hyperparameter that must be selected ahead of time to balance hard vs. soft SVM.

Problem: The above represents the primal formulation of the programming problem in terms of so-called primal variables $\mathbf w,b,\boldsymbol{\xi}$. Show how the KKT conditions enable one to obtain a corresponding dual formulation of the programming problem in terms of a dual variable $\boldsymbol{\lambda}$:

\[\text{argmax}_{\boldsymbol{\lambda}\in\mathbf R^N:\boldsymbol{\lambda}\cdot\mathbf y=0,\mathbf 0\leq\boldsymbol{\lambda}\leq C\mathbf 1}|\boldsymbol{\lambda}|_1-\frac{1}{2}(\boldsymbol{\lambda}\odot\mathbf y)X^TX(\boldsymbol{\lambda}\odot\mathbf y)\]

Solution: The idea is to combine the $2$ constraints $\mathbf y\odot(X^T\mathbf w+b\mathbf{1})\geq\mathbf 1-\boldsymbol{\xi}$ and $\boldsymbol{\xi}\geq\mathbf 0$ using $2$ KKT multipliers $\boldsymbol{\lambda},\boldsymbol{\mu}$ into the Lagrangian:

\[L(\mathbf w,b,\boldsymbol{\xi},\boldsymbol{\lambda},\boldsymbol{\mu})=\frac{|\mathbf w|^2}{2}+C|\boldsymbol{\xi}|_1-\boldsymbol{\lambda}\cdot(\mathbf y\odot(X^T\mathbf w+b\mathbf{1})-\mathbf 1+\boldsymbol{\xi})-\boldsymbol{\mu}\cdot\boldsymbol{\xi}\]

The KKT necessary conditions assert:

\[\frac{\partial L}{\partial\mathbf w}=\mathbf 0\Rightarrow\mathbf w=X(\boldsymbol{\lambda}\odot\mathbf y)\]

\[\frac{\partial L}{\partial b}=0\Rightarrow\boldsymbol{\lambda}\cdot\mathbf y=0\]

\[\frac{\partial L}{\partial\boldsymbol{\xi}}=\mathbf 0\Rightarrow C\mathbf 1=\boldsymbol{\lambda}+\boldsymbol{\mu}\]

in addition to the $2$ original constraints on the primal variables (primal feasibility), and dual feasibility $\boldsymbol{\lambda},\boldsymbol{\mu}\geq\mathbf 0$ on both dual variables (as a corollary, $\mathbf 0\leq\boldsymbol{\lambda}\leq C\mathbf 1$), and complementary slackness $\boldsymbol{\lambda}\odot(\mathbf y\odot(X^T\mathbf w+b\mathbf{1})-\mathbf 1+\boldsymbol{\xi})=\boldsymbol{\mu}\odot\boldsymbol{\xi}=\mathbf 0$. The idea of the support vectors is clearly seen in the $1^{\text{st}}$ of these complementary slackness conditions.

By substituting the stationary conditions found back into the Lagrangian, one can eliminate all the primal variables (and even eliminate one of the dual variables $\boldsymbol{\mu}$) to obtain the claim:

\[L(\boldsymbol{\lambda})=|\boldsymbol{\lambda}|_1-\frac{1}{2}(\boldsymbol{\lambda}\odot\mathbf y)X^TX(\boldsymbol{\lambda}\odot\mathbf y)\]

which is a standard quadratic programming exercise with efficient solutions for $\boldsymbol{\lambda}$.

(mention something about strong duality as enabling a min-max interchange of primal vs. dual?)

Problem: Although soft-margin SVM is a significant improvement over hard-margin SVM, it too has a glaring issue. What is that issue and how can it be addressed?

Solution: Whether hard-margin or soft-margin, SVMs are ultimately linear binary classifiers; they can only directly learn a hyperplane decision boundary. Therefore, if the data is simply linearly inseparable, then even soft-margin SVM would be unwise to apply directly.

However, there is a creative solution; just take the $N$ feature vectors $\mathbf x_1,…,\mathbf x_N\in\mathbf R^n$ and apply a nonlinear transformation $\prime:\mathbf R^n\to\mathbf R^{>n}$ from the current feature space $\mathbf R^n$ to some higher-dimensional feature space $\mathbf R^{>n}$ (essentially a kind of feature engineering), thereby obtaining $N$ transformed feature vectors $\mathbf x’_1,…,\mathbf x’_N\in\mathbf R^{>n}$. Ideally, if the nonlinear transformation $\prime$ is well-chosen, then the transformed feature vectors $\mathbf x’_1,…,\mathbf x’_N$ will become linearly separable in the higher-dimensional feature space $\mathbf R^{>n}$, in which case the usual soft-margin SVM may be applied in $\mathbf R^{>n}$. Then, applying the inverse nonlinear mapping $\prime^{-1}:\mathbf R^{>n}\to\mathbf R^n$ back to the original feature space $\mathbf R^n$, the SVM (soft) maximal-margin hyperplane in $\mathbf R^{>n}$ would be transformed into some nonlinear decision boundary back in $\mathbf R^n$, thereby effectively extending the applicability of the SVM method to linearly inseparable data! See this YouTube video for an example.

Problem: Continuing off the previous problem, suppose one is already in the transformed space with feature vectors $\mathbf x’_1,…,\mathbf x’_N\in\mathbf R^{>n}$, and one has solved the dual programming problem in this space to obtain some optimal $\boldsymbol{\lambda}\in\mathbf R^N$. How is binary classification thus performed?

Solution: Based on the previous analysis, the SVM binary classifier in this transformed feature space $\mathbf R^{>n}$ may be written:

\[\hat y_{\text{SVM}}(\mathbf x’)=\text{sgn}(\mathbf w\cdot\mathbf x’+b)=\text{sgn}\left(\sum_{i=1}^N\lambda_iy_i\mathbf x’\cdot\mathbf x’_i+y_S-\sum_{i=1}^N\lambda_iy_i\mathbf x’_S\cdot\mathbf x’_i\right)\]

where $\mathbf x’_S$ is any support feature vector. Note by complementary slackness that most $\lambda_i=0$ are vanishing (corresponding non-support feature vectors $\mathbf x_i$), so the sums $\sum_{i=1}^N$ may also be written as sums over only support vectors.

Problem: Explain how the expression above can be simplified by means of the kernel trick.

Solution: Basically, the feature space transformation $\prime:\mathbf R^n\to\mathbf R^{>n}$ is a complicated mapping of vectors to vectors. By contrast, only their scalar dot products $\mathbf x’\cdot\mathbf x’_i$ appear in the SVM classifier $\hat y_{\text{SVM}}(\mathbf x)$, and scalars are simpler than vectors. The kernel trick essentially invites one to forget about the details of the transformation $\prime$ and just directly specify an analytical expression for the dot product in the transformed feature space $K(\mathbf x,\tilde{\mathbf x}):=\mathbf x’\cdot\tilde{\mathbf x}’$, where here $K:\mathbf R^n\times\mathbf R^n\to\mathbf R$ is called the kernel function. Effectively, this means one never has to actually visit the transformed feature space $\mathbf R^{>n}$ since the kernel is perfectly happy to just take the feature vectors in the original space $\mathbf R^n$. Thus, one can write:

\[\hat y_{\text{SVM}}(\mathbf x)=\text{sgn}(\mathbf w\cdot\mathbf x’+b)=\text{sgn}\left(\sum_{i=1}^N\lambda_iy_iK(\mathbf x,\mathbf x_i)+y_S-\sum_{i=1}^N\lambda_iy_iK(\mathbf x_S,\mathbf x_i)\right)\]

Problem: What is the Gaussian (radial basis function) kernel $K_{\sigma}(\mathbf x,\tilde{\mathbf x})$? Sketch why it’s a valid kernel function.

Solution: This is by far the most popular kernel for nonlinear SVM classification, involving a hyperparameter $\sigma$:

\[K_{\sigma}(\mathbf x,\tilde{\mathbf x})=e^{-|\mathbf x-\tilde{\mathbf x}|^2/2\sigma^2}\]

algebraically, it is clearly equivalent to:

\[K_{\sigma}(\mathbf x,\tilde{\mathbf x})=e^{-|\mathbf x|^2/2\sigma^2}e^{-|\tilde{\mathbf x}|^2/2\sigma^2}e^{\mathbf x\cdot\tilde{\mathbf x}/\sigma^2}\]

Taylor expanding the last factor:

\[K_{\sigma}(\mathbf x,\tilde{\mathbf x})=e^{-|\mathbf x|^2/2\sigma^2}e^{-|\tilde{\mathbf x}|^2/2\sigma^2}\sum_{d=0}^{\infty}\frac{(\mathbf x\cdot\tilde{\mathbf x})^d}{\sigma^{2d}d!}\]

so it’s clearly a (countably-infinite-dimensional!) inner product because of the presence of the $d$-dimensional polynomial kernels $K_d(\mathbf x,\tilde{\mathbf x})=(\mathbf x\cdot\tilde{\mathbf x})^d$.

Problem: What were the main problems with SVMs that spurred the deep learning renaissance via the development of artificial neural networks?

Solution: The main issue with SVMs is their lack of scalability. Roughly speaking, the computational cost of training an SVM on a dataset of $N$ training examples scales like $O(N^2)$ in the best case (essentially because the Gram matrix $X^TX$ is an $N\times N$ matrix of all pairwise interactions of features), even with modern tricks like sequential minimal optimization this is still unavoidable. Another smaller issue is that

Problem: Write a short Python program using the scikit-learn library to train a support vector machine on the Kaggle Titanic competition dataset.

Solution:

titanic

Titanic – Machine Learning from Disaster (Kaggle Competition)

Problem Type: supervised binary classification

Training data: $N=891$ feature vectors (corresponding to $N=891$ passengers on the Titanic) each with $n=11$ features. Some features for certain passengers are NaN. Each feature vector is associated to a binary target label which is either $y=0$ (passenger died) or $y=1$ (passenger survived).

Test data: $N’=418$ feature vectors (each with $n=11$ features), again some features are NaN.

$$\textbf{Support Vector Machine}$$

In [39]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

Load the labelled training data in “train.csv” (stored in same folder as this Jupyter Notebook) into a Pandas DataFrame (representing a feature matrix $X_{\text{train}}$ together with the labelled targets $\mathbf y_{\text{train}}$).

In [ ]:

def df_processed(csv_loc):
    """
    Loads and processes the Titanic dataset from a CSV location.

    This function handles data cleaning and feature engineering to prepare
    the data for machine learning. It addresses missing values (NaNs) and
    converts categorical features into numerical format.

    Args:
        csv_loc (str): The file path to the training or test CSV file.

    Returns:
        pd.DataFrame: A processed DataFrame ready for modeling.

    Notes:
        Data Cleaning & Imputation:
        - 'Age': Replaces NaN entries with the median age of the column.
        - 'Fare': Replaces NaN entries with the median fare.
        - 'Embarked': Replaces NaN entries with the mode (most frequent port).

        Feature Engineering & Dropping:
        - 'Cabin', 'Ticket', 'Name': These columns are dropped. 'Cabin' has
          many NaN values, while 'Ticket' and 'Name' are non-numeric object
          types with high cardinality, which are dropped for simplicity.
        - 'Sex' & 'Embarked': These categorical columns are converted into
          numerical dummy variables using one-hot encoding. `drop_first=True`
          is used to create k-1 dummies, which helps avoid
          multicollinearity (e.g., 'Sex' becomes 'Sex_male' with 0/1 values).
    """
    df = pd.read_csv(csv_loc)
    df = df.drop(["Ticket", "Cabin", "Name"], axis=1)
    df["Age"] = df["Age"].fillna(df["Age"].median())
    df["Embarked"] = df["Embarked"].fillna(df["Embarked"].mode())
    df["Fare"] = df["Fare"].fillna(df["Fare"].median()) # 1 passenger in the test set has NaN fare
    df = pd.get_dummies(df, columns=["Embarked", "Sex"], drop_first=True)
    return df

Now, separate out the feature matrix $X_{\text{train}}$ and the target label vector $\mathbf y_{\text{train}}$ from the DataFrame, and apply identical $z$-score normalization to both $X_{\text{train}}$ and $X_{\text{test}}$, using the mean and variance derived from $X_{\text{train}}$.

In [41]:

X_tr = df_processed("train.csv").drop(["Survived"], axis=1)
y_tr = df_processed("train.csv")["Survived"]
X_ts = df_processed("test.csv")
scaler = StandardScaler()
X_tr_sc = scaler.fit_transform(X_tr)
X_ts_sc = scaler.transform(X_ts)

Fit a soft-margin support vector machine to this training data, with strictness hyperparameter $C=0.9$ (default is $C=1$) and (default) Gaussian RBF kernel $K_{\gamma}(\mathbf x,\tilde{\mathbf x})=e^{-\gamma|\mathbf x-\tilde{\mathbf x}|^2}$, where here $\gamma=1/n\sigma^2_{\hat X_{\text{train}}}$ is also the default (cf. Bienayme’s identity).

In [42]:

svm_model = SVC(C=0.9, kernel='rbf', gamma=1/(X_tr_sc.shape[1]*X_tr_sc.var())) 
svm_model.fit(X_tr_sc, y_tr)
y_ts_hat = svm_model.predict(X_ts_sc)

PassengerIDs = pd.read_csv("test.csv")["PassengerId"].to_numpy()
df_submission = pd.DataFrame({"PassengerId": PassengerIDs, 
                              "Survived": y_ts_hat})
df_submission.to_csv('SVM_submission.csv', index=False)

This approach turns out to yield a test set accuracy of $79.665\%$. One can use scikit-learn’s GridSearchCV function to try different values of the hyperparameter $C$ and even change the choice of kernel, and use cross-validation to determine which combination seems to do best.

In [ ]:

grid = {"C": np.arange(0.8, 1.1, 0.01), 
        "kernel": ["linear","rbf"]}

grid_search = GridSearchCV(
    estimator=SVC(probability=True),
    param_grid=grid,
    cv=5,
    n_jobs=-1, 
    verbose=2
)

grid_search.fit(X_tr_sc, y_tr)

print("\n--- GridSearch Results ---")
print(f"Best parameters found: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

Fitting 5 folds for each of 62 candidates, totalling 310 fits

--- GridSearch Results ---
Best parameters found: {'C': 0.9200000000000002, 'kernel': 'rbf'}
Best cross-validation score: 0.8260

Posted in Blog | Leave a comment

Numerical Computation

Posted on October 30, 2025 by wdengquantum.me

Problem: In numerical computation, what are the $2$ main kinds of rounding error?

Solution: Overflow error ($N\approx\infty$) but perhaps even more dangerous is underflow error ($\varepsilon\approx 0$) which are in some sense inverses of each other:

\[0=\frac{1}{\infty}\]

Problem: Explain how rounding errors can affect evaluation of the softmax activation function and strategies to address it.

Solution: Given a vector $\mathbf x\in\mathbf R^n$, the corresponding softmax probability $\ell^1$-unit vector is $e^{\mathbf x}/|e^{\mathbf x}|_1\in S^{n-1}$. If one of the components of $\mathbf x$ is very negative, then the corresponding component of $e^{\mathbf x}$ can underflow, whereas if it is instead very positive, then overflow becomes a possibility.

To address these numerical stability issues, the trick is to notice that softmax is invariant under “diagonal” translations along $\mathbf 1\in\mathbf R^n$, i.e. $\mathbf x\mapsto\mathbf x-\lambda\mathbf 1$ leads to $e^{\mathbf x-\lambda\mathbf 1}=e^{\mathbf x}\odot e^{-\lambda\mathbf 1}=e^{-\lambda}e^{\mathbf x}$ and hence the $\ell^1$-norm $|e^{\mathbf x-\lambda\mathbf 1}|_1=e^{-\lambda}|e^{\mathbf x}|_1$ is also merely scaled by the same factor $e^{-\lambda}$. Choosing $\lambda:=\text{max}(\mathbf x)$ to be maximum component of $\mathbf x$ (which coincides with the $\ell^{\infty}$ norm if $\mathbf x\geq 0$ is positive semi-definite) would eliminate any overflow error that may have been present in evaluating the numerator $e^{\mathbf x}$ by instead evaluating $e^{\mathbf x-\text{max}(\mathbf x)\mathbf 1}$. For the same reason, the shifted denominator $|e^{\mathbf x-\text{max}(\mathbf x)\mathbf 1}|_1$ is also immune to overflow error. Furthermore, the denominator is also resistant to underflow error because at least one term in the series for $|e^{\mathbf x-\lambda\mathbf 1}|_1=\sum_{i}e^{x_i-\lambda}$ will be $e^0=1\gg 0$, namely term $i=\text{argmax}(\mathbf x)$. The only possible problem is that the modified numerator could still experience an underflow error; this would be bad if subsequently one were to evaluate the information content (loss function) of a softmax outcome in which taking $\log(0)=-\infty$ would produce a nan. For this case, one can apply a similar trick to numerically stabilize the computation of this logarithm.

Problem: Describe the numerical analysis phenomenon of catastrophic cancellation.

Solution: Subtraction $f(x,y):=x-y$ is ill-conditioned when $x\approx y$. This is because even if $x$ and $y$ have small relative errors, the relative error in their difference $f(x,y)$ can still be substantial.

Problem: What does the condition number of a matrix $X$ measure?

Solution: The condition number measures how numerically stable it is to invert $X\mapsto X^{-1}$. It is given by the ratio of the largest to smallest eigenvalues of $X$ (by absolute magnitude, assuming $X$ is diagonalizable):

\[\frac{\text{max}|\text{spec}(X)|}{\text{min}|\text{spec}(X)|}\]

For instance, a non-invertible matrix would have $\text{min}|\text{spec}(X)|=0$ and thus an infinite condition number, since it’s non-invertible.

Problem: Show that during gradient descent of a scalar field $f(\mathbf x)$, if at some point $\mathbf x$ in the optimization landscape, one has $\frac{\partial f}{\partial\mathbf x}\cdot\frac{\partial^2 f}{\partial\mathbf x^2}\frac{\partial f}{\partial\mathbf x}>0$ (e.g. such as if the Hessian $\frac{\partial^2 f}{\partial\mathbf x^2}$ is just generally positive-definite or the gradient $\frac{\partial f}{\partial\mathbf x}$ is an eigenvector of the Hessian $\frac{\partial^2 f}{\partial\mathbf x^2}$ with positive eigenvalue), then by a line search logic, the most effective local learning rate $\alpha(\mathbf x)>0$ is given by:

\[\alpha(\mathbf x)=\frac{\left|\frac{\partial f}{\partial\mathbf x}\right|^2}{\frac{\partial f}{\partial\mathbf x}\cdot\frac{\partial^2 f}{\partial\mathbf x^2}\frac{\partial f}{\partial\mathbf x}}\]

Solution: Since gradient descent takes a step $\mathbf x\mapsto\mathbf x-\alpha\frac{\partial f}{\partial\mathbf x}$, and the goal is to decrease the value of the function $f$ as much as possible to get closer to the minimum, one would like to maximize:

\[f(\mathbf x)-f\left(\mathbf x-\alpha\frac{\partial f}{\partial\mathbf x}\right)\]

as a function of the learning rate $\alpha>0$. Assuming that a quadratic approximation is valid:

\[f(\mathbf x)-f\left(\mathbf x-\alpha\frac{\partial f}{\partial\mathbf x}\right)\approx \alpha\left|\frac{\partial f}{\partial\mathbf x}\right|^2-\frac{\alpha^2}{2}\frac{\partial f}{\partial\mathbf x}\cdot\frac{\partial^2 f}{\partial\mathbf x^2}\frac{\partial f}{\partial\mathbf x}\]

which has a well-defined maximum of the $\alpha$-parabola provided the $\alpha^2$ coefficient is positive.

Problem: Explain why gradient descent is behaving like this in this case? What can be done to address that?

Solution: The problem is that the Hessian is ill-conditioned (in this case the condition number is $5$ everywhere); the curvature in the $(1,1)$ direction is $5\times$ greater than the curvature in the $(1,-1)$ direction. Gradient descent wastes time repeatedly descending “canyon walls” because it sees those as the steepest features. A $2^{\text{nd}}$-order optimization algorithm (e.g. Newton’s method) that uses information from the Hessian would do better.

Problem: Explain how the Karush-Kuhn-Tucker (KKT) conditions generalize the method of Lagrange multipliers.

Solution: The idea is, in addition to equality constraints like $x+y=1$, one can also allow for inequality constraints like $x^2+y^2\leq 1$ (indeed, an equality $a=b$ is just $2$ inequalities $a\leq b$ and $a\geq b$). Then, constrained optimization of a scalar field $f(\mathbf x)$ subject to a bunch of equality constraints $g_1(\mathbf x)=g_2(\mathbf x)=…=0$ and a bunch of inequality constraints $h_1(\mathbf x),h_2(\mathbf x),…\leq 0$ is equivalent to unconstrained optimization of the Lagrangian:

\[L(\mathbf x,\boldsymbol{\lambda}_{\mathbf g},\boldsymbol{\lambda}_{\mathbf h})=f(\mathbf x)+\boldsymbol{\lambda}_{\mathbf g}\cdot\mathbf g(\mathbf x)+\boldsymbol{\lambda}_{\mathbf h}\cdot\mathbf h(\mathbf x)\]

where $\boldsymbol{\lambda}_{\mathbf g},\boldsymbol{\lambda}_{\mathbf h}$ are called KKT multipliers. The KKT necessary (but not sufficient!) conditions for optimality are then (ignoring technicalities about equating objects of possibly different dimensions):

$\frac{\partial L}{\partial\mathbf x}\mathbf 0$ (stationary)
$\mathbf g(\mathbf x)=\mathbf 0\geq \mathbf h(\mathbf x)$ (primal feasibility)
$\boldsymbol{\lambda}_{\mathbf h}\geq 0$ (if the inequalities were instead written as $\mathbf h(\mathbf x)\geq\mathbf 0$ then this would instead require $\boldsymbol{\lambda}_{\mathbf h}\leq 0$) (dual feasibility)
$\boldsymbol{\lambda}_{\mathbf h}\odot\mathbf h(\mathbf x)=\mathbf 0$ (complementary slackness)

Posted in Blog | Leave a comment

Information Theory & Inference

Posted on October 4, 2025 by wdengquantum.me

Problem: Draw a schematic of a binary symmetric channel (BSC) with bit flip probability $p_f$.

Solution: Classically, one has:

On the other hand, taking a more quantum perspective, in the computational basis $(|0\rangle,|1\rangle)$ one might define a binary symmetric channel as a kind of quantum logic gate:

\[\text{BSC}_{p_f}=\begin{pmatrix}\sqrt{1-p_f}&\sqrt{p_f}\\\sqrt{p_f}&\sqrt{1-p_f}\end{pmatrix}\]

The word “symmetric” in the name emphasizes that the bit flip probability from $0\to 1$ is identical to the bit flip probability $1\to 0$.

Problem: Consider the radio communication link between Galileo and Earth, and suppose that the noisy communication channel between them may be modelled as a BSC with bit flip probability $p_f=0.01$. In a sequence of $N=10^5$ bit transmissions, what is the probability that the error rate is less than $1\%$?

Solution: This means that one requires $N'<0.01N=10^3$ bit flips. Each such $N’$ has binomial probability ${{N}\choose{N’}}p_f^{N’}(1-p_f)^{N’}$ so the total probability is given by the cumulative distribution function:

\[\sum_{N’=0}^{N’=999}{{N}\choose{N’}}p_f^{N’}(1-p_f)^{N’}\]

Problem: What are the $2$ solutions for addressing noise in communication channels? Which one is preferable?

Solution:

The physical solution: build more reliable hardware, cooling circuits, etc. (basically all the stuff one learns in a typical experimental physics class).
The system solution: accept the noisy communication channel as is, and add communication systems to it so as to detect and correct errors. This is the goal of coding theory.

The system solution is preferable because it only comes at an increased computational cost whereas the former comes at an increased financial cost. But more importantly, the improvements arising from the system solution can be very dramatic, unlike the incremental improvements of the physical solution.

Problem: What is the distinction between information theory and coding theory?

Solution: The difference between information theory and coding theory is analogous to the distinction between theoretical physics and experimental physics; the former is concerned with the theoretical limitations of the aforementioned communication systems (e.g. “what is the best error-correcting performance possible over the space of all possible error-correcting algorithms?”) whereas the latter is interested in specifically designing practical such algorithms (also called codes) for error correction.

Problem: Draw a schematic that depicts the high-level structure of an error-correcting code.

Solution: Coding = Encoding + Decoding:

Here $\mathbf s$ is the source message vector which is encoded as a transmitted message vector $\mathbf t=\mathbf s+\textbf{redundancy}$. After transmission across the noisy communication channel, the transmitted message vector has been distorted into $\tilde{\mathbf t}=\mathbf t+\textbf{noise}$. This is then decoded and received as a message vector $\tilde{\mathbf s}$ with the hope that $\tilde{\mathbf s}=\mathbf s$ with probability close to $1$.

Problem: Consider a first naive attempt at designing an error-correcting code for a BSC, known as a repetition code. Explain how that works with an example.

Solution: A choice of error-correcting code basically amounts to a choice of encoder + decoder, since again coding = encoding + decoding. For the repetition code, the encoder is literally a repetition of each bit a predefined number of times, say $3$ in the example below. The corresponding optimal decoder turns out to be the obvious “majority-vote” among the triplets (assuming $p_f<1/2$; if it were the case that $p_f>1/2$ then instead a “minority-vote” decoder would be better!)

(aside: “optimal” means the decoder should map each $\tilde{\mathbf t}$ to $\tilde{\mathbf s}=\text{argmax}_{\mathbf s}p(\mathbf s|\tilde{\mathbf t})$ and in particular as part of computing this conditional probability, another implicit assumption is that both $p(0)=p(1)=1/2$ have equal prior probabilities)

Here $\mathbf n$ is a spare noise vector each of whose bits are sampled from a $p_f$-Bernoulli distribution, where $0$ means no bit flip and $1$ means yes bit flip, and $\tilde{\mathbf t}\equiv\mathbf t+\mathbf n\pmod{2}$, or using the xor notation, $\tilde{\mathbf t}=\mathbf t\oplus\mathbf n$

Problem: Show that, compared to no repetition for which the bit flip probability across the BSC is $p_f<1/2$, the $3$-fold repetition code has an effective bit flip probability $p^{\text{eff}}_f<p_f$ which is reduced compared to the original bit flip probability $p_f$.

Solution: In order to get a bit flip, either $2$ or all $3$ bits in the $3$-fold repetition encoded vector $\mathbf t$ have to be flipped. Thus, the effective bit flip probability $p^{\text{eff}}_f$ is a cubic function of $p_f$:

\[p^{\text{eff}}_f=3p_f^2(1-p_f)+p_f^3=3p_f^2-2p_f^3\]

In order to have $p^{\text{eff}}_f<p_f$, this requires:

\[3p_f-2p_f^2<1\Rightarrow 2(p_f-1/2)(p_f-1)>0\]

which is satisfied for $p_f<1/2$.

Problem: Instead of $N=3$ repetitions, generalize the above expression for the effective bit flip probability $p^{\text{eff}}_f$ to an arbitrary odd number $N$ of repetitions.

Solution: The effective bit flip probability $p^{\text{eff}}_f$ is now a degree-$N$ polynomial in $p_f$:

\[p^{\text{eff}}_f=\sum_{n=(N+1)/2}^N{N\choose n}p_f^n(1-p_f)^{N-n}\]

which for $p_f\ll 0.5$ is dominated by the first term $n=(N+1)/2$.

Problem: Describe how the $(7,4)$ Hamming error correction code works, and explain why it’s an example of a linear block code.

Solution: The idea is that if the source vector $\mathbf s$ has length $L_{\mathbf s}$ and the encoded transmitted vector $\mathbf t$ has some larger length $L_{\mathbf t}>L_{\mathbf s}$, then one has the “rate” $L_{\mathbf s}/L_{\mathbf t}=4/7$. Specifically, for every $4$ source bits $\mathbf s=s_1s_2s_3s_4$, the transmitted vector will contain $7$ bits $\mathbf t=t_1t_2t_3t_4t_5t_6t_7$ such that $t_{1,2,3,4}:=s_{1,2,3,4}$ and $t_{5,6,7}$ are uniquely specified by the requirements that $t_1+t_2+t_3+t_5\equiv t_2+t_3+t_4+t_6\equiv t_1+t_3+t_4+t_7\equiv 0\pmod 2$; in other words, $t_{5,6,7}$ are said to be parity-check bits in the sense that they check the parity of a certain subset of the source bits $s_{1,2,3,4}$. This encoding can be depicted visually as such:

where the parity of each of the $3$ circles has to be even. For example, if $\mathbf s=1011$, then the Hamming encoding of this would be $\mathbf t=1011001$.

As for the $(7,4)$ Hamming decoder, for generic $p_f<1/2$ it can be a bit tricky but if one assumes $p_f\ll 1/2$, then it is pretty safe to assume that there will be at most $1$ bit flip during transmission across the noisy binary symmetric communication channel $\mathbf t\mapsto\tilde{\mathbf t}$. In that case, one can take a given $\tilde{\mathbf t}$ and compute its syndrome vector $\mathbf z$; i.e. in each of the $3$ circles, has the parity remained even $(0)$ or has it become odd $(1)$? For instance, getting a syndrome $\mathbf z=000$ means that most likely no bits were flipped and thus there are no errors! Or, if $\mathbf z=011$, then under the premise that there was only a single bit flip, it must have been $s_4$, etc. In general, $\tilde{\mathbf t}$ can have $8$ possible syndromes $\mathbf z$, and each such syndrome $\mathbf z$ maps uniquely onto the $7$ possible single bit flips together with the possibility of zero bit flips.

The Hamming code is a block code because the parity-check bits look at whole blocks (or subsets) of the source bitstring $\mathbf s$, rather than one bit of $\mathbf s$ at a time (as was essentially done in repetition coding). The Hamming code is also said to be a linear code (mod $2$) because the encoding operation is linear in the sense that (assuming $\mathbf s,\mathbf t$ are both column vectors):

\[\mathbf t=G^T\mathbf s\]

where the $7\times 4$ generator matrix is given by:

\[G^T=\begin{pmatrix}1&0&0&0\\0&1&0&0\\0&0&1&0\\0&0&0&1\\1&1&1&0\\0&1&1&1\\1&0&1&1\end{pmatrix}\]

Or transposing both sides, $\mathbf t^T=\mathbf s^TG$, and in some coding theory texts one often takes $\mathbf s,\mathbf t$ to be row vectors by default in which case this would just be written $\mathbf t=\mathbf sG$. Of course, the columns of $G^T$ represent the encodings $\mathbf t$ of the “basis” bitstrings $\mathbf s=1000,0100,0010,0001$.

Problem: The $(7,4)$ Hamming decoder $\tilde{\mathbf t}\to\mathbf z\to\tilde{\mathbf s}$ uses a syndrome vector $\mathbf z$ as an intermediate in the decoding process. Show that the map from $\tilde{\mathbf t}\to\mathbf z$ is linear in the sense that:

\[\mathbf z=H\tilde{\mathbf t}\]

where the parity-check matrix $H:=(P,1_3)$, and $P=\begin{pmatrix}1&1&1&0\\0&1&1&1\\1&0&1&1\end{pmatrix}$ also appears in the generator matrix via:

\[G^T=\begin{pmatrix}1_4\\P\end{pmatrix}\]

Show that $HG^T=0$ and hence if there were no errors such that $\tilde{\mathbf t}=\mathbf t$, then the corresponding syndrome vector $\mathbf z=\mathbf 0\Leftrightarrow\mathbf t\in\text{ker}H$. Hence, explain why a maximum-likelihood decoder is equivalent to finding the most probable noise vector $\mathbf{n}$ for which $\mathbf z=H\mathbf n$.

Solution: Just check it, it’s true. And clearly $HG^T=2P=0$ because $2\equiv 0\pmod 2$. It follows then that $\mathbf t\in\text{ker}H$ because $\mathbf t=G^T\mathbf s$. Thus, because $\tilde{\mathbf t}=\mathbf t+\mathbf n$, acting on both sides with $H$ results in $\mathbf z=H\mathbf n$.

Problem: As the $(7,4)$ Hamming code is a block code, given a possibly very long source message, the only way to implement the code is to chop up the message into blocks of length $L_{\mathbf s}=4$; naturally, one can ask, for any given block in this “assembly line”, what’s the probability $p(\tilde{\mathbf s}\neq\mathbf s)$ of it being decoded incorrectly? By contrast, a different question one can ask is what is the effective bit flip probability $p_f^{\text{eff}}$ within one of these blocks? (give both to leading order in $p_f$)

Solution: First, notice that without the Hamming code, any odd number of bit flips as well as certain combinations of an even number of bit flips would lead very easily to a decoding error $\tilde{\mathbf s}\neq\mathbf s$; as $p_f\ll 1/2$, the dominant process is having a single bit flip, so in this case $p(\tilde{\mathbf s}\neq\mathbf s)={4\choose{1}}p_f(1-p_f)^3+…=4p_f+O(p_f^2)$.

By contrast, with the $(7,4)$ Hamming code, the block error probability is no longer linear in $p_f$, but quadratic. This is simply because the $(7,4)$ Hamming code is designed to correct all $1$-bit flip errors, thereby eradicating the dominant process above. Unfortunately, any $2$ bit flips (implicitly on distinct bits) in the $7$-bit encoding $\mathbf t$ will not be corrected, leading to block error $\tilde{\mathbf s}\neq\mathbf s$ (identical remarks apply to $3,4,5,6,7$ bit flips); after all, no matter what syndrome vector $\mathbf z$ one finds in $\tilde{\mathbf t}$, the Hamming decoder tells one to unflip at most $1$ bit because it operates under the assumption that there was only $1$ bit flip, so it is literally impossible to fully correct a $\tilde{\mathbf t}$ containing more than $1$ bit flip. Thus:

\[p(\tilde{\mathbf s}\neq\mathbf s)=\sum_{n=2}^7{7\choose{n}}p_f^n(1-p_f)^{7-n}=21p_f^2+O(p_f^3)\]

By contrast, the effective bit flip probability is in general:

\[p_f^{\text{eff}}=\frac{1}{L_{\mathbf s}}\sum_{i=1}^{L_{\mathbf s}}p(\tilde s_i\neq s_i)\]

But here all $L_{\mathbf s}=4$ source bits are indistinguishable (indeed all $L_{\mathbf t}=7$ transmitted bits are indistinguishable), so one can focus arbitrarily on e.g. $i=1$ and ask what is $p_f^{\text{eff}}=p(\tilde s_1\neq s_1)$. There are then $2$ distinct ways to obtain the same answer (to leading order); the first is to explicitly enumerate all the possible $2$ bit flip errors in $\tilde{\mathbf t}$ such that the final decoded source vector $\tilde{\mathbf s}$ will have an error in bit $\tilde s_1\neq s_1$; clearly there are $6$ such combinations that look like $s_1$ being flipped in addition to one of the other $6$ bits; then, another $3^{\text{rd}}$ bit will always be flipped. Meanwhile, one can check there are also $3$ other configurations in which $s_1$ is not initially flipped, but rather the decoder incorrectly flips it when going from $\tilde{\mathbf t}\to\tilde{\mathbf s}$:

Since each of these are still $2$ bit flip processes, one thus has:

\[p_f^{\text{eff}}=9p_f^2(1-p_f)^5+…=9p_f^2+O(p_f^3)\]

To leading order, one thus observes that $p_f^{\text{eff}}=\frac{3}{7}p(\tilde{\mathbf s}\neq\mathbf s)$; the $3/7$ prefactor has a simple interpretation as saying that whenever there was $2$ bit flips in $\tilde{\mathbf t}$, it would inevitably imply $3$ bit flips in the decoded $\tilde{\mathbf s}$ (prior to truncating away the $3$ parity-check bits) so $3$ out of the $7$ bits in $\tilde{\mathbf s}$ are in error, and all $7$ bits are indistinguishable. This thus has a very Bayesian interpretation where $p(\tilde{\mathbf s}\neq\mathbf s)$ is the prior and $3/7$ is the conditional probability of a bit like $s_1$ being flipped if it’s known that the whole $7$-bit block has suffered a $3$-bit error.

Problem: How many distinct elements are there in $\ker H$? What is the interpretation of such an element?

Solution: There are $15$ elements in $\ker H$, and each such $\mathbf n\in\ker H$ has the property that the corresponding syndrome $\mathbf z=\mathbf 0$ if $\textbf n$ were interpreted as a certain symmetric pattern of bit flips/noise. For example:

Problem: The $(7,4)$ Hamming code is not the only one; what other possible $(L_{\mathbf t},L_{\mathbf s})$ Hamming codes are possible?

Solution: Any for which $L_{\mathbf t}-L_{\mathbf s}=\log_2(L_{\mathbf t}+1)$, the RHS being the number of parity-check bits which grows merely logarithmically with the total transmission size $L_{\mathbf t}$!

Problem: What are the $2$ main interpretations of probability?

Solution: There is the frequentist and Bayesian interpretations of probability; the former interprets probabilities as long-run event frequencies while the latter inteprets probabilities as degrees of belief (this is possible provided the concept of “degree of belief” satisfies certain reasonable Cox axioms). In general, probability theory is simply an extension of logic to account for uncertainty.

Problem: Whenever dealing with multiple random variables $X,Y,…$, what is the fundamental object to look at? What can be derived from this fundamental object? (cf. the partition function in statistical mechanics as serving the same sort of “all-encompassing” role).

Solution: The fundamental object which one should prioritize figuring out is the joint probability distribution function $p_{(X,Y,…)}(x,y,…)$ of all the random variables $(X,Y,…)$. Consider a simple visualization of this concept for just $2$ real-valued random variables $X, Y$ and the marginal as well as conditional probability distributions one can obtain from their joint probability distribution $p_{(X,Y)}(x,y)\equiv p(x,y)$:

Problem: A certain disease is present among $1\%$ of the world’s population, and a medical test for the disease has a $95\%$ true positive rate and a $90\%$ true negative rate. If Joe tests positive for the disease, what are his odds of actually having it?

Solution: For these sorts of problems, there is essentially a $3$-step recipe to solving them:

Map out the complete marginal distribution of the prior.
For each value of the prior, “unshovel” your way back to the joint distribution w.r.t. all values of the new data.
Hence, compute whatever you need to.

In this case, the prior $X$ may be modelled as a Bernoulli random variable taking on the value $x=0$ if a person doesn’t have the disease and $x=1$ if they do. Following the steps:

Problem: Suppose that on a certain online store, products can be rated either positively or negatively. There are $3$ different sellers of the same product. Seller $A$ has $10$ reviews with $100\%$ positive ratings, seller $B$ has $50$ reviews with $96\%$ positive ratings, and seller $C$ has $200$ reviews with $93\%$ positive ratings. Which seller should one buy from?

Solution: This is another Bayesian inference problem (draw a picture with $p_+$ on the horizontal axis and $N_+$ on the vertical axis!). The model is that the seller has some underlying fixed, but unknown probability $p_+$ of delivering a positive experience and $1-p_+$ of delivering a negative experience to a user. A priori, if one had no knowledge about the review information of other customers, then the prior $p_+$ can be modelled by a uniform distribution $p(p_+)=1$ for $p_+\in[0,1]$. But then, equipped with the review info of $N_+$ positive reviews out of $N$, Bayesian updating says that this uniform prior should be updated to a beta distributed posterior:

\[p(p_+|N_+,N)=\frac{p(p_+)p(N_+,N|p_+)}{\int_0^1dp_+p(p_+)p(N_+,N|p_+)}=\frac{{N\choose{N_+}}p_+^{N_+}(1-p_+)^{N-N_+}}{\int_0^1 dp_+{N\choose{N_+}}p_+^{N_+}(1-p_+)^{N-N_+}}=\frac{(N+1)!p_+^{N_+}(1-p_+)^{N-N_+}}{N_+!(N-N_+)!}\]

where the beta function $\textrm{B}(z,w):=\int_0^1 dx x^{z-1}(1-x)^{w-1}=\frac{\Gamma(z)\Gamma(w)}{\Gamma(z+w)}$ normalization has been (optionally) used.

A reasonable goal would be to maximize one’s own probability of a positive experience. This means taking the posterior $p(p_+|N_+,N)$ computed above and recycling that as one’s prior $p(p_+|N_+,N)\mapsto p(p_+)$. Then, the probability $p(+)$ of a positive experience coincides with the expected value of $p_+$:

\[p(+)=\int_0^1 dp_+p(p_+)p_+=\frac{\textrm{B}(N_++2,N-N_++1)}{\textrm{B}(N_++1,N-N_++1)}=\frac{N_++1}{N+2}\]

this result is called Laplace’s rule of succession, and may be remembered by taking the original $N$ reviews, and appending $2$ new fictitious reviews in which one is positive and one is negative. For instance, if a product had no reviews so that $N=N_+=0$, then reasonably enough Laplace’s rule of succession predicts one to have a $p(+)=1/2$ probability of a positive experience. Applying it to each of the sellers in the question, one finds that seller $B$ has the highest probability $p(+)$, so should vendor with them.

Remember that Bayesian inference is always predicated on assumptions, which above meant a uniform equiprobable ignorance $p(p_+)=1$ at the very beginning about $p_+$, though it has been suggested by Jeffreys, Haldane, etc. to use a different prior like $p(p_+)=\frac{1}{\pi\sqrt{p(1-p)}}$ or $p(p_+)\sim\frac{1}{p(1-p)}$ to model one’s ignorance. Of course, this would modify how Laplace’s rule of succession looks with each of these priors.

Problem: (for fun!) Prove the earlier beta function relation $\textrm{B}(z,w):=\int_0^1 dx x^{z-1}(1-x)^{w-1}=\frac{\Gamma(z)\Gamma(w)}{\Gamma(z+w)}$. Show that in the special case of positive integers $z,w\in\mathbf Z^+$, the result may be derived combinatorially.

Solution:

(loosely speaking, this is reminiscent of the identity for the exponential function $\exp(z)\exp(w)=\exp(z+w)$ except for the $\Gamma$ function there is an “extra factor” in the form of the beta function $B(z,w)$). In the special case where $z,w\in\mathbf Z^+$, the initial temptation would be to expand $(1-x)^{w-1}$ using the binomial theorem, but it turns out there is a much cuter way to evaluate the integral by telling a story.

Imagine seeing a random distribution of $N+1$ balls on the unit interval $[0,1]$, of which $1$ of them is pink while the other $N$ are black. There are $2$ ways they could’ve gotten there. On the one hand, it may be that there were initially $N+1$ black balls, and then one of them was painted pink at random, and then the balls were thrown altogether onto $[0,1]$. Or it may be that the $N+1$ black balls were first thrown altogether onto $[0,1]$, and subsequently one of them was painted pink at random. Clearly, these $2$ processes are equivalent definitions of a discrete random variable $N_+$ defined as the number of black balls to the left of the pink ball on $[0,1]$. This has probability mass function $p(N_+)$ in the first story given by:

\[p(N_+)=\int_0^1dp_+p(\text{pink thrown at }p_+)p(N_+|\text{pink thrown at }p_+)\]

\[=\int_0^1 dp_+{N\choose{N_+}}p_+^{N_+}(1-p_+)^{N-N_+}\]

On the other hand, according to the second story, $N_+=0,1,…,N$ each with uniform probability!

\[p(N_+)=\frac{1}{N+1}\]

Hence it is established that:

\[\int_0^1 dp_+p_+^{N_+}(1-p_+)^{N-N_+}=\frac{N_+!(N-N_+)!}{(N+1)!}\]

which is equivalent to the earlier more general result for $z,w\in\mathbf C$.

Problem: Given a discrete random variable $X$ with probability mass function $p(x)$, define the information content $I(x)$ associated to drawing outcome $x$ from $X$. Hence, define the entropy $S_X$ of $X$.

Solution: A picture is worth a thousand words:

where $I(x):=-\log p(x)$ is the surprise associated with drawing $x$ from $X$, and $S_X:=\sum_xp(x)I(x)=-\sum_x p(x)\log p(x)=\sum_x I(x)2^{-I(x)}$ is the average surprise of the discrete random variable $X$. The choice of base in the $\log$ defines the unit of information (e.g. bits, nats, etc.).

Problem: Show that for $X\perp Y$ independent discrete random variables, their joint entropy $S_{(X,Y)}=S_X+S_Y$.

Solution: This is really rooted in Shannon’s fundamental axioms that the joint information content for independent random variables should be additive $I_{(X,Y)}(x,y)=I_X(x)+I_Y(y)$:

\[S_{(X,Y)}:=-\sum_{(x,y)}p(x,y)\log p(x,y)=-\sum_{(x,y)}p(x)p(y)\log p(x)-\sum_{(x,y)}p(x)p(y)\log p(y)=S_X+S_Y\]

Problem: Give the intuition behind the Kullback-Leibler divergence $D_{\text{KL}}(\tilde X|X)$ of a random variable $\tilde X$ with respect to another random variable $X$ (both defined over the same sample space).

Solution: Any random variable $X$ inherently contains some non-negative average level of surprise $S_X\geq 0$ because of the word “random” in “random variable” (indeed, $S_X=0$ iff $p(x)=1$ for some outcome $x$). That’s one kind of surprise. But in practice, if one is under the illusion that the outcomes $x$ are being drawn from a random variable $\tilde X$ when in fact they are actually following a different random variable $X\neq\tilde X$, then the perceived degree of surprise $S_{\tilde X|X}=-\sum_xp_X(x)\log p_{\tilde X}(x)$ (called the cross-entropy of $\tilde X$ with respect to $X$) would intuitively be larger than the average surprise $S_X$ inherent in the intrinsic randomness of $X$ itself. The KL divergence of $\tilde X$ with respect to $X$ thus measures how surprising the data is purely due to the wrong model $\tilde X$ being used in lieu of the ground truth $X$:

\[D_{\textrm{KL}}(\tilde X|X):=S_{\tilde X|X}-S_X\]

thus, it should be intuitively clear that $D_{\textrm{KL}}(\tilde X|X)\geq 0$, a result known as the Gibbs inequality.

Problem: What is the intuition behind Jensen’s inequality.

Solution: Take any “smiling” curve and hang a bunch of masses on the curve. Then Jensen’s inequality says that their center of mass will also lie above the curve.

More precisely, for a convex function $f(x)$ (which means any secant line lies above the function’s graph, or algebraically $f(\lambda x+(1-\lambda)x’)\leq\lambda f(x)+(1-\lambda)f(x’)$ for $\lambda\in[0,1]$), one has:

\[\langle f(x)\rangle\geq f(\langle x\rangle)\]

with equality $\langle f(x)\rangle=f(\langle x\rangle)$ iff $x$ is a uniform random variable. More explicitly, for any convex linear combination of the form $p_1x_1+p_2x_2+…+p_Nx_N$ where each $p_i\geq 0$ and $\sum_{i=1}^Np_i=1$, one has:

\[\sum_i p_if(x_i)\geq f\left(\sum_ip_ix_i\right)\]

The inequality is reversed for concave functions $f(x)$.

Problem: Unstable particles are emitted from a source and decay at a distance $x\geq 0$, an exponentially distributed real random variable with characteristic length $x_0$. Decay events can only be observed if they occur in a window extending from $x=1$ to $x=20$ (arbitrary units). If $100$ decays are registered at some locations $1\leq x_1,x_2,…,x_{100}\leq 20$, estimate the characteristic length $x_0$.

Solution: The prior distribution of $x$ is, as stated, an exponential decay:

\[p(x|x_0)=\frac{e^{-x/x_0}}{x_0}[x\geq 0]\]

which is normalized $\int_{-\infty}^{\infty}dxp(x|x_0)=1$. After learning that $x\in [1,20]$, the posterior distribution of $x$ becomes:

\[p(x|x\in [1,20],x_0)=\frac{p(x|x_0)p(x\in[1,20]|x, x_0)}{p(x\in[1,20]|x_0)}\]

but if $x$ is measured, and happens to lie in $x\in[1,20]$, then $p(x\in[1,20]|x, x_0)=1$, otherwise if $x\notin [1,20]$ then $p(x\in[1,20]|x, x_0)=0$; all in all one has the compact Iverson bracket expression $p(x\in[1,20]|x, x_0)=[x\in [1,20]]$. The denominator is the integral of the numerator and indeed is just the usual way that probability density functions are meant to be used:

\[p(x\in[1,20]|x_0)=\int_{-\infty}^{\infty}dxp(x|x_0)p(x\in[1,20]|x,x_0)=\int_1^{20}dx\frac{e^{-x/x_0}}{x_0}=e^{-1/x_0}-e^{-20/x_0}\]

Now, using this updated prior, one has:

\[p(\{x_1,…,x_N\}|x\in[1,20],x_0)=p(x_1|x\in[1,20],x_0)…p(x_N|x\in[1,20],x_0)=\frac{e^{-(x_1+…+x_N)/x_0}}{x_0^N(e^{-1/x_0}-e^{-20/x_0})^N}\]

But one would instead like to Bayesian-infer $p(x_0|\{x_1,…,x_N\},x\in[1,20])$ so as to find the most probable value of the characteristic length $x_0$:

\[p(x_0|\{x_1,…,x_N\},x\in[1,20])=\frac{p(x_0|x\in[1,20])p(\{x_1,…,x_N\}|x\in[1,20],x_0)}{p(\{x_1,…,x_N\}|x\in[1,20])}\]

Clearly, the only missing puzzle piece in here is the prior distribution $p(x_0|x\in[1,20])$ on the characteristic length itself. Even without specifying this, one can still gain a lot of insight by plotting $p(\{x_1,…,x_N\}|x\in[1,20],x_0)$ as a function of $x_0$, or more precisely, to get the key idea, plot the likelihood $p(x|x\in[1,20],x_0)$ of $x_0$ for various fixed $x\in[1,20]$, and notice in particular a peak in $x_0$ as $x$ decreases.

Posted in Blog | Leave a comment

SVD, Pseudoinverse, and PCA

Posted on September 23, 2025 by wdengquantum.me

Problem: State the form of the singular value decomposition (SVD) of an arbitrary linear operator $X:\mathbf C^m\to\mathbf C^n$.

Solution: The SVD of $X$ is given by:

\[X=U_2\Sigma U^{\dagger}_1\]

where $U_1\in U(m)$ and $U_2\in U(n)$ are unitary operators and $\Sigma:\mathbf C^m\to\mathbf C^n$ is a “Hermitian” positive semi-definite diagonal operator. Strictly speaking, one should prove that such a factorization of “rotate-stretch-rotate” really does exist for any $X$; the details are omitted here.

Problem: In practice, how does one find the SVD of a linear operator $X$?

Solution: The idea is to look at the $2$ Hermitian, positive semi-definite operators constructed from $X$:

\[XX^{\dagger}=U_2\Sigma\Sigma^{\dagger}U_2^{\dagger}\]

\[X^{\dagger}X=U_1\Sigma^{\dagger}\Sigma U_1^{\dagger}\]

And in particular, by the spectral theorem, both $XX^{\dagger}$ and $X^{\dagger}X$ have an orthonormal eigenbasis (of $\mathbf C^n$, $\mathbf C^m$ respectively) corresponding to real, non-negative eigenvalues $\sigma^2_1>\sigma^2_2>…>\sigma^2_{\text{min}(n,m)}$, so since the above matches the standard eigendecomposition template, it’s clear that computing the SVD of $X$ just amounts to diagonalizing $XX^{\dagger}$ and $X^{\dagger}X$ (actually just $1$ has to be diagonalized, namely diagonalize $XX^{\dagger}$ if $n\leq m$ and diagonalize $X^{\dagger}X$ if $m\leq n$. Then, because the singular eigenvalues $\sigma^2_1,\sigma^2_2,…\sigma^2_{\text{min}(n,m)}$ are identical for both $XX^{\dagger}$ and $X^{\dagger}X$, it’s easy to find the eigenvectors of the other).

Problem: The most important application of SVD is to estimate a complicated linear operator $X:\mathbf C^m\to\mathbf C^n$ by a low-rank approximation. Explain how this is achieved.

Solution: Order the $\text{min}(n,m)$ singular values of $X:\mathbf C^m\to\mathbf C^n$ from greatest to least as before $\sigma^2_1,\sigma^2_2,…\sigma^2_{\text{min}(n,m)}$. Write the column eigenvectors $U_1=(\mathbf u_1^{(1)}, \mathbf u_1^{(2)},…,\mathbf u_1^{(m)})$ and $U_2=(\mathbf u_2^{(1)}, \mathbf u_2^{(2)},…,\mathbf u_2^{(n)})$. Then the singular value decomposition may also be written:

\[X=U_1\Sigma U^{\dagger}_2=\sum_{i=1}^{\text{min}(n,m)}\sigma_i\mathbf u_1^{(i)}\otimes\mathbf u_2^{(i)}\]

In particular, the Eckart-Young theorem states the best rank-$r$ approximation to $X$ (for $r\leq\text{rank}(X)$) is given by a truncating the above series at the first $r$ terms:

\[X\approx \sum_{i=1}^{r}\sigma_i\mathbf u_1^{(i)}\otimes\mathbf u_2^{(i)}\]

Here, the word “best” is with respect to the Frobenius norm:

\[\text{argmin}_{\tilde X\in\mathbf C^{n\times m}:\text{rank}(\tilde X)=r}\sqrt{\text{Tr}((X-\tilde X)^{\dagger}(X-\tilde X))}=\sum_{i=1}^{r}\sigma_i\mathbf u_1^{(i)}\otimes\mathbf u_2^{(i)}\]

(insert proof here:)

Problem: Here is another application of SVD: given a diagonal operator $\Sigma:\mathbf C^m\to\mathbf C^n$, explain how to compute its (Moore-Penrose) pseudoinverse $\Sigma^+:\mathbf C^n\to\mathbf C^m$. Hence, how can the pseudoinverse of a general linear operator $X:\mathbf C^m\to\mathbf C^n$ be computed?

Solution: $\Sigma^+$ is almost like the usual inverse $\Sigma^{-1}$ in that one simply inverts all the eigenvalues on the diagonal. But the exception is if $\Sigma$ has a zero eigenvalue…then ordinarily $\Sigma^{-1}$ would not exist, so for this case the pseudoinverse is created and the instruction is to simply leave the zeros be (rather than trying to invert them and get $1/0=\infty$). Finally, also don’t forget to transpose the result. For a general $X$, the quickest way to compute the pseudoinverse is to use its SVD:

\[X=U_2\Sigma U^{\dagger}_1\]

Ordinarily trying to take the inverse (even if $n\neq m$) would appear to give:

\[X^{-1}=U_1\Sigma^{-1}U^{\dagger}_2\]

But since $\Sigma^{-1}$ may not exist, the pseudoinverse of $X$ is thus given by:

\[X^+=U_1\Sigma^+U_2^{\dagger}\]

(of course, if $X^{-1}$ actually exists then the pseudoinverse is in fact not “pseudo” at all $X^+=X^{-1}$).

Problem: In general, given a vector $\mathbf y\in\mathbf C^n$ and linear operator $X:\mathbf C^m\to\mathbf C^n$, the equation $\mathbf y=X\mathbf x$ only has a unique solution for $\mathbf x\in\mathbf C^m$ provided that $X$ is invertible, i.e. $\mathbf x=X^{-1}\mathbf y$. If this isn’t the case however, one can still compute the pseudoinverse $X^+$; what interpretation does $X^+\mathbf y\in\mathbf C^m$ have?

Solution: There are $2$ cases; if $m\leq n$, then there will be many solutions $\mathbf x\in\mathbf C^m$ to $\mathbf y=X\mathbf x$, and in particular $X^+\mathbf y$ is also an element of this solution space (i.e. $XX^+\mathbf y=\mathbf y$) but has the additional property that it is closest to the origin:

\[X^+\mathbf y=\text{argmin}_{\mathbf x\in\mathbf C^m:\mathbf y=X\mathbf x}|\mathbf x|\]

By contrast, if $m\geq n$, then there will be no solutions $\mathbf x\in\mathbf C^m$ to $\mathbf y=X\mathbf x$ and in that case of course $X^+\mathbf y$ is also not a solution, but nevertheless it is the closest one can come to an actual solution in the sense that:

\[X^+\mathbf y=\text{argmin}_{\mathbf x\in\mathbf C^m}|\mathbf y-X\mathbf x|\]

Problem: Explain the what and how of conducting principal component analysis (PCA) of a bunch of (say $m$) feature vectors $\mathbf x_1,…,\mathbf x_m\in\mathbf C^n$.

Solution: Earlier, $X:\mathbf C^m\to\mathbf C^n$ simply had the interpretation of an arbitrary linear operator, in which case the interpretation of $XX^{\dagger}$ and $X^{\dagger}X$ were both likewise arbitrary. However, one intuitive interpretation is to view $X$ as a data matrix; if one takes the $m$ columns of $X=(\mathbf x_1,…,\mathbf x_m)$ to represent the $m$ feature vectors in $\mathbf C^n$, then the Gram matrix $X^{\dagger}X$ takes on a statistical interpretation as an autocorrelation matrix:

\[X^{\dagger}X=
\begin{pmatrix} |\mathbf{x}_1|^2 & \cdots & \mathbf{x}_1^{\dagger}\mathbf{x}_m \\
\vdots & \ddots & \vdots \\
\mathbf{x}_m^{\dagger}\mathbf{x}_1 & \cdots & |\mathbf{x}_m|^2
\end{pmatrix}\]

or alternatively, if $X\mapsto X-\boldsymbol{\mu}\otimes\mathbf{1}_m$ has already been mean-subtracted (where $\boldsymbol{\mu}=\sum_{i=1}^m\mathbf x_i/m$) so as to have zero-mean, then $X^{\dagger}X$ is just the covariance matrix of the data. Based on the earlier discussion of SVD, it follows that the columns of $U_1\in U(m)$ give principal axes of the covariance matrix $X^{\dagger}X$, and that the corresponding singular values in $\Sigma$ measure the standard deviation of the data along those principal axes (as variance is commonly denoted $\sigma^2$, this also explains the choice of notation in the SVD). Note that in such applications, one is almost always in the regime $m\gg n$ so that there will be $n$ singular values $\sigma_1,…,\sigma_n$. Together, each principal axis together with corresponding singular value is said to be a principal component (PC) of the data $X$, hence the name. The “analysis” part comes from analyzing how the total variance $\sum_{i=1}^n\sigma_i^2$ of the data $X$ is explained by each PC, specifically the fraction of it explained by the $j^{\text{th}}$ PC is $\sigma_j^2/\sum_{i=1}^n\sigma_i^2$. If the singular values are ordered $\sigma_1\geq…\geq\sigma_n$, then PCA also provides a lossy data compression algorithm, specifically, in analogy to the low-rank approximation application of SVD, one can pick some $r< n$ and project all $m$ feature vectors $\mathbf x_1,…,\mathbf x_m\in\mathbf C^n$ onto the $r$-dimensional subspace spanned by the corresponding first $r$ principal axes associated to the first $r$ singular values $\sigma_1,…,\sigma_r$.

If the data matrix contained the feature vectors as its rows instead $X=(\mathbf x_1,…,\mathbf x_m)^{\dagger}$, then repeat the above discussion with $XX^{\dagger}$ instead.

Posted in Blog | Leave a comment

Quantumplations

PyTorch Fundamentals (Part \(2\))

PyTorch model building essentials¶

PyTorch Fundamentals (Part \(1\))

Random Tensors¶

Tensor datatypes¶

Manipulating Tensors (Tensor Operation)¶

Tokenization & Transformers

JAX Fundamentals (Part \(1\))

Monte Carlo Methods

Differential Geometry

Support Vector Machines

Numerical Computation

Information Theory & Inference

SVD, Pseudoinverse, and PCA

Archives

Meta