Graph Neural Networks

Posted on March 4, 2026 by wdengquantum.me

Problem: Give a broad sketch of the current state of the field of research in graph neural networks.

Solution:

Problem: Okay, so now explain what a graph neural network (GNN) actually is.

Solution: A GNN is basically any neural network whose input is any (undirected/directed/mixed/multi) graph (e.g. molecules, social networks, citation networks, etc.). In order to be sensible, the GNN output (whatever it is, e.g. a binary classifier for molecule toxicity) has to genuinely be an intrinsic function of the graph structure alone, and in particular not depend on any arbitrary choice of “ordering” with which one might index the graph vertices and edges (i.e. the output must be either permutation equivariant or permutation invariant depending on its nature, cf. tensors vs. tensor components in some basis).

Problem: Explain the subclass of GNNs known as message-passing neural networks (MPNNs).

Solution: An MPNN, being a certain category of GNNs, starts its life by taking as input some graph $(V,E)$. More precisely, this looks like some feature vector $\mathbf x_v$ (e.g. mass, charge, atomic number for atoms) for each vertex $v\in V$ and possibly also a feature vector $\mathbf x_e$ (e.g. bond length, bond energy, etc.) for each edge $e\in E$. The idea is that, in a manner similar to a head of self-attention, each vertex $v\in V$ wants to update its current state $\mathbf x_v$ into some new state $\mathbf x’_v$ by soaking in context from its neighbours, and in an analogous manner each edge $e\in E$ also wants to update its current state $\mathbf x_e\mapsto\mathbf x’_e$ based on its “neighbours” (thus it’s not quite the same as self-attention in which a token doesn’t just look at its nearest neighbour tokens, but at all the tokens in the context). In the general MPNN framework, this can be roughly broken down into $3$ conceptual steps:

Message phase: from a sender perspective, each vertex $v\in V$ “broadcasts” a “personalized” message vector $\mathbf m_{vv’}$ along the edge $(v,v’)\in E$ connecting it to a neighbouring vertex $v’\in V$. This message vector $\mathbf m_{vv’}$ is any (learnable) function of its current state $\mathbf x_v$, the current state of the receiving neighbour vertex $\mathbf x_{v’}$, and the current edge feature vector $\mathbf x_{vv’}$ connecting them.
Aggregation phase: simultaneously, from a receiver perspective, each vertex $v\in V$ receives the broadcasted signals from its neighbouring vertices. From this perspective, it then takes all the received message vectors and synthesizes them into a single “message summary” vector $\mathbf m_v$ which in practice is any permutation invariant function of the message vectors $\mathbf m_{vv’}$ it received from neighbouring vertices $v’\in V$ (e.g. their average).
Update phase: Finally, the vertex $v\in V$ updates its own current state $\mathbf x_v$ to some new state $\mathbf x’_v$ using another (learnable) function of its current state $\mathbf x_v$ and the message summary $\mathbf m_v$.

This $3$-step process represents a single forward pass through $1$ message-passing layer; several composed together define an MPNN.

Problem: Now that the general framework of MPNN architectures has been defined, walk through the following specific examples of MPNN architectures:

Graph convolutional networks (GCNs)
Graph attention networks (GATs)

Solution:

Posted in Blog | Leave a comment

Renormalization Group

Posted on March 1, 2026 by wdengquantum.me

Problem: Consider a Landau-Ginzburg field theory involving a single real scalar field $\phi(\mathbf x)$ for $\mathbf x\in\mathbf R^n$ governed by the canonically normalized free energy density:

\[\mathcal F(\phi,\partial\phi/\partial\mathbf x,…)=\frac{1}{2}\biggr|\frac{\partial\phi}{\partial\mathbf x}\biggr|^2+\frac{\phi^2}{2\xi^2}+…\]

Explain what the $+…$ means, explain which terms have temperature $T$-dependence, and explain for which such terms does that $T$-dependence actually matter?

Solution: The $+…$ includes any terms (each with their own coupling constants) compatible with the golden trinity of constraints: locality, analyticity, and symmetry (e.g. a quartic $g\phi^4$ coupling). The part of the free energy density $\mathcal F$ before the $+…$ should compared to the Lagrangian density $\mathcal L$ of Klein-Gordon field theory:

\[\mathcal L=\frac{1}{2c^2}\left(\frac{\partial\phi}{\partial t}\right)^2-\frac{1}{2}\biggr|\frac{\partial\phi}{\partial\mathbf x}\biggr|^2-\frac{\phi^2}{2\bar{\lambda}^2}\]

with $\bar{\lambda}=\hbar/mc$ the reduced Compton wavelength playing a role analogous to the correlation length $\xi\sim 1/\sqrt{|T-T_c|}$; indeed, this $T$-dependence in $\xi=\xi(T)$ is the only $T$-dependence that matters, even though generically all the other coupling constants will also have $T$-dependence.

Problem: Define the (non-standard) notion of “$\mathcal F$-space”.

Solution: Essentially, $\mathcal F$-space is a kind of “theory space” since a Landau-Ginzburg field theory is defined by its free energy density $\mathcal F$. Equivalently, it can be viewed as a countably $\infty$-tuple $(\xi,g,…)$ of the coupling constants in $\mathcal F$.

Posted in Blog | Leave a comment

Convolutional Neural Networks

Posted on March 1, 2026 by wdengquantum.me

CNNs_Part_1

$\textbf{Problem}$: Write functions that take an arbitrary grayscale image and convolve them with a Sobel edge detection kernel. Apply both functions to a grayscale image of your choice.

$\textbf{Solution}$:

In [1]:

import numpy as np
import matplotlib.pyplot as plt

sobel_horizontal_kernel = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]])
sobel_vertical_kernel = np.array([[-1, -2, -1], [0, 0, 0], [1, 2, 1]])

def sobel_horizontal(img):
    convolved_image = np.zeros(shape=(img.shape[0]-2,img.shape[1]-2))
    for i in range(1,img.shape[0] - 1):
        for j in range(1,img.shape[1] - 1):
            convolved_image[i-1, j-1] = np.sum(img[i-1:i+2, j-1:j+2] * sobel_horizontal_kernel)
    return convolved_image

def sobel_vertical(img):
    convolved_image = np.zeros(shape=(img.shape[0]-2,img.shape[1]-2))
    for i in range(1,img.shape[0] - 1):
        for j in range(1,img.shape[1] - 1):
            convolved_image[i-1, j-1] = np.sum(img[i-1:i+2, j-1:j+2] * sobel_vertical_kernel)
    return convolved_image

np.random.seed(42)
random_grayscale_img = np.random.randint(0, 256, size=(10, 10))

print("Original Image")
plt.imshow(random_grayscale_img, cmap="gray")
plt.show()

print("Horizontal Sobel")
plt.imshow(sobel_horizontal(random_grayscale_img), cmap="gray")
plt.show()

print("Vertical Sobel")
plt.imshow(sobel_vertical(random_grayscale_img), cmap="gray")
plt.show()

Original Image

No description has been provided for this image

Horizontal Sobel

Vertical Sobel

$\textbf{Problem}$: Above, the $3\times 3$ Sobel edge detection kernels were convolved with an $n\times m$ grayscale image, leading to an $(n-2)\times (m-2)$ edge map. However, although for $n, m \gg 1$ this $-2$ shrinking effect is not particularly noticeable, over many layers of convolutions it can compound undesirably. Thus, one solution is to use $\textit{padding}$ to prevent this shrinking effect. Show how this works.

$\textbf{Solution}$: (by the way, the above unpadded version is sometimes called a $\textit{valid}$ convolution, whereas the padded version is sometimes called a $\textit{same}$ convolution).

In [2]:

padded_random_grayscale_img = np.pad(random_grayscale_img, ((1, 1), (1, 1)), 'constant')
print("Padded random grayscale image shape: ", padded_random_grayscale_img.shape)
plt.imshow(padded_random_grayscale_img, cmap='gray')
plt.show()

print("Horizontal Sobel")
plt.imshow(sobel_horizontal(padded_random_grayscale_img), cmap="gray")
plt.show()

print("Vertical Sobel")
plt.imshow(sobel_vertical(padded_random_grayscale_img), cmap="gray")
plt.show()

Padded random grayscale image shape:  (12, 12)

Horizontal Sobel

Vertical Sobel

$\textbf{Problem}$: Generate a random $15\times 15$ grayscale image, a random $5\times 5$ kernel, and perform a $\textit{stride}$ convolution of the image with the kernel using a stride size of $2$.

$\textbf{Solution}$: The dimension of the output image in this (unpadded) case is $(15-5)/2+1=6$.

In [3]:

np.random.seed(42)
rnd_gray_img = np.random.randint(0, 256, size=(15, 15))
rnd_kernel = np.random.randint(0, 10, size=(5, 5))
plt.imshow(rnd_gray_img, cmap='gray')
plt.show()
plt.imshow(rnd_kernel, cmap='gray')
plt.show()

In [4]:

def stride_convolution(img, kernel, stride_size, padding):
    output_img = np.zeros(shape=(int((img.shape[0] + 2 * padding - kernel.shape[0])/stride_size + 1), int((img.shape[1] + 2 * padding - kernel.shape[1])/stride_size + 1)))
    for i in range(output_img.shape[0]):
        for j in range(output_img.shape[1]):
            output_img[i, j] = np.sum(img[i * stride_size:i * stride_size + kernel.shape[0], j * stride_size:j * stride_size + kernel.shape[1]] * kernel)
    return output_img

plt.imshow(stride_convolution(rnd_gray_img, rnd_kernel, 2, 0), cmap='gray')

Out[4]:

<matplotlib.image.AxesImage at 0x164ca230b90>

$\textbf{Problem}$: Suppose one has a $28\times 28\times 192$ volume at some point inside a convolutional neural network. Explain how a same, unpadded, unit-stride convolution can be performed on this volume to obtain an output volume of dimensions $28\times 28\times 32$.

$\textbf{Solution}$: Use $32$ filters of size $1\times 1\times 192$ with stride $1$ and no padding; this is a common technique to reduce channel number while preserving spatial dimensions.

$\textbf{Problem}$: Implement simplified versions of the LeNet-5, AlexNet, and VGG-16 convolutional neural networks with randomly initialized weights and biases (here simplified means that we will not implement batch normalization, dropout, or any activation functions). Run a forward pass through it without bothering to train the parameters.

$\textbf{Solution}$:

In [18]:

class ConvLayer():
    def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0):
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding
        
        # Initialize weights and biases
        self.weights = np.random.randn(out_channels, in_channels, kernel_size, kernel_size)
        self.biases = np.random.randn(out_channels)
    
    def forward(self, x):
        # x shape: (batch_size, in_channels, height, width)
        batch_size, in_channels, height, width = x.shape
        
        # Calculate output dimensions
        out_height = (height + 2 * self.padding - self.kernel_size) // self.stride + 1
        out_width = (width + 2 * self.padding - self.kernel_size) // self.stride + 1
        
        # Initialize output
        out = np.zeros((batch_size, self.out_channels, out_height, out_width))
        
        # Apply padding if needed
        if self.padding > 0:
            x = np.pad(x, ((0, 0), (0, 0), (self.padding, self.padding), (self.padding, self.padding)), 'constant')
        
        # Convolution operation
        for b in range(batch_size):
            for c in range(self.out_channels):
                for h in range(out_height):
                    for w in range(out_width):
                        # Extract the receptive field
                        receptive_field = x[b, :, h*self.stride:h*self.stride+self.kernel_size, 
                                            w*self.stride:w*self.stride+self.kernel_size]
                        
                        # Perform convolution
                        out[b, c, h, w] = np.sum(receptive_field * self.weights[c]) + self.biases[c]
        
        return out

class PoolingLayer():
    def __init__(self, kernel_size, stride=2, mode="max"):
        self.kernel_size = kernel_size
        self.stride = stride
        self.mode = mode
    
    def forward(self, x):
        # x shape: (batch_size, channels, height, width)
        batch_size, channels, height, width = x.shape
        
        # Calculate output dimensions
        out_height = height // self.stride
        out_width = width // self.stride
        
        # Initialize output
        out = np.zeros((batch_size, channels, out_height, out_width))
        
        # Max pooling operation
        for b in range(batch_size):
            for c in range(channels):
                for h in range(out_height):
                    for w in range(out_width):
                        # Extract the receptive field
                        receptive_field = x[b, c, h*self.stride:h*self.stride+self.kernel_size, 
                                            w*self.stride:w*self.stride+self.kernel_size]
                        
                        # Perform max pooling
                        if self.mode == "max":
                            out[b, c, h, w] = np.max(receptive_field)
                        elif self.mode == "avg":
                            out[b, c, h, w] = np.mean(receptive_field)
        
        return out

class FCLayer():
    def __init__(self, in_features, out_features):
        self.in_features = in_features
        self.out_features = out_features
        
        # Initialize weights and biases
        self.weights = np.random.randn(out_features, in_features)
        self.biases = np.random.randn(out_features)
    
    def forward(self, x):
        # x shape: (batch_size, in_features)
        batch_size = x.shape[0]
        
        # Initialize output
        out = np.zeros((batch_size, self.out_features))
        
        # Perform matrix multiplication
        for b in range(batch_size):
            out[b] = np.dot(self.weights, x[b]) + self.biases
        
        return out

In [19]:

#Le-Net 5 Implementation
num_batches = 3
np.random.seed(42)
batch_rand_gray_imgs = np.random.randint(0, 256, size=(num_batches, 1, 32, 32))

# Create layers
layer1 = ConvLayer(in_channels=1, out_channels=6, kernel_size=5, stride=1, padding=0)
layer2 = PoolingLayer(kernel_size=2, stride=2, mode="avg")
layer3 = ConvLayer(in_channels=6, out_channels=16, kernel_size=5, stride=1, padding=0)
layer4 = PoolingLayer(kernel_size=2, stride=2, mode="avg")
layer5 = FCLayer(in_features=400, out_features=120) # both fully connected layers assume the input image dimensions are (h,w,c) = (32, 32, 1)
layer6 = FCLayer(in_features=120, out_features=84)
layer7 = FCLayer(in_features=84, out_features=1) #y_hat is int b/w 0 and 9 to classify handwritten digit, nowadays prefer softmax

# Forward pass
output = layer1.forward(batch_rand_gray_imgs)
output = layer2.forward(output)
output = layer3.forward(output)
output = layer4.forward(output)
output = output.reshape(num_batches, -1)
output = layer5.forward(output)
output = layer6.forward(output)
output = layer7.forward(output)

print(output)

[[-14004126.31782551]
 [-11998811.82382246]
 [-13235583.18652075]]

In [ ]:

# AlexNet Implementation
num_batches = 3
np.random.seed(42)
batch_rand_gray_imgs = np.random.randint(0, 256, size=(num_batches, 3, 227, 227))

# Create layers
layer1 = ConvLayer(in_channels=3, out_channels=96, kernel_size=11, stride=4, padding=0)
layer2 = PoolingLayer(kernel_size=3, stride=2, mode="max")
layer3 = ConvLayer(in_channels=96, out_channels=256, kernel_size=5, stride=1, padding=2) # SAME convolutional layer
layer4 = PoolingLayer(kernel_size=3, stride=2, mode="max")
layer5 = ConvLayer(in_channels=256, out_channels=384, kernel_size=3, stride=1, padding=1)
layer6 = ConvLayer(in_channels=384, out_channels=384, kernel_size=3, stride=1, padding=1)
layer7 = ConvLayer(in_channels=384, out_channels=256, kernel_size=3, stride=1, padding=1)
layer8 = PoolingLayer(kernel_size=3, stride=2, mode="max")
layer9 = FCLayer(in_features=9216, out_features=4096) #9216 = 256 * 6 * 6
layer10 = FCLayer(in_features=4096, out_features=4096)
layer11 = FCLayer(in_features=4096, out_features=1000) 

# Forward pass
output = layer1.forward(batch_rand_gray_imgs)
output = layer2.forward(output)
output = layer3.forward(output)
output = layer4.forward(output)
output = layer5.forward(output)
output = layer6.forward(output)
output = layer7.forward(output)
output = layer8.forward(output)
output = output.reshape(num_batches, -1)
output = layer9.forward(output)
output = layer10.forward(output)
output = layer11.forward(output)
# normally would apply softmax on this 1000-feature vector output

for i in range(num_batches):
    plt.semilogy(output[i, :])
    plt.show()

In [32]:

# VGG-16 Implementation
num_batches = 3
np.random.seed(42)
batch_rand_gray_imgs = np.random.randint(0, 256, size=(num_batches, 3, 224, 224))

# Create layers
# Block 1
layer1_1 = ConvLayer(in_channels=3, out_channels=64, kernel_size=3, stride=1, padding=1)
layer1_2 = ConvLayer(in_channels=64, out_channels=64, kernel_size=3, stride=1, padding=1)
pool1 = PoolingLayer(kernel_size=2, stride=2, mode="max")

# Block 2
layer2_1 = ConvLayer(in_channels=64, out_channels=128, kernel_size=3, stride=1, padding=1)
layer2_2 = ConvLayer(in_channels=128, out_channels=128, kernel_size=3, stride=1, padding=1)
pool2 = PoolingLayer(kernel_size=2, stride=2, mode="max")

# Block 3
layer3_1 = ConvLayer(in_channels=128, out_channels=256, kernel_size=3, stride=1, padding=1)
layer3_2 = ConvLayer(in_channels=256, out_channels=256, kernel_size=3, stride=1, padding=1)
layer3_3 = ConvLayer(in_channels=256, out_channels=256, kernel_size=3, stride=1, padding=1)
pool3 = PoolingLayer(kernel_size=2, stride=2, mode="max")

# Block 4
layer4_1 = ConvLayer(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=1)
layer4_2 = ConvLayer(in_channels=512, out_channels=512, kernel_size=3, stride=1, padding=1)
layer4_3 = ConvLayer(in_channels=512, out_channels=512, kernel_size=3, stride=1, padding=1)
pool4 = PoolingLayer(kernel_size=2, stride=2, mode="max")

# Block 5
layer5_1 = ConvLayer(in_channels=512, out_channels=512, kernel_size=3, stride=1, padding=1)
layer5_2 = ConvLayer(in_channels=512, out_channels=512, kernel_size=3, stride=1, padding=1)
layer5_3 = ConvLayer(in_channels=512, out_channels=512, kernel_size=3, stride=1, padding=1)
pool5 = PoolingLayer(kernel_size=2, stride=2, mode="max")

# Fully Connected Layers
fc1 = FCLayer(in_features=25088, out_features=4096) # 25088 = 512 * 7 * 7
fc2 = FCLayer(in_features=4096, out_features=4096)
fc3 = FCLayer(in_features=4096, out_features=1000)

# Forward pass
output = layer1_1.forward(batch_rand_gray_imgs)
output = layer1_2.forward(output)
output = pool1.forward(output)

output = layer2_1.forward(output)
output = layer2_2.forward(output)
output = pool2.forward(output)

output = layer3_1.forward(output)
output = layer3_2.forward(output)
output = layer3_3.forward(output)
output = pool3.forward(output)

output = layer4_1.forward(output)
output = layer4_2.forward(output)
output = layer4_3.forward(output)
output = pool4.forward(output)

output = layer5_1.forward(output)
output = layer5_2.forward(output)
output = layer5_3.forward(output)
output = pool5.forward(output)

output = output.reshape(num_batches, -1)
output = fc1.forward(output)
output = fc2.forward(output)
output = fc3.forward(output)

for i in range(num_batches):
    plt.semilogy(output[i, :])
    plt.show()

$\textbf{Problem}$: Why is a CNN a better choice of architecture than an MLP for computer vision tasks?

$\textbf{Solution}$: As a simple example, given a convolutional layer of a CNN in which a square $n\times n\times c$ image is convolved (without striding or padding) with $N$ kernels each of size $k\times k$, producing an output volume of dimensions $(n-k+1)\times (n-k+1)\times N$, the number of parameters is expected to be $N(k^2+1)$ (each of the $N$ kernels has $k^2$ weights and a single bias term). By contrast, if one were to instead implement a fully connected layer of dense connections between the flattened input image of size $cn^2$ and the flattened output volume of size $N(n-k+1)^2$, the total number of parameters would be $cn^2N(n-k+1)^2$ (much larger!). Essentially, this is due to the $\textit{parameter sharing}$ property of CNNs, the sparsity of their connections. Finally, the convolutional structure is better at capturing $\textit{translational invariance}$ of the image (unlike the MLP where pixels are just flattened arbitrarily, losing their spatial information).

$\textbf{Problem}$: Explain what a $\textit{residual block}$ is and how several residual blocks may be composed together to define a $\textit{residual neural network}$ (ResNet).

$\textbf{Solution}$: A residual block may be viewed as a simple additive perturbation to a standard $2$-layer feedforward neural network. Recall that, given an input feature vector $\mathbf x$, a feedforward layer $(W,\mathbf b)$ may be viewed as the composition of a $\textit{linear layer}$ $\mathbf x\mapsto W\mathbf x + \mathbf b$ with a nonlinear activation function $\boldsymbol{\sigma}$. Thus, the overall action of a $2$-layer feedforward looks something like: $$\mathbf x\mapsto\boldsymbol{\sigma}(W_2\boldsymbol{\sigma}(W_1\mathbf x + \mathbf b_1) + \mathbf b_2)$$

By contrast, in a residual block, the overall action looks like: $$\mathbf x\mapsto\boldsymbol{\sigma}(W_2\boldsymbol{\sigma}(W_1\mathbf x + \mathbf b_1) + \mathbf b_2 + \mathbf x)$$

(assuming compatible dimensions for vector addition with $\mathbf x$). Thus, although the $1^{\text{st}}$ internal layer of the residual block is no different from a standard feedforward, the difference lies in the insertion of a so-called $\textit{skip connection}$ $+\mathbf x$ in between the linear layer and the activation function of the $2^{\text{nd}}$ layer. A ResNet is then literally just a bunch of residual blocks composed together!

$\textbf{Problem}$: Give some hand-wavy intuition/justification for what makes ResNets interesting/useful.

$\textbf{Solution}$: Consider the following $\textit{thought experiment}$. A standard feedforward neural network classifier with $20$ layers, post-training, has a misclassification rate of $15\%$ $\textit{on the training set}$ itself. Clearly, this seems to be a bias/underfitting problem rather than a variance/overfitting problem because it’s on the $\textit{training set}$. So one reasons that, to make the network more expressive and improve its performance on the training set, one might decide to append another $36$ feedforward layers on top of the $20$ layers already there, thus creating a $56$-layer feedforward. Theoretically, it’s manifest that the optimal training set error can only get lower because one could simply have the first $20$ layers doing exactly what they were doing before, and then have the remaining $36$ layers implement an identity function $\mathbf x\mapsto\mathbf x$. In practice however, it turns out that, paradoxically, training set error actually increases!

The problem is that it turns out to be surprisingly delicate to learn an identity mapping $\mathbf x\mapsto\mathbf x$ due to exploding/vanishing gradient problems during cost function minimization. So rather than struggling so hard to learn the identity, why not redesign the architecture such that the identity $\mathbf x\mapsto\mathbf x$ is the $\textit{default}$ mapping, and to then merely learn whatever perturbation (a.k.a. $\textit{residual}$ hence the network’s name) $\boldsymbol{\varepsilon}(\mathbf x)$ to this identity mapping is needed to learn the actual underlying map of interest $\mathbf x\mapsto \mathbf x+\boldsymbol{\varepsilon}(\mathbf x)$. Roughly speaking, in the earlier notation, one has parameterized the residual by $\boldsymbol{\varepsilon}(\mathbf x)=W_2\boldsymbol{\sigma}(W_1\mathbf x + \mathbf b_1) + \mathbf b_2$, assuming the outer activation function $\boldsymbol{\sigma}$ may be neglected (or if it happens to be a ReLU, this argument is even more appealing!). Notice than that (loosely) $W_1=W_2=\mathbf b_1=\mathbf b_2=\mathbf 0$ gives $\boldsymbol{\varepsilon}(\mathbf x)=\mathbf 0$, hence learning an exact identity map $\mathbf x\mapsto\mathbf x$. But it’s very easy to coerce weights and biases towards $\mathbf 0$ using standard regularization techniques.

In [33]:

# ResNets were first used in computer vision, so the W_1, W_2 matrices above really correspond to convolutional layers in a CNN structure
class ResBlock():
    def __init__(self, in_channels, out_channels, stride=1):
        self.conv1 = ConvLayer(in_channels, out_channels, kernel_size=3, stride=stride, padding=1)
        self.conv2 = ConvLayer(out_channels, out_channels, kernel_size=3, stride=1, padding=1)
        
        # Identity x path (Projection Shortcut if dimensions change)
        self.projection = None
        if stride != 1 or in_channels != out_channels:
            # 1x1 convolution to match dimensions
            self.projection = ConvLayer(in_channels, out_channels, kernel_size=1, stride=stride, padding=0)
            
    def relu(self, x):
        return np.maximum(0, x)

    def forward(self, x):
        identity = x
        
        # 1. First transformation + activation
        out = self.conv1.forward(x)
        out = self.relu(out)
        
        # 2. Second transformation
        out = self.conv2.forward(out)
        
        # 3. Match dimensions of identity if needed
        if self.projection is not None:
            identity = self.projection.forward(x)
            
        # 4. Skip Connection Addition
        out += identity
        
        # 5. Final activation
        out = self.relu(out)
        
        return out

$\textbf{Problem}$: Just like a ResNet is made from composing a sequence of residual blocks together, an inception neural network (InceptionNet) is made from composing a sequence of inception blocks together (mostly true; in practice there may also some other side branches, pooling layers, etc). Explain the architecture of an inception block.

$\textbf{Solution}$: Whereas typical CNN architectures require one to choose a filter size, an inception block asks “why not try several filter sizes?”. To this effect, a naive inception block implementation might look something like:

In [34]:

class NaiveInceptionBlock():
    def __init__(self, in_channels, out_channels):
        self.conv1 = ConvLayer(in_channels, out_channels, kernel_size=1, stride=1, padding=0)
        self.conv2 = ConvLayer(in_channels, out_channels, kernel_size=3, stride=1, padding=1) #same convolution
        self.conv3 = ConvLayer(in_channels, out_channels, kernel_size=5, stride=1, padding=2) #same convolution
        self.pool = PoolingLayer(kernel_size=2, stride=2, mode="max")

    def forward(self, x):
        x1 = self.conv1.forward(x)
        x2 = self.conv2.forward(x)
        x3 = self.conv3.forward(x)
        x4 = self.pool.forward(x)
        return np.concatenate([x1, x2, x3, x4], axis=1)

$\textbf{Problem}$: Needless to say, one of the drawbacks to asking “why not try several filter sizes?” is that, by having to literally try several filter sizes, the naive inception block implementation above can get computationally expensive in terms of the number of multiplications and additions required per forward pass. Explain how the introduction of $1\times 1$ $\textit{bottleneck layers}$ can help reduce computational burden without sacrificing too much accuracy.

$\textbf{Solution}$: The idea is to insert $1\times 1$ convolutional layers before the more computationally expensive $3\times 3$ and $5\times 5$ convolutions. These $1\times 1$ convolutions act as “bottlenecks” by reducing the number of channels (and thus the computational cost) before the larger convolutions are applied. For example, instead of applying a $5\times 5$ convolution directly to 256 input channels, we can first apply a $1\times 1$ convolution to reduce the channels to, say, 64, and then apply the $5\times 5$ convolution to the reduced-channel tensor. This significantly reduces the number of parameters and computations while maintaining most of the representational power of the larger filters.

In [35]:

class InceptionBlockWithBottleNeck(NaiveInceptionBlock):
    def __init__(self, in_channels, num_1x1, num_3x3_reduce, num_3x3, num_5x5_reduce, num_5x5, num_pool_proj):
        super().__init__(in_channels, num_1x1, num_3x3, num_5x5, num_pool_proj)
        self.conv_3x3_reduce = ConvLayer(in_channels, num_3x3_reduce, kernel_size=1)
        self.conv_5x5_reduce = ConvLayer(in_channels, num_5x5_reduce, kernel_size=1)
        self.conv_3x3_reduce_out = ConvLayer(num_3x3_reduce, num_3x3, kernel_size=3, padding=1)
        self.conv_5x5_reduce_out = ConvLayer(num_5x5_reduce, num_5x5, kernel_size=5, padding=2)

    def forward(self, x):
        out_1x1 = self.conv_1x1(x)
        out_3x3_reduce = self.conv_3x3_reduce(x)
        out_3x3 = self.conv_3x3_reduce_out(out_3x3_reduce)
        out_5x5_reduce = self.conv_5x5_reduce(x)
        out_5x5 = self.conv_5x5_reduce_out(out_5x5_reduce)
        out_pool = self.pool(x)
        out_pool_proj = self.pool_proj(out_pool)
        return np.concatenate([out_1x1, out_3x3, out_5x5, out_pool_proj], axis=1)

$\textbf{Problem}$: Explain how the MobileNet architecture replaces the standard convolutional layer with a depthwise separable convolutional layer.

$\textbf{Solution}$: The idea is to sort of break down the standard convolutional layer into two parts: a $\textit{depthwise convolutional layer}$ and a $\textit{pointwise convolutional layer}$. The depthwise convolutional layer is a convolutional layer that only convolves with the depth (channels) dimension, while the pointwise convolutional layer is a convolutional layer that only convolves with the height and width dimensions.

Posted in Blog | Leave a comment

Hamilton’s Optics-Mechanics Analogy

Posted on February 19, 2026 by wdengquantum.me

Problem: Deduce the Hamilton-Jacobi equation of classical mechanics.

Solution: Instead of viewing the action $S=S[\mathbf x(t)]$ as a functional of the particle’s trajectory $\mathbf x(t)$, it can be viewed more simply as a scalar field $S(\mathbf x,t)$ in which the initial point in spacetime $(t_0,\mathbf x_0)$ is fixed and one simply takes the on-shell trajectory from $(t_0,\mathbf x_0)$ to $(t,\mathbf x)$. Then the total differential $dS=\mathbf p\cdot d\mathbf x$ (follows from the usual Noetherian calculation) so in particular:

\[\mathbf p=\frac{\partial S}{\partial\mathbf x}\]

Intuitively, this is saying that the particle moves in a direction (the direction of the momentum $\mathbf p$) orthogonal to the contour surfaces of the action field $S$, i.e. such isosurfaces can be viewed as “wavefronts”. Then the total time derivative is:

\[\dot S=L\]

But $\frac{\partial S}{\partial t}+\frac{\partial S}{\partial\mathbf x}\cdot\dot{\mathbf x}=\frac{\partial S}{\partial t}+\mathbf p\cdot\dot{\mathbf x}$. Thus, isolating for $H=\mathbf p\cdot\dot{\mathbf x}-L$ yields the Hamilton-Jacobi nonlinear $1^{\text{st}}$-order PDE for $S(\mathbf x,t)$:

\[-\frac{\partial S}{\partial t}=H\left(\mathbf x,\frac{\partial S}{\partial\mathbf x},t\right)\]

Problem: When $\partial H/\partial t=0$, the Hamiltonian is conserved with energy $H=E$, so this motivates the additive separation of variables, $S(\mathbf x,t):=S_0(\mathbf x)-Et$ for some constant $E$. What does the Hamilton-Jacobi equation simplify to in this case? For a single non-relativistic particle of mass $m$ moving in a potential $V(\mathbf x)$, what does this look like? What about in $1$ dimension?

Solution: \[H\left(\mathbf x,\frac{\partial S_0}{\partial\mathbf x}\right)=E\]

which for $H(\mathbf x,\mathbf p)=|\mathbf p|^2/2m+V(\mathbf x)$ looks like:

\[\frac{1}{2m}\biggr|\frac{\partial S_0}{\partial\mathbf x}\biggr|^2+V(\mathbf x)=E\]

and in $1$ dimension is integrable to the explicit solution:

\[S_0(x)=\pm\int ^xdx’\sqrt{2m(E-V(x’))}\]

In particular, the usual trajectory $x(t)$ can be obtained by treating $S_o=S_0(x,t;E)$ as a family of solutions parameterized by the energy $E$; this works because $S_0$ can be a viewed as a particular generating function of a canonical transformation $(\mathbf x,\mathbf p,H)\mapsto (\mathbf x’,\mathbf p’,H’)$ in which the “boosted” Hamiltonian vanishes $H’=0$.

\[\frac{\partial S_0}{\partial E}=-t_0\Rightarrow t-t_0=\pm\sqrt{\frac{m}{2}}\int_{x_0}^{x(t)}\frac{dx’}{\sqrt{E-V(x’)}}\]

Problem: Above, the static field $S_0(\mathbf x)$ was introduced to simplify the Hamilton-Jacobi equation when the energy $E$ was conserved. However, if one pulls back to the level of functionals rather than fields, one can define an analogous abbreviated action functional $S_0[\mathbf x]$ which depends only on the path $\mathbf x$ taken rather than the trajectory $\mathbf x(t)$. Define $S_0[\mathbf x]$, and moreover show that when the energy $E$ is conserved, the on-shell path is a stationary point of $S_0$ (this is called Maupertuis’s principle).

Solution: The abbreviated action for a single particle of mass $m$ and (non-relativistic) energy $E=|\mathbf p|^2/2m +V(\mathbf x)$ is:

\[S_0[\mathbf x]:=\int d\mathbf x\cdot\mathbf p=\int ds |\mathbf p|=\int ds\sqrt{2m(E-V(\mathbf x))}\]

(modifying this to $S_0[\mathbf x]:=\int ds |\mathbf p|=\int ds\sqrt{2(E-V(\mathbf x))}$ allows for interpretation as an $N$-particle system in configuration space $\mathbf x\in\mathbf R^{3N}$ with the Riemannian “mass metric” $ds^2=m_1|d\mathbf x_1|^2+…+m_N|d\mathbf x_N|^2$).

To find the stationary paths of $S_0[\mathbf x]$ subject to the constraint $H(\mathbf x,\mathbf p)=E$, one can implement a Lagrange multiplier $\gamma(\tau)$ to perform unconstrained extremization of:

\[S[\mathbf x(\tau)]:=S_0[\mathbf x(\tau)]-\int d\tau\gamma(\tau)(H(\mathbf x,\mathbf p)-E)=\int d\tau (\mathbf p\cdot\dot{\mathbf x}-\gamma(\tau)(H(\mathbf x,\mathbf p)-E))\]

The Euler-Lagrange equations lead to Hamilton’s equations:

\[\frac{d\mathbf x}{d\tau}=\gamma\frac{\partial H}{\partial\mathbf p}\]

\[\frac{d\mathbf p}{d\tau}=-\gamma\frac{\partial H}{\partial\mathbf x}\]

provided the Lagrange multiplier $\gamma=dt/d\tau$ encodes reparameterization invariance; with this choice it’s clear that the integrand in the functional $S$ was nothing more than the Lagrangian $L=\mathbf p\cdot\dot{\mathbf x}-H$ (plus an unimportant constant $E$) so Maupertuis’s principle reduces to the usual Hamilton’s principle.

Problem: What does Fermat’s principle in ray optics assert? Hence, derive the ray equation.

Solution: The time functional $T=T[\mathbf x(s)]$ of a ray trajectory $\mathbf x(s)$ is stationary on-shell. That is:

\[cT[\mathbf x(s)]=\int ds n(\mathbf x(s))\]

This is reparameterization invariant, since one can arbitrarily parameterize $\mathbf x=\mathbf x(t)$ and replace $ds=dt|\dot{\mathbf x}|$. The corresponding Euler-Lagrange equations are:

\[\frac{d}{dt}\left(n(\mathbf x)\frac{\dot{\mathbf x}}{|\dot{\mathbf x}|}\right)=|\dot{\mathbf x}|\frac{\partial n}{\partial\mathbf x}\]

But by choosing the natural parameterization $t:=s$ one has $|d\mathbf x/ds|=1$, hence the ray equation:

\[\frac{d}{ds}\left(n\frac{d\mathbf x}{ds}\right)=\frac{\partial n}{\partial\mathbf x}\]

This can also be written in terms of the curvature vector $\boldsymbol{\kappa}=d^2\mathbf x/ds^2$:

\[\boldsymbol{\kappa}=\left(\frac{\partial\ln n}{\partial\mathbf x}\right)_{\perp d\mathbf x}\]

Problem: Starting from an arbitrary Cartesian component $\psi(\mathbf x,t)=\psi_0(\mathbf x)e^{i(k_0cT(\mathbf x)-\omega t)}$ of either the $\mathbf E$ or $\mathbf B$ fields of a light wave (here $\omega=ck_0$ with $k_0=2\pi/\lambda_0$ is the free space wavenumber), make the eikonal approximation to the dispersionless wave equation obeyed by $\psi$ in order to obtain the (scalar) eikonal equation. By defining light rays as the integral curves of the eikonal field $cT(\mathbf x)$ (a kind of local optical path length), reproduce the vector eikonal equation from Fermat’s principle above.

Solution: The ansatz $\psi(\mathbf x,t)=\psi_0(\mathbf x)e^{i(k_0cT(\mathbf x)-\omega t)}$ is easy to justify; the $e^{-i\omega t}$ is a just a Fourier transform factor that reduces the wave equation $\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\psi=\frac{n^2}{c^2}\frac{\partial^2\psi}{\partial t^2}$ to a Helmholtz equation $\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\psi=-n^2k_0^2\psi$. The remaining piece is just a polar parameterization of an arbitrary $\mathbf C$-valued spatial field $\psi_0(\mathbf x)e^{ik_0cT(\mathbf x)}$. One obtains:

\[\biggr|\frac{\partial cT}{\partial\mathbf x}\biggr|^2=n^2+\frac{1}{k_0^2\psi_0}\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\psi_0+\frac{2i}{k_0\psi_0}\frac{\partial\psi_0}{\partial\mathbf x}\cdot\frac{\partial cT}{\partial\mathbf x}+\frac{i}{k_0}\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2cT\]

The eikonal approximation amounts to taking the ray optics limit $k_0\to\infty$ (in practice, the wavelength $2\pi/k_0$ has to be much shorter than all other length scales), and yields the (scalar) eikonal equation:

\[\biggr|\frac{\partial cT}{\partial\mathbf x}\biggr|=n\]

A light ray is thus a trajectory $\mathbf x(s)$ with unit tangent vector:

\[\frac{d\mathbf x}{ds}=\frac{1}{n}\frac{\partial cT}{\partial\mathbf x}\]

The rest is an application of the chain rule:

\[\frac{d}{ds}=\frac{\partial}{\partial\mathbf x}\cdot\frac{d\mathbf x}{ds}=\frac{1}{n}\frac{\partial}{\partial\mathbf x}\cdot\frac{\partial cT}{\partial\mathbf x}\]

followed by the identity:

\[\left(\frac{\partial cT}{\partial\mathbf x}\cdot\frac{\partial}{\partial\mathbf x}\right)\frac{\partial cT}{\partial\mathbf x}=\frac{1}{2}\frac{\partial}{\partial\mathbf x}\biggr|\frac{\partial cT}{\partial\mathbf x}\biggr|^2\]

to deduce the (vector) eikonal equation of motion for ray trajectories just as Fermat’s principle predicts.

Problem: Hence, what is Hamilton’s optics-mechanics analogy?

Solution: In a nutshell, the isomorphism proceeds as:

\[(n(\mathbf x), cT)\leftrightarrow (|\mathbf p(\mathbf x)|,S_0)\]

Problem: Use Hamilton’s optics-mechanics analogy to solve the brachistochrone problem (this was how Johann Bernoulli originally solved it).

Solution: By energy conservation, the speed of the particle at distance $y>0$ below its initial dropping height is $v=\sqrt{2gy}$. By Fermat’s principle, minimizing the time functional then amounts to treating the particle as a light ray with $n(\mathbf x)=c/v(y)$. So the question becomes how do light rays bend in a horizontally stratified medium with $n(y)\propto y^{-1/2}$? The answer is given by the ray equations:

\[\frac{d}{ds}\begin{pmatrix} y^{-1/2}dx/ds \\ y^{-1/2}dy/ds\end{pmatrix}=\begin{pmatrix}0 \\ y^{-1/2}/2\end{pmatrix}\]

The horizontal component expresses Snell’s law since $dx/ds=\sin\theta$ (it expresses momentum conservation along the homogeneous $\partial n/\partial x=0$ direction). Using the tangent vector constraint $ds^2=dx^2+dy^2$ gives the ODE of a cycloid:

\[\frac{dy}{dx}=\sqrt{\frac{\text{const}}{y}-1}\]

(the vertical component ODE has an analytical solution $y(s)=-s^2/8R+s$ which is contained in the cycloid, so is redundant information).

Problem: How did Hamilton’s optics-mechanics analogy inspire Schrodinger to propose his famous equations of quantum mechanics?

Solution: Essentially, Schrodinger asked: ray optics is to wave optics as classical mechanics is to what? In other words, one imagines there exists a wave theory of particles/matter and one would like to take the “inverse eikonal limit” of classical mechanics (here, “inverse eikonal limit” is usually called quantization):

Just as light rays propagate parallel to their phase fronts:

\[\frac{d\mathbf x}{ds}=\frac{1}{n}\frac{\partial cT}{\partial\mathbf x}\]

Particles propagate parallel to their “action fronts” exactly according to Hamilton’s analogy:

\[\frac{d\mathbf x}{ds}=\frac{1}{|\mathbf p(\mathbf x)|}\frac{\partial S_0}{\partial\mathbf x}\]

Already this suggests that the action should have some phase interpretation. More precisely, it should be the phase of the particle’s de Broglie wave in units of $\hbar$. It’s also not obvious that particle’s should be described by a scalar wave field rather than e.g. the electromagnetic vector wave fields of light. Schrodinger simply guessed it looked like the equivalent of “scalar diffraction theory” with a single wavefunction $\psi(\mathbf x,t)=\psi_0(\mathbf x,t)e^{iS(\mathbf x,t)/\hbar}$. This gives the Madelung equations of quantum hydrodynamics, one of which is just a continuity equation (giving credence to the Born interpretation of $|\psi|^2$) and the other is a quantum Hamilton-Jacobi equation which in the limit $\hbar\to 0$ (analogous to the eikonal limit $\lambda_0\to 0$) simplifies to the classical Hamilton-Jacobi equation.

Posted in Blog | Leave a comment

Pseudo-Riemannian Geometry

Posted on February 13, 2026 by wdengquantum.me

Problem: Define the signature of a matrix. Hence, state and prove Sylvester’s law of inertia.

Solution: The signature of an $n\times n$ matrix $A$ is a $3$-tuple $(n_+,n_-,n_0)$ where $n_+$ is the number of positive eigenvalues of $A$ (including multiplicity), $n_-$ is the number of negative eigenvalue of $A$ (including multiplicity), and $n_0=\text{dim}\ker A$ is the multiplicity of the zero eigenvalue; thus, $n_++n_-+n_0=n$.

Let $A,B$ be real symmetric matrices. Then Sylvester’s law of inertia asserts that $A$ and $B$ are congruent matrices iff they have the same signature (which sometimes is called “inertia” because of this invariance under congruence, hence the name).

Proof: It’s easy to see that the nullity $n_0$ is preserved by congruence transformations. If one can can show that $n_+$ is also preserved, then it implies $n_-$ is conserved by virtue of $n_++n_-+n_0=n$. To show this, the idea is to prove by contradiction, assuming $n_+$ is not preserved which by dimension counting would imply a non-zero vector living in the intersection of the subspace spanned by the positive-eigenvalue eigenvectors of $A$ and the congruence-transformed subspace spanned by the non-positive-eigenvalue eigenvectors of $B$.

Since any real, symmetric matrix $A$ is isomorphic to a real quadratic form $Q(\mathbf x):=\mathbf x^TA\mathbf x$, the concepts of signature and Sylvester’s law of inertia can also be reformulated in the language of quadratic forms rather than real symmetric matrices.

Problem: Let $X$ be a smooth $n$-manifold. Explain what it means to place the additional structure of a pseudo-Riemannian metric $g$ on $X$. Then explain how Riemannian and Lorentzian geometry are special cases of pseudo-Riemannian geometry.

Solution: A Riemannian metric $g$ on $X$ is a type $(0,2)$ tensor field that defines an inner product $g_x:T_x(X)^2\to\mathbf R$ at the tangent space $T_x(X)$ of each point $x\in X$. The “tensor field” part is sometimes rephrased as saying that $g_x$ is a bilinear form. Moreover, to flesh out the usual axioms of a real inner product space, the Riemannian metric tensor $g_x$ at each $x\in X$ must be symmetric $g_x(v_x,v’_x)=g_x(v’_x,v_x)$ and positive-definite $v_x\neq 0\Rightarrow g_x(v_x,v_x)>0$.

A pseudo-Riemannian metric relaxes the positive-definite requirement of a Riemannian metric, instead merely requiring non-degeneracy (i.e. at each point $x\in X$, the zero vector $0$ is the only vector orthogonal $g_x(0,v_x)=0$ to all $v_x\in T_x(X)$). More concisely, what this is saying is that $n_0=0$, so the signature of a pseudo-Riemannian metric may be thought of as a pair $(n_+,n_-)$.

Thus, Riemannian metrics are the special subset of pseudo-Riemannian metrics for which $(n_+,n_-)=(n,0)$. Meanwhile, Lorentzian metrics are another special subset of pseudo-Riemannian metrics for which $(n_+,n_-)=(n-1,1)$ (this is the relativists’/ convention). Typically, spacetime $X$ has dimension $n=4$, so e.g. the Minkowski metric is said to have signature $(3,1):=(-,+,+,+)$.

Problem: Let $(X,g)$ be a pseudo-Riemannian manifold. Often, one writes the general expression for an infinitesimal line element on $X$ as $ds^2=g_{\mu\nu}dx^{\mu}dx^{\nu}$; explain the shorthand being used here.

Solution: This is nothing more than choosing some chart $x^{\mu}$ on $X$ and expanding the metric tensor field $g$ in the coordinate basis $\{dx^{\mu}\otimes dx^{\nu}\}$ of type $(0,2)$ tensor fields:

\[g=g_{\mu\nu}dx^{\mu}\otimes dx^{\nu}\]

for some real, symmetric scalar fields $g_{\mu\nu}:X\to\mathbf R$. One then simply writes $ds^2:=g$ and omits the tensor product $\otimes$.

Problem: Let $x(t)\in X$ be a curve on a Riemannian manifold $(X,g)$. By choosing a chart $x^{\mu}$ to cover a suitable region of $X$ spanned by the trajectory $x(t)$, explain how to compute the length $s$ of the curve $x(t)$ using the Riemannian metric $g$.

Solution: Heuristically, it is:

\[s=\int ds=\int\sqrt{g_{\mu\nu}dx^{\mu}dx^{\nu}}=\int dt\sqrt{g_{\mu\nu}(x(t))\dot{x}^{\mu}\dot{x}^{\nu}}\]

Problem: Write down the action $S$ for a non-relativistic particle of mass $m$ moving on a Riemannian manifold $(X,g)$. Hence, derive the geodesic equation. What happens if the particle is relativistic?

Solution: A nonrelativistic free particle of mass $m$ has action $S=\int dt L$ described by the usual nonrelativistic kinetic Lagrangian:

\[L=\frac{m}{2}|\dot{\mathbf x}|^2\]

The catch is that $|\dot{\mathbf x}|^2=g_{ij}(\mathbf x)\dot x^{i}\dot x^{j}$ depends on the Riemannian metric $g$ in which the particle moves. Applying the Euler-Lagrange equations, one finds the equation of motion (called the geodesic equation):

\[\ddot x^i+\Gamma^i_{jk}\dot x^j\dot x^k=0\]

where the Christoffel symbols $\Gamma^i_{jk}(\mathbf x):=\frac{1}{2}g^{i\ell}(\mathbf x)\left(\frac{\partial g_{\ell j}}{\partial x^k}+\frac{\partial g_{\ell k}}{\partial x^j}-\frac{\partial g_{jk}}{\partial x^{\ell}}\right)$ are analogous to “fictitious forces” on the particle due to the curvature of $X$.

By contrast, a relativistic free particle has action $S=-mcs=-mc^2\tau$. This can be recast into non-relativistic form $S=\int dt L$ using the relativistic kinetic Lagrangian:

\[L=-mc\sqrt{g_{\mu\nu}\dot x^{\mu}\dot x^{\nu}}\]

which in flat space $g_{\mu\nu}=\text{diag}(1,-1,-1,-1)$ reduces to the familiar $L=-mc^2/\gamma$ (nb. more generally, the quantity under the square root is always $\geq 0$ for the timelike worldlines of particles). Some comments:

The stationary action principle $\delta S=0$ in which the action $S$ tends to be minimized corresponds to the well-known fact that relativistic free particles travelling in straight line geodesics through spacetime maximize the Minkowski distance $s$, or equivalently Alice experiences more proper time (i.e. ages more) than Bob in the twin paradox.
The relativistic action (unlike its non-relativistic counterpart) is manifestly reparameterization invariant.

Now the Euler-Lagrange geodesic equations for the relativistic Lagrangian are almost identical to the non-relativistic case except there’s an additional term:

\[\ddot x^{\mu}+\Gamma^{\mu}_{\nu\rho}(x)\dot x^{\nu}\dot x^{\rho}=\frac{\dot L}{L}\dot x^{\mu}\]

The additional term vanishes $\dot L=0$ iff $L=L(\tau’)$ is parameterized as any affine transformation $\tau’=a\tau+b$ of the particle’s proper time $\tau$ (prove this by showing $\dot{\tau}’=-aL/mc^2$). Only in this case does the relativistic geodesic equation reduce to the non-relativistic form:

\[\frac{d^2 x^{\mu}}{d\tau’^2}+\Gamma^{\mu}_{\nu\rho}(x(\tau’))\frac{dx^{\nu}}{d\tau’}\frac{dx^{\rho}}{d\tau’}=0\]

Problem: For $X=\mathbf R^2$, calculate all non-vanishing Christoffel symbols in the polar coordinate chart $(\rho,\phi)$.

Solution:

Note it is usually more efficient to compute Christoffel symbols this way rather than using the explicit formula in terms of partial derivatives of the metric $g$ (though they’re of course equivalent).

Problem: Explain how the presence of a pseudo-Riemannian metric $g$ on a smooth manifold $X$ allows one to identify covectors in $T_x^*(X)$ with tangent vectors in $T_x(X)$ at each point $x\in X$.

Solution: Just as in $X=\mathbf R^n$, one identifies a tangent vector $\mathbf v$ with its covector $\mathbf v\cdot$ via the inner product $\cdot$, so at a given $x\in X$, one uses the pseudo-inner product $g_x$ on $T_x(X)$ to make the exact same identification $v_x\leftrightarrow g_x(v_x,\space)$ (this sometimes called the musical isomorphism induced by $g$ in light of the map $\flat_x:T_x(X)\to T^*_x(X)$ defined by $v_x^{\flat_x}(v’_x):=g(v_x,v’_x)$ and its inverse $\sharp_x=\flat_x^{-1}$).

In some chart $x^{\mu}$, one can write the $1$-form $v^{\flat}=(v^{\flat})_{\mu}dx^{\mu}$ with components $(v^{\flat})_{\mu}=v^{\flat}(\partial_{\mu})=g(v,\partial_{\mu})=g_{\mu\nu}v^{\nu}$ where $v^{\nu}=v(x^{\nu})$ are the components of its musically isomorphic vector field $v=v^{\nu}\partial_{\nu}$ in the same coordinate basis. Physicists typically abuse notation by merely writing $(v^{\flat})_{\mu}=g_{\mu\nu}v^{\nu}$ as just $v_{\mu}=g_{\mu\nu}v^{\nu}$, identifying $v^{\flat}\equiv v$ under the musical isomorphism. But then this lazy notation makes it look as if the metric $g$ performs a trivial mechanical action of just “lowering the index” on $v$.

Problem: Define a tensor field $g^*$ that performs the trivial mechanical action of “raising the index” on a $1$-form $A$.

Solution: Define $g^*_x:T^*_x(X)^2\to\mathbf R$ to be a type $(2,0)$ tensor given by:

\[g^*(A,A’):=g(A^{\sharp}, A’^{\sharp})\]

Then on the one hand, in some chart $x^{\mu}$, one has:

\[g^*(v^{\flat},v’^{\flat})=g(v,v’)=g_{\mu\nu}v^{\mu}v’^{\nu}\]

On the other hand, in that same chart $x^{\mu}$, one has:

\[g^*(v^{\flat},v’^{\flat})=(g^*)^{\mu\nu}(v^{\flat})_{\mu}(v’^{\flat})_{\nu}=(g^*)^{\mu\nu}g_{\mu\rho}v^{\rho}g_{\nu\sigma}v’^{\sigma}=(g^*)^{\rho\sigma}g_{\mu\rho}g_{\nu\sigma}v^{\mu}v’^{\nu}\]

This implies $g_{\mu\nu}=(g^*)^{\rho\sigma}g_{\mu\rho}g_{\nu\sigma}$. Since $g$ is non-degenerate (the $N_0=0$ axiom of a pseudo-Riemannian metric), it follows that $\det g_{\mu\nu}\neq 0$ is invertible, so the only possibility is $(g^*)^{\mu\nu}g_{\nu\rho}=\delta^{\mu}_{\rho}$, i.e. $g^*\sim g^{-1}$. It is common practice to just drop the $*$ and write $g^{\mu\nu}$ in lieu of $(g^*)^{\mu\nu}$. With this lazy notation, $g^{\mu\nu}$ looks like it just “raises the index” on $A^{\mu}=g^{\mu\nu}A_{\nu}$ (where one last abuse of notation $(A^{\sharp})^{\mu}=A^{\mu}$ has been made!). cf. raising lowering operators $a,a^{\dagger}$ in quantum mechanics.

Problem: What is the canonical volume form on a smooth, orientable pseudo-Riemannian $n$-manifold $X$? Why is it canonical?

Solution: In a chart $x^{\mu}$ in which the pseudo-Riemannian metric has components $g=g_{\mu\nu}dx^{\mu}\otimes dx^{\nu}$, the canonical volume form is $\sqrt{|\det g|}dx^1\wedge…\wedge dx^n$. This is indeed a volume form because it is a top form with $\det g\neq 0$ nowhere vanishing for same reasons as before. Moreover, it is canonical because it does not depend on the choice of chart $x^{\mu}$ (provided it’s the same orientation) as $\sqrt{|\det g|}$ is a scalar density of weight $1$ whereas $dx^1\wedge…\wedge dx^n$ is a scalar density of weight $-1$.

Problem: Let $X$ be a smooth, $n$-dimensional, orientable pseudo-Riemannian manifold, and let $\omega\in\Omega^k(X)$ be a differential $k$-form on $X$. Define the Hodge dual $n-k$-form $\star\omega\in\Omega^{n-k}(X)$ on $X$.

Solution: The Hodge dual $\star\omega\in\Omega^{n-k}(X)$ is defined to be the unique differential $n-k$-form such that …

Posted in Blog | Leave a comment

Reinforcement Learning (Part $1$)

Posted on February 1, 2026 by wdengquantum.me

Problem: How does the paradigm of reinforcement learning (RL) fit within the broader context of machine learning?

Solution: It is instructive to compare/contrast reinforcement learning with supervised learning. In this way, it will be seen that RL can in fact be viewed as a generalization of SL:

In SL, the training feature vectors $\mathbf x_i$ manifest as the state vector $\mathbf x_t$ of an agent (similar to how it’s okay to conflate the position vector $\mathbf x(t)$ of a particle in classical mechanics with the particle itself, there’s no harm in conflating the state vector $\mathbf x_t$ with the agent itself, see map-territory relation). The difference is purely a perspective shift reflected in the choice of subscript variable $i=1,2,3,…$ vs. $t=0,1,2,…$.
In SL, the feature vectors $\mathbf x_i$ are i.i.d. draws from the same underlying data-generating distribution. In RL, the initial state vector $\mathbf x_{t=0}$ is drawn from an initial distribution, and subsequent state vectors $\mathbf x_t$ are dependent on the previous state $\mathbf x_{t-1}$ (discrete-time Markov chain).
In SL, the model output $\hat y(\mathbf x|\boldsymbol{\theta})$ is analogous to an action $\mathbf f$ outputted by an agent’s policy function $\pi(\mathbf f|\mathbf x,t)$.
In SL, each training feature vector $\mathbf x_i$ is labelled by its correct output $y_i$. This concept has no analog in reinforcement learning as there is no notion of a “correct” action for the agent to take.
In SL, the feedback mechanism is provided by a loss function $L(\hat y,y)$. By contrast, in RL the feedback mechanism is provided by a reward signal $r_t$.
In SL, the objective is to find optimal parameters $\boldsymbol{\theta}^*=\text{argmin}_{\boldsymbol{\theta}}\sum_{i=1}^{N_{\text{train}}}L(\hat y(\mathbf x_i|\boldsymbol{\theta}),y_i)$ minimizing the training cost over the $N_{\text{train}}$ training examples. In RL, the objective is to find an optimal agent policy $\pi^*=\text{argmax}_{\pi}\langle\sum_{t=0}^T\gamma^tr_t|\pi\rangle$ maximizing the expected return over an episode of horizon $T\leq\infty$ (which depends on if/when the agent reaches a terminal state $\mathbf x_T$).
In SL, there is often a distinction between training, cross-validation, and testing datasets, with a golden rule often being that the model should obviously not be able to see any of the testing data. In RL, it is societally acceptable to train on your test set 🙂

Problem: Imagine the set of all states $\mathbf x$, i.e. the agent’s state space. On this state space, one can impose a scalar field $v_{\pi}(\mathbf x,t)$; what is the meaning of this field? Give a simple example thereof.

Solution: The scalar field $v_{\pi}(\mathbf x,t)$ can roughly be thought of as describing how “valuable” the state $\mathbf x$ is at time $t$. More precisely, if one were to initialize an agent $\pi$ at state $\mathbf x_t:=\mathbf x$, then $v_{\pi}(\mathbf x,t)$ is the expected return:

\[v_{\pi}(\mathbf x,t):=\biggr\langle\sum_{t’=t+1}^T\gamma^{t’-t-1}r_{t’}\biggr|\mathbf x,\pi\biggr\rangle\]

Consider a $\pi$-creature (of $3$Blue$1$Brown fame) symbolizing an agent with policy $\pi$. This $\pi$-creature has to complete a maze. Within the RL framework, this can be modelled as an MDP in which each square is one possible state $\mathbf x$ of the agent, the maze structure defines the allowed actions $\mathbf f$ from each state $\mathbf x$, and one can work with an undiscounted $\gamma=1$ return in which the reward is $r_t=-1$ for each action taken. Thus, the optimal policy $\pi^*$ is a time-independent, deterministic policy $\mathbf f=\pi^*(\mathbf x)$ that gets the $\pi$-creature from its initial state to the terminal state in the fewest number of moves. With that in mind, one can label on top of each square $\mathbf x$ its corresponding optimal value $v_{\pi^*}(\mathbf x)$ (note, this image was generated using Nano Banana Pro, some of the calculated values are just wrong but the idea is clear):

Problem: The value function $v_{\pi}(\mathbf x,t)$ is passive; it simply assess how good/bad it is to be at $\mathbf x_t:=\mathbf x$. In order to remedy this, one can look at the quality function $q_{\pi}(\mathbf x,\mathbf f,t)$; explain how this takes on a more active role compared to the passive nature of the value function $v_{\pi}(\mathbf x,t)$.

Solution: Because $q_{\pi}(\mathbf x,\mathbf f,t)$ assesses the quality of the agent choosing the action $\mathbf f_t:=\mathbf f$ while in state $\mathbf x_t:=\mathbf x$. That is, it is a more refined conditional expected return:

\[q_{\pi}(\mathbf x,\mathbf f,t)=\biggr\langle\sum_{t’=t+1}^T\gamma^{t’-t-1}r_{t’}\biggr|\mathbf x,\mathbf f,\pi\biggr\rangle\]

Problem: How come the agent’s policy, value function and quality function are sometimes respectively written as $\pi(\mathbf f|\mathbf x),v_{\pi}(\mathbf x)$ and $q_{\pi}(\mathbf x,\mathbf f)$ without the $t$ argument?

Solution: There could be $3$ reasons:

Similar to the above maze example, sometimes it just doesn’t depend on $t$.
If one is seeking to maximize the agent’s expected return over an infinite horizon $T=\infty$; in such a case, it is both intuitively and mathematically clear that $\pi(\mathbf f|\mathbf x),v_{\pi}(\mathbf x)$ and $q_{\pi}(\mathbf x,\mathbf f)$ should all be invariant with respect to time translations.
If the horizon is finite $T<\infty$, then it is standard to package $\mathbf x$ and $t$ together, thereby working with an “augmented” state vector $\mathbf x’:=(\mathbf x,t)$. Then everything can be made to look identical to the infinite horizon $T=\infty$ case by replacing $\mathbf x\mapsto\mathbf x’$.

(cf. distinction between state variables and path variables in thermodynamics).

Problem: State and prove the law of iterated expectation.

Solution: If $X,Y$ are any random variables, then:

\[\langle X\rangle=\langle\langle X|Y\rangle\rangle\]

The proof is easy as long as one is careful to interpret all the expectations correctly. For instance, $\langle X|Y\rangle$ is not a scalar but a random variable with respect to $Y$:

\[\langle X|Y\rangle=\int dx p(x|Y) x\]

Thus, it is clear that the outer expectation must also be with respect to $Y$ alone:

\[\langle\langle X|Y\rangle\rangle=\int dy p(y)\langle X|y\rangle=\int dxdy p(y)p(x|y)x\]

Rewriting in terms of the joint distribution $p(x,y)=p(y)p(x|y)$ and integrating out $\int dy p(x,y)=p(x)$, one finally obtains:

\[=\int dx p(x) x=\langle X\rangle\]

Of course, this also generalizes easily to identities such as the following equality of random variables (w.r.t. $Z$):

\[\langle X|Z\rangle=\langle\langle X|Y,Z\rangle|Z\rangle\]

Problem: Show that the value function $v_{\pi}(\mathbf x)$ satisfied the following identity (to be used as a lemma later):

\[v_{\pi}(\mathbf x_t)=\langle r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1})|\mathbf x_t,\pi\rangle\]

Solution: Denoting the return random variable by:

\[R_t:=\sum_{t’=t+1}^T\gamma^{t’-t-1}r_{t’}\]

The fundamental observation is that $R_t$ obeys a simple recurrence relation:

\[R_t=r_{t+1}+\gamma R_{t+1}\]

Taking suitable conditional expectations thereof:

\[\langle R_t|\mathbf x_t,\pi\rangle=\langle r_{t+1}|\mathbf x_t,\pi\rangle+\gamma\langle R_{t+1}|\mathbf x_t,\pi\rangle\]

The LHS is the definition of $v_{\pi}(\mathbf x_t)$. Applying the law of iterated expectation to the $2^{\text{nd}}$ term on the RHS:

\[\langle R_{t+1}|\mathbf x_t,\pi\rangle=\langle\langle R_{t+1}|\mathbf x_{t+1},\mathbf x_t,\pi\rangle|\mathbf x_t,\pi\rangle\]

The Markov property ensures that $\langle R_{t+1}|\mathbf x_{t+1},\mathbf x_t,\pi\rangle=\langle R_{t+1}|\mathbf x_{t+1},\pi\rangle=v_{\pi}(\mathbf x_{t+1})$. One thus obtains the desired result:

\[v_{\pi}(\mathbf x_t)=\langle r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1})|\mathbf x_t,\pi\rangle\]

Problem: How are $v_{\pi}(\mathbf x_t)$ and $q_{\pi}(\mathbf x_t,\mathbf f_t)$ related to each other?

Solution: Apply the law of iterated expectations:

\[v_{\pi}(\mathbf x_t):=\langle R_t|\mathbf x_t,\pi\rangle=\langle\langle R_t|\mathbf x_t,\mathbf f_t,\pi\rangle|\mathbf x_t,\pi\rangle=\langle q_{\pi}(\mathbf x_t,\mathbf f_t)|\mathbf x_t,\pi\rangle\]

Since $\mathbf x_t$ is being conditioned upon, the expectation must be over $\mathbf f_t$ so one can explicitly write it as:

\[=\sum_{\mathbf f_t}p(\mathbf f_t|\mathbf x_t,\pi)q_{\pi}(\mathbf x_t,\mathbf f_t)\]

But trivially $p(\mathbf f_t|\mathbf x_t,\pi)=\pi(\mathbf f_t|\mathbf x_t)$. Thus, one has an expression for $v_{\pi}(\mathbf x_t)$ in terms of $q_{\pi}(\mathbf x_t,\mathbf f_t)$. Now one would like to proceed the other way, relating $q_{\pi}(\mathbf x_t,\mathbf f_t)$ to $v_{\pi}(\mathbf x_t,\mathbf f_t)$. This can be achieved by fleshing out the expectation from the earlier lemma:

\[v_{\pi}(\mathbf x_t)=\langle r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1})|\mathbf x_t,\pi\rangle\]

\[=\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\pi)(r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1}))\]

Further expand $p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\pi)=\sum_{\mathbf f_t}p(\mathbf f_t|\mathbf x_t,\pi)p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t,\pi)$ and recognize again $p(\mathbf f_t|\mathbf x_t,\pi)=\pi(\mathbf f_t|\mathbf x_t)$ and $p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t,\pi)=p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)$ because the action $\mathbf f_t$ is already fixed. Moving the summation $\sum_{\mathbf f_t}$ to the outside so as to compare with the earlier identity expressing $v_{\pi}(\mathbf x_t)$ in terms of $q_{\pi}(\mathbf x_t,\mathbf f_t)$, one can pattern-match:

\[q_{\pi}(\mathbf x_t,\mathbf f_t)=\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)(r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1}))\]

(which in hindsight is manifest…)

Problem: Hence, deduce the Bellman equations for the value and quality functions.

Solution: To obtain the Bellman equation for $v_{\pi}(\mathbf x_t)$, substitute the above expression for $q_{\pi}(\mathbf x_t,\mathbf f_t)$ into the expression for $v_{\pi}(\mathbf x_t)$:

\[v_{\pi}(\mathbf x_t)=\sum_{\mathbf f_t}\pi(\mathbf f_t|\mathbf x_t)\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)(r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1}))\]

Similarly, to the get the Bellman equation for $q_{\pi}(\mathbf x_t,\mathbf f_t)$ substitute vice versa:

\[q_{\pi}(\mathbf x_t,\mathbf f_t)=\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)\left(r_{t+1}+\gamma\sum_{\mathbf f_{t+1}}p(\mathbf f_{t+1}|\mathbf x_{t+1},\pi)q_{\pi}(\mathbf x_{t+1},\mathbf f_{t+1})\right)\]

Intuitively, the Bellman equations relate the value of every state with the values of states that can be transitioned into, thereby providing a system of simultaneous equations that in principle can be solved to deduce all state values. A key caveat for this dynamic programming approach to work is that one must have complete knowledge of the environment dynamics $p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)$ (similar remarks apply to the qualities of different state-action pairs).

Problem: The above Bellman equations apply to a generic agent policy $\pi$; how do they specialize to the case of an optimal policy $\pi^*$?

Solution: Reverting back to the form of the “pre-Bellman” equations and using the key insight that the optimal policy $\pi^*(\mathbf f_t|\mathbf x_t)=\delta_{\mathbf f_t,\text{argmax}_{\mathbf f_t}q_{\pi^*}(\mathbf x_t,\mathbf f_t)}$ should only assign non-zero probability to actions of the best quality, one has:

\[v_{\pi^*}(\mathbf x_t)=\text{max}_{\mathbf f_t}q_{\pi^*}(\mathbf x_t,\mathbf f_t)\]

\[q_{\pi^*}(\mathbf x_t,\mathbf f_t)=\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)(r_{t+1}+\gamma v_{\pi^*}(\mathbf x_{t+1}))\]

which lead to the Bellman optimality equations.

Problem: In light of Bellman optimality, discuss how policy evaluation and policy improvement work. Hence, explain the notion of generalized policy iteration (GPI).

Solution: Imagine the space of all $(\pi,v)$ pairs, where $\pi(\mathbf f_t|\mathbf x_t)$ is any policy distribution and $v(\mathbf x_t)$ is the value function of some policy. Within this space, there are $2$ canonical subspaces. One is the set $(\pi,v_{\pi})$ of all pairs where $v_{\pi}$ is specifically the value function associated to policy $\pi$, and the other subspace $(\pi_v,v)$ where $\pi_v(\mathbf f_t|\mathbf x_t):=\delta_{\mathbf f_t,\text{argmax}_{\mathbf f_t}q(\mathbf x_t,\mathbf f_t)}$ is greedy with respect to the value function $v$, where $q(\mathbf x_t,\mathbf f_t)=\langle r_{t+1}+\gamma v(\mathbf x_{t+1})|\mathbf x_t,\mathbf f_t\rangle$ (in particular, notice $q\neq q_{\pi}$; the policy $\pi$ does not appear in the definition of $q$).

Policy evaluation is an algorithm for projecting $(v,\pi)\mapsto (v_{\pi},\pi)$. The idea is to view $v$ as a random initial guess for the underlying true policy value function $v_{\pi}$ (though any terminal states $\mathbf x_T$ should be initialized $v(\mathbf x_T):=0$). Then, sweeping through each state $\mathbf x_t$, one updates the value of $v(\mathbf x_t)$ using the Bellman equation involving the (known) randomly initialized values of the other states:

\[v(\mathbf x_t)\mapsto\sum_{\mathbf f_t}\pi(\mathbf f_t|\mathbf x_t)\sum_{\mathbf x_{t+1},r_{t+1}}p(\mathbf x_{t+1},r_{t+1}|\mathbf x_t,\mathbf f_t)(r+\gamma v(\mathbf x_{t+1}))\]

With sufficiently many sweeps over the state space, general theorems guarantee convergence $v\to v_{\pi}$.

Policy improvement is an algorithm for projecting onto the other canonical subspace $(v,\pi)\mapsto (v,\pi_v)$. Essentially just act greedy:
\[\pi(\mathbf f_t|\mathbf x_t)\mapsto\delta_{\mathbf f_t,\text{argmax}_{\mathbf f_t}q_{\pi}(\mathbf x_t,\mathbf f_t)}\]
and is rigorously justified by the policy improvement theorem.

Policy iteration roughly amounts to alternating between the $2$ steps of policy evaluation and policy improvement; however there is some leeway in how one does this, for instance one need not necessarily run the policy evaluation step to completion but rather just perform a single sweep over the state space each time (this more “generalized” version of policy iteration is called value iteration, and can be shown to eventually still funnel/converge onto the optimal policy $\pi^*$ as one can prove that $\pi$ is greedy with respect to its own value function $v_{\pi}$ iff $\pi=\pi^*$).

Problem: Distinguish between model-free vs. model-based methods/algorithms in reinforcement learning.

Solution: In both cases, the fundamental limitation (which previously was taken for granted) is incomplete knowledge of the environment dynamics $p(\mathbf x_{t+1},r_{t+1}|\mathbf x_t,\mathbf f_t)$; instead, one can only sample trajectories $\mathbf x_{t=0},\mathbf f_{t=0},r_{t=0},\mathbf x_{t=1},\mathbf f_{t=1},r_{t=1},…$ when running a policy $\pi$ through the MDP. Recalling the fundamental objective of RL is to compute $\pi^*=\text{argmax}_{\pi}\langle R_t|\pi\rangle$, a model-free method for doing so is one which does not attempt to estimate $p(\mathbf x_{t+1},r_{t+1}|\mathbf x_t,\mathbf f_t)$ (here $p$ is what the word “model” refers to, i.e. a model of the environment dynamics) whereas model-based methods do attempt to estimate $p(\mathbf x_{t+1},r_{t+1}|\mathbf x_t,\mathbf f_t)$ and thereby obtain $\pi^*$ through GPI as outlined earlier (see the OpenAI Spinning Up documentation for a nice diagram of this taxonomy and further comments).

Problem: Explain how Monte Carlo evaluation and Monte Carlo control are both model-free methods that accomplish the same actions as their respective model-based cousins, i.e. policy evaluation and policy improvement.

Solution: Fix $\pi$. The key is to recognize that $v_{\pi}(\mathbf x_t)$ is an expectation of the return random variable starting at $\mathbf x_t$, and Monte Carlo point estimators can always be applied to approximate expectation values:

\[v_{\pi}(\mathbf x_t)=\langle R_t|\mathbf x_t,\pi\rangle\approx\frac{1}{N_{\mathbf x_t}}\sum_{n=1}^{N_{\mathbf x_t}}R_n\]

where $N_{\mathbf x_t}$ is the number of times the MDP passes through state $\mathbf x_t$ in however many trajectories, and $R_n$ is the return following the $n^{\text{th}}$ pass through $\mathbf x_t$ within some MDP trajectory.

Lemma: the arithmetic mean $\hat{\mu}_n:=\frac{1}{n}\sum_{i=1}^nx_i$ obeys the recurrence relation $\hat{\mu}_n=\hat{\mu}_{n-1}+\frac{1}{n}(x_n-\hat{\mu}_{n-1})$

Proof: trivial, but instructive to highlight the convex linear combination nature of the recurrence $\hat{\mu}_n=\left(1-\frac{1}{n}\right)\hat{\mu}_{n-1}+\frac{1}{n}x_n$, so for large $n\gg 1$, the majority $1-1/n$ of the weight is given to the prior value of the mean $\hat{\mu}_{n-1}$, only a tiny sliver $1/n$ is given to the new sample $x_n$.

Thus, rather than waiting for a bunch of trajectories to finish before applying the Monte Carlo estimate above, one can introduce an intermediate estimator initialized at $\hat v_{\pi}(\mathbf x_t):=0$ that applies (at the termination of each trajectory) the following iterative rule for each time $n=1,2,…,N_{\mathbf x_t}$ that $\mathbf x_t$ was visited:

\[\hat v_{\pi}(\mathbf x_t)\mapsto\hat v_{\pi}(\mathbf x_t)+\alpha_n(R_n-\hat v_{\pi}(\mathbf x_t))\]

where $\alpha_n:=1/n$. Since this is just a mathematical decomposition of the original Monte Carlo estimate, it is an equally valid model-free method for $v_{\pi}$ estimation. Now comes a curveball: one common modification is to replace $\alpha_n\mapsto\alpha$ with an $n$-independent constant $\alpha\in[0,1]$ (this is called constant-$\alpha$ MC).

If one were presented with the recurrence relation $\hat{\mu}_n=\hat{\mu}_{n-1}+\frac{1}{n}(x_n-\hat{\mu}_{n-1})$ and initial condition $\hat{\mu}_0=0$, one could claim the explicit solution to be $\hat{\mu}_n=\frac{1}{n}\sum_{i=1}^nx_i$. However, if instead one were presented with the recurrence relation $\hat{\mu}’_n=\hat{\mu}’_{n-1}+\alpha(x_n-\hat{\mu}’_{n-1})$ and same initial condition $\hat{\mu}’_0=0$, what would be the solution for $\hat{\mu}’_n$? It’s easy to check $\hat{\mu}’_n=\alpha\sum_{i=1}^n(1-\alpha)^{n-i}x_i$; thus, recent samples are given exponentially more weight, and it’s biased towards underestimating $\langle\hat{\mu}’_n\rangle-\mu=-\mu(1-\alpha)^n$, though is asymptotically unbiased $\lim_{n\to\infty}\langle\hat{\mu}’_n\rangle=\mu$. However, unlike the asymptotically vanishing variance $\sigma^2_{\hat{\mu}_n}=\sigma^2/n$ of the standard mean estimator $\hat{\mu}_n$, the variance $\sigma^2_{\hat{\mu}’_n}=\frac{\alpha\sigma^2}{2-\alpha}(1-(1-\alpha)^{2n})\to\frac{\alpha\sigma^2}{2-\alpha}\approx\alpha\sigma^2/2$ of $\hat{\mu}’_n$ is bounded from below, hence the estimate $\hat{\mu}’_n$ will forever oscillate about $\mu$. It should make sense $\sigma^2_{\hat{\mu}’_n}\propto\alpha$, i.e. a larger step size $\alpha$ means bigger oscillations, though it also means faster convergence of $\hat{\mu}’_n\to\mu$ because of the $(1-\alpha)^n$.

This all begs the question: why bother with constant-$\alpha$ MC? (come back to this…)

Having used constant-$\alpha$ MC to estimate $v_{\pi}(\mathbf x_t)$ in a model-free way for some fixed agent policy $\pi$, one can naturally ask about model-free estimation of $q_{\pi}(\mathbf x_t,\mathbf f_t)$. It turns out to just be:

\[\hat q_{\pi}(\mathbf x_t,\mathbf f_t)\mapsto\hat q_{\pi}(\mathbf x_t,\mathbf f_t)+\alpha(R_n-\hat q_{\pi}(\mathbf x_t,\mathbf f_t))\]

Conceptually, the standard way to convince oneself of this is that because $\pi$ was fixed, it could be thought of as part of the incomplete knowledge of the environment dynamics $p$ so that the agent never takes any actions but just lets the environment push its state around. This perspective shift from MDP trajectories $\mathbf x_{t=0},\mathbf f_{t=0},r_{t=0},\mathbf x_{t=1},\mathbf f_{t=1},r_{t=1},…$ to MRP trajectories $\mathbf x_{t=0},r_{t=0},\mathbf x_{t=1},r_{t=1},…$ amounts to viewing the MRP states $\mathbf x_t$ as corresponding to augmented state-action pairs $(\mathbf x_t,\mathbf f_t)$ in the MDP. But then it is mathematically apparent that $v_{\pi}(\mathbf x_t)$ in the MRP is just $q_{\pi}(\mathbf x_t,\mathbf f_t)$ in the MDP, so constant-$\alpha$ MC works for $q_{\pi}$-evaluation! Thus, Monte Carlo control chooses to be greedy with respect to $q_{\pi}$, and since $q_{\pi}$ could be estimated in a model-free way, this therefore implements model-free policy improvement.

Problem: Explain the exploration-exploitation tradeoff in reinforcement learning and how a soft policy $\pi$ can help to address this.

Solution: Being greedily exploiting high value states all the time could lead to the agent settling into a “locally” optimal but not globally optimal policy $\pi^*$; the only way to know for sure is to explore all state-action sequences, for instance by ensuring that one’s policy is soft $\pi(\mathbf f_t|\mathbf x_t)>0$ for all states $\mathbf x_t$ and actions $\mathbf f_t$. A crude way to guarantee softness during Monte Carlo control is to use an $\varepsilon$-greedy policy (a better name would be either a $(1-\varepsilon)$-greedy policy or an $\varepsilon$-soft policy), though there are other more refined ways to do so.

Problem: Having distinguished between model-based and model-free methods in RL, one can make an orthogonal distinction between on-policy and off-policy methods. Explain and motivate this taxonomy.

Solution: The constant-$\alpha$ MC $q_{\pi}$-evaluation followed by $\varepsilon$-greedy MC control is a simple model-free, on-policy RL algorithm for discovering $\pi^*$. This is because the trajectory samples used to evaluate $q_{\pi}$ came from the same policy $\pi$ that is subsequently being improved. However, in practice the trajectory samples may come from some (known) behavior policy $\pi_b$ even though the goal is to evaluate $q_{\pi}$ so as to improve another, different (known) target policy $\pi\neq\pi_b$. This is where off-policy methods come in (cf. the distinction between on-shell vs. off-shell particles in physics). The obvious way to bridge $\pi_b$ and $\pi$ is via importance sampling:

\[q_{\pi}(\mathbf x_t,\mathbf f_t)=\langle R_t|\mathbf x_t,\mathbf f_t,\pi\rangle\]

\[=\frac{1}{N_{\mathbf x_t}}\sum_{n=1}^{N_{\mathbf x_t}}\frac{\pi(\mathbf f_{t+1}|\mathbf x_{t+1})…\pi(\mathbf f_T|\mathbf x_T)}{\pi_b(\mathbf f_{t+1}|\mathbf x_{t+1})…\pi_b(\mathbf f_T|\mathbf x_T)}R_n\]

where it is essential to emphasize that the trajectories in this sum are sampled under the behavior policy $\pi_b$! This is model-free because all the transition probabilities have cancelled out in the importance sampling quotient.

Posted in Blog | Leave a comment

Density Functional Theory

Posted on January 28, 2026 by wdengquantum.me

Problem: In one sentence, what is the essence of DFT?

Solution: To replace $\Psi\mapsto n$, where the number density of a system of $N$ identical quantum particles (usually electrons) $n(\mathbf x)$ is:

\[n(\mathbf x):=N\int d^3\mathbf x_2…d^3\mathbf x_N |\Psi(\mathbf x,\mathbf x_2,…,\mathbf x_N)|^2\]

in terms of its $N$-body wavefunction $\Psi(\mathbf x,\mathbf x_2,…,\mathbf x_N)$. Thus, $\int d^3\mathbf x n(\mathbf x)=N$ is normalized.

Problem: Write down the non-relativistic Hamiltonian $H$ of a molecule. Attempt to express the expected energy $E=\langle\Psi|H|\Psi\rangle$ in many-body state $|\Psi\rangle\in\bigwedge^NL^2(\mathbf R^3\to\mathbf C)\otimes\mathbf C^2$ in terms of $n(\mathbf x)$ defined above; this functional $E=E[n]$ of the density $n$ explains the name “density functional theory“.

Solution: (the Born-Oppenheimer approximation is implicitly used whereby the electrons move on an effective/external potential energy surface $V_{\text{ext}}(\mathbf x)$ defined by a collection of stationary nuclei)

\[H=\sum_{i=1}^N\frac{|\mathbf P_i|^2}{2m}+V_{\text{ext}}(\mathbf X_i)+\frac{1}{2}\sum_{1\leq i\neq j\leq N}\frac{\alpha\hbar c}{|\mathbf X_i-\mathbf X_j|}\]

One can check that:

\[E=\frac{\hbar^2}{2m}\sum_{i=1}^N\int d^3\mathbf x_1…d^3\mathbf x_N\biggr|\frac{\partial\Psi}{\partial\mathbf x_i}\biggr|^2+\int d^3\mathbf x n(\mathbf x)V_{\text{ext}}(\mathbf x)+\frac{\alpha\hbar c}{2}\int d^3\mathbf x d^3\mathbf x’\frac{n_2(\mathbf x,\mathbf x’)}{|\mathbf x-\mathbf x’|}\]

where $n_2(\mathbf x,\mathbf x’)=N(N-1)\int d^3\mathbf x_3…d^3\mathbf x_N|\Psi(\mathbf x,\mathbf x’,\mathbf x_3,…,\mathbf x_N)|^2$ so in particular $(N-1)n(\mathbf x)=\int d^3\mathbf x’ n_2(\mathbf x,\mathbf x’)$.

Problem: Write down the Thomas-Fermi approximation for $E[n]$, and explain why it sucks. Also derive the von Weizsäcker correction to the Thomas-Fermi energy functional.

Solution: The Thomas-Fermi functional treats the electrons as a locally ideal Fermi gas, thus having the usual Pauli pressure $p\propto n^{5/3}$ in $\mathbf R^3$ and thus (kinetic) energy density $3p/2$ scaling likewise. Furthermore, the exchange-correlation functional is taken to vanish $E_{\text{xc}}[n]=0$ (thus, Thomas-Fermi is often said to be semiclassical):

\[E_{\text{TF}}[n]=\frac{3\hbar^2(3\pi^2)^{2/3}}{10m}\int d^3\mathbf x n^{5/3}(\mathbf x)+\int d^3\mathbf x n(\mathbf x)V_{\text{ext}}(\mathbf x)+\frac{\alpha\hbar c}{2}\int d^3\mathbf x d^3\mathbf x’\frac{n(\mathbf x)n(\mathbf x’)}{|\mathbf x-\mathbf x’|}\]

One has:

\[\frac{\delta}{\delta n(\mathbf x)}\left(E_{\text{TF}}-\mu\left(\int d^3\mathbf xn(\mathbf x)-N\right)\right)=\frac{\hbar^2k^2_F(\mathbf x)}{2m}+V_{\text{eff}}(\mathbf x)-\mu=0\]

where $k_F(\mathbf x):=(3\pi^2n(\mathbf x))^{1/3}$ and $V_{\text{eff}}(\mathbf x)=V_{\text{ext}}(\mathbf x)+\alpha\hbar c\int d^3\mathbf x’\frac{n(\mathbf x’)}{|\mathbf x-\mathbf x’|}$.

(aside: although the Thomas-Fermi model is meant to be applied to molecules, a result of Teller showed that $E_{\text{TF molecule}}>2E_{\text{TF atom}}$ so Thomas-Fermi theory predicts the molecular state is unstable with respect to dissociation. In a central external potential such as due to a single nucleus $V_{\text{ext}}(r)=-Z\alpha\hbar c/r$, the effective potential $V_{\text{eff}}(r)$ and electron density $n(r)$ are both isotropic as well. Starting from Poisson’s equation:

\[\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\frac{V_{\text{eff}}(r)}{-e}=-\frac{-en(r)}{\varepsilon_0}\]

with $\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2f(r)=\frac{1}{r^2}\frac{d}{dr}\left(r^2\frac{df}{dr}\right)=\frac{1}{r}\frac{d^2}{dr^2}(rf(r))$ (use the latter) one can check that by defining $f(r):=\left(\frac{8}{3\pi a_0}\right)^2rk^2_F(r)$, one recovers the Thomas-Fermi ODE:

\[\frac{d^2 f}{dr^2}=\frac{f^{3/2}(r)}{\sqrt{r}}\]

The von Weizsäcker term is a correction to the kinetic energy functional in the Thomas-Fermi model that accounts for density gradients:

\[T_{\text{vW}}[n]=\frac{\hbar^2}{8m}\int d^3\mathbf x\frac{|\partial n/\partial\mathbf x|^2}{n(\mathbf x)}\]

Heuristically, it can be justified by writing the expected kinetic energy of a single electron $\frac{\hbar^2}{2m}\int d^3\mathbf x |\partial\psi/\partial\mathbf x|^2$ with wavefunction $\psi(\mathbf x)\in\mathbf R$ and replacing $\psi(\mathbf x)\mapsto\sqrt{n(\mathbf x)}$.

Problem: Prove the two Hohenberg-Kohn theorems.

Solution: First, it is useful to prove:

Lemma: If $V_{\text{ext}}(\mathbf x)\neq V’_{\text{ext}}(\mathbf x)$ represent $2$ physically distinct external potentials (i.e. they differ by more than just some additive constant), then the corresponding molecular Hamiltonians $H\neq H’$ do not share any non-zero eigenstates.
Proof: Suppose for sake of contradiction that there exists $|\Psi\rangle\neq 0$ such that
\[H|\Psi\rangle=E|\Psi\rangle\]
\[H’||\Psi\rangle=E’|\Psi\rangle\]
Subtracting, one obtains:
\[(H’-H)|\Psi\rangle=(E’-E)|\Psi\rangle\]
Inspecting the form of the molecular Hamiltonian $H$ given earlier, one sees that the “universal” part (i.e. the kinetic energy and electron-electron Coulomb repulsion terms) cancel, leaving only the non-universal residue $H’-H=\sum_{i=1}^NV’_{\text{ext}}(\mathbf X_i)-V_{\text{ext}}(\mathbf X_i)$. But the RHS $E’-E$ is $\mathbf X_i$-independent for all $i=1,…,N$; this means $V’_{\text{ext}}(\mathbf x)-V_{\text{ext}}(\mathbf x)=\text{const}$ which is a contradiction.

With this in mind, the first HK theorem follows by specializing to the respective (distinct by the above lemma) ground states $|\Psi_0\rangle\neq|\Psi’_0\rangle$ of $H$ and $H’$. Apply the variational bound both ways:

\[\]

Problem: Demonstrate the first Hohenberg-Kohn theorem for $2$ electrons in a $1$-dimensional harmonic trap.

Solution:

Problem: Derive the Kohn-Sham equations.

Solution: Fictitious ideal electron gas with density $n(\mathbf x)$.

Problem:

Posted in Blog | Leave a comment

PyTorch Fundamentals (Part $2$)

Posted on January 27, 2026 by wdengquantum.me

Problem: Do an end-to-end walkthrough of the PyTorch machine learning workflow using the most basic univariate linear regression example. In particular, generate some linear data over a normalized feature space (whose slope $w$ and intercept $b$ would in practice be a priori unknown), split that linear data into training and testing subsets (no cross-validation dataset needed for this simple example), define the linear layer class, instantiate a model object of the class, and starting from random values of $w$ and $b$, use stochastic gradient descent with learning rate $\alpha=0.01$ to minimize the training cost function $C_{\text{train}}(w,b)$ based on an $L^1$ loss. Iterate SGD for $300$ epochs, and for every $20$ epochs, record the current value of $C_{\text{train}}(w,b)$ and the current value of $C_{\text{test}}(w,b)$. Plot these cost function curves as a function of the epoch number $0, 20, 40,…$. Save the final state dictionary of the model’s learned parameters $w,b$ post-training, and load it back onto a new instance of the model class.

Solution:

pytorch_workflow

Typical PyTorch Workflow

Get data ready (turn into tensors).
Build of pick a pretrained model (to suit one’s problem).
Fit the model to the data and make a prediction.
Evaluate the model
Improve through experimentation.
Save and reload your trained model

In [1]:

import torch
from torch import nn # nn module contains all of PyTorch's building blocks for designing architectures
import matplotlib.pyplot as plt 

torch.__version__

Out[1]:

'2.9.1+cu128'

In [2]:

# Create known parameters
weight = 0.7
bias = 0.3 
start = 0
end = 1
step = 0.02
x = torch.arange(start, end, step).unsqueeze(dim=1)
#print(x)
y = weight * x + bias
print(f"First 10 elements of x: {x[:10]}")
print(f"First 10 elements of y: {y[:10]}")
plt.plot(x, y)

First 10 elements of x: tensor([[0.0000],
        [0.0200],
        [0.0400],
        [0.0600],
        [0.0800],
        [0.1000],
        [0.1200],
        [0.1400],
        [0.1600],
        [0.1800]])
First 10 elements of y: tensor([[0.3000],
        [0.3140],
        [0.3280],
        [0.3420],
        [0.3560],
        [0.3700],
        [0.3840],
        [0.3980],
        [0.4120],
        [0.4260]])

Out[2]:

[<matplotlib.lines.Line2D at 0x7f60957a38b0>]

In [3]:

train_split = int(0.8 * len(x)) # or int(0.8 * len(y))
x_train, y_train = x[:train_split], y[:train_split]
x_test, y_test = x[train_split:], y[train_split:]
# can also use scikit learn's splitting method which 
# adds some randomness to the training data.
print(len(x_train), len(y_train), len(x_test), len(y_test))

40 40 10 10

In [4]:

# Create linear regression model class
class LinearRegressionModel(nn.Module):
    #almost all classes inherit from nn.Module
    def __init__(self):
        super().__init__()
        self.weights = nn.Parameter(torch.randn(1,
                                                requires_grad=True, #requires_grad=True is default
                                                dtype=torch.float))
        self.bias = nn.Parameter(torch.randn(1,
                                             requires_grad=True,
                                             dtype=torch.float))
        # or, instead of hard coding the weights and biases, PyTorch's nn.Module class has
        # a built-in nn.Linear layer, so above can be replaced by something like
        # self.linear_layer = nn.Linear(in_features=1, out_features=1)
        # Forward method to define computation in the model, should always be defined to override 
        # default method in nn.Module
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.weights * x + self.bias

PyTorch model building essentials¶

torch.nn – contains all the building blocks for computational graphs (e.g. neural networks)
torch.nn.Parameter – what parameters should our model learn, often a PyTorch layer from torch.nn will set these for us
torch.nn.Module – The base class for all neural network modules, if you subclass/inherit from it, you should overwrite forward()
torch.optim – this where the optimizers in PyTorch live, they will help with gradient descent
def forward() – All nn.Module subclasses require one to overwrite this method; this defines what happens in forward pass/propagation/computation.

In [5]:

# Checking contents of PyTorch model
torch.manual_seed(42)
# Create object instance of the class:
model_0 = LinearRegressionModel()
print(list(model_0.parameters()))
print(model_0.state_dict())

with torch.inference_mode(): #inference mode basically tells the model that this is now training/cross-validation data, not to be used for updating parameters
    y_hat = model_0(x_test)

plt.scatter(x, y)
plt.scatter(x_test, y_hat)

[Parameter containing:
tensor([0.3367], requires_grad=True), Parameter containing:
tensor([0.1288], requires_grad=True)]
OrderedDict([('weights', tensor([0.3367])), ('bias', tensor([0.1288]))])

Out[5]:

<matplotlib.collections.PathCollection at 0x7f60936cb610>

In [6]:

# Setup loss function
loss_fn = nn.L1Loss() #mean absolute deviation
print(loss_fn)
# Setup optimizer
optimizer = torch.optim.SGD(params=model_0.parameters(),
                            lr=0.01)

L1Loss()

In [7]:

N_epochs = 200 #hyperparameter

epoch_count = []
train_loss_values = []
test_loss_values = []

# Training loop
for epoch in range(N_epochs):
    model_0.train() # put model in training mode (default state)

    # Forward pass
    y_hat = model_0(x_train)

    # Calculate loss
    loss = loss_fn(y_hat, y_train)
    print(loss)
    # Optimizer zero grad
    optimizer.zero_grad()

    # Backpropagation
    loss.backward()

    # Step the optimizer (gradient descent)
    optimizer.step() # by default optimizer changes will accumulate through the loop, so need to zero the gradient in above step

    # Testing loop
    model_0.eval()
    with torch.inference_mode():
        test_pred = model_0(x_test)
        test_loss = loss_fn(test_pred, y_test)

    if epoch % 10 == 0:
        epoch_count.append(epoch)
        train_loss_values.append(loss)
        test_loss_values.append(test_loss)
        print(f"Epoch: {epoch} | Loss: {loss} | Test Loss: {test_loss}")

    print(model_0.state_dict())

tensor(0.3129, grad_fn=<MeanBackward0>)
Epoch: 0 | Loss: 0.31288138031959534 | Test Loss: 0.48106518387794495
OrderedDict([('weights', tensor([0.3406])), ('bias', tensor([0.1388]))])
tensor(0.3014, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3445])), ('bias', tensor([0.1488]))])
tensor(0.2898, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3484])), ('bias', tensor([0.1588]))])
tensor(0.2783, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3523])), ('bias', tensor([0.1688]))])
tensor(0.2668, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3562])), ('bias', tensor([0.1788]))])
tensor(0.2553, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3601])), ('bias', tensor([0.1888]))])
tensor(0.2438, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3640])), ('bias', tensor([0.1988]))])
tensor(0.2322, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3679])), ('bias', tensor([0.2088]))])
tensor(0.2207, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3718])), ('bias', tensor([0.2188]))])
tensor(0.2092, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3757])), ('bias', tensor([0.2288]))])
tensor(0.1977, grad_fn=<MeanBackward0>)
Epoch: 10 | Loss: 0.1976713240146637 | Test Loss: 0.3463551998138428
OrderedDict([('weights', tensor([0.3796])), ('bias', tensor([0.2388]))])
tensor(0.1862, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3835])), ('bias', tensor([0.2488]))])
tensor(0.1746, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3874])), ('bias', tensor([0.2588]))])
tensor(0.1631, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3913])), ('bias', tensor([0.2688]))])
tensor(0.1516, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3952])), ('bias', tensor([0.2788]))])
tensor(0.1401, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.3991])), ('bias', tensor([0.2888]))])
tensor(0.1285, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4030])), ('bias', tensor([0.2988]))])
tensor(0.1170, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4069])), ('bias', tensor([0.3088]))])
tensor(0.1061, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4108])), ('bias', tensor([0.3178]))])
tensor(0.0968, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4146])), ('bias', tensor([0.3258]))])
tensor(0.0891, grad_fn=<MeanBackward0>)
Epoch: 20 | Loss: 0.08908725529909134 | Test Loss: 0.21729660034179688
OrderedDict([('weights', tensor([0.4184])), ('bias', tensor([0.3333]))])
tensor(0.0823, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4222])), ('bias', tensor([0.3403]))])
tensor(0.0764, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4258])), ('bias', tensor([0.3463]))])
tensor(0.0716, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4293])), ('bias', tensor([0.3518]))])
tensor(0.0675, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4328])), ('bias', tensor([0.3568]))])
tensor(0.0640, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4361])), ('bias', tensor([0.3613]))])
tensor(0.0610, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4394])), ('bias', tensor([0.3653]))])
tensor(0.0585, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4425])), ('bias', tensor([0.3688]))])
tensor(0.0564, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4455])), ('bias', tensor([0.3718]))])
tensor(0.0546, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4483])), ('bias', tensor([0.3743]))])
tensor(0.0531, grad_fn=<MeanBackward0>)
Epoch: 30 | Loss: 0.053148526698350906 | Test Loss: 0.14464017748832703
OrderedDict([('weights', tensor([0.4512])), ('bias', tensor([0.3768]))])
tensor(0.0518, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4539])), ('bias', tensor([0.3788]))])
tensor(0.0507, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4564])), ('bias', tensor([0.3803]))])
tensor(0.0498, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4590])), ('bias', tensor([0.3818]))])
tensor(0.0490, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4615])), ('bias', tensor([0.3833]))])
tensor(0.0482, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4639])), ('bias', tensor([0.3843]))])
tensor(0.0475, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4662])), ('bias', tensor([0.3853]))])
tensor(0.0469, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4684])), ('bias', tensor([0.3858]))])
tensor(0.0464, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4706])), ('bias', tensor([0.3863]))])
tensor(0.0459, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4728])), ('bias', tensor([0.3868]))])
tensor(0.0454, grad_fn=<MeanBackward0>)
Epoch: 40 | Loss: 0.04543796554207802 | Test Loss: 0.11360953003168106
OrderedDict([('weights', tensor([0.4748])), ('bias', tensor([0.3868]))])
tensor(0.0450, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4768])), ('bias', tensor([0.3868]))])
tensor(0.0446, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4788])), ('bias', tensor([0.3868]))])
tensor(0.0442, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4808])), ('bias', tensor([0.3868]))])
tensor(0.0438, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4828])), ('bias', tensor([0.3868]))])
tensor(0.0434, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4848])), ('bias', tensor([0.3868]))])
tensor(0.0431, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4866])), ('bias', tensor([0.3863]))])
tensor(0.0427, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4884])), ('bias', tensor([0.3858]))])
tensor(0.0424, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4902])), ('bias', tensor([0.3853]))])
tensor(0.0420, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4920])), ('bias', tensor([0.3848]))])
tensor(0.0417, grad_fn=<MeanBackward0>)
Epoch: 50 | Loss: 0.04167863354086876 | Test Loss: 0.09919948130846024
OrderedDict([('weights', tensor([0.4938])), ('bias', tensor([0.3843]))])
tensor(0.0413, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4956])), ('bias', tensor([0.3838]))])
tensor(0.0410, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4974])), ('bias', tensor([0.3833]))])
tensor(0.0406, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.4992])), ('bias', tensor([0.3828]))])
tensor(0.0403, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5010])), ('bias', tensor([0.3823]))])
tensor(0.0399, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5028])), ('bias', tensor([0.3818]))])
tensor(0.0396, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5046])), ('bias', tensor([0.3813]))])
tensor(0.0392, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5064])), ('bias', tensor([0.3808]))])
tensor(0.0389, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5082])), ('bias', tensor([0.3803]))])
tensor(0.0385, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5100])), ('bias', tensor([0.3798]))])
tensor(0.0382, grad_fn=<MeanBackward0>)
Epoch: 60 | Loss: 0.03818932920694351 | Test Loss: 0.08886633068323135
OrderedDict([('weights', tensor([0.5116])), ('bias', tensor([0.3788]))])
tensor(0.0379, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5134])), ('bias', tensor([0.3783]))])
tensor(0.0375, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5152])), ('bias', tensor([0.3778]))])
tensor(0.0372, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5168])), ('bias', tensor([0.3768]))])
tensor(0.0368, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5186])), ('bias', tensor([0.3763]))])
tensor(0.0365, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5202])), ('bias', tensor([0.3753]))])
tensor(0.0361, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5220])), ('bias', tensor([0.3748]))])
tensor(0.0358, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5236])), ('bias', tensor([0.3738]))])
tensor(0.0354, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5254])), ('bias', tensor([0.3733]))])
tensor(0.0351, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5272])), ('bias', tensor([0.3728]))])
tensor(0.0348, grad_fn=<MeanBackward0>)
Epoch: 70 | Loss: 0.03476089984178543 | Test Loss: 0.0805937647819519
OrderedDict([('weights', tensor([0.5288])), ('bias', tensor([0.3718]))])
tensor(0.0344, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5306])), ('bias', tensor([0.3713]))])
tensor(0.0341, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5322])), ('bias', tensor([0.3703]))])
tensor(0.0337, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5340])), ('bias', tensor([0.3698]))])
tensor(0.0334, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5355])), ('bias', tensor([0.3688]))])
tensor(0.0330, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5373])), ('bias', tensor([0.3683]))])
tensor(0.0327, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5391])), ('bias', tensor([0.3678]))])
tensor(0.0324, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5407])), ('bias', tensor([0.3668]))])
tensor(0.0320, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5425])), ('bias', tensor([0.3663]))])
tensor(0.0317, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5441])), ('bias', tensor([0.3653]))])
tensor(0.0313, grad_fn=<MeanBackward0>)
Epoch: 80 | Loss: 0.03132382780313492 | Test Loss: 0.07232122868299484
OrderedDict([('weights', tensor([0.5459])), ('bias', tensor([0.3648]))])
tensor(0.0310, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5475])), ('bias', tensor([0.3638]))])
tensor(0.0306, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5493])), ('bias', tensor([0.3633]))])
tensor(0.0303, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5509])), ('bias', tensor([0.3623]))])
tensor(0.0300, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5527])), ('bias', tensor([0.3618]))])
tensor(0.0296, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5545])), ('bias', tensor([0.3613]))])
tensor(0.0293, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5561])), ('bias', tensor([0.3603]))])
tensor(0.0289, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5579])), ('bias', tensor([0.3598]))])
tensor(0.0286, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5595])), ('bias', tensor([0.3588]))])
tensor(0.0282, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5613])), ('bias', tensor([0.3583]))])
tensor(0.0279, grad_fn=<MeanBackward0>)
Epoch: 90 | Loss: 0.02788739837706089 | Test Loss: 0.06473556160926819
OrderedDict([('weights', tensor([0.5629])), ('bias', tensor([0.3573]))])
tensor(0.0275, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5647])), ('bias', tensor([0.3568]))])
tensor(0.0272, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5665])), ('bias', tensor([0.3563]))])
tensor(0.0269, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5681])), ('bias', tensor([0.3553]))])
tensor(0.0265, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5699])), ('bias', tensor([0.3548]))])
tensor(0.0262, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5715])), ('bias', tensor([0.3538]))])
tensor(0.0258, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5733])), ('bias', tensor([0.3533]))])
tensor(0.0255, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5748])), ('bias', tensor([0.3523]))])
tensor(0.0251, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5766])), ('bias', tensor([0.3518]))])
tensor(0.0248, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5784])), ('bias', tensor([0.3513]))])
tensor(0.0245, grad_fn=<MeanBackward0>)
Epoch: 100 | Loss: 0.024458957836031914 | Test Loss: 0.05646304413676262
OrderedDict([('weights', tensor([0.5800])), ('bias', tensor([0.3503]))])
tensor(0.0241, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5818])), ('bias', tensor([0.3498]))])
tensor(0.0238, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5834])), ('bias', tensor([0.3488]))])
tensor(0.0234, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5852])), ('bias', tensor([0.3483]))])
tensor(0.0231, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5868])), ('bias', tensor([0.3473]))])
tensor(0.0227, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5886])), ('bias', tensor([0.3468]))])
tensor(0.0224, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5902])), ('bias', tensor([0.3458]))])
tensor(0.0221, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5920])), ('bias', tensor([0.3453]))])
tensor(0.0217, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5938])), ('bias', tensor([0.3448]))])
tensor(0.0214, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5954])), ('bias', tensor([0.3438]))])
tensor(0.0210, grad_fn=<MeanBackward0>)
Epoch: 110 | Loss: 0.021020207554101944 | Test Loss: 0.04819049686193466
OrderedDict([('weights', tensor([0.5972])), ('bias', tensor([0.3433]))])
tensor(0.0207, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.5988])), ('bias', tensor([0.3423]))])
tensor(0.0203, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6006])), ('bias', tensor([0.3418]))])
tensor(0.0200, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6022])), ('bias', tensor([0.3408]))])
tensor(0.0196, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6040])), ('bias', tensor([0.3403]))])
tensor(0.0193, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6058])), ('bias', tensor([0.3398]))])
tensor(0.0190, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6074])), ('bias', tensor([0.3388]))])
tensor(0.0186, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6092])), ('bias', tensor([0.3383]))])
tensor(0.0183, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6108])), ('bias', tensor([0.3373]))])
tensor(0.0179, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6126])), ('bias', tensor([0.3368]))])
tensor(0.0176, grad_fn=<MeanBackward0>)
Epoch: 120 | Loss: 0.01758546568453312 | Test Loss: 0.04060482233762741
OrderedDict([('weights', tensor([0.6141])), ('bias', tensor([0.3358]))])
tensor(0.0172, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6159])), ('bias', tensor([0.3353]))])
tensor(0.0169, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6175])), ('bias', tensor([0.3343]))])
tensor(0.0166, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6193])), ('bias', tensor([0.3338]))])
tensor(0.0162, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6211])), ('bias', tensor([0.3333]))])
tensor(0.0159, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6227])), ('bias', tensor([0.3323]))])
tensor(0.0155, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6245])), ('bias', tensor([0.3318]))])
tensor(0.0152, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6261])), ('bias', tensor([0.3308]))])
tensor(0.0148, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6279])), ('bias', tensor([0.3303]))])
tensor(0.0145, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6295])), ('bias', tensor([0.3293]))])
tensor(0.0142, grad_fn=<MeanBackward0>)
Epoch: 130 | Loss: 0.014155393466353416 | Test Loss: 0.03233227878808975
OrderedDict([('weights', tensor([0.6313])), ('bias', tensor([0.3288]))])
tensor(0.0138, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6331])), ('bias', tensor([0.3283]))])
tensor(0.0135, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6347])), ('bias', tensor([0.3273]))])
tensor(0.0131, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6365])), ('bias', tensor([0.3268]))])
tensor(0.0128, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6381])), ('bias', tensor([0.3258]))])
tensor(0.0124, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6399])), ('bias', tensor([0.3253]))])
tensor(0.0121, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6415])), ('bias', tensor([0.3243]))])
tensor(0.0118, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6433])), ('bias', tensor([0.3238]))])
tensor(0.0114, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6451])), ('bias', tensor([0.3233]))])
tensor(0.0111, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6467])), ('bias', tensor([0.3223]))])
tensor(0.0107, grad_fn=<MeanBackward0>)
Epoch: 140 | Loss: 0.010716589167714119 | Test Loss: 0.024059748277068138
OrderedDict([('weights', tensor([0.6485])), ('bias', tensor([0.3218]))])
tensor(0.0104, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6501])), ('bias', tensor([0.3208]))])
tensor(0.0100, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6519])), ('bias', tensor([0.3203]))])
tensor(0.0097, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6534])), ('bias', tensor([0.3193]))])
tensor(0.0093, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6552])), ('bias', tensor([0.3188]))])
tensor(0.0090, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6568])), ('bias', tensor([0.3178]))])
tensor(0.0087, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6586])), ('bias', tensor([0.3173]))])
tensor(0.0083, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6604])), ('bias', tensor([0.3168]))])
tensor(0.0080, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6620])), ('bias', tensor([0.3158]))])
tensor(0.0076, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6638])), ('bias', tensor([0.3153]))])
tensor(0.0073, grad_fn=<MeanBackward0>)
Epoch: 150 | Loss: 0.0072835334576666355 | Test Loss: 0.016474086791276932
OrderedDict([('weights', tensor([0.6654])), ('bias', tensor([0.3143]))])
tensor(0.0069, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6672])), ('bias', tensor([0.3138]))])
tensor(0.0066, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6688])), ('bias', tensor([0.3128]))])
tensor(0.0063, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6706])), ('bias', tensor([0.3123]))])
tensor(0.0059, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6724])), ('bias', tensor([0.3118]))])
tensor(0.0056, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6740])), ('bias', tensor([0.3108]))])
tensor(0.0052, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6758])), ('bias', tensor([0.3103]))])
tensor(0.0049, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6774])), ('bias', tensor([0.3093]))])
tensor(0.0045, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6792])), ('bias', tensor([0.3088]))])
tensor(0.0042, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6808])), ('bias', tensor([0.3078]))])
tensor(0.0039, grad_fn=<MeanBackward0>)
Epoch: 160 | Loss: 0.0038517764769494534 | Test Loss: 0.008201557211577892
OrderedDict([('weights', tensor([0.6826])), ('bias', tensor([0.3073]))])
tensor(0.0035, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6844])), ('bias', tensor([0.3068]))])
tensor(0.0032, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6860])), ('bias', tensor([0.3058]))])
tensor(0.0028, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6878])), ('bias', tensor([0.3053]))])
tensor(0.0025, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6894])), ('bias', tensor([0.3043]))])
tensor(0.0021, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6912])), ('bias', tensor([0.3038]))])
tensor(0.0018, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6927])), ('bias', tensor([0.3028]))])
tensor(0.0015, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6947])), ('bias', tensor([0.3028]))])
tensor(0.0012, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
Epoch: 170 | Loss: 0.008932482451200485 | Test Loss: 0.005023092031478882
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
Epoch: 180 | Loss: 0.008932482451200485 | Test Loss: 0.005023092031478882
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
Epoch: 190 | Loss: 0.008932482451200485 | Test Loss: 0.005023092031478882
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])
tensor(0.0089, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6951])), ('bias', tensor([0.2993]))])
tensor(0.0026, grad_fn=<MeanBackward0>)
OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])

In [8]:

with torch.inference_mode(): #inference mode basically tells the model that this is now training/cross-validation data, not to be used for updating parameters
    y_hat = model_0(x_test)

plt.scatter(x, y)
plt.scatter(x_test, y_hat)

Out[8]:

<matplotlib.collections.PathCollection at 0x7f608837beb0>

In [9]:

plt.scatter(epoch_count, torch.tensor(train_loss_values).numpy(), label="Training Cost")
plt.scatter(epoch_count, torch.tensor(test_loss_values).numpy(), label="Testing Cost")
plt.legend()

/tmp/ipykernel_349241/1254886168.py:1: UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior.
Consider using tensor.detach() first. (Triggered internally at /pytorch/torch/csrc/autograd/generated/python_variable_methods.cpp:836.)
  plt.scatter(epoch_count, torch.tensor(train_loss_values).numpy(), label="Training Cost")

Out[9]:

<matplotlib.legend.Legend at 0x7f60883d6d40>

In [10]:

# Want to be able to save the model parameters! 
# There are 3 main methods for saving + loading model parameters in PyTorch:
print(model_0.state_dict())

from pathlib import Path

# Create a directory to save models into
model_path = Path("models")
model_path.mkdir(parents=True, exist_ok=True)

# Create model save path
model_name = "pytorch_workflow_model_0.pth"
model_save_path = model_path / model_name

# Save the model state_dict
torch.save(model_0.state_dict(), model_save_path)

OrderedDict([('weights', tensor([0.6990])), ('bias', tensor([0.3093]))])

In [11]:

!ls -l models

total 4
-rw-r--r-- 1 william william 2093 Jan 27 13:54 pytorch_workflow_model_0.pth

In [12]:

# Loading a PyTorch model

# Since only saved the model state_dict, will create new
# object instance of the LinearRegressionModel model class and 

loaded_model_0 = LinearRegressionModel()

# Load the saved state_dict of model_0 (this will update the new instance with the updated parameters)

loaded_model_0.load_state_dict(torch.load(f=model_save_path))

Out[12]:

<All keys matched successfully>

In [13]:

next(model_0.parameters()).device

Out[13]:

device(type='cpu')

Exercises!

In [ ]:

#1
import torch
import matplotlib.pyplot as plt

weight = 0.3
bias = 0.9
N = 100
# something I got reminded of in a nasty way is that all features should be normalized!!! When blindly using torch.linspace(0, 100, N), got exploding gradients...
x = torch.linspace(0, 1, N).unsqueeze(dim=1) #the need for unsqueeze is subtle...need for computing cost function C below as y_hat_train and y_hat_test will both be of shape (n, 1) for some n
y = weight * x + bias # in practice don't know how y relates to x
x_train = x[:int(0.8 * N)]
y_train = y[:int(0.8 * N)]
x_test = x[int(0.8 * N):]
y_test = y[int(0.8 * N):]
plt.plot(x_train, y_train, label="Training Data") #hmm...didn't need to convert PyTorch tensors to NumPy arrays for plotting
plt.plot(x_test, y_test, label="Testing Data")
plt.legend()

Out[ ]:

<matplotlib.legend.Legend at 0x7f60a1f5b640>

In [21]:

#2
from torch import nn

torch.manual_seed(42)

class PyTorchModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.w = nn.Parameter(torch.randn(1,
                              requires_grad=True,
                              dtype=torch.float))
        self.b = nn.Parameter(torch.randn(1,
                              requires_grad=True,
                              dtype=torch.float))
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.w * x + self.b
    
model_instance = PyTorchModel()
print(model_instance.state_dict())

#3
C = nn.L1Loss()
optimizer = torch.optim.SGD(params=model_instance.parameters(), lr=0.01)

N_epochs = 300
C_train_curve = []
C_test_curve = []
for i in range(N_epochs):
    # training loop
    model_instance.train()
    y_hat_train = model_instance(x_train) # or model_instance.forward(x_train)
    C_train = C(y_hat_train, y_train)
    optimizer.zero_grad()
    C_train.backward()
    optimizer.step()

    # testing/evaluation loop
    if i % 20 == 0:
        C_train_curve.append(C_train)
        model_instance.eval()
        with torch.inference_mode(): # stop gradient tape recording during testing
            y_hat_test = model_instance(x_test) # or model_instance.forward(x_test)
            C_test = C(y_hat_test, y_test)
            C_test_curve.append(C_test)
        print(f"Epoch: {i}, Training Cost: {C_train}, Test Cost: {C_test}")

print(f"Final Trained Model Parameters: {model_instance.state_dict()}")

OrderedDict([('w', tensor([0.3367])), ('b', tensor([0.1288]))])
Epoch: 0, Training Cost: 0.7565514445304871, Test Cost: 0.7244139909744263
Epoch: 20, Training Cost: 0.524712860584259, Test Cost: 0.45227327942848206
Epoch: 40, Training Cost: 0.29287439584732056, Test Cost: 0.18013274669647217
Epoch: 60, Training Cost: 0.07645779848098755, Test Cost: 0.07569172978401184
Epoch: 80, Training Cost: 0.0533239021897316, Test Cost: 0.11738457530736923
Epoch: 100, Training Cost: 0.046195853501558304, Test Cost: 0.10600712150335312
Epoch: 120, Training Cost: 0.03922543674707413, Test Cost: 0.09009645879268646
Epoch: 140, Training Cost: 0.03225494548678398, Test Cost: 0.07418543100357056
Epoch: 160, Training Cost: 0.02528444491326809, Test Cost: 0.05827441066503525
Epoch: 180, Training Cost: 0.018313953652977943, Test Cost: 0.04236338287591934
Epoch: 200, Training Cost: 0.01134470570832491, Test Cost: 0.025760680437088013
Epoch: 220, Training Cost: 0.004374831914901733, Test Cost: 0.009503781795501709
Epoch: 240, Training Cost: 0.004876463208347559, Test Cost: 0.0061147273518145084
Epoch: 260, Training Cost: 0.004876463208347559, Test Cost: 0.0061147273518145084
Epoch: 280, Training Cost: 0.004876463208347559, Test Cost: 0.0061147273518145084
Final Trained Model Parameters: OrderedDict([('w', tensor([0.3052])), ('b', tensor([0.9028]))])

In [28]:

# 4
import numpy as np
every_20_epochs = np.arange(0, N_epochs, 20)
plt.scatter(every_20_epochs, torch.tensor(C_train_curve).numpy(), label="Training Cost Curve")
plt.scatter(every_20_epochs, torch.tensor(C_test_curve).numpy(), label="Testing Cost Curve")
plt.legend()

#5
from pathlib import Path
model_path = Path("models")
model_path.mkdir(parents=True, exist_ok=True)
model_name = "pytorch_lin_regress_model.pth"
model_save_path = model_path / model_name
torch.save(model_instance.state_dict(), model_save_path)

In [34]:

#6 
another_model_instance = PyTorchModel()
another_model_instance.load_state_dict(torch.load(model_save_path))

another_y_hat = another_model_instance(x)
plt.plot(x, y, label="Original Data (Both Training + Testing)")
plt.plot(x, torch.tensor(another_y_hat).numpy(), label="Loaded model prediction")
plt.legend()

/tmp/ipykernel_349241/188330038.py:7: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).
  plt.plot(x, torch.tensor(another_y_hat).numpy(), label="Loaded model prediction")

Out[34]:

<matplotlib.legend.Legend at 0x7f609fe06e90>

In [ ]:

Posted in Blog | Leave a comment

PyTorch Fundamentals (Part $1$)

Posted on January 25, 2026 by wdengquantum.me

Problem: Illustrate some of the basic fundamentals involved in using the PyTorch deep learning library. In particular, discuss the attributes of PyTorch tensors (e.g. dtype, CPU/GPU devices, etc.), how to generate random PyTorch tensors with/without seeding, and operations that can be performed on and between PyTorch tensors.

Solution:

pytorch_fundamentals

In [2]:

import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
print(torch.__version__)

2.9.1+cu128

In [3]:

# scalars are the most basic type of PyTorch tensor
scalar = torch.tensor(8)
print(scalar)
print(scalar.ndim)
print(scalar.item())

tensor(8)
0
8

In [4]:

vector = torch.tensor([7, 7])
print(vector)
print(vector.ndim)
print(vector.shape)

tensor([7, 7])
1
torch.Size([2])

In [5]:

# MATRIX
MATRIX = torch.tensor([[7, 8],
                       [9, 10]])

print(MATRIX)
print(MATRIX.ndim)
print(MATRIX.shape)
print(MATRIX[1])

tensor([[ 7,  8],
        [ 9, 10]])
2
torch.Size([2, 2])
tensor([ 9, 10])

In [6]:

# TENSOR
TENSOR = torch.tensor([[[1, 2, 3], [3, 6, 9], [2, 4, 6]]])
print(TENSOR)
print(TENSOR.ndim)
print(TENSOR.shape)
print(TENSOR.shape[0])

tensor([[[1, 2, 3],
         [3, 6, 9],
         [2, 4, 6]]])
3
torch.Size([1, 3, 3])
1

In [7]:

my_tensor = torch.tensor([[[[[[[[[[[[[[[[[[[[[[[[[[2, 4, 2, 3]]]]]]]]]]]]]]]]]]]]]]]]]])
print(my_tensor.ndim) # number of onion layers of square brackets
print(my_tensor.shape) # number of elements within each onion layer

26
torch.Size([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4])

Random Tensors¶

Why random tensors?

Random tensors are important because the way many neural networks learn is that they start with tensors full of random numbers and then adjust those random numbers to better represent the data.

Start with random numbers -> look at data -> update random numbers -> look at data -> update random numbers

In [8]:

# Creating a random tensor of size (3, 4)
random_tensor = torch.rand(3, 4)
random_tensor

Out[8]:

tensor([[0.0879, 0.6826, 0.4789, 0.1849],
        [0.7974, 0.9331, 0.8372, 0.3934],
        [0.4137, 0.7374, 0.8922, 0.0088]])

In [9]:

random_image_size_tensor = torch.rand(size=(3, 224, 224))
random_image_size_tensor.shape, random_image_size_tensor.ndim

Out[9]:

(torch.Size([3, 224, 224]), 3)

In [10]:

plt.imshow(random_image_size_tensor[1])

Out[10]:

<matplotlib.image.AxesImage at 0x7f2eb56a1930>

In [11]:

zero_tensor = torch.zeros(size=(3, 4))
one_tensor = torch.ones(size=(3, 4))
print(zero_tensor)
print(one_tensor)
print(one_tensor.dtype) # by default, all tensors use float32 (single-point precision) initially
# Multiplication symbol * leads to Hadamard product
print(random_tensor)
print(one_tensor * random_tensor)

tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])
tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]])
torch.float32
tensor([[0.0879, 0.6826, 0.4789, 0.1849],
        [0.7974, 0.9331, 0.8372, 0.3934],
        [0.4137, 0.7374, 0.8922, 0.0088]])
tensor([[0.0879, 0.6826, 0.4789, 0.1849],
        [0.7974, 0.9331, 0.8372, 0.3934],
        [0.4137, 0.7374, 0.8922, 0.0088]])

In [12]:

one_to_ten = torch.arange(start=1, end=11, step=1)
print(one_to_ten)

tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [13]:

ten_zeros = torch.zeros_like(input=one_to_ten)
print(ten_zeros)

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Tensor datatypes¶

Note: Tensor datatypes is one of the 3 big errors one often runs into in PyTorch:

Tensors not the right datatype
Tensors not the right shape
Tensors not on the right device

In [14]:

# Float 32 tensor
float_32_tensor = torch.tensor([3.0, 6.0, 9.0], 
                               dtype=torch.float32,
                               device=None,
                               requires_grad=False)

print(float_32_tensor)
print(float_32_tensor.dtype)

tensor([3., 6., 9.])
torch.float32

In [15]:

float_16_tensor = float_32_tensor.type(torch.float16)
print(float_16_tensor)
print(float_16_tensor.dtype)

tensor([3., 6., 9.], dtype=torch.float16)
torch.float16

In [16]:

float_16_tensor * float_32_tensor #

Out[16]:

tensor([ 9., 36., 81.])

In [17]:

some_tensor = torch.rand(3, 4)
print(some_tensor)
print(f"Datatype of tensor: {some_tensor.dtype}")
print(f"Shape of tensor: {some_tensor.shape}")
print(f"Device that tensor is on: {some_tensor.device}")

tensor([[0.2600, 0.3511, 0.7676, 0.6426],
        [0.9504, 0.4816, 0.0339, 0.0265],
        [0.9296, 0.9317, 0.4725, 0.8148]])
Datatype of tensor: torch.float32
Shape of tensor: torch.Size([3, 4])
Device that tensor is on: cpu

Manipulating Tensors (Tensor Operation)¶

Addition
Subtraction
Hadamard multiplication
Matrix Multiplication
Division? (Inversion?)

In [72]:

tensor = torch.tensor([1, 2, 3])
print(tensor + 10)
print(tensor * 10)
print(tensor - 10)
print(torch.add(tensor, 10))

tensor([11, 12, 13])
tensor([10, 20, 30])
tensor([-9, -8, -7])
tensor([11, 12, 13])

In [ ]:

tensor1 = torch.tensor([[1, 2],
                        [3, 4]])
tensor2 = torch.tensor([[4, 5],
                        [6, 7]])

print(tensor1 @ tensor2)
print(torch.matmul(tensor1, tensor2))
print(torch.mm(tensor1, tensor2))
%timeit tensor1 @ tensor2

tensor([[16, 19],
        [36, 43]])
tensor([[16, 19],
        [36, 43]])
tensor([[16, 19],
        [36, 43]])
1.88 μs ± 96.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [ ]:

%%time
# Hard coding matrix multiplication
C = torch.tensor([[0, 0], [0, 0]])
for i in range(2):
    for j in range(2):
        for k in range(2):
            C[i, j] += tensor1[i, k] * tensor2[k, j]

print(C)

tensor([[16, 19],
        [36, 43]])
CPU times: user 1.94 ms, sys: 394 μs, total: 2.34 ms
Wall time: 1.92 ms

Tensor Aggregation

In [ ]:

x = torch.arange(1, 100, 10)
print(x)
print(x.dtype)
print(torch.min(x), x.min())
print(torch.max(x), x.max())
print(torch.mean(x.type(torch.float32)), x.type(torch.float32).mean())
print(torch.sum(x), x.sum())
print(torch.argmin(x), x.argmin())
print(torch.argmax(x), x.argmax())
print(x[9])

tensor([ 1, 11, 21, 31, 41, 51, 61, 71, 81, 91])
torch.int64
tensor(1) tensor(1)
tensor(91) tensor(91)
tensor(46.) tensor(46.)
tensor(460) tensor(460)
tensor(0) tensor(0)
tensor(9) tensor(9)
tensor(91)

In [93]:

# Reshaping, stacking, squeezing, unsqueezing PyTorch tensors
import torch
x = torch.arange(1, 10)
x, x.shape

Out[93]:

(tensor([1, 2, 3, 4, 5, 6, 7, 8, 9]), torch.Size([9]))

In [99]:

x_reshaped = x.reshape(3, 3)
x_reshaped, x_reshaped.shape

Out[99]:

(tensor([[5, 2, 3],
         [4, 5, 6],
         [7, 8, 9]]),
 torch.Size([3, 3]))

In [ ]:

z = x.view(1, 9)
z, z.shape
z[:, 0] = 5
z, x # changes in z will affect x, stored in same memory location

Out[ ]:

(tensor([[5, 2, 3, 4, 5, 6, 7, 8, 9]]), tensor([5, 2, 3, 4, 5, 6, 7, 8, 9]))

In [ ]:

x_stacked = torch.stack([x, x, x, x], dim=1)
x_h_stacked = torch.hstack([x, x, x, x])
x_v_stacked = torch.vstack([x, x, x, x])
print(x_stacked)
print(x_h_stacked)
print(x_v_stacked)

tensor([[5, 5, 5, 5],
        [2, 2, 2, 2],
        [3, 3, 3, 3],
        [4, 4, 4, 4],
        [5, 5, 5, 5],
        [6, 6, 6, 6],
        [7, 7, 7, 7],
        [8, 8, 8, 8],
        [9, 9, 9, 9]])
tensor([5, 2, 3, 4, 5, 6, 7, 8, 9, 5, 2, 3, 4, 5, 6, 7, 8, 9, 5, 2, 3, 4, 5, 6,
        7, 8, 9, 5, 2, 3, 4, 5, 6, 7, 8, 9])
tensor([[5, 2, 3, 4, 5, 6, 7, 8, 9],
        [5, 2, 3, 4, 5, 6, 7, 8, 9],
        [5, 2, 3, 4, 5, 6, 7, 8, 9],
        [5, 2, 3, 4, 5, 6, 7, 8, 9]])
tensor([5, 2, 3, 4, 5, 6, 7, 8, 9])

In [112]:

y = torch.zeros(1, 2, 3, 2, 3)
print(y)
print(y.squeeze(), y.squeeze().shape)

tensor([[[[[0., 0., 0.],
           [0., 0., 0.]],

          [[0., 0., 0.],
           [0., 0., 0.]],

          [[0., 0., 0.],
           [0., 0., 0.]]],


         [[[0., 0., 0.],
           [0., 0., 0.]],

          [[0., 0., 0.],
           [0., 0., 0.]],

          [[0., 0., 0.],
           [0., 0., 0.]]]]])
tensor([[[[0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.]]],


        [[[0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.]]]]) torch.Size([2, 3, 2, 3])

In [124]:

print(x)
print(x.unsqueeze(dim=0))
print(x.unsqueeze(dim=1))

tensor([5, 2, 3, 4, 5, 6, 7, 8, 9])
tensor([[5, 2, 3, 4, 5, 6, 7, 8, 9]])
tensor([[5],
        [2],
        [3],
        [4],
        [5],
        [6],
        [7],
        [8],
        [9]])

In [135]:

random_image = torch.rand(size=(200, 400, 3))
plt.imshow(random_image)
print(f"Shape of random_image: {random_image.shape}")
random_image_permuted = random_image.permute(2, 1, 0)
print(f"Shape of random_image_permuted: {random_image_permuted.shape}")
#random_image_permuted =

Shape of random_image: torch.Size([200, 400, 3])
Shape of random_image_permuted: torch.Size([3, 400, 200])

In [ ]:

# Indexing from PyTorch tensors is similar to indexing from NumPy arrays
x = torch.arange(1, 10).reshape(1, 3, 3)
print(x, x.shape)
print(x[0])
print(x[0][0])
print(x[0][0][0])
print(x[0][:, 2])

tensor([[[1, 2, 3],
         [4, 5, 6],
         [7, 8, 9]]]) torch.Size([1, 3, 3])
tensor([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])
tensor([1, 2, 3])
tensor(1)
tensor([3, 6, 9])

In [ ]:

# PyTorch tensors & NumPy
np_array = np.arange(1.0, 8.0)
pytorch_tensor = torch.from_numpy(np_array) # since default NumPy dtype is FP64, PyTorch will reflect this in torch.from_numpy()
print(np_array)
print(pytorch_tensor)

T = torch.ones(7)
numpy_arr = T.numpy()
print(T)
print(numpy_arr)
print(numpy_arr.dtype) #now default is FP32 b/c torch tensors default to FP32

[1. 2. 3. 4. 5. 6. 7.]
tensor([1., 2., 3., 4., 5., 6., 7.], dtype=torch.float64)
tensor([1., 1., 1., 1., 1., 1., 1.])
[1. 1. 1. 1. 1. 1. 1.]
float32

In [180]:

random_tensor_A = torch.rand(3, 4)
random_tensor_B = torch.rand(3, 4)

print(random_tensor_A)
print(random_tensor_B)
print(random_tensor_A == random_tensor_B)

tensor([[0.6327, 0.6428, 0.5100, 0.1919],
        [0.9144, 0.7456, 0.6312, 0.8555],
        [0.9164, 0.6492, 0.5402, 0.2073]])
tensor([[0.0675, 0.4836, 0.7717, 0.0868],
        [0.9814, 0.8618, 0.9880, 0.1541],
        [0.8092, 0.0201, 0.0600, 0.7161]])
tensor([[False, False, False, False],
        [False, False, False, False],
        [False, False, False, False]])

In [188]:

# Set a random seed
seed = 42
torch.manual_seed(seed)
random_tensor_C = torch.rand(3, 4)
random_tensor_D = torch.rand(3, 4)
print(random_tensor_C)
print(random_tensor_D)

tensor([[0.8823, 0.9150, 0.3829, 0.9593],
        [0.3904, 0.6009, 0.2566, 0.7936],
        [0.9408, 0.1332, 0.9346, 0.5936]])
tensor([[0.8694, 0.5677, 0.7411, 0.4294],
        [0.8854, 0.5739, 0.2666, 0.6274],
        [0.2696, 0.4414, 0.2969, 0.8317]])

In [190]:

# Running tensors and PyTorch objects on GPUs (and making faster computations)
!nvidia-smi

Sat Jan 24 20:22:27 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.188                Driver Version: 573.71         CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4050 ...    On  |   00000000:01:00.0 Off |                  N/A |
| N/A   25C    P3             14W /   39W |       0MiB /   6141MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

In [ ]:

# Check for GPU access with PyTorch
torch.cuda.is_available()

# If want code to be device-agnostic, then can check if GPU exists:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)
# Count number of GPUs
print(torch.cuda.device_count())

cuda
1

In [ ]:

# Putting tensors & models on GPU
my_tensor = torch.tensor([1, 2, 3])
print(my_tensor, my_tensor.device) #default to CPU

# To move tensor to GPU (if GPU available as encoded in device variable above)
my_tensor_on_gpu = my_tensor.to(device)
print(my_tensor_on_gpu)
print(my_tensor_on_gpu.device)

# Since NumPy only works on CPU, sometimes one may want to perform some arithmetic
# with NumPy arrays, in which case one would like to move a tensor back onto CPU,
# then convert to NumPy array.

my_tensor_on_cpu = my_tensor_on_gpu.to("cpu") # or can do my_tensor_on_gpu.cpu()
print(my_tensor_on_cpu)
print(my_tensor_on_cpu.device)
print(my_tensor_on_cpu.numpy())

tensor([1, 2, 3]) cpu
tensor([1, 2, 3], device='cuda:0')
cuda:0
tensor([1, 2, 3])
cpu
[1 2 3]

Exercise Solutions:

In [374]:

#1 Read the documentation on torch.tensor and torch.Cuda
#2 
random_tensor = torch.rand(size=(7, 7))
print(random_tensor)
#3 
another_random_tensor = torch.rand(size=(1, 7))
print(random_tensor@another_random_tensor.T)

tensor([[0.5159, 0.1636, 0.0958, 0.8985, 0.5814, 0.9148, 0.3324],
        [0.6473, 0.3857, 0.4778, 0.1955, 0.6691, 0.6581, 0.4897],
        [0.3875, 0.1918, 0.8458, 0.1278, 0.7048, 0.3319, 0.2588],
        [0.5898, 0.2403, 0.6152, 0.5982, 0.1288, 0.5832, 0.7130],
        [0.6979, 0.4371, 0.0901, 0.4229, 0.6737, 0.3176, 0.6898],
        [0.8330, 0.2389, 0.5049, 0.7067, 0.5392, 0.5418, 0.5624],
        [0.1069, 0.5393, 0.8462, 0.9506, 0.7939, 0.5670, 0.7335]])
tensor([[1.8984],
        [1.4171],
        [1.1102],
        [1.5038],
        [1.7249],
        [1.8912],
        [2.3290]])

In [ ]:

#4
seed = 1234
torch.manual_seed(seed) #this actually seeds both CPU & GPU with same seed-->same behavior
random_tensor = torch.rand(size=(7, 7), device=device)
#random_tensor = random_tensor.to(device)
print(random_tensor)
another_random_tensor = torch.rand(size=(1, 7), device=device)
#another_random_tensor = another_random_tensor.to(device)
print(random_tensor@another_random_tensor.T)

tensor([[0.1272, 0.8167, 0.5440, 0.6601, 0.2721, 0.9737, 0.3903],
        [0.3394, 0.5451, 0.7312, 0.3864, 0.5959, 0.7578, 0.2126],
        [0.7198, 0.9845, 0.5518, 0.0981, 0.0582, 0.5839, 0.1083],
        [0.9461, 0.3170, 0.8328, 0.6676, 0.2886, 0.9022, 0.8115],
        [0.1784, 0.9534, 0.1486, 0.3882, 0.7977, 0.1752, 0.5777],
        [0.1949, 0.8499, 0.3125, 0.2156, 0.0383, 0.4934, 0.3138],
        [0.3121, 0.5664, 0.1266, 0.7097, 0.0040, 0.5147, 0.2811]],
       device='cuda:0')
tensor([[0.9558],
        [1.2227],
        [0.9335],
        [1.6030],
        [0.9344],
        [0.5282],
        [0.5664]], device='cuda:0')

In [415]:

#4 (seeding the GPU, but then random tensors must be created there!)
seed = 1234
torch.cuda.manual_seed(seed)
random_tensor = torch.rand(size=(7, 7), device=device)
print(random_tensor)
another_random_tensor = torch.rand(size=(1, 7), device=device)
print(random_tensor@another_random_tensor.T)

tensor([[0.1272, 0.8167, 0.5440, 0.6601, 0.2721, 0.9737, 0.3903],
        [0.3394, 0.5451, 0.7312, 0.3864, 0.5959, 0.7578, 0.2126],
        [0.7198, 0.9845, 0.5518, 0.0981, 0.0582, 0.5839, 0.1083],
        [0.9461, 0.3170, 0.8328, 0.6676, 0.2886, 0.9022, 0.8115],
        [0.1784, 0.9534, 0.1486, 0.3882, 0.7977, 0.1752, 0.5777],
        [0.1949, 0.8499, 0.3125, 0.2156, 0.0383, 0.4934, 0.3138],
        [0.3121, 0.5664, 0.1266, 0.7097, 0.0040, 0.5147, 0.2811]],
       device='cuda:0')
tensor([[0.9558],
        [1.2227],
        [0.9335],
        [1.6030],
        [0.9344],
        [0.5282],
        [0.5664]], device='cuda:0')

In [ ]:

#6 (just to prove that torch.manual_seed() seeds both CPU and GPU devices)
torch.manual_seed(1234)
T1_rand = torch.rand(size=(2, 3), device="cuda")
T2_rand = torch.rand(size=(2, 3), device="cuda")
print(T1_rand)
print(T2_rand)
#7
T = T1_rand.T@T2_rand
print(T)
#8
print(torch.max(T), T.max())
print(torch.min(T), T.min())
#9 
print(torch.argmax(T), T.argmax())
print(torch.argmin(T), T.argmin())
#10 
T_rand = torch.rand(size=(1, 1, 1, 10))
print(T_rand, T_rand.shape)
T_rand = T_rand.squeeze()
print(T_rand, T_rand.shape)

tensor([[0.1272, 0.8167, 0.5440],
        [0.6601, 0.2721, 0.9737]], device='cuda:0')
tensor([[0.6208, 0.0276, 0.3255],
        [0.1114, 0.6812, 0.3608]], device='cuda:0')
tensor([[0.1525, 0.4531, 0.2796],
        [0.5374, 0.2079, 0.3640],
        [0.4462, 0.6783, 0.5284]], device='cuda:0')
tensor(0.6783, device='cuda:0') tensor(0.6783, device='cuda:0')
tensor(0.1525, device='cuda:0') tensor(0.1525, device='cuda:0')
tensor(7, device='cuda:0') tensor(7, device='cuda:0')
tensor(0, device='cuda:0') tensor(0, device='cuda:0')
tensor([[[[0.0290, 0.4019, 0.2598, 0.3666, 0.0583, 0.7006, 0.0518, 0.4681,
           0.6738, 0.3315]]]]) torch.Size([1, 1, 1, 10])
tensor([0.0290, 0.4019, 0.2598, 0.3666, 0.0583, 0.7006, 0.0518, 0.4681, 0.6738,
        0.3315]) torch.Size([10])

Posted in Blog | Leave a comment

Self-Attention in Transformers

Posted on December 31, 2025 by wdengquantum.me

Problem: Explain how the transformer architecture works at a mathematical level (e.g. as outlined in the Attention Is All You Need paper).

Solution:

(Tokenization) Partition the inputted natural language text into a sequence of tokens $\tau_1,…,\tau_N$ (here $N$ is bounded from above by the LLMs context size).
(Embedding) Each token $\tau_i$ is embedded as a vector $\tau_i\mapsto\mathbf x_i\in\mathbf R^{n_e}$ that contains information about the token’s generic meaning as well as its position in the inputted natural language text. Here, the hyperparameter $n_e$ is the dimension of the embedding space.
(Single Self-Attention Head) For each embedding vector $\mathbf x_i$, compute its query vector $\mathbf q_i=W_{\mathbf q}\mathbf x_i$, its key vector $\mathbf k_i=W_{\mathbf k}\mathbf x_i$, and its value vector $\mathbf v_i=W_{\mathbf v}\mathbf x_i$. Here, $W_{\mathbf q},W_{\mathbf k}\in\mathbf R^{n_{qk}\times n_e}$ are weight matrices that map from the embedding space $\mathbf R^{n_e}$ to the query/key space $\mathbf R^{n_{qk}}$ of dimension $n_{qk}$ and $W_{\mathbf v}\in\mathbf R^{n_e\times n_e}$ is the weight matrix of values (which in practice is decomposed into a low-rank approximation $W_{\mathbf v}=W_{\mathbf v\uparrow}W_{\mathbf v\downarrow}$ where typically $W_{\mathbf v\downarrow}\in\mathbf R^{n_{qk}\times n_e}$ and $W_{\mathbf v\uparrow}\in\mathbf R^{n_e\times n_{qk}}$). For each $\mathbf x_i$, one computes an update vector $\Delta\mathbf x_i$ to be added to it according to a convex linear combination of the value vectors $\mathbf v_1,…,\mathbf v_N$ of all the embeddings $\mathbf x_1,…,\mathbf x_N$ in the context, specifically:

\[\Delta\mathbf x_i=V\text{softmax}\left(\frac{K^T\mathbf q_i}{\sqrt{n_{qk}}}\right)\]

where $K=(\mathbf k_1,…,\mathbf k_N)\in\mathbf R^{n_{qk}\times N}$ and $V=(\mathbf v_1,…,\mathbf v_N)\in\mathbf R^{n_e\times N}$ are key and value matrices associated to the inputted context (filled with column vectors here rather than the ML convention of row vectors). This map that takes the initial, generic token embeddings $\mathbf x_i$ and nudges them towards more contextualized embeddings $\mathbf x_i\mapsto\mathbf x’_i=\mathbf x_i+\Delta\mathbf x_i$ is called a head of self-attention. The $1/\sqrt{n_{qk}}$ scaling in the softmax temperature is justified on the grounds that if $\mathbf k$ and $\mathbf q$ are random vectors whose independent components each have mean $0$ and variance $1$, then $\mathbf k\cdot\mathbf q$ will have mean $0$ and variance $n_{qk}$, hence the need to normalize by $\sqrt{n_{qk}}$ to ensure $\mathbf k\cdot\mathbf q/\sqrt{n_{qk}}$ continues to have variance $1$.

4. (Multi-Headed Self-Attention) Since context can influence meaning in different ways, repeat the above procedure in parallel for several heads of self-attention; each head will propose a displacement update to each of the $N$ original embeddings $\mathbf x_i$; add up all of them.

5. (Multilayer Perceptron) Linear, ReLU, Linear basically. It is hypothesized that facts are stored in this part of the transformer.

6. (Layers) Alternate between the multi-headed self-attention blocks and MLP blocks, make a probabilistic prediction of the next token $\hat{\tau}_{N+1}$ using only the final, context-rich, modified embedding $\mathbf x’_N$ of the last token $\tau_N$ in the context by applying an unembedding matrix $\mathbf u=W_{\mathbf u}\mathbf x’_N$ and running it through a softmax $\text{softmax}(\mathbf u)$.

Problem: Based on the above discussion of the transformer architecture, explain how a large language model (LLM) like Gemini, ChatGPT, Claude, Grok, DeepSeek, etc. works (at a high level).

Solution: Essentially, since an LLM is a neural network which takes as input some string of text and probabilistically predicts the next token, by seeding it with some corpus of text $T$, the LLM can sample according to the probability distribution it generates for the next token, and append that to $T\mapsto T+\tau$. Then, simply repeat this except pretend that $T+\tau$ was the seed all along. In this way, generative AI models such as ChatGPT (where GPT stands for generative pre-trained transformer) work. In practice, it is helpful to also provide some system prompt like “What follows is a conversation between a user and a knowledgeable AI assistant:”.

Posted in Blog | Leave a comment

Quantumplations

Graph Neural Networks

Renormalization Group

Convolutional Neural Networks

Hamilton’s Optics-Mechanics Analogy

Pseudo-Riemannian Geometry

Reinforcement Learning (Part \(1\))

Density Functional Theory

PyTorch Fundamentals (Part \(2\))

PyTorch model building essentials¶

PyTorch Fundamentals (Part \(1\))

Random Tensors¶

Tensor datatypes¶

Manipulating Tensors (Tensor Operation)¶

Self-Attention in Transformers

Archives

Meta