Advanced Learning Algorithms

Posted on July 15, 2025 by wdengquantum.me

Problem: In a multilayer perceptron (MLP), how are layers conventionally counted?

Solution: The input layer $\textbf x\equiv\textbf a^{(0)}$ is also called “layer $0$”. However, if someone says that an MLP has e.g. $7$ layers, what this means is that in fact it has $6$ hidden layers $\textbf a^{(1)},\textbf a^{(2)},…,\textbf a^{(6)}$, together with the output layer $\textbf a^{(7)}$. In other words, the input layer $\textbf a^{(0)}$ is not counted by convention.

Problem: Write down the formula for the activation $a^{(\ell)}_n$ of the $n^{\text{th}}$ neuron in the $\ell^{\text{th}}$ layer of a multilayer perceptron (MLP) artificial neural network.

Solution: Using a sigmoid activation function, the activation of the $n^{\text{th}}$ neuron in the $\ell^{\text{th}}$ layer is:

\[a^{(\ell)}_n=\frac{1}{e^{-(\textbf w^{(\ell)}_n\cdot\textbf a^{(\ell-1)}+b^{(\ell)}_n)}+1}\]

Problem: What is the output from the following Python code?

Solution: The first NumPy array has shape $(3,)$. The second NumPy array has shape $(1,3)$. The third NumPy array has shape $(3,1)$. While this may seem like a trivial distinction, in fact it’s very important when it comes to using the TensorFlow library, which only accepts matrices of the latter $2$ kind.

Problem: What type of activation function is most often used for the hidden layers of an MLP? What about for the output layer?

Solution: The most common activation function for the hidden layers is the ReLU (rectified linear unit), defined as $a_{\text{ReLU}}(x):=\text{max}(0,x)$ whereas the activation function used for the output layer depends on the range of target labels $y$ (i.e. if it’s about binary classification $y\in\{0,1\}$ then a sigmoid activation would be useful, if it’s $y\in[0,\infty)$ then another ReLU would be useful, and if it’s $y\in\textbf R$ then a linear activation function, would be useful; there are also other more exotic activation functions possible).

There are $2$ reasons for using ReLU (rather than the historical sigmoid functions). The first (less important) reason is that ReLU is clearly a bit faster to compute than a sigmoid. The second (more important) reason is that experimentally one finds $\partial/\partial(W,\textbf b)$-descent is faster with ReLU than with sigmoids (intuitively this is due to the presence of more “flat” sections on the graph of the sigmoid than the ReLU).

Naturally one might also ask “why not just linear activations all the way through”? The reason is because this would just reduce to linear regression (i.e. the neural network wouldn’t buy one anything new). Similarly, changing the activation of the output layer to a sigmoid while keeping the hidden layers as linear activations just reduces to logistic classification.

Problem: Import the subset of training examples in the MNIST database whose target value is either $y=0$ or $y=1$, and train a multilayer perceptron with $3$ hidden layers $(N_1,N_2,N_3)=(25,15,1)$ to do logistic binary classification on this subset (and evaluate the accuracy, and for fun have a look at the misclassified images).

Solution:

MNIST_test.html

In [1]:

import tensorflow as tf
# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

print(x_train.shape)  # (60000, 28, 28)
print(y_train.shape) # (60000,)

(60000, 28, 28)
(60000,)

In [2]:

import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(15, 15))
for n in np.arange(5):
    plt.subplot(2, 5, n+1)
    plt.imshow(x_train[n, :, :], cmap='gray')
    plt.title(f"Target Label: {y_train[n]}")

No description has been provided for this image

In [3]:

# Get only the subset of training examples in the MNIST database whose target label is either 0 or 1
subset_Boolean = (y_train < 2)
x_train_01_only = x_train[subset_Boolean, :, :]
x_train_01_only.shape #(12665, 28, 28)
y_train_01_only = y_train[subset_Boolean]
y_train_01_only.shape #(12665,)

# Reshape to (60000, 28*28) to be compatible with TensorFlow MLP.fit later
x_train_01_only_reshaped = x_train_01_only.reshape(x_train_01_only.shape[0], -1)

In [4]:

plt.figure(figsize=(15, 15))
for n in np.arange(5):
    plt.subplot(2, 5, n+1)
    plt.imshow(x_train_01_only[n, :], cmap='gray')
    plt.title(f"Target Label: {y_train_01_only[n]}")

In [5]:

from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

MLP = Sequential([tf.keras.Input(shape=(784,)),
                  Dense(units=25, activation='relu'),
                 Dense(units=15, activation='relu'),
                 Dense(units=1, activation='sigmoid')])

MLP.compile(loss=tf.keras.losses.BinaryCrossentropy())
MLP.fit(x_train_01_only_reshaped, y_train_01_only, epochs=50)

Epoch 1/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 0.5120
Epoch 2/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 0.0361
Epoch 3/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 0.0111
Epoch 4/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 0.0286
Epoch 5/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 0.0127
Epoch 6/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 0.0142
Epoch 7/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 0.0049
Epoch 8/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 0.0040
Epoch 9/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 2.5909e-05
Epoch 10/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 0.0028
Epoch 11/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 2.8652e-04
Epoch 12/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 4.5746e-04
Epoch 13/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 0.0014
Epoch 14/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 3.9976e-07
Epoch 15/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 1.5654e-04
Epoch 16/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 2.5585e-09
Epoch 17/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 2.1332e-10
Epoch 18/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 7.2755e-10
Epoch 19/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 2.4617e-10
Epoch 20/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 9.2963e-11
Epoch 21/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 1.4893e-10
Epoch 22/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 8.3827e-11
Epoch 23/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 3.9026e-10
Epoch 24/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 2.0069e-10
Epoch 25/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 5.2177e-11
Epoch 26/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 2.4525e-10
Epoch 27/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 1.4602e-10
Epoch 28/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 3.3851e-11
Epoch 29/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 3.2089e-11
Epoch 30/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 2.7352e-11
Epoch 31/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 7.5922e-11
Epoch 32/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 2.9092e-11
Epoch 33/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 8.3690e-11
Epoch 34/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 1.4150e-11
Epoch 35/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 4.1584e-11
Epoch 36/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 8.3697e-11
Epoch 37/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 2.1891e-11
Epoch 38/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 6.1118e-11
Epoch 39/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 1.7206e-11
Epoch 40/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 6.9571e-11
Epoch 41/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 4.1679e-11
Epoch 42/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 2.2382e-11
Epoch 43/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 1.8092e-11
Epoch 44/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 4.7662e-11
Epoch 45/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 3.0981e-11
Epoch 46/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 5.3414e-11
Epoch 47/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 6.5965e-11
Epoch 48/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 2.8171e-11
Epoch 49/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 3.7733e-11
Epoch 50/50
396/396 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 1.4982e-11

Out[5]:

<keras.src.callbacks.history.History at 0x25370d10350>

In [6]:

# Some testing examples (TESTING = NOT TRAINING!)
plt.figure(figsize=(15, 15))
for n in np.arange(5):
    plt.subplot(2, 5, n+1)
    plt.imshow(x_test[n, :, :], cmap='gray')
    plt.title(f"Target Label: {y_test[n]}")

In [7]:

subset_Boolean = (y_test < 2)
x_test_01_only = x_test[subset_Boolean, :, :]
x_test_01_only.shape #(2115, 28, 28)
y_test_01_only = y_test[subset_Boolean]
y_test_01_only.shape #(2115,)

# Reshape to (60000, 28*28) to be compatible with TensorFlow MLP.fit later
x_test_01_only_reshaped = x_test_01_only.reshape(x_test_01_only.shape[0], -1)

In [8]:

plt.figure(figsize=(15, 15))
for n in np.arange(5):
    plt.subplot(2, 5, n+1)
    plt.imshow(x_test_01_only[n, :], cmap='gray')
    plt.title(f"Target Label: {y_test_01_only[n]}")

In [9]:

y_hat = MLP.predict(x_test_01_only_reshaped)
y_hat_reshaped = y_hat.reshape(y_test_01_only.shape[0])
y_hat_reshaped_int = np.round(y_hat_reshaped)
accuracy = (y_hat_reshaped_int == y_test_01_only).mean()
print(f"MLP Binary Classification Accuracy: {accuracy:.4f}")

67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
MLP Binary Classification Accuracy: 0.9995

In [10]:

probs = MLP.predict(x_test_01_only_reshaped)   # shape (N, 1)
y_hat = (probs>=0.5).astype(int).reshape(-1)               # shape (N,), can also use .ravel()
accuracy = (y_hat==y_test_01_only).mean()
print(f"MLP Binary Classification Accuracy: {accuracy:.4f}")

67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
MLP Binary Classification Accuracy: 0.9995

In [11]:

# Which one was misclassified?
deviation_Booleans = (y_hat != y_test_01_only)

x_test_01_only_deviations = x_test_01_only[deviation_Booleans]
y_test_01_only_deviations = y_test_01_only[deviation_Booleans]
len(x_test_01_only_deviations) #4

plt.figure(figsize=(20, 15))
for i, test_x in enumerate(x_test_01_only_deviations):
    plt.subplot(2, 5, i+1)
    plt.imshow(test_x, cmap='gray')
    plt.title(f"Target Label: {y_test_01_only_deviations[i]}, MLP Prediction: {y_hat[deviation_Booleans][i]}")

Problem: Write down the activations of a softmax output layer with $N_{\text{out}}$ neurons to be used for multiclass classification with $N_{\text{out}}$ classes.

Solution: For $1\leq n\leq N_{\text{out}}$, define the conventional variables $z_n:=\textbf w_n\cdot\textbf x+b_n$, where $\textbf w_n,b_n$ are the weights and bias of the $n^{\text{th}}$ neuron in that $N_{\text{out}}$-neuron output layer, and $\textbf x$ here represents the activation vector of the previous (hidden) layer of neurons. Then the softmax activation of the $n^{\text{th}}$ neuron in that output layer is given by:

\[a^{\text{softmax}}_n(z_1,…,z_N):=\frac{e^{z_n}}{e^{z_1}+…+e^{z_N}}\]

This has the interpretation of a Boltzmann-like probability distribution over the $N$ classes because:

\[\sum_{n=1}^Na^{\text{softmax}}_n(z_1,…,z_N)=1\]

Problem: What is the corresponding loss function for softmax classification?

Solution: If the target label $y=n$:

\[L(a^{\text{softmax}}_n)=-\ln a^{\text{softmax}}_n\]

which is the usual definition of information content.

Problem: Explain why softmax classification is a generalization of logistic classification.

Problem: Suppose one is deciding between several possible neural network architectures. One has a bunch of data $\mathcal D:=\{(\textbf x_i,y_i)\}$ on one’s hands. What is a quantitative way to perform model selection in this case (i.e. decide which neural network architecture to use)?

Solution: First, randomly partition the data set $\mathcal D$ into $3$ disjoint subsets $\mathcal D=\mathcal D_{\text{tr}}\cup\mathcal D_{\text{cv}}\cup\mathcal D_{\text{te}}$, viz. a training subset $\mathcal D_{\text{tr}}$, a cross-validation subset $\mathcal D_{\text{cv}}$, and a testing subset $\mathcal D_{\text{te}}$; note as a rule of thumb the size of these subsets in relation to the original data set should be around $\#\mathcal D_{\text{tr}}/\#\mathcal D\approx 60\%$ and $\#\mathcal D_{\text{cv}}/\#\mathcal D\approx \#\mathcal D_{\text{te}}/\#\mathcal D\approx 20\%$. Then, each of these $3$ subsets plays its respective role for each of the $3$ steps:

For each neural network architecture, minimize $C_{\mathcal D_{\text{tr}}}(\{\textbf w_n^{(\ell)}\},\{b_n^{(\ell)}\}|\lambda=0)$, thereby obtaining a set of optimal weights $\{\textbf w_n^{(\ell)*}\}$ and optimal biases $\{b_n^{(\ell)*}\}$ for the neurons in that particular neural network architecture.
For each neural network architecture, using its optimal $\{\textbf w_n^{(\ell)}\},\{b_n^{(\ell)}\}$ from step $1$, evaluate $C_{\mathcal D_{\text{cv}}}(\{\textbf w_n^{(\ell)*}\},\{b_n^{(\ell)*}\}|\lambda=0)$. The neural network one should choose is then the one for which this cross-validation error is the minimum.
(Optional) An unbiased estimator for that neural network’s generalization error is then given by $C_{\mathcal D_{\text{te}}}(\{\textbf w_n^{(\ell)*}\},\{b_n^{(\ell)*}\}|\lambda=0)$. This is unbiased because neither the particular neural network architecture nor its weights and biases were determined from $\mathcal D_{\text{te}}$.

Problem: Explain the difference between a model’s representational capacity and its effective capacity.

Solution: The representational capacity of a model is essentially asking about how many possible functions can be reached by the ansatz. For example, a quadratic model $\hat y(x|w_1,w_2,b)=w_1x^2+w_2x+b$ clearly has greater representational capacity than the linear model $\hat y(x|w,b)=wx+b$ because the function space spanned by the latter is a subset of the function space spanned by the former. In practice, the true representational capacity of a model may be difficult to exploit due to the nonconvex optimization landscape. Taking this limitation into account yields the concept of an effective capacity which is always less than or equal to the model’s representational capacity.

Problem: For a binary classifier, explain how the Vapnik-Chervonenkis (VC) dimension $\dim_{\text{VC}}$ provides an integer proxy for its representational capacity.

Solution: One can first define a slightly more general concept of VC dimension $\dim_{\text{VC}}$ for any arbitrary collection $\mathcal C$ of sets.

A collection of sets $\mathcal C$ is said to shatter a set $X$ iff every subset of $X$ can be “carved out” by taking $X$ and intersecting it with some set in $\mathcal C$; this is equivalent to saying that the power set $2^X=\mathcal C\cap X$ where $\mathcal C\cap X$ is defined in the obvious manner.

The VC dimension $\dim_{\text{VC}}$ of the collection $\mathcal C$ of sets is then defined to the cardinality of the largest set $X$ which can be shattered by $\mathcal C$.

The way this applies to a binary classifier is that because a binary classifier works by taking feature space (e.g. $\mathbf R^n$) and dividing it up into $2$ disjoint regions (such that features on one side of the boundary are classified $0$ while features on the other side are classified $1$), $\mathcal C$ is taken to be the collection of all such regions that can be classified as e.g. $1$ for instance. Then $X$ is simply the collection of feature vectors in that feature space (e.g. $\mathbf R^n$).

For example, the classic XOR counterexample provides an intuition for why the VC dimension of any linear binary classifier in the plane $\mathbf R^2$ is $\dim_{\text{VC}}=3$ (and in fact more generally the VC dimension of a linear binary classifier is $\dim_{\text{VC}}=n+1$ in $\mathbf R^n$).

Problem: Show that the mean-squared error (MSE) admits a decomposition into $3$ non-negative terms of the form:

\[\text{MSE}=\text{Bias}^2+\text{Variance}+\text{Bayes Error}\]

and give expressions for all $4$ terms in this equation. Using this identity, explain the bias-variance tradeoff.

Solution: The mean squared error of a point estimator $\hat y=\hat y(x)$ for an underlying random variable $y=y(x)$ is in this case a proxy for generalization error:

\[\text{MSE}:=\langle (\hat y-y)^2\rangle\]

The bias of the point estimator $\hat y$ is:

\[\text{Bias}:=\langle\hat y\rangle-\langle y\rangle\]

The variance of the point estimator $\hat y$ is:

\[\text{Variance}:=\langle (\hat y-\langle\hat y\rangle)^2\rangle\]

The Bayes error is an irreducible error associated with the random-variable nature of $y$ itself rather than the model $\hat y$ (e.g. $2$ houses with the exact same features selling for different prices in a regression model, or $2$ emails with identical features but one is spam and another isn’t in a classification model):

\[\text{Bayes error}=\sigma^2_y\]

(aside: the reason this is called a Bayes error is that $y$ is being treated as a random variable rather than how a frequentist would view $y$ as a fixed parameter…clearly if $y$ were simply fixed then there would be no variance $\sigma^2_y=0$ and hence no Bayes error).

Since the Bayes error is generally not in one’s control, it follows that the best one can in principle hope for is $\text{MSE}=\text{Bayes error}$, so this begs the question of how one can minimize the sum of $\text{Bias}^2+\text{Variance}$? Here, the dilemma of the bias-variance tradeoff is observed, namely $\text{Bias}^2$ tends to be an decreasing function of the model’s effective capacity whereas $\text{Variance}$ tends to be an increasing function of the effective capacity of $\hat y$:

The challenge is thus to find that sweet spot in model capacity (also called model complexity/flexibility, etc.) that minimizes the sum $\text{Bias}^2+\text{Variance}$ of these $2$ errors as much as possible, since that’s all one has control over. If the model capacity is less than this optimal capacity, then one would be in a high-bias, low-variance regime of underfitting. By contrast, if the model capacity exceeds the optimal model capacity, then this low-bias, high-variance regime would be at risk of overfitting.

Practically, one can use as a proxy for diagnosis:

\[\text{Bias}^2\sim C_{\mathcal D_{\text{train}}}(\{\textbf w_n^{(\ell)*}\},\{b_n^{(\ell)*}\}|\lambda=0)-C_0\]

\[\text{Variance}\sim C_{\mathcal D_{\text{cv}}}(\{\textbf w_n^{(\ell)*}\},\{b_n^{(\ell)*}\}|\lambda=0)-C_{\mathcal D_{\text{train}}}(\{\textbf w_n^{(\ell)*}\},\{b_n^{(\ell)*}\}|\lambda=0)\]

where $C_0$ is some notion of “baseline cost” (e.g. human performance).

Problem: The previous problem explained how to diagnose high bias or high variance in a neural network model. Once they have been diagnosed, what’s the cure?

Solution: Essentially there are $2$ axes over which one has control:

Add more training examples if the neural network is diagnosed with high variance (reducing training examples is never a good idea).
Feature engineering/regularization: if the neural network is diagnosed with high bias (aka underfitting) then more features might help. By contrast, if diagnosed with high variance (aka overfitting), then fewer features might help. An equivalent way to filter out features is with regularization (e.g. increasing $\lambda$ is isomorphic to removing features, and vice versa).

Problem: Draw a flow chart to illustrate a typical machine learning workflow:

Solution: It almost never hurts to go to a bigger neural network as long as one regularizes appropriately:

(bigger network = more features)

Problem: Explain the meaning of the phrase “large neural networks are low-bias machines“.

Solution: Loosely speaking, provided:

\[\text{Number of Neurons in Neural Network}\gg\#\mathcal D\]

then neural networks can almost always fit the training data pretty well, meaning low-bias. As a result, it is typically the case that high-variance is a bigger problem.

Problem: Describe what error analysis means in the context of evaluating a machine learning system’s classification performance (does it also apply to regression?)

Solution: The idea is that, if a neural network is diagnosed with high variance issues, rather than just feeding it more random training examples, one should feed it in a smarter way by first looking at the exact nature of the misclassifications.

Problem: Suppose a neural network is diagnosed with high variance, and error analysis shows that in fact one simply needs more training examples in general, but new training examples are scarce; in this case, what are $2$ additional techniques one can try to improve the performance of one’s ML system?

Solution:

Data Augmentation: take existing training examples one already has, and perturb them in “representative” ways so as to obtain new training examples.
Data Synthesis: create new “representative” training examples out of thin air. There are many ways to do this depending on the specific application.

Problem: What are the $2$ steps of transfer learning?

Solution:

Supervised Pretraining: train a neural network on a loosely related task.
Fine Tuning: taking the weights and biases of the hidden layers in the pretrained neural network as a starting point, do further training on either just the output layer or of the entire neural network, until it performs as desired on the task for which abundant data was not so immediately available.

Posted in Blog | Leave a comment

Many-Body Green’s Functions

Posted on June 28, 2025 by wdengquantum.me

Problem: Given a system of $N$ identical bosons or fermions with Hamiltonian $H$ in a mixed ensemble described by a density operator $\rho$ (usually $\rho=e^{-\beta H}/Z$ or $\rho=e^{-\beta(H-\mu N)}/Z$ in equilibrium at temperature $T=1/k_B\beta$ and chemical potential $\mu$ though one can also work with a non-equilibrium $\rho$; another limit often taken is $T=0$ in which case the system is guaranteed to be in its ground state) and two operators $P(t),Q(t)$ in the Heisenberg picture, define the greater Green’s function $G^{>}_{P,Q}(t,t’)$, the lesser Green’s function $G^{<}_{P,Q}(t,t’)$, the causal Green’s function $G_{P,Q}(t,t’)$, the retarded Green’s function $G^+_{P,Q}(t,t’)$, and the advanced Green’s function $G^-_{P,Q}(t,t’)$ of $P$ and $Q$.

Solution: In all the formulas, the expectation is with respect to $\rho$, ($\langle A\rangle=\text{Tr}(\rho A)$ which at $T=0$ looks like typical QFT expectations $\langle \space|A|\space\rangle$ in a vacuum ground state $|\space\rangle$), and in all $\pm,\mp$ occurrences, the “top” sign is for bosons while the “bottom” sign is for fermions:

\[G^{>}_{P,Q}(t,t’)=-\frac{i}{\hbar}\langle P(t)Q(t’)\rangle\]

\[G^{<}_{P,Q}(t,t’)=\mp\frac{i}{\hbar}\langle Q(t’)P(t)\rangle\]

\[G_{P,Q}(t,t’)=\Theta(t-t’)G^{>}_{P,Q}(t,t’)+\Theta(t’-t)G^{<}_{P,Q}(t,t’)=-\frac{i}{\hbar}\langle\mathcal T P(t)Q(t’)\rangle\]

(note that $\mathcal T P(t)Q(t’)=\Theta(t-t’)P(t)Q(t’)\pm\Theta(t’-t)Q(t’)P(t)$ and in particular for fermions the time-ordering is not simply equal to $\mathcal T P(t)Q(t’)\neq \Theta(t-t’)P(t)Q(t’)+\Theta(t’-t)Q(t’)P(t)$)

\[G^{\pm}_{P,Q}(t,t’)=\pm\Theta(\pm(t-t’))(G^{>}_{P,Q}(t,t’)-G^{<}_{P,Q}(t,t’))=\mp\frac{i}{\hbar}\Theta(\pm(t-t’))\langle [P(t),Q(t’)]_{\mp}\rangle\]

where $[A,B]_{-}=[A,B]$ is the commutator for bosons while $[A,B]_-=\{A,B\}$ is the anticommutator for fermions, i.e. $[A,B]_{\pm}=AB\pm BA$.

(comment about if $H=H_0+V$ and using interaction picture instead of Heisenberg).

Problem: Show that if the Hamiltonian $H$ is time-independent and the density operator $\rho=\rho(H)$ is a function of $H$ only (such as in equilibrium $\rho=e^{-\beta H}/Z$), then all $5$ Green’s functions are a function of only the time difference $t-t’$.

Solution: Clearly it suffices to show it for the greater and lesser Green’s functions. Specifically, for any $2$ Heisenberg operators $P(t),Q(t’)$, one has:

\[\langle P(t)Q(t’)\rangle=\frac{1}{Z}\text{Tr}\left(\rho P(t)Q(t’)\right)\]

Since $H$ is time-independent, the time-evolution operator is just $e^{-iHt/\hbar}$ and hence the Heisenberg operators are $P(t)=e^{iHt/\hbar}Pe^{-iHt/\hbar}$ and $Q(t’)=e^{iHt’/\hbar}Qe^{-iHt’/\hbar}$ so:

\[=\frac{1}{Z}\text{Tr}\left(\rho e^{iHt/\hbar}Pe^{-iH(t-t’)/\hbar}Qe^{-iHt’/\hbar}\right)\]

but by cyclicity of the trace and the fact that $\rho$ is a function of $H$, and so commutes with any other function of $H$:

\[=\frac{1}{Z}\text{Tr}\left(\rho e^{iH(t-t’)/\hbar}Pe^{-iH(t-t’)/\hbar}Q\right)\]

and the claim follows. Thus, in such cases there is no harm in letting $t’:=0$.

Problem: For single-particle Green’s functions, what are $P(t)$ and $Q(t)$ typically? What about for two-particle Green’s functions (and beyond)?

Solution: For single-particle Green’s functions (called such even though it’s really still a many-body object because it’s sensitive to the full Hamiltonian $H$), the standard case of interest is $P(t)=c_i(t)$ and $Q(t’):=c^{\dagger}_j(t’)$ where $\{|i\rangle\}$ is a basis of the single-particle state space and $c_i, c^{\dagger}_j$ are the annihilation and creation operators associated to $|i\rangle$ and $|j\rangle$ respectively. For example, if one is dealing with a bunch of electrons moving in $\mathbf R^3$ and one chooses $|i\rangle\equiv |\mathbf x,\sigma\rangle$, then the single-particle retarded Green’s function (thought of as a $2\times 2$ matrix) can be written as:

\[G^+_{\sigma\sigma’}(\mathbf x,t;\mathbf x’,t’)=-\frac{i}{\hbar}\Theta(t-t’)\langle\{\psi_{\sigma}(\mathbf x,t),\psi^{\dagger}_{\sigma’}(\mathbf x’,t’)\}\rangle\]

Alternatively, picking $|i\rangle\equiv |\mathbf k,\sigma\rangle$ leads to:

\[G^+_{\mathbf k\sigma,\mathbf k’\sigma’}(t,t’)=-\frac{i}{\hbar}\Theta(t-t’)\langle\{c_{\mathbf k\sigma}(t),c^{\dagger}_{\mathbf k’\sigma’}(t’)\}\rangle\]

Single-particle Green’s functions are also loosely referred to as two-point correlation functions. By extension, an example of a two-particle Green’s function might take $P(t)=N_{\mathbf k}(t):=\sum_{\mathbf k’\sigma}c_{\mathbf k’+\mathbf k,\sigma}(t)c_{\mathbf k’\sigma}(t)$ and $Q(t’)=N_{-\mathbf k}(t’)$ which would be a four-point correlation function.

Problem: As mentioned earlier, when $\rho=e^{-\beta H}/Z$ and $\dot H=0$, one can set $t’:=0$ without loss of information. In this case, being left with only a single time variable $t$, one can Fourier transform each of the $5$ single-particle Green’s functions from the time domain $t\mapsto\omega$ to the frequency domain (with the usual linear response convention). In this case, show that the frequency-domain Green’s functions admit the Lehmann (also called spectral) representations:

\[G^{>}_{P,Q}(\omega)=-\frac{2\pi i}{Z}\sum_{nm}e^{-\beta E_n}\langle n|P|m\rangle\langle m|Q|n\rangle\delta(\hbar \omega+E_n-E_m)\]

\[G^{<}_{P,Q}(\omega)=\pm e^{-\beta\hbar\omega}G^{>}_{P,Q}(\omega)\]

\[G^+_{P,Q}(\omega)=\frac{1}{Z}\sum_{nm}\frac{\langle n|P|m\rangle\langle m|Q|n\rangle}{\hbar(\omega+i0^+)+E_n-E_m}(e^{-\beta E_n}\mp e^{-\beta E_m})\]

\[G^-_{P,Q}(\omega)=(G^+_{Q,P}(\omega))^*\]

where $\{|n\rangle\}$ is an orthonormal $H$-eigenbasis.

Solution: (here the derivation was done with $P=c_i$ and $Q=c^{\dagger}_j$ but it works with arbitrary $P,Q$):

Problem: For each of the $5$ frequency-domain Green’s functions (equilibrium $\rho=e^{-\beta H}/Z$, time-independent $H$ as before), one can define a corresponding spectral function. For instance, associated to the retarded Green’s function $G_{P,Q}^+(\omega)$ is the retarded spectral function:

\[A^+_{P,Q}(\omega):=-2\hbar\text{Im}G_{P,Q}^+(\omega)\]

Show that, in the special but important case where $Q=P^{\dagger}$, the corresponding retarded spectral function $A^+_{P,P^{\dagger}}(\omega)$ scales with both the greater Green’s function $G^>_{P,P^{\dagger}}(\omega)$ and the lesser Green’s function $G^<_{P,P^{\dagger}}(\omega)$ in the frequency domain as:

\[A^+_{P,P^{\dagger}}(\omega)=i\hbar (1\mp e^{-\beta\hbar\omega})G^>_{P,P^{\dagger}}(\omega)=i\hbar (\mp e^{\beta\hbar\omega}-1)G^<_{P,P^{\dagger}}(\omega)\]

Or inverting:

\[i\hbar G^>_{P,P^{\dagger}}(\omega)=(1\pm N_{\mp}(\omega))A^+_{P,P^{\dagger}}(\omega)\]

\[\pm i\hbar G^<_{P,P^{\dagger}}(\omega)=N_{\mp}(\omega)A^+_{P,P^{\dagger}}(\omega)\]

where $N_{\pm}(\omega):=\frac{1}{e^{\beta\hbar\omega}\pm 1}$ are respectively Bose-Einstein/Fermi-Dirac distributions. These relations go by the name of the fluctuation-dissipation theorem (valid at equilibrium!) because $G^>$ and $G^<$ are clearly a measure of fluctuations whereas the spectral function $A^+$ measures dissipation by virtue of being $\propto G^+$.

Solution: The quick and dirty way is to just use the Lehmann representation of $G^+_{P,P^{\dagger}}(\omega)$, and when taking its imaginary part one just needs to invoke Sokhotski-Plemelj to get a bunch of $\delta$’s which makes it look very similar to the Lehmann representations of $G^>_{P,P^{\dagger}}(\omega)$ and $G^<_{P,P^{\dagger}}(\omega)$ from which the results follow.

Problem: Specializing from the above problem, work with a diagonal component $P=c_i$ of the single-particle retarded Green’s function. In this case, show that the retarded spectral function $A^+_i(\omega)$ obeys the sum rule:

\[\int_{-\infty}^{\infty}\frac{d\omega}{2\pi}A^+_i(\omega)=1\]

Solution: This is the first time that the commutation/anticommutation relation is actually necessary:

(note: in some literature, the spectral function is taken to be $A^+_{P,Q}(\omega)=-\frac{\hbar}{\pi}\text{Im}G^+_{P,Q}(\omega)$ such that with this convention the sum rule reads $\int_{-\infty}^{\infty}d\omega A^+_{P,Q}(\omega)=1$).

Connect these to the resolvent of $H$ and FT of time-evolution operator? (see the Oxford text for more)

Problem: Show that the retarded spectral function $A^+_i(\omega)$ behaves like a density of states:

\[\langle c^{\dagger}_ic_i\rangle=\int_{-\infty}^{\infty}\frac{d\omega}{2\pi}A_i^+(\omega)N_{\mp}(\omega)\]

Solution: Using hopefully obvious notation:

Problem: For a system of non-interacting identical bosons/fermions, what is the (retarded) spectral function $A^+_{\sigma}(\mathbf k,\omega)$?

Solution: In that case, $H=\sum_{\mathbf k\sigma}\frac{\hbar^2|\mathbf k|^2}{2m}c^{\dagger}_{\mathbf k\sigma}c_{\mathbf k\sigma}$. Although one can directly use the Lehmann representation formulas given that for both bosons and fermions the eigenstates $|n\rangle$ and spectrum $E_n$ of $H$ are well-understood, it’s instructive to do a “first-principles” derivation. Specifically, first one needs to show that the the free time evolution of the creation and annihilation operators is explicitly given by:

\[c_{\mathbf k\sigma}(t)=e^{-i\hbar|\mathbf k|^2t/2m}c_{\mathbf k\sigma}\]

\[c^{\dagger}_{\mathbf k\sigma}(t)=e^{i\hbar|\mathbf k|^2t/2m}c^{\dagger}_{\mathbf k\sigma}\]

The rigorous way is to simply downgrade from $c_{\mathbf k\sigma}(t)=e^{iHt/\hbar}c_{\mathbf k\sigma}e^{-iHt/\hbar}$ back to the Heisenberg equation of motion $i\hbar\dot c_{\mathbf k\sigma}(t)=[c_{\mathbf k\sigma}(t),H]=e^{iHt/\hbar}[c_{\mathbf k\sigma},H]e^{-iHt/\hbar}$ and upon inserting $H=\sum_{\mathbf k’\sigma’}\frac{\hbar^2|\mathbf k’|^2}{2m}c^{\dagger}_{\mathbf k’\sigma’}c_{\mathbf k’\sigma’}$, recognize this as the commutator $[c_i,N_j]=\delta_{ij}c_i$ valid for both bosons and fermions (read it as a fancy way of saying $N-(N-1)=1$). This gives a $1^{\text{st}}$-order ODE for $c_{\mathbf k\sigma}(t)$ with the trivial solution $c_{\mathbf k\sigma}(t)=e^{-i\hbar |\mathbf k|^2t/2m}c_{\mathbf k\sigma}$, and by taking the adjoint one also gets $c^{\dagger}_{\mathbf k\sigma}(t)=e^{-i\hbar |\mathbf k|^2t/2m}c^{\dagger}_{\mathbf k\sigma}$. The quicker, but more intuitively appealing way, is to directly take $c_{\mathbf k\sigma}(t)=e^{iHt/\hbar}c_{\mathbf k\sigma}e^{-iHt/\hbar}$ and act on an arbitrary Fock state. The result then amounts to an identity of the form $e^{i(E-\hbar^2|\mathbf k|^2/2m)t/\hbar}e^{-iEt/\hbar}=e^{-i\hbar|\mathbf k|^2t/2m}$. Armed with this, the diagonal retarded single-particle Green’s function is:

\[G^+_{\sigma}(\mathbf k,t)=-\frac{i}{\hbar}\Theta(t)\langle[c_{\mathbf k\sigma}(t),c^{\dagger}_{\mathbf k\sigma}]_{\mp}\rangle\]

\[=-\frac{i}{\hbar}\Theta(t)e^{-i\hbar |\mathbf k|^2 t/2m}\langle[c_{\mathbf k\sigma},c^{\dagger}_{\mathbf k\sigma}]_{\mp}\rangle\]

\[=-\frac{i}{\hbar}\Theta(t)e^{-i\hbar |\mathbf k|^2 t/2m}\]

by virtue of the CCR/CAR. With the help of an $i\varepsilon$ prescription, this is easy to Fourier transform:

\[G^+_{\sigma}(\mathbf k,\omega)=-\frac{i}{\hbar}\int_0^{\infty}e^{i(\omega-\hbar|\mathbf k|^2/2m)t-\varepsilon t}\]

\[=\frac{1}{\hbar\omega-\frac{\hbar^2|\mathbf k|^2}{2m}+i0^+}\]

So finally, using the Sokhotski-Plemelj theorem:

\[A^+_{\sigma}(\mathbf k,\omega)=2\pi\delta\left(\omega-\frac{\hbar|\mathbf k|^2}{2m}\right)\]

which clearly fulfills the sum rule. As another corollary, applying the earlier result to this non-interacting system, one has:

\[\langle c^{\dagger}_{\mathbf k\sigma}c_{\mathbf k\sigma}\rangle=\int_{-\infty}^{\infty}\frac{d\omega}{2\pi}2\pi\delta\left(\omega-\frac{\hbar|\mathbf k|^2}{2m}\right)N_{\mp}(\omega)=N_{\mp}\left(\frac{\hbar|\mathbf k|^2}{2m}\right)=\frac{1}{e^{\beta\hbar^2|\mathbf k|^2/2m}\mp 1}\]

which reinforces that the Bose-Einstein/Fermi-Dirac distributions are strictly only valid for non-interacting systems of bosons/fermions.

Problem: Above, the $\varepsilon$-prescription seemed to be a bit artificial…on the other hand, if one replaces $\varepsilon\mapsto\Delta\omega$ so that the retarded Green’s function decays exponentially in time (due to interactions which scatter particles out of the state $|\mathbf k\sigma\rangle$) with non-infinitesimal decay rate $\Delta\omega>0$, then show that the spectral function is broadened from a $\delta$ spike into a Lorentzian with HWHM $\Delta\omega$:

\[A^+_{\sigma}(\mathbf k,\omega)=\frac{2\Delta\omega}{(\omega-\hbar|\mathbf k|^2/2m)^2+\Delta\omega^2}\]

Solution: The diagonal single-particle retarded Green’s function is a Mobius transformation of $\omega$:

\[G^+_{\sigma}(\mathbf k,\omega)=\frac{1}{\hbar\omega-\hbar^2|\mathbf k|^2/2m+i\hbar\Delta\omega}\]

and the claim follows. Aside: if $\Delta\omega$ is not too large, then this gives rise to the idea of a quasiparticle from Landau’s Fermi liquid theory. In that case, a generic spectral function might have the form:

\[A^+_{\sigma}(\mathbf k,\omega)=\frac{2Z_{\mathbf k}\Delta\omega_{\mathbf k}}{(\omega-\hbar|\mathbf k|^2/2m)^2+\Delta\omega_{\mathbf k}^2}+\tilde A^+_{\sigma}(\mathbf k,\omega)\]

where $Z_{\mathbf k}\in[0,1]$ is a momentum-dependent quasiparticle residue and in particular if $Z_{\mathbf k}\neq 1$ then an incoherent background/continuum spectrum $\tilde A^+_{\sigma}(\mathbf k,\omega)$ must be present in order for $A^+_{\sigma}(\mathbf k,\omega)$ to still obey the sum rule.

Problem: Annotate the following Nature Physics paper.

Solution:

Everything in the first half of the paper (steady state part) can be summed up like:

And the latter half discussing the resonant dynamics of the pre-steady state can be visualized qualitatively as:

Finally, a few points worth emphasizing:

Clearly $\uparrow$ is unstable, wants to decay back to $\downarrow$ with transition rate $\Gamma$, so the zero-detuning $\delta_0<0$ must contain information about the $\uparrow,B$ interactions.
A Rabi experiment is different from say Ramsey in that it really is about a continuous drive so $t\gg 1/\Omega$, as mentioned in the paper’s Figure $1a$.
To emphasize again, there is the implicit hyperparameters that $a_{\uparrow B}=\infty$ is unitarily/strongly interacting while $a_{\downarrow B}\approx 0$ is weakly/non-interacting; the paper gives exact values). So only the attractive polaron is present. At the end the paper mentions extending into the BEC regime, where both would coexist for a sufficiently broad Feshbach resonance.

And some random thoughts I have about the paper:

Instead of the $3D$ surface graph, maybe a heat map would be instructive too?
At large $\Omega$, how does the AC Stark shift associated with a dressed atom-photon affect the physics (since it seems it would affect the internal energies of the polaron)?

Problem: Write an essay that summarizes the key points learned from the following papers/slides:

Fermi polarons and beyond (Parish and Levinsen)
The quantum impurity problem and beyond (Parish)
Exact theory of the finite-temperature spectral function of Fermi polarons with multiple particle-hole excitations: Diagrammatic theory versus Chevy ansatz (Hu, Wang, Liu)
A single impurity in an ideal atomic Fermi gas: current understanding and some open problems (Lan, Lobo)
Many-particle physics with ultracold gases (Punk)
Polarons, dressed molecules, and itinerant ferromagnetism in ultracold Fermi gases (Massignan, Zaccanti, Bruun)
Ultrafast many-body interferometry of impurities coupled to a Fermi sea (Cetina et al.)
https://static-content.springer.com/esm/art%3A10.1038%2Fs41567-025-02799-8/MediaObjects/41567_2025_2799_MOESM1_ESM.pdf
file:///C:/Users/weidu/Downloads/Double_Mode_in_Driven_Fermi_Polaron.pdf

Solution: For a generic $2$-component Fermi gas whose $2$ components may be called $\uparrow$ and $\downarrow$ (this could be $2$ hyperfine states of the same atom, or $2$ hyperfine states of different atoms) the Hamiltonian is $H=H_0+V_{\downarrow\uparrow}$ where the kinetic energy is:

\[H_0=\sum_{\textbf k}\frac{\hbar^2|\textbf k|^2}{2m_{\uparrow}}c^{\dagger}_{\textbf k\uparrow}c_{\textbf k\uparrow}+\frac{\hbar^2|\textbf k|^2}{2m_{\downarrow}}c^{\dagger}_{\textbf k\downarrow}c_{\textbf k\downarrow}\]

and the short-range scattering pseudopotential $V_{\downarrow\uparrow}$ of “bare strength” $g_{\uparrow\downarrow}$ describes momentum-conserving collisions between the $2$ components $\uparrow,\downarrow$ of the Fermi gas in a volume $V$ (note that interactions among the components themselves are neglected, i.e. they are separately ideal Fermi gases. That is, one assumes there is no $\uparrow\uparrow$ or $\downarrow\downarrow$ scattering. This is justified by the fact that identical spin parts of $2$ identical fermions could only interact via an odd-$\ell$ scattering channel, the lowest of which is $\ell=1$ $\p$-wave scattering whose cross-section $\sigma_{\ell}\sim k^{2\ell}$ is suppressed at low $k$):

\[V_{\downarrow\uparrow}=\frac{g_{\uparrow\downarrow}}{V}\sum_{\textbf k_1,\textbf k_2,\textbf k’_1,\textbf k’_2}\delta_{\textbf k’_1+\textbf k’_2,\textbf k_1+\textbf k_2}c^{\dagger}_{\textbf k’_2,\downarrow}c^{\dagger}_{\textbf k’_1,\uparrow}c_{\textbf k_2,\downarrow}c_{\textbf k_1,\uparrow}\]

The Fermi polaron is the limit $N_{\downarrow}/N_{\uparrow}\to 0$ of the $2$-component Fermi gas, in fact typically one just takes $N_{\downarrow}=1$. In light of this population imbalance between the $2$ components $\uparrow,\downarrow$ of the Fermi gas, the standard terminology is to call the majority $\uparrow$ component as the bath and the minority $\downarrow$ component as the impurity. Through the short-range interaction $V_{\downarrow\uparrow}$, the $\downarrow$ impurity polarizes the $\uparrow$ bath in its $\textbf x$-space vicinity (hence the name polaron!), and it is common to say that the $\downarrow$ impurity is dressed by the polarized $\uparrow$ cloud that it “carries” along with it. This composite object of the $\downarrow$ impurity together with the $\uparrow$ cloud is a quasiparticle called the (Fermi) polaron (in particular, it is important to emphasize that polaron is not synonymous with $\downarrow$ impurity; the interaction $V_{\downarrow\uparrow}$ is essential and instead polaron is synonymous with $\downarrow$ impurity + $\uparrow$ polarized cloud).

This discussion has been an intuitive/qualitative picture in $\textbf x$-space (aka real space). In $\textbf k$-space (aka reciprocal space), one can get more quantitative. Here, rather than starting from a $2$-component $\uparrow\downarrow$ Fermi gas, one can first visualize a single-component $\uparrow$ ideal Fermi gas in its ground state where the Fermi sea is occupied up to the Fermi wavenumber $k_F=(6\pi^2 n_{\uparrow})^{1/3}$:

\[|\text{FS}\rangle:=\prod_{|\textbf k|\leq k_F}c^{\dagger}_{\textbf k,\uparrow}|\space\space\rangle\]

The excited states are then particle-hole excitations of this ground state Fermi sea $|\text{FS}\rangle$. When adding a single $\downarrow$ impurity to the bath, one would expect that the impurity would scatter bath fermions from inside to outside the Fermi sea. The modified ground state $|\text{FP}\rangle$ of the Fermi polaron system (i.e. the $\downarrow$ impurity + $\uparrow$ bath) would therefore be expected to be of the form (working in the ZMF classically, or quantum mechanically fixing an eigenstate of total momentum $\textbf 0$):

\[|\text{FP}\rangle=\alpha_0c^{\dagger}_{\textbf 0,\downarrow}|\text{FS}\rangle+\sum_{|\textbf k|\leq k_F,|\textbf k’|\geq k_F}\alpha_{\textbf k,\textbf k’}c^{\dagger}_{\textbf k-\textbf k’,\downarrow}c^{\dagger}_{\textbf k’,\uparrow}c_{\textbf k,\uparrow}|\text{FS}\rangle+…\]

where the $…$ indicates $N$ particle-hole excitations for $N\geq 2$. Ignoring the $…$ terms, this is called the Chevy ansatz and can be used as a trial ground state with $\alpha_0,\alpha_{\textbf k,\textbf k’}$ the fitting parameters to be tuned such as to minimize the Rayleigh-Ritz energy quotient $E=\frac{\langle\text{FP}|H|\text{FP}\rangle}{\langle \text{FP}|\text{FP}\rangle}$ in the variational method. The ground state energy eigenvalue $E$ obtained in this manner is an estimate of the polaron energy. Explicitly:

As another corollary, the fitting parameter $\alpha_0$, once fitted, gives the polaron residue:

\[Z:=|\alpha_0|^2\leq 1\]

The (unobservable) bare strength $g_{\uparrow\downarrow}=g_{\uparrow\downarrow}(k^*)$ should be taken to run with the (unobservable) UV cutoff $k^*\to\infty$ such as to keep $a_s$ fixed! Specifically, through the renormalization condition:

\[\frac{1}{g_{\uparrow\downarrow}}=\frac{\mu}{2\pi\hbar^2 a_{\uparrow\downarrow}}-\frac{2\mu}{\hbar^2V}\sum_{|\textbf k|\leq k^*}\frac{1}{|\textbf k|^2}\]

Dippy renormalization derivation:

Since:

\[f(\textbf k,\textbf k’)=-\frac{\mu}{2\pi\hbar^2}\int d^3\textbf x’e^{-i\textbf k’\cdot\textbf x’}V(\textbf x’)\psi(\textbf x’)\]

For $V(\textbf x’)=g\delta^3(\textbf x’)$, this becomes:

\[f(\textbf k,\textbf k’)=-\frac{\mu}{2\pi\hbar^2}g\psi(\textbf 0)\]

But $\psi(\textbf x)=e^{ikz}+f(\textbf k,\textbf k’)\frac{e^{ikr}}{r}$, and it’s that divergent $1/r$ piece in the scattered spherical wave that’s gonna cause trouble. This is because $\psi(\textbf 0)$ seems to blow up due to it. But rather than let it blow up, allow it to be some large number call it $k^*/\pi$ (clearly dimensionally okay). Then $\psi(\textbf 0)=1+f(\textbf k,\textbf k’)k^*/\pi$. Substituting gives and isolating for $f$:

\[\frac{1}{f}+\frac{k^*}{\pi}=-\frac{2\pi\hbar^2}{\mu g}\]

The point is now, suddenly, you introduce a new low-energy parameter into the game, the $s$-wave scattering length $a_s$! Notice it didn’t appear in any equations yet! But since we only care about low-energy/low-momentum $\textbf k\to\textbf 0$, and we know we have the limit $f(\textbf k,\textbf k’)\to -a_s$ as $\textbf k\to \textbf 0$. It’s a sort of limit/correspondence principle-like knot at the end of a string that the theory has to approach. So making that substitution, one obtains the running of the bare coupling with the UV cutoff. This turns out the be the same as the above renormalization condition.

More precise derivation:

Write the Born series for the scattering amplitude:

\[f(\textbf k,\textbf k’)=-\frac{1}{4\pi}\frac{2\mu}{\hbar^2}\langle\textbf k’|V_{\uparrow\downarrow,s}|\psi_{\textbf k}\rangle\]

By defining the transition operator $T_{\uparrow\downarrow,s}|\textbf k\rangle:=V_{\uparrow\downarrow,s}|\psi_{\textbf k}\rangle$ which can be easily checked to obey $T=V+VG_0T$ with $G_0=(E_{\textbf k}1-H_0)^{-1}$ the free particle resolvent, then because $\langle k|V_{\uparrow\downarrow,s}|\textbf k’\rangle=g$ for $V_{\uparrow\downarrow,s}=g\delta^3(\textbf X)$, then actually $\textbf k’$ doesn’t even matter (i.e. $s$-wave scattering is isotropic!) so it can be used as a dummy index for the summation. Furthermore to the isotropy, $f(\textbf k)=f(k)$:

\[f(k)=-\frac{\mu}{2\pi\hbar^2}\left(g+\frac{g^2}{V}\sum_{|\textbf k’|\leq k^*}\frac{1}{E_{\textbf k}-E_{\textbf k’}}+\text{geometric series}\right)\]

In the end, once you set $\textbf k:=\textbf 0$ so that $f(0)=-a_{\uparrow\downarrow,s}$, you get the same thing. Here, the subtleties are that the series part can be summed by letting $V\to\infty$ so $\frac{1}{V}\sum_{\textbf k’}\to\int\frac{d^3\textbf k’}{(2\pi)^3}$, and if you take $\langle\textbf x|\textbf k\rangle=e^{i\textbf k\cdot\textbf x}$ then the correct identity resolution is $\frac{1}{V}\sum_{\textbf k}|\textbf k\rangle\langle\textbf k|$ for quantization volume $V$ and also remember $G_0|\textbf k’\rangle=\frac{1}{E_{\textbf k}-E_{\textbf k’}}|\textbf k’\rangle$ is an eigenstate.

(there are both attractive and repulsive Fermi polarons so this polarization effect can go either way). In the attractive case, if the attraction is strong enough, the the polaron can dimerize with a bath fermion, forming a molecule; this polaron-molecule transition is interesting.

Surprisingly, the Chevy ansatz works remarkably well (i.e. agrees with state-of-the-art diagrammatic quantum Monte Carlo stuff)! Seems to include the dimer bound state in it?

———————-

There are $2$ key assumptions about the typical regime of ultracold atomic gases, namely $n^{-1/3},\lambda_T\gg r_{vdW}\sim 100a_0$.

In the vicinity of a broad Feshbach resonance, the scattering amplitude may be approximated by the Mobius transformation $f_s(k)=-\frac{1}{ik+a^{-1}_s}$. However, in the vicinity of a narrow Feshbach resonance, need to also parameterize it with the effective range $r_{\text{eff}}$ so that $f_s(k)=-\frac{1}{ik+a^{-1}_s-\frac{1}{2}r_{\text{eff}}k^2}$. Although $a_s$ and $r_{\text{eff}}$ are determined by microscopic details of $V_{\uparrow\downarrow}(r)$, different microscopic details in another potential $\tilde V_{\uparrow\downarrow}(r)$ can lead to the same low-energy scattering amplitude $f_s(k)$. The practical corollary of this observation is that one do just that, namely substitute $V_{\uparrow\downarrow}(r)$ for a suitable pseudopotential.

Problem: Consider a toy model of the Fermi polaron in which the $\downarrow$ impurity interacts with only the nearest $\uparrow$ impurity in the Fermi sea, the rest of the $\uparrow$ Fermi sea serving to exert a pressure that effectively confines the relative distance between the $\downarrow$ and $\uparrow$ impurities to a radius $R$. By equating the ground state energy of the infinite spherical potential well with the Fermi energy $E_F$, show that:

\[R=\sqrt{\frac{m_{\uparrow}}{\mu}}\]

Hence, show that for a positive-energy eigenstate $E=\frac{\hbar^2k^2}{2m}$ the wavenumber $k$ is determined through the $s$-wave scattering length $a_s$ by:

\[k\cot kR=a^{-1}_s+R^*k^2\]

(where the Bethe-Peierls boundary condition is used). Show that for $m_{\uparrow}=m_{\downarrow}$ and $R^*=0$, this simplifies to:

\[-\frac{1}{k_Fa_s}=-\frac{k}{k_F}\cot\sqrt{2}\pi\frac{k}{k_F}\]

By considering the scaled energy from the Fermi energy $\frac{E-E_F}{E_F}$ which in this case amounts to $2(k/k_F)^2-1$, plot this as a function of $-1/k_Fa_s$.

Problem: Explain how Ramsey interferometry works.

Solution: Applying two $\pi/2$-pulses separated by some time $\Delta t$; then Ramsey fringes are seen as a function of this temporal separation $\Delta t$; it is a bit like a time-domain analog of a Mach-Zender interferometer.

Posted in Blog | Leave a comment

Identical Quantum Particles & Second Quantization

Posted on June 23, 2025 by wdengquantum.me

Problem: What does it mean for $N=2$ particles to be identical?

Solution: $N=2$ particles are identical iff their intrinsic properties are all identical; in classical mechanics this typically means mass $m$, charge $q$, etc. while in quantum mechanics this typically means mass $m$, charge $q$, and spin $s$ (and other intrinsic quantum numbers). The word intrinsic is important here. For example, $2$ electrons are considered identical classical particles even if they are at different positions $\textbf x_1\neq\textbf x_2$, or travelling with different velocities $\dot{\textbf x}_1\neq\dot{\textbf x}_2$, and likewise are considered identical quantum particles even if they are in different quantum states $|n,\ell,m_{\ell},m_s\rangle\neq |n’,\ell’,m_{\ell’},m_{s’}\rangle$ for instance (although exactly what properties qualify as “intrinsic” isn’t obvious a priori…)

Problem: Fill in the $4$ entries of the following $2\times 2$ matrix with either the word distinguishable or indistinguishable.

Solution:

Problem: Consider $N=2$ identical quantum particles, thus described by a state in $\mathcal H^{\otimes 2}$, where $\mathcal H$ is each of their (identical) single-particle state space. Define the exchange operator $(12):\mathcal H^{\otimes 2}\to \mathcal H^{\otimes 2}$.

Solution: For some reason the term “exchange” is conventional in the QM literature even though the term “transposition” is usually used to describe this (e.g. in group theory). For unentangled states, the exchange operator is defined by:

\[(12)|\psi\rangle\otimes|\phi\rangle:=|\phi\rangle\otimes |\psi\rangle\]

where $|\psi\rangle,|\phi\rangle\in\mathcal H$ are arbitrary single-particle states, and extended to the rest of $\mathcal H^{\otimes 2}$ by linearity. In particular, transposition is manifestly independent of one’s choice of $\mathcal H$-basis because this definition doesn’t mention any $\mathcal H$-basis.

Problem: Show that $(12)$ is both unitary and Hermitian; hence what are the eigenvalues of $(12)$ and give an example of an eigenstate with each eigenvalue.

Solution:

The eigenvalues of $(12)$ are thus $\text{spec}(12)=\{-1,1\}$. For example, $(12)|\psi\rangle\otimes|\psi\rangle=|\psi\rangle\otimes|\psi\rangle$ has eigenvalue $1$ for any $|\psi\rangle\in\mathcal H$. By contrast, $(12)(|\psi\rangle\otimes|\phi\rangle-|\phi\rangle\otimes|\psi\rangle)=-(|\psi\rangle\otimes|\phi\rangle-|\phi\rangle\otimes|\psi\rangle)$ has eigenvalue $-1$ for any $|\psi\rangle,|\phi\rangle\in\mathcal H$

Problem: Define the $2$-symmetrizer $\mathcal S_2$ and the $2$-antisymmetrizer $\mathcal A_2$ and show that both are orthogonal projectors.

Solution:

Problem: Let $A,B:\mathcal H\to\mathcal H$ each be arbitrary linear operators on the single-particle state space $\mathcal H$. The exchange operator acts on states by a unitary $(12)$ which swaps $|\psi\rangle\otimes|\phi\rangle\mapsto|\psi\rangle\otimes|\phi\rangle$. If one wished to achieve an analogous result at the level of operators, specifically $A\otimes B\mapsto B\otimes A$, how can this be accomplished?

Solution:

Problem: Generalizing the above slightly, given arbitrarily many compositions of operators on $\mathcal H^{\otimes 2}$, e.g. $(A\otimes B)(C\otimes D)(E\otimes F)…$, how can one obtain the operator $(B\otimes A)(D\otimes C)(F\otimes E)…$?

Solution: Do exactly what was done above:

\[(12)(A\otimes B)(C\otimes D)(E\otimes F)…(12)^{-1}\]

\[=(12)(A\otimes B)(12)^{-1}(12)(C\otimes D)(12)^{-1}(12)(E\otimes F)(12)^{-1}(12)…(12)^{-1}\]

\[=(B\otimes A)(D\otimes C)(F\otimes E)…\]

Problem: Show that an operator $\mathcal O:\mathcal H^{\otimes 2}\to\mathcal H^{\otimes 2}$ is exchange-symmetric iff $[\mathcal O,(12)]=0$.

Solution: In light of the above, an exchange-symmetric operator is reasonably defined to obey:

\[(12)\mathcal O(12)^{-1}=\mathcal O\]

and hence the result follows.

(aside: all of the above discussion also holds for $2$ non-identical quantum particles provided one assumes they are each described by the same single-particle state space $\mathcal H$).

Now, generalize the prior discussion to an arbitrary number $N$ of identical quantum particles.

Problem: Define the $N$-symmetrizer $\mathcal S_N$ and the $N$-antisymmetrizer $\mathcal A_N$ operators on the space $\mathcal H^{\otimes N}$ of $N$ identical particles (where $\mathcal H$ as before is the single-particle state space).

Solution: The $N$-symmetrizer is defined by:

\[\mathcal S_N:=\frac{1}{\#S_N}\sum_{\sigma\in S_N}\sigma\]

And the $N$-antisymmetrizer is defined by:

\[\mathcal A_N:=\frac{1}{\#S_N}\sum_{\sigma\in S_N}\text{sgn}(\sigma)\sigma=\frac{1}{\#S_N}\left(\sum_{\sigma\in A_N}\sigma-\sum_{\sigma\in S_N-A_N}\sigma\right)\]

where $\#S_N=N!$, and note this is consistent with the earlier definitions for $N=2$. Strictly speaking each permutation $\sigma\in S_N$ is an abstract group element for which there is no notion of “group addition”, rather this is a faithful unitary representation of $S_N$ on $\mathcal H^{\otimes N}$ (a more pedantic notation could be $\hat{\sigma}$ for the permutation operator associated to $\sigma$).

Problem: Establish the following properties of $\mathcal S_N$ and $\mathcal A_N$:

i) (Useful lemma) \[\sigma\mathcal S_N=\mathcal S_N\sigma=\mathcal S_N\] and \[\sigma\mathcal A_N=\mathcal A_N\sigma=\text{sgn}(\sigma)\mathcal A_N\] for any $\sigma\in S_N$.

ii) (Orthogonal projectors) $\mathcal S^2_N=\mathcal S_N$ and $\mathcal A^2_N=\mathcal A_N$ are both idempotent projections, and $\mathcal S^{\dagger}_N=\mathcal S_N$ and $\mathcal A^{\dagger}_N=\mathcal A_N$ are both Hermitian observables.

iii) (Orthogonal images) $\mathcal S_N\mathcal A_N=\mathcal A_N\mathcal S_N=0$ (but note this is in general not a full orthogonal complement, i.e. $(\mathcal S^{\perp}_N(\mathcal H^{\otimes N})\neq\mathcal A_N(\mathcal H^{\otimes N})$ unless $N=2$; in other words, for $N\geq 3$, there exist states that are neither symmetric nor antisymmetric, but have mixed symmetries, see Young tableaux).

Solution:

i) Since $S_N$ is a normal subgroup of itself, its left/right cosets coincide and indeed, all simply return $S_N$ trivially. For $\mathcal A_N$, one can write:

\[\sigma\mathcal A_N=\frac{1}{N!}\sum_{\hat{\sigma}\in S_N}\text{sgn}(\hat{\sigma})\sigma\hat{\sigma}\]

\[=\frac{1}{N!}\sum_{\hat{\sigma}\in S_N}\text{sgn}(\hat{\sigma})\text{sgn}^2(\sigma)\sigma\hat{\sigma}\]

\[=\frac{\text{sgn}(\sigma)}{N!}\sum_{\hat{\sigma}\in S_N}\text{sgn}(\hat{\sigma})\text{sgn}(\sigma)\sigma\hat{\sigma}\]

and since $\text{sgn}:S_N\to\{-1,1\}$ is a homomorphism, this simplifies to the claimed result.

ii) First, note that (as mentioned above), all permutations $\sigma\in S_N$ are unitary (this follows because all permutations can be built from transpositions alone, each of which was proven to be unitary from the $N=2$ analysis earlier). Thus, taking the adjoint is the same as the inverse. So:

\[\mathcal S^{\dagger}_N=\frac{1}{N!}\sum_{\sigma\in S_N}\sigma^{\dagger}\]

\[=\frac{1}{N!}\sum_{\sigma\in S_N}\sigma^{-1}\]

but $S_N$ is a group, so inversion is a group bijection, and the sum is invariant. For $\mathcal A_N$, the proof of Hermiticity is almost identical with the additional insight that inversion also preserves the sign $\text{sgn}(\sigma^{-1})=\text{sgn}(\sigma)$ since any decomposition into transpositions would just be reversed, but the number of transpositions (even/odd) would be invariant. Regarding projection:

iii) For example:

Problem: Following up on the above iii), show explicitly that for $N=2$ identical particles, $S_2+A_2=1$ partitions the space $\mathcal H^{\otimes 2}$ but for $N=3$ identical particles $S_3+A_3\neq 1$.

Solution: (it seems the reason it breaks down is basically because for $N\geq 3$, the identity permutation is no longer the only even permutation)

Problem: Define the totally symmetric state subspace $S^N\mathcal H$ and the totally antisymmetric state subspace $\bigwedge^N\mathcal H$ of the full $N$-particle state space $\mathcal H^{\otimes N}$ and hence explain why the $N$-symmetrizer $\mathcal S_N:\mathcal H^{\otimes N}\to S^N\mathcal H$ and the $N$-antisymmetrizer $\mathcal A_N:\mathcal H^{\otimes N}\to \bigwedge^N\mathcal H$ may be viewed thus.

Solution: The totally symmetric state subspace is essentially the intersection of the eigenspaces of all permutation operators with eigenvalue $1$ (or in group-theoretic language, the set of all states whose stabilizer subgroup is $S_N$ itself):

\[S^N\mathcal H:=\{|\Psi\rangle\in\mathcal H^{\otimes N}:\sigma|\Psi\rangle=|\Psi\rangle\text{ for all }\sigma\in S_N\}\]

(permutation operators do not commute as one can see from the conjugacy classes of $S_N$, so they cannot all be simultaneously diagonalized, but nevertheless it turns out to still be possible to find special symmetric states that are eigenstates of all of them). The totally antisymmetric state subspace is defined by:

\[\bigwedge^N\mathcal H:=\{|\Psi\rangle\in\mathcal H^{\otimes N}:\sigma|\Psi\rangle=\text{sgn}(\sigma)|\Psi\rangle\text{ for all }\sigma\in S_N\}\]

Problem: State the symmetrization postulate of quantum mechanics.

Solution: For a system of $N$ identical quantum particles, not all states in $\mathcal H^{\otimes N}$ are physical states, even though they may be valid “mathematical states”. Instead, only states in $S^N\mathcal H$ xor states in $\bigwedge^N\mathcal H$ can be physical states, corresponding respectively to identical bosons vs. identical fermions.

Problem: Show that for any $N$-body state $|\Psi\rangle\in\mathcal H^{\otimes N}$, the symmetrization of $|\Psi\rangle$ obtained by acting with the $N$-symmetrizer $\mathcal S_N|\Psi\rangle$ is unique in the projective totally symmetric state subspace $PS^N\mathcal H$, and similarly the antisymmetrization of $|\Psi\rangle$ obtained by acting with the $N$-antisymmetrizer $\mathcal A_N|\Psi\rangle$ is unique in $P\bigwedge^N\mathcal H$.

Solution: One can proceed by demonstrating that:

\[\text{span}_{\textbf C}S_N|\Psi\rangle\cap S^N\mathcal H=\text{span}_{\textbf C}\mathcal S_N|\Psi\rangle\]

\[\text{span}_{\textbf C}S_N|\Psi\rangle\cap \bigwedge^N\mathcal H=\text{span}_{\textbf C}\mathcal A_N|\Psi\rangle\]

For example, for the $2^{\text{nd}}$ assertion above:

and the proof the first one is similar. Intuitively, any linear combination of permutations which is symmetric must in fact give uniform weight to all permutations (which is what $\mathcal S_N$ prescribes), and similarly if the linear combination is antisymmetric, then each permutation is weighted by its sign instead (as per $\mathcal A_N$).

Problem: Let $\{|1\rangle,|2\rangle,…\}\subseteq\mathcal H$ be an orthonormal basis of the single-particle state space $\mathcal H$, and consider $N$ identical particles living collectively in $\mathcal H^{\otimes N}$. Explain why it is useful to introduce the notation $|N_1,N_2,…,\rangle$ (called the occupation number representation) to denote that there are $N_i\in\textbf N$ particles in single-particle state $|i\rangle\in\mathcal H$.

Solution: Because the particles are identical, it is highly inefficient to be asking “which state is which particle in” since the notion of “which particle” is meaningless. Instead, common sense dictates that the more efficient question to ask is “how many particles are in each state?” as this doesn’t care about which particle is which.

Problem: If one demands $\langle N_1,N_2,…|N_1,N_2,…\rangle=1$ to be normalized, then write down a single unified formula for $|N_1,N_2,…\rangle$ in the language of first quantization that works regardless of whether the particles are identical bosons or identical fermions.

Solution: A little bit of thought shows that one possibility is:

\[|N_1,N_2,…\rangle=\frac{1}{\sqrt{N!N_1!N_2!…}}\sum_{\sigma\in S_N}(\pm)^{(1-\text{sgn}(\sigma))/2}\sigma|1\rangle^{\otimes N_1}|2\rangle^{\otimes N_2}…\]

where $N=N_1+N_2+…$, and the $\pm$ should be taken as $+$ for bosons and $-$ for fermions. If it’s for $N$ identical bosons, one can write it in terms of the $N$-symmetrizer:

\[|N_1,N_2,…\rangle=\sqrt{\frac{N!}{N_1!N_2!…}}\mathcal S_N|1\rangle^{\otimes N_1}|2\rangle^{\otimes N_2}…\]

while for $N$ identical fermions, the Pauli exclusion principle enforces $N_i\in\{0,1\}$ for all single-particle states $|i\rangle$, and thus $N_i!=1$. Thus, in this case:

\[|N_1,N_2,…\rangle=\sqrt{N!}\mathcal A_N|1\rangle^{\otimes N_1}|2\rangle^{\otimes N_2}…\]

One can explicitly check the normalization of the state written in this manner:

Problem: Consider a system of $3$ hydrogen atoms, $2$ of which are in a single-particle ground state $|0\rangle$ and $1$ of them is in some single-particle excited state $|1\rangle$; write down the $3$-body collective state of this system in first quantization.

Solution: Phrases like “ground state” and “excited state” are simply particular eigenstates of the single-particle Hamiltonian $H$; in particular, because $H$ is Hermitian, these single-particle states may be taken orthonormal. With this in mind, because hydrogen atoms are bosons (assuming the usual protium isotope without neutrons), the occupation number state $|N_{|0\rangle}=2,N_{|1\rangle}=1\rangle$ is given in first quantization by:

\[|N_{|0\rangle}=2,N_{|1\rangle}=1\rangle=\frac{1}{\sqrt{3!2!1!}}(|001\rangle+|001\rangle+|010\rangle+|010\rangle+|100\rangle+|100\rangle)\]

\[=\frac{|001\rangle+|010\rangle+|100\rangle}{\sqrt 3}\]

where of course the shorthand $|001\rangle:=|0\rangle\otimes |0\rangle\otimes |1\rangle$, etc. is being used.

Problem: All of the above discussion is always prefaced by fixing the number of identical quantum particles $N$; in other words, all of the physics happens inside the $N$-particle sector of the Fock space $\mathcal F_{\mathcal H}$ generated by the single-particle state space $\mathcal H$. Define the Fock space $\mathcal F_{\mathcal H}$ for identical bosons, and separately for identical fermions.

Solution: For both bosons and fermions, one can write:

\[\mathcal F_{\mathcal H}:=\bigoplus_{N=0}^{\infty}\text{span}_{\textbf C}\{|N_1,N_2,…\rangle:N_1+N_2+…=N\}\]

But it is clearer to distinguish between the bosonic Fock space:

\[\mathcal F_{\mathcal H}=\bigoplus_{N=0}^{\infty}S^N\mathcal H\]

and the fermionic Fock space:

\[\mathcal F_{\mathcal H}=\bigoplus_{N=0}^{\infty}\bigwedge^N\mathcal H\]

(in fact the fermionic Fock space can also be written $\mathcal F_{\mathcal H}=\bigoplus_{N=0}^{\dim\mathcal H}\bigwedge^N\mathcal H$ since for $N>\dim\mathcal H$, the totally antisymmetric state space $\bigwedge^N\mathcal H=\{0\}$ is trivial).

Problem: Explain why the Fock space $\mathcal F_{\mathcal H}$ needs to be formed by a direct sum $\oplus_{N=0}^{\infty}$ rather than just a union $\bigcup_{N=0}^{\infty}$. Furthermore, explain how to complete Fock space $\mathcal F_{\mathcal H}$ into a Hilbert space by explicitly stating the inner product placed on $\mathcal F_{\mathcal H}$.

Solution: Simply put, the union $\bigcup_{N=0}^{\infty}$ doesn’t preserve the vector space structure, hence the need for the direct sum $\oplus_{N=0}^{\infty}$. The canonical inner product on $\mathcal F_{\mathcal H}$ is defined by summing the inner product between respective components in the $N$-particle sectors (this statement is just a general property of direct sums in linear algebra).

Problem: Clarify by the above discussion by working through the example of electrons fixed in space, so that $\mathcal H\cong\textbf C^2$. Let $|0\rangle,|1\rangle$ be an orthonormal basis for $\mathcal H$. What is the electron Fock space $\mathcal F_{\mathcal H}$? What is its dimension $\dim(\mathcal F_{\mathcal H})$? What is a basis for $\mathcal F_{\mathcal H}$? Hence, show that distinct $N$-particle sectors are orthogonal.

Solution:

in particular, it is clear that distinct $N$-particle sectors are orthogonal because the Fock states here $|\space\rangle,|0\rangle,|1\rangle,\frac{|01\rangle-|10\rangle}{\sqrt{2}}$ form an orthonormal basis for $\mathcal F_{\mathcal H}\cong\textbf C^4$. Note that both $|\Psi\rangle,|\Phi\rangle$ will (in general) not be called Fock states because they do not contain a fixed number of particles $N$, but rather are said to be in a superposition of Fock states, each of which by definition only deals with a fixed number $N$ of particles in some $N$-particle sector (which is why it could be written using a single occupation number ket).

Problem: Outside of QFT, particles are conserved $\dot N=0$, so it’s not obvious why one would ever want to somehow “tamper” with the number of particles $N$ in the system. Nevertheless, it turns out to be very useful, like a magician waving their magic wand to make rabbits appear or disappear from their hat. Explain how the creation and annihilation operators for bosons and fermions are defined respectively on the bosonic and fermionic Fock spaces $\mathcal F_{\mathcal H}$.

Solution: For a given single-particle basis $\{|i\rangle\}\subset\mathcal H$ of the single-particle state space $\mathcal H$, to each single-particle basis ket $|i\rangle$, one associates a creation operator $a^{\dagger}_i:\mathcal F_{\mathcal H}\to\mathcal F_{\mathcal H}$ and similarly an annihilation operator $a_i:\mathcal F_{\mathcal H}\to\mathcal F_{\mathcal H}$. Since Fock states $|N_1,N_2,…,\rangle$ form a basis for the Fock space $\mathcal F_{\mathcal H}$, to define the action of each $a^{\dagger}_i,a_i$ on the full Fock space $\mathcal F_{\mathcal H}$, it suffices to define their actions on each Fock state, and extend to the rest of $\mathcal F_{\mathcal H}$ by linearity. Hence, they are defined to act on Fock states via:

\[a^{\dagger}_i|N_1,N_2,…,N_i,…\rangle=\sqrt{N_i+1}|N_1,N_2,…,N_i+1,…\rangle\]

\[a_i|N_1,N_2,…,N_i,…\rangle=\sqrt{N_i}|N_1,N_2,…,N_i-1,…\rangle\]

However, one should check that these $2$ definitions are compatible in the sense that the claimed adjoint relation between $a^{\dagger}_i$ and $a_i$ is really correct w.r.t. the Fock space inner product. To this end, one has to check that for arbitrary $|\Psi\rangle,|\Phi\rangle\in\mathcal F_{\mathcal H}$:

\[\langle\Psi|a^{\dagger}_i|\Phi\rangle=\langle\Phi|a_i|\Psi\rangle^*\]

but bilinearity of the inner product means it suffices to check this when $|\Psi\rangle=|N_1,N_2,…,N_i,…\rangle$ and $|\Phi\rangle=|N’_1,N’_2,…,N’_i,…\rangle$ are both Fock states:

\[\langle N_1,N_2,…,N_i,…|a^{\dagger}_i|N’_1,N’_2,…,N’_i,…\rangle\]

\[=\sqrt{N’_i+1}\langle N_1,N_2,…,N_i,…|N’_1,N’_2,…,N’_i+1,…\rangle\]

\[=\sqrt{N’_i+1}\delta_{N_1,N’_1}\delta_{N_2,N’_2}…\delta_{N_i,N’_i+1}…\]

On the other hand:

\[\langle N’_1,N’_2,…,N’_i,…|a_i|N_1,N_2,…,N_i,…\rangle^*\]

\[=\sqrt{N_i}\langle N’_1,N’_2,…,N’_i,…|N_1,N_2,…,N_i-1,…\rangle^*\]

\[=\sqrt{N_i}\delta_{N’_1,N_1}\delta_{N’_2,N_2}…\delta_{N’_i,N_i-1}…\]

which are clearly the same expression.

Problem: Complete the following tasks:

i) Write down a formula for a generic Fock state $|N_1,N_2,…,\rangle$ in terms of creation operators acting on the vacuum $|\space\rangle=|N_1=0,N_2=0,…\rangle$.

ii) Explain why, for any single-particle state $|i\rangle\in\mathcal H$, the $\alpha=0$ coherent state is given by:

\[\text{ker}a_i=\text{span}_{\textbf C}\{|N_1,N_2,…,N_i=0,…\rangle\in\mathcal F_{\mathcal H}\}\]

iii) Evaluate for bosons the commutators $[a_i,a_j],[a^{\dagger}_i,a^{\dagger}_j],[a_i,a^{\dagger}_j]$ and for fermions the anticommutators $\{c_i,c_j\},\{c^{\dagger}_i,c^{\dagger}_j\},\{c_i,c^{\dagger}_j\}$ (for fermions it is conventional to use the letter “$c$” in lieu of “$a$”).

iv) Given $2$ single-particle bases $\{|i\rangle\}$ and $\{|\tilde i\rangle\}$ for the single-particle state space $\mathcal H$, explain how to relate $a^{\dagger}_i$ with $a^{\dagger}_{\tilde i}$ and similarly how to relate $a_i$ with $a_{\tilde i}$. Verify that these transformations preserve the commutation/anticommutation relations.

v) Define the number operator $\hat N_i$ and show that when acting on a Fock state, it counts the number of identical particles in the single-particle state $|i\rangle$ (in fact, a good way to think about Fock states is that they are eigenstates of the total number operator $\hat N:=\sum_i\hat N_i$).

Solution:

i) In second quantization, the Fock state is given by:

\[|N_1,N_2,…,\rangle=\]

ii)

iii)

iv) Although it is tempting to think of this as a “change of basis” from linear algebra, this is actually not what’s going on…(or is it?)

Problem: For $\mathcal H=L^2(\textbf R^3\to\textbf C)$, one can use either the orthonormal $\textbf X$-eigenbasis $\{|\textbf x\rangle\}$ or the orthonormal$\textbf P$-eigenbasis $\{|\textbf k\rangle$. Show that the corresponding creation and annihilation operator fields for each are related by a Fourier transform:

\[\psi^{\dagger}(\textbf x):=a^{\dagger}_{\textbf x}\]

\[\]

where the notation $\psi(\textbf x):=a_{\textbf x}$ and $\psi_{\textbf k}:=a_{\textbf k}$ is being used (unambiguous with the wavefunction because this is $2^{\text{nd}}$ quantization now! No more $1^{\text{st}}$ quantization).

Problem: Consider a system of $N=2$ identical non-interacting spin $s=1/2$ fermions in an infinite potential well of width $L$ (nodes at $x=0,L$). Write down the general $2$-body wavefunction $\Psi(1,2)$ for the system’s ground state, $1^{\text{st}}$ excited state, and $2^{\text{nd}}$ excited state by calculating Slater determinants of the single-fermion spin-orbitals.

Solution: Use the notation $\chi_{n,m_s}$ for the spin-orbital (note the word “orbital” here isn’t really meant in the sense of e.g. orbital angular momentum $\textbf L$, but rather in the chemist sense of “atomic orbital” though the $2$ notions aren’t entirely disjoint):

\[\chi_{n,m_s}=|\psi_n\rangle\otimes|m_s\rangle\]

where $\psi_n(x)=\sqrt{\frac{2}{L}}\sin\frac{n\pi x}{L}$ is the position space wavefunction of the $n^{\text{th}}$ excited single-particle state. Use the notation $\chi_{n,m_s}(i)$ to refer to placing the (arbitrarily labelled) $i^{\text{th}}$ fermion ($i\in\{1,2\}$) into the spin-orbital $\chi_{n,m_s}$:

\[\chi_{7,\uparrow}(2)\cong \sqrt{\frac{2}{L}}\sin\frac{7\pi x_2}{L}|\uparrow\rangle_2\]

In this case, a general Slater determinant of spin-orbitals is of the form:

\[\Psi(1,2)=\frac{1}{\sqrt{2!}}\det\begin{pmatrix}\chi_{n_1,m^{(1)}_s}(1)&\chi_{n_2,m^{(2)}_s}(1)\\\chi_{n_1,m^{(1)}_s}(2)&\chi_{n_2,m^{(2)}_s}(2)\end{pmatrix}\]

Now consider the ground state. Because the fermions are non-interacting, this amounts to the requirement $n_1=n_2=1$, thus fixing the spatial part of the allowed spin-orbitals. In particular, since the spatial parts are identical, the spin parts cannot be $m^{(1)}_s\neq m^{(2)}_s$ otherwise $\Psi(1,2)=0$; this is just the Pauli exclusion principle. Arbitrarily letting $m^{(1)}_s=-m^{(2)}_s=1/2$, it follows that the ground state manifold is one-dimensional, and spanned by the ground state:

\[\Psi(1,2)=\frac{1}{\sqrt{2!}}\det\begin{pmatrix}\chi_{1,\uparrow}(1)&\chi_{1,\downarrow}(1)\\\chi_{1,\uparrow}(2)&\chi_{1,\downarrow}(2)\end{pmatrix}\cong\frac{2}{L}\sin\frac{\pi x_1}{L}\sin\frac{\pi x_2}{L}\frac{|\uparrow\rangle_1|\downarrow\rangle_2-|\downarrow\rangle_1|\uparrow\rangle_2}{\sqrt{2}}\]

So the spatial part of the $2$-body wavefunction is clearly $1\leftrightarrow 2$ symmetric while the singlet spin part is $1\leftrightarrow 2$ antisymmetric, ensuring total antisymmetry $\Psi(2,1)=-\Psi(1,2)$.

Meanwhile, for the system’s $1^{\text{st}}$ excited state:

Problem: Briefly describe how the results above would be affected if instead it were $N=2$ identical non-interacting spin $s=1$ bosons.

Solution: Instead of Slater determinants, one would use “Slater permanents”, or just “permanents” for short. These are basically calculated in the same way as a determinant except all minus signs become plus signs. This implicitly enforces total symmetry $\Psi(2,1)=\Psi(1,2)$ of the wavefunction. In addition, because $2s+1=3$ now, the degeneracies of each manifold would be enhanced.

Problem: What is the explicit connection between the Slater permanents/determinants and the symmetrizer/antisymmetrizer?

Solution: The idea is to symmetrize/antisymmetrize the Hartree product ansatz, essentially providing the bridge to Hartree-Fock. This immediately yields Slater permanents/determinants respectively (up to a normalization):

\[\sqrt{N!}\mathcal S_N\chi_1(1)\chi_2(2)…\chi_N(N)=\frac{1}{\sqrt{N!}}\text{perm}\begin{pmatrix}\chi_1(1)\end{pmatrix}\]

(FILL IN THE PERM)

\[\sqrt{N!}\mathcal A_N\chi_1(1)\chi_2(2)…\chi_N(N)=\frac{1}{\sqrt{N!}}\text{det}\begin{pmatrix}\chi_1(1)\end{pmatrix}\]

(FILL IN THE DET)

Problem: The purpose of the previous problems wasn’t so much to actually compute all those $N$-body wavefunctions $\Psi(1,…,N)$, but rather to force one to compute them in order to realize how tedious and redundant the whole business is.

A warning: second quantization is only really useful for systems of identical particles; if the $N$ particles were all distinguishable then in principle one can still use the second quantization framework, but in that case it doesn’t offer any advantage over just plain wavefunction language (aka first quantization).

Problem: Derive the commutation relations for the bosonic creation/annihilation operators and similarly derive the anticommutation relations for the fermionic creation/annihilation operators, starting from … . This shows that commutation and anticommutation relations completely encode the permutation symmetries of the bosonic and fermionic states.

Problem: What does it mean for a linear operator $H$ to be an $N$-body operator?

Solution: An operator $H$ is said to be an $N$-body operator iff there exists a decomposition of $H$ in the form:

\[H=\sum_i H_i\]

where each operator $H_i$ acts only on $N$ particles at a time, acting as the identity operator on all other particles. For instance, the kinetic energy operator for an arbitrary system of particles (whether identical or not) is a $1$-body operator, as are most external potentials. By contrast, common $2$-body operators include interaction potentials between particles.

Problem: Given a system of $N$ identical particles (fermions or bosons), and a $1$-body operator $H$, and a basis $\{|i\rangle\}$ of the single-particle Hilbert space, explain why the second quantization functor acting on $H$ is given by the “dictionary”:

\[|i\rangle\mapsto c^{\dagger}_{|i\rangle}|\space\rangle\]

\[H\mapsto\sum_{|i\rangle,|j\rangle}\langle i|H|j\rangle c^{\dagger}_{|i\rangle}c_{|j\rangle}\]

Solution: Because the matrix elements are preserved under this homomorphism, so the functor is sort of “unitary” in a way? More precisely, matrix elements between any $2$ Fock states are unchanged.

Problem: Write the one-body Rabi drive operator $V_{\text{Rabi}}=\frac{\hbar}{2}\tilde{\boldsymbol{\Omega}}\cdot\boldsymbol{\sigma}$ in $2^{\text{nd}}$ quantization.

Solution: Since the states $\{|\textbf k,\sigma\rangle\}$ are a basis for the single-particle Hilbert space:

Problem: Write the $2$-body scattering contact pseudopotential operator $V:=g\delta^3(\textbf X-\textbf X’)$ in $2^{\text{nd}}$ quantization.

Posted in Blog | Leave a comment

Fermi Liquid Theory

Posted on June 20, 2025 by wdengquantum.me

Problem: What is the meant by the phrase “elementary excitations” of an ideal Fermi gas?

Solution: Basically “excitations” is a fancy word for “excited states”, in this case more precisely “many-body excited states”. One example is depicted in the diagram below. Note that the elementary excitations are the subset of all excitations involving only a single fermion creation/annihilation operator (i.e. only removing $1$ fermion from the Fermi sea, or only adding $1$ fermion to a state beyond the Fermi sea; thus particle-hole excitations are not elementary excitations but could be thought of as composed of $2$ elementary excitations).

Problem: What is a rule of thumb for the difference between quasiparticles and collective excitations?

Solution: Quasiparticles (e.g. holes) are fermionic while collective excitations (e.g. phonons) are bosonic.

Problem: Define an adiabatic process in quantum mechanics.

Solution: It needs to go sufficiently slow (similar to the notion of quasistatic (indeed, reversible) process in thermodynamics, and unfortunately the exact opposite of what “adiabatic” means in thermo). This ensures that one can describe a well-defined evolution of the eigenenergies of the eigenstates as a function of the adiabatic turn-on of the interaction potential (again, analogous to thermo, quasistatic is needed to define the set of equilibrium states connected tgt). In particular, provided the states are separated from each other, then the adiabatic theorem guarantees there will not be a level crossing; $E_1(\lambda=0)<E_2(\lambda=0)\Rightarrow E_1(\lambda)<E_2(\lambda)$ for all $\lambda\in[0,1]$ which at the macroscopic level is equivalent to saying there will not be a phase transition (e.g. if ground state were to swap). Note the adiabatic theorem applies to the $N$-particle sector of the Fock space, and specifically to the many-body eigenstates (not single-particle eigenstates), and also in FLT, calling it “adiabatic principle” is a misnomer, should just be called “adiabatic assumption”.

Given a many-body interacting system of identical fermions, explain the necessary (but not sufficient) condition of adiabaticity that the fermion interactions must fulfill in order for that system of identical fermions to deserve the name/classification of being a “Fermi liquid“.

Solution: A Hamiltonian $H$ is said to be adiabatically connected to another Hamiltonian $H_0$ iff there exists a smooth path in “Hamiltonian space” from $H_0\to H$. Here “smooth” means no level crossings of the eigenstates, or $\Leftrightarrow$ no phase transitions. Also, just as in thermodynamics when one sketches e.g. an isotherm or adiabat on a $pV$-diagram which is always implicitly showing a quasistatic and in fact reversible process connecting a bunch of equilibrium states together, so here too the “path” in Hamiltonian space should be traversed sufficiently slowly as a function of time $t$ (as quantified for instance by the adiabatic theorem). Of course in the lab, the interactions in $H$ are already “on”, so this rather technical minutiae is at best a gedankenexperiment.

A system of identical fermions with interacting Hamiltonian $H$ only has any hope of being a Fermi liquid if $H$ is adiabatically connected to the corresponding non-interacting Hamiltonian $H_0$ of an ideal Fermi gas as a reference system. In other words, from a topological/phase diagram perspective, Fermi liquids are a subset of the connected component of the ideal Fermi gas.

The “meaty corollary” that adiabaticity brings with it as that, if it’s satisfied, then there must exist a bijection between the ideal Fermi gas and the interacting Fermi liquid. The essence of theoretical physics is to map hard problems to easy problems! And moreover, these sorts of isomorphisms reveal a lot of deep connections/symmetries (e.g. in this case experiments found linear heat capacities, constant Pauli diamagnetism, etc. which were predicted qualitatively in the ideal Fermi gas model despite interactions…this isomorphism is the essence of Landau’s explanation of that remarkable observation).

The “rigorous proof” of this isomorphism goes back to the gedankenexperiment above, i.e. that it should be possible to “trace the footsteps” of each non-interacting eigenstate $N_{\textbf k}$ of $H_0$ through state space to end up at the unique, corresponding eigenstate of $H$.

Henceforth, write $\tilde N_{\textbf k}$ to denote the eigenstate of the interacting fermion system $H$ adiabatically connected to $N_{\textbf k}$; note that $N_{\textbf k}\neq \frac{1}{e^{\beta (E_{\textbf k}-\mu)}+1}$ can be any arbitrary, possibly non-equilibrium occupation number distribution in $\textbf k$-space, so long as it’s compatible with Pauli exclusion $N_{\textbf k}\in\{0,1\}$. This is roughly saying that “eigenvalues are more robust than eigenvectors” with respect to perturbations, or in quantum lingo, “quantum numbers are more robust than eigenstates”; although $\textbf k$ in general will no longer be a good quantum number for basically any kind of interactions, one can sort of “pretend” that it’s a good quantum number anyway by using it as an adiabatic label for the interacting eigenstates of $H$.

Problem: (something about the ansatz…)

Solution: For a degenerate ideal Fermi gas, the occupation numbers are given by a sharp Fermi-Dirac step $N_{\textbf k}=\Theta(k_F-|\textbf k|)$ and the total energy is $E=\frac{3}{5}NE_F$. Now suppose one were to shuffle the fermions around in $\textbf k$-space, effectively moving some fermions from the $|\textbf k|<k_F$ Fermi sea (leaving behind holes) and promoting them to the unoccupied region $|\textbf k|>k_F$. Then within the $|\textbf k|<k_F$ Fermi sea, if a fermion was removed from a particular $\textbf k$-state, then the occupation number $N_{\textbf k}$ of that $\textbf k$-state will have decreased by $\Delta N_{\textbf k}=-1$. Similarly, if that fermion is then added to a $\textbf k$-state outside the Fermi sea $|\textbf k|>k_F$, the occupation number of that $\textbf k$-state would increase by $\Delta N_{\textbf k}=1$. It is thus clear that the total energy of the degenerate ideal Fermi gas would increase from its initial value of $E=\frac{3}{5}NE_F$ by an amount:

\[\Delta E=\frac{\textbf k}\frac{\hbar^2|\textbf k|^2}{2m}\Delta N_{\textbf k}\]

where the sum $\sum_{\textbf k}$ is of course over all $\textbf k$-states, both inside and outside the Fermi sea (and due to the monotonically increasing nature of the free particle dispersion $\sim|\textbf k|^2$, the positive contributions from outside the Fermi sea will necessarily overwhelm the negative contributions from within the Fermi sea leading to an increase $\Delta E>0$ as mentioned above).

So far this discussion has been for a ideal (aka non-interacting) Fermi gas. What happens if the fermions can now interact with each other (e.g. Coulombic repulsion between electrons)? Then Landau postulated that the above expression should be replaced by:

\[E=\sum_{\textbf k,\sigma}\frac{\hbar^2}{m^*}k_F(k-k_F)\delta N_{\textbf k}+\frac{1}{2}\sum_{\textbf k,\textbf k’,\sigma,\sigma’}f_{\textbf k,\textbf k’,\sigma,\sigma’}\delta N_{\textbf k,\sigma}\delta N_{\textbf k’,\sigma’}\]

Problem: Show that for $T\ll T_F$ and $E-\mu\ll\mu$, the quasiparticle lifetime $\tau$ goes like:

\[\tau=\frac{\hbar\mu}{a(E-\mu)^2+b(k_BT)^2}\]

for dimensionless constants $a,b\in\textbf R$ of $O(1)$. In particular, for a $T=0$ degenerate Fermi liquid, one has $\tau\sim 1/(E-E_F)^2$ so the closer the energy $E$ of the quasiparticle is to the Fermi energy $E_F$, the longer-lived it is.

Solution: This can be obtained simply from an application of Fermi’s golden rule to the (obviously dominant!) scattering process of a quasiparticle with momentum $|\textbf k|>k_F$ colliding with a Fermi sea particle of momentum $|\textbf k_2|<k_F$ and ending up as $2$ quasiparticles outside the Fermi sea $|\textbf k’_1|,|\textbf k’_2|>k_F$. This decay channel is obviously dominant because of Pauli blocking (what else could possibly happen?) and is enforced by the corresponding step functions (each either $0,1$) in the density of final states:

\[\frac{1}{\tau_{\textbf k}}=\frac{2\pi}{\hbar}\sum_{\textbf k_2,\textbf k’_1,\textbf k’_2}|\langle\textbf k’_1,\textbf k’_2|V|\textbf k,\textbf k_2\rangle|^2\Theta(k_F-|\textbf k_2|)(1-\Theta(k_F-|\textbf k’_1|))(1-\Theta(k_F-|\textbf k’_2|))\delta_{\textbf k+\textbf k_2,\textbf k’_1+\textbf k’_2}\delta_{E_{\textbf k}+E_{\textbf k_2},E_{\textbf k’_1}+E_{\textbf k’_2}}\]

(aside: are the assumptions of Fermi’s golden rule sufficiently fulfilled in this case?). Actually, if we want to extract the $T$-dependent part of $\tau_{\textbf k}$, then instead of step functions we should linearize Fermi-Dirac at temperature $T$ about the Fermi surface. Justify that this is because in fact, with a little thought (can be proven mathematically), all of the wavevectors actually need to be pretty close to the Fermi surface for momentum and kinetic energy constraints to be satisfiable.

Problem: What are $2$ examples of quantum systems to which Landau’s Fermi liquid theory applies?

Solution: Normal (i.e. not superfluid) $^3\text{He}$ and normal (i.e. not superconducting) conduction electrons in a conductor (the latter case has a bit more complications due to the long-range nature of the Coulomb electrostatic repulsion whereas in He-3 it’s just short-range VdW LJ).

References:

https://www.cambridge.org/core/books/abs/statistical-mechanics-and-applications-in-condensed-matter/fermi-liquids/6BB91EF8A8EB8E2AA7F7630D341C5234

https://www.cambridge.org/core/books/abs/quantum-theory-of-the-electron-liquid/normal-fermi-liquid/59F1693E7C95315DF101D2D492BF7386

https://www.cambridge.org/core/books/abs/advanced-quantum-condensed-matter-physics/landau-fermi-liquid-theory/A56792EFE45087D87EA48F578B50D719

https://www.cambridge.org/core/books/abs/ultracold-atomic-physics/fermi-liquid/A04490E00A1B276F7FEB888D2902AF9E

https://www-thphys.physics.ox.ac.uk/people/SteveSimon/QCM2019/QuantumMatterApr30.pdf

Posted in Blog | Leave a comment

Supervised Machine Learning: Regression & Classification

Posted on June 12, 2025 by wdengquantum.me

Problem: Somewhat bluntly, what is machine learning?

Solution: Machine learning may be regarded (somewhat crudely) as just glorified “curve fitting”; an “architecture” is really just some “ansatz”/choice of fitting function that contains some number of unknown parameters, and machine learning simply amounts to fitting this function to some training data by pre-training + fine-tuning the optimal values of the unknown parameters. It’s nonlinear regression on steroids. Just as Feynman diagrams are just pretty pictures for various complicated integrals, so neural networks can also be thought of as just a kind of “pretty picture” for an otherwise rather complicated (but hence expressive/high-capacity!) function. Perhaps one additional point worth emphasizing is that we don’t just want the error on the training data to be low, we also want the model to generalize well to the test set.

Problem: What is an acronym to help remember the $10$-step workflow of a typical machine learning project?

Solution: PRAC-DTDT-ID

Problem: what would one like to predict? Could be a probabilistic thing or just learning some function on a manifold.
Representation: taking abstract points on the manifold and coordinatizing them in some chart: encodings, embeddings (learned encodings), graphs, manual feature engineering/normalization/pre-processing.
Architecture: defining the model $\hat{\mathbf y}(\mathbf x|\boldsymbol{\theta})$ typically through the composition of elementary operations/layers it is composed of (e.g. CNNs, RNNs, GNNs, ResNets, classifiers, decision trees, recommender systems, autoencoders, SVM, random forest, transformer,…).
Cost: defining $C(\boldsymbol{\theta})$ which means defining the loss function $L(\hat{\mathbf y},\mathbf y)$ and any regularization.
Data: actual i.i.d. samples from the manifold (which is a set with a like more topological structure!) and partitioning these samples into training, dev, and test subsets.
Train: mini-batch stochastic $\partial C_{\text{mb}}/\partial\boldsymbol{\theta}$ descent on the training subset. For a layered NN architecture, compute $\partial C_{\text{mb}}/\partial\boldsymbol{\theta}$ by backpropagation. Learning rate scheduling, Adam/Lion/etc. optimizers, number of epochs/steps of $\partial C_{\text{mb}}/\partial\boldsymbol{\theta}$ descent.
Dev: tune RACDT hyperparameters on dev subset based on $C_{\text{val}}(\boldsymbol{\theta})$ or other metric.
Test: evaluate model $\hat{\mathbf y}$ on test subset based on $C_{\text{test}}(\boldsymbol{\theta})$ or other metric.
Iterate: based on test results, make changes to RACDT (e.g. data augmentation/synthesis) depending on high bias/variance.
Deploy: enter production environment, continuous training/fine-tuning of model parameters $\boldsymbol{\theta}$ to keep up-to-date in case underlying data-generating distribution is changing over $t$).

Problem: Distinguish between $2$ kinds of machine learning, namely supervised learning and unsupervised learning.

Solution: In supervised learning, the machine is fed a training set $\{(\textbf x_1,y_1),…,(\textbf x_N,y_N)\}$ where each $2$-tuple $(\textbf x_i,y_i)$ is called a training example, $\textbf x_i$ is called a feature vector and $y_i$ is called its target label. The goal is to fit a function $\hat y(\textbf x)$ (historically called a hypothesis, nowadays called a model) to this data in order to predict/estimate the target value $y$ of an arbitrary feature vector $\textbf x$ not in its training set.

In unsupervised learning, the machine is simply given a bunch of feature vectors $\textbf x_1,…,\textbf x_N$ without being labelled by any target values, and asked to find patterns/structure within the data. Thus, supervision and training are synonymous.

Problem: What does the manifold hypothesis state?

Solution: The manifold hypothesis states that for many machine learning systems, not all of $n$-dimensional feature space $\mathbf R^n$ need be understood, but only some submanifold $\cong\mathbf R^{n’}$ of dimension $n'<n$. This partially offsets the curse of dimensionality.

Problem: What are the $2$ most common types of supervised learning?

Solution: If the range of target labels $y_i$ is at most countable (e.g $\{0,1\},\textbf N,\textbf Q$, etc.) then this type of supervised learning is called classification (in the special case of $\{0,1\}$, this is called binary classification whereas beyond $2$ target labels it would be called multiclass classification). If instead the range of target labels $y_i$ is uncountable (e.g. $\textbf R,\textbf C$) then this type of supervised learning is called regression (cf. the distinction between discrete and continuous random variables).

Problem: Write down the normal equations for multivariate linear regression, and the corresponding optimal estimators.

Solution: Given $N$ labelled training examples $\{(\mathbf x_1,y_1),…,(\mathbf x_N,y_N)\}$, the goal is to fit a linear map $\hat y(\mathbf x|\mathbf w)=\mathbf w\cdot\mathbf x$ to the data (each of the $N$ feature vectors $\mathbf x_1,…,\mathbf x_N$ should be thought of as appended with a constant feature $1$ and $\mathbf w$ being in fact a “weight-bias vector” in the sense that its last entry is the bias $b$). Ideally, we would like to find a $\mathbf w$ such that $\mathbf w\cdot\mathbf x_1=y_1$,…,$\mathbf w\cdot\mathbf x_N=y_N$, or in terms of the feature matrix $X:=(\mathbf x_1,…,\mathbf x_N)$, one would like $X^T\mathbf w=\mathbf y$ where $\mathbf y:=(y_1,…,y_N)^T$. However, in general this won’t be possible. Instead, minimizing the MSE cost function:

\[\frac{\partial}{\partial\mathbf w}|\mathbf y-\mathbf X^T\mathbf w|^2=\mathbf 0\]

one obtains the normal equations:

\[XX^T\mathbf w=X\mathbf y\]

and hence, because the Gram matrix $XX^T$ is positive semi-definite and Hermitian, may be inverted:

\[\mathbf w=X^{T+}\mathbf y\]

where the pseudoinverse $X^{T+}$ of $X^T$ is given by:

\[X^{T+}=(XX^T)^{-1}X\]

Problem: Write down the iterative update rule for standard batch gradient descent minimization of an arbitrary real-valued scalar field $C(\boldsymbol{\theta})$.

Solution: The update rule for $\partial_{\boldsymbol{\theta}}$-descent is:

\[\boldsymbol{\theta}\mapsto\boldsymbol{\theta}-\alpha\frac{\partial C}{\partial\boldsymbol{\theta}}\]

where $\alpha>0$ is a (possibly variable) learning rate.

Problem: In ML applications, $C(\boldsymbol{\theta})$ has the interpretation of a so-called cost function and the parameters $\boldsymbol{\theta}$ are usually called weights and biases. What’s special about the ML cost function $C(\boldsymbol{\theta})$, and how does mini-batch stochastic gradient descent (SGD) exploit this?

Solution: It takes the form of a sum of a bunch of s0-called loss functions. That is:

\[C(\boldsymbol{\theta})=\frac{1}{N}\sum_{i=1}^NL(\mathbf x_i,y_i|\boldsymbol{\theta})\]

which is an expectation over the empirical distribution of the $N$ training examples $(\mathbf x_1,y_1),…,(\mathbf x_N,y_N)$. Here, the integer $N$ will be important. A typical supervised ML model might be trained on $N\sim 10^9$ labelled feature vectors. This means that the corresponding gradient of the cost function:

\[\frac{\partial C}{\partial\boldsymbol{\theta}}=\frac{1}{N}\sum_{i=1}^N\frac{\partial L(\mathbf x_i,y_i|\boldsymbol{\theta})}{\partial\boldsymbol{\theta}}\]

would also require summing $N$ terms, thus making each step of standard batch $\partial_{\boldsymbol{\theta}}$-descent computationally expensive. The idea of mini-batch stochastic gradient descent is to replace $N\mapsto N'<N$ in order to reduce the computational burden at the expense of throwing away some precision about the optimal direction in $\boldsymbol{\theta}$-space to take each $\partial_{\boldsymbol{\theta}}$-descent step. Strictly speaking, SGD itself is the extreme limit in which $N’:=1$, but in practice it is more common to use mini-batch SGD in which say $N’=10^3$ so that the original ocean of $N=10^9$ training examples may be randomly partitioned into $N/N’=10^6$ mini-batches each containing $N’=10^3$ training examples. Then, the mini-batches are further randomly sampled (with replacement) for each step of this modified/stochastic $\partial_{\boldsymbol{\theta}}$-descent:

\[\boldsymbol{\theta}\mapsto\boldsymbol{\theta}-\frac{\alpha}{N’}\sum_{i=1}^{N’}\frac{\partial L(\mathbf x_i,y_i|\boldsymbol{\theta})}{\partial\boldsymbol{\theta}}\]

Problem:

\[C(\textbf w,b):=\frac{1}{2N}\sum_{i=1}^N(\hat{y}(\textbf x_i|\textbf w,b)-y_i)^2\]

is a mean-square cost function for a linear regression model $\hat y(\textbf x|\textbf w,b)=\textbf w\cdot\textbf x+b$ attempting to predict a training set $\{(\textbf x_1,y_1),…,(\textbf x_N,y_N)\}$ (here $\textbf w$ is a weight vector and $b$ is called a bias), what do the update rules for gradient descent look like in mathematical notation (explicitly)?

Solution: Note that $\textbf w_n$ and $b_n$ should be updated simultaneously:

\[\textbf w_n=\textbf w_{n-1}-\frac{\alpha}{N}\sum_{i=1}^N(\textbf w_{n-1}\cdot\textbf x_i+b_{n-1}-y_i)\textbf x_i\]

\[b_n=b_{n-1}-\frac{\alpha}{N}\sum_{i=1}^N(\textbf w_{n-1}\cdot\textbf x_i+b_{n-1}-y_i)\]

Problem: What are the $2$ key advantages of using NumPy?

Solution:

Advantage #$1$: Vectorization. The reason this is classified as an advantage hinges on the premise that modern parallel computing hardware (e.g. GPUs) turned out (somewhat unintentionally) to very efficient at matrix multiplication (and more broadly the many standard operations on/between NumPy arrays). The smoking gun for a correctly vectorized implementation of a computer program is that there should be no Python for loops visible anywhere in the code; all should instead be swept under the NumPy rug.

Advantage #$2$: Broadcasting. This allows standard operations such as scalar multiplication $\textbf x\mapsto c\textbf x$ to occur.

Universal functions (ufuncs) in NumPy (precompiled $C$-loop).

(crudely speaking, these are Cartesian tensors in math/physics). Typically never have to bother with Python loops.

Problem: Implement gradient descent in Python for the training set $\{(1,1),(2,2),(3,3)\}$ using a linear regression model. In particular, for a fixed initial guess, determine the optimal learning rate $\alpha$. Also try different initial guesses? And batch gradient descent vs. mini-batch…

Solution:

numpy_exercises

$\textbf{Problem}$: Write Python code to generate two random vectors $\textbf w,\textbf x\in\mathbf R^{10^7}$ and compute their dot product $\textbf w\cdot\textbf x$:

i) Without vectorization (i.e. using a for loop)

ii) With vectorization

Show that vectorization significantly speeds up the computation of $\textbf w\cdot\textbf x$.

$\textbf{Solution}$:

In [52]:

import numpy as np # import the NumPy library
from numpy import random # import the random module from the NumPy library
from time import time # import the time library

n = int(1e7) 
random.seed(62831853) # seed for "predictable" random vectors
w = random.rand(n) # random vector with n entries between 0 and 1
print(w)
x = random.rand(n) # random vector with n entries between 0 and 1
print(x)

# Without vectorization:
t_initial = time()
dot_product = 0
for i in np.arange(n):
    dot_product = dot_product + w[i]*x[i]
t_final = time()
print(f"Dot product value: {dot_product}")
print(f"Time taken: {t_final-t_initial} seconds")

# With vectorization:
t_initial = time()
dot_product = np.dot(w,x)
t_final = time()
print(f"Dot product value: {dot_product}")
print(f"Time taken: {t_final-t_initial} seconds")

[0.99139118 0.98259778 0.04074994 ... 0.37571015 0.46541306 0.60192845]
[0.02930771 0.23935712 0.46679807 ... 0.67633435 0.5376593  0.09256103]
Dot product value: 2499560.663731212
Time taken: 4.61395788192749 seconds
Dot product value: 2499560.663731507
Time taken: 0.003765106201171875 seconds

gradient_descent

$\textbf{Problem}$:

i) Generate (from a fixed seed) $1000$ random points in the $xy$-plane which are dispersed around the line $y=3x+1$ with standard deviation $\sigma_y=50$, in the range $0\leq x\leq 100$.

ii) Using the univariate linear regression model $\hat y=wx+b$ with initial guess $w=b=0$, apply gradient descent with learning rates $\alpha=10^{-6},10^{-5},10^{-4}$ for $100$ iterations each and plot the corresponding learning curves.

iii) Show, for each of the previous values of $\alpha$, the machine learning of $(w,b)$ in a suitable plane.

$\textbf{Solution}$:

In [10]:

import numpy as np
import matplotlib.pyplot as plt
import scienceplots

N = 1000
x_min = 0
x_max = 100
sigma_y = 50

rng = np.random.default_rng(seed=62831853)
x = np.array(rng.uniform(x_min, x_max, N))
y = np.array(3*x + 1 + rng.normal(0, sigma_y, N))

with plt.style.context(['science']):
    plt.figure(figsize=(10, 5))
    plt.scatter(x, y, s=0.5)
    plt.title("Random Data")
    plt.xlabel(r"$x$")
    plt.ylabel(r"$y$")
    plt.show()

In [11]:

def y_hat(x, w, b):
    return w[..., None] * x + b[..., None] 

def C(x, y, w, b):
    return 0.5 * np.mean((y_hat(x, w, b) - y)**2, axis=-1)

def dC_dw(w, b):
    return 1/N * np.sum((y_hat(x, w, b) - y) * x, axis=-1)

def dC_db(w, b):
    return 1/N * np.sum(y_hat(x, w, b) - y, axis=-1)

def grad_descent(w_init, b_init, alpha, n):
    w = np.zeros(n)
    b = np.zeros(n)
    w[0] = w_init
    b[0] = b_init
    for i in np.arange(1, n):
        w[i] = w[i-1] - alpha * dC_dw(w[i-1], b[i-1])
        b[i] = b[i-1] - alpha * dC_db(w[i-1], b[i-1])
    return w, b

In [12]:

n = 100
w_init = 0
b_init = 0

iterations = np.arange(n)
costs = np.zeros(n)

with plt.style.context(['science']):
    plt.figure(figsize=(10, 5))
    for alpha in [1e-6, 1e-5, 1e-4]:
        w, b = grad_descent(w_init, b_init, alpha, n)
        plt.plot(iterations, C(x, y, w, b), label=r"$\alpha=$"+str(alpha))
    plt.title("Learning Curves for Gradient Descent for Several Learning Rates")
    plt.xlabel("Number of Iterations")
    plt.ylabel("Cost Function")
    plt.legend()
    plt.show()

In [13]:

with plt.style.context(['science']):
    plt.figure(figsize=(10, 5))
    for alpha in [1e-6, 1e-5, 1e-4]:
        weights, biases = grad_descent(w_init, b_init, alpha, n)
        plt.scatter(weights, biases, label=r"$\alpha=$"+str(alpha), s=0.5)
    plt.title(r"Machine Learning the Linear Regression Model Parameters $w,b$")
    plt.xlabel(r"$w$")
    plt.ylabel(r"$b$")
    plt.legend()
    plt.show()

Problem: Explain how feature normalization and feature engineering can help to speed up gradient descent and obtain a model with greater predictive power.

Solution: Because the learning rate $\alpha$ in gradient descent is, roughly speaking, a dimensionless parameter, it makes sense to nondimensionalize all feature variables in some sense. This is the idea of feature normalization. For instance, one common method is to simply perform range normalization:

\[x\mapsto\frac{x-\langle x\rangle}{\text{range}(x)}\]

Another method is standard deviation normalization where instead of using the range as a crude measure of dispersion, one replaces it by the more refined standard deviation to obtain a $z$-score:

\[x\mapsto\frac{x-\langle x\rangle}{\sigma_x}\]

Feature engineering is about using one’s intuition to design new features by transforming or combining original features. Typically, this means expanding the set of basis functions one works with, thus allowing one to curve fit even for nonlinear functions. Thus, linear regression also works with nonlinear functions; the word “linear” in “linear regression” shouldn’t be thought of as “linear fit” but as “linear algebra”.

Problem: Now consider the other type of supervised learning, namely classification, and specifically consider the method of logistic classification (commonly called by the misnomer of logistic “regression” even though it’s about classification). Write down a table comparing linear regression with logistic classification with regards to their:

i) Model function $\hat y(\textbf x|\textbf w,b)$ to be fit to the training set $\{(\textbf x_1,y_1),…,(\textbf x_N,y_N)\}$.

ii) Loss functions $L(\hat y,y)$ appearing in the cost function $C(\textbf w,b)=\frac{1}{N}\sum_{i=1}^NL(\hat y(\textbf x_i|\textbf w,b),y_i)$.

Solution:

where, as a minor aside, the loss function for logistic classification can also be written in the explicit “Pauli blocking” or “entropic” form:

\[L(\hat y,y)=-y\ln\hat y-(1-y)\ln(1-\hat y)\]

Indeed, it is possible to more rigorously justify this choice of loss function precisely through such maximum-likelihood arguments. For simplicity, one can simply think of this choice of $L$ for logistic classification as ensuring that the corresponding cost function $C$ is convex so that gradient descent can be made to converge to a global minimum (which wouldn’t have been the case if one had simply stuck with the old quadratic cost function from linear regression). A remarkable fact (related to this?) is that the explicit gradient descent update formulas for each iteration look exactly the same for linear regression and logistic classification:

\[\textbf w_n=\textbf w_{n-1}-\frac{\alpha}{N}\sum_{i=1}^N(\hat y(\textbf x_i|\textbf w_{n-1},b_{n-1})-y_i)\textbf x_i\]

\[b_n=b_{n-1}-\frac{\alpha}{N}\sum_{i=1}^N(\hat y(\textbf x_i|\textbf w_{n-1},b_{n-1})-y_i)\]

just with the model function $\hat y(\textbf x|\textbf w,b)$ specific to each case.

Problem: Given a supervised classification problem involving some training set to which a logistic sigmoid is fit with some optimal weights $\textbf w$ and bias $b$, how does the actual classification then arise?

Solution: One has to decide on some critical “activation energy”/threshold $\hat y_c\in[0,1]$ such that the classification of a (possibly unseen) feature vector $\textbf x$ is $\Theta(\hat y(\textbf x|\textbf w,b)-\hat y_c)\in\{0,1\}$. Thus, the set of feature vectors $\textbf x$ for which $\hat y(\textbf x|\textbf w,b)=\hat y_c$ is called the decision boundary.

Problem: Given $N$ feature vectors $\mathbf x_1,…,\mathbf x_N\in\mathbf R^n$ each with $n$ features, explain the curse of dimensionality relating the $2$ integers $N$ and $n$, and methods for addressing it.

Solution: As the number of features $n$ increases, the corresponding number of feature vectors $N$ required for the ML model to generalize accurately grows exponentially:

\[N\sim\exp(n)\]

This is simply due to the fundamental geometric sparseness of higher-dimensional Euclidean space $\mathbf R^n$. Thus, any method for addressing the curse of dimensionality must either increase $N$ (collect more training examples!) or decrease $n$ (either blunt feature selection or more sophisticated feature engineering using $n$-reduction techniques like PCA).

Problem: Explain what it means to regularize a learning algorithm. Give examples of regularization techniques.

Solution: In the broadest sense, regularizing a learning algorithm means making some tweak to the algorithm so as to reduce its tendency to overfit to the training data. More precisely, the goal of regularization is to reduce the model’s effective capacity (though not its representational capacity) in such a way that there would be a substantial decrease in the model’s variance that more than offsets a slight increase in bias. This thereby reduces the overall generalization error of the model, without significantly increasing its training error.

Examples of regularization include $L^1$ (Lasso) regularization in which the cost function $C(\mathbf w,b)$ acquires an additional penalty term of the form:

\[C(\textbf w,b)\mapsto C(\textbf w,b)+\frac{\lambda_1}{2N}|\textbf w|\]

or $L^2$ (ridge) regularization in which:

\[C(\textbf w,b)\mapsto C(\textbf w,b)+\frac{\lambda_2}{2N}|\textbf w|^2\]

(this one actually has an interpretation as arising from a Bayesian MAP estimate assuming an isotropic zero-mean normally distributed prior $p(\mathbf w)=\left(\frac{\lambda}{2\pi}\right)^{n/2}e^{-\lambda |\mathbf w|^2/2}$).

Of course, other regularization techniques (both explicit like above and implicit) exist.

(optionally though not typically one can also regularize the bias term $b$ by adding $\lambda b^2/2N$ to the cost function $C$). Here, $\lambda_1,\lambda_2$ like $\alpha$ is another hyperparameter (sometimes called a regularization parameter).

Posted in Blog | Leave a comment

Basic Kinetic Theory

Posted on June 12, 2025 by wdengquantum.me

Problem: Consider placing a fictitious open surface in an equilibrium ideal gas at temperature $T$; although the net particle current density through such a surface would be $\textbf J=\textbf 0$, if one only counts the particles that go through the surface from one side to the other, then show that the resulting unidirectional particle current density $J$ is non-zero, and given by the Hertz-Knudsen equation:

\[J=\frac{1}{4}n\langle v\rangle\]

where $n=p/k_BT$ is the number density and $\langle v\rangle=\sqrt{8k_BT/\pi m}$ the average speed.

Solution:

Problem: By an analogous calculation, show that the unidirectional kinetic energy current density $S$ for an ideal gas (which one might also think of as a heat flux $S=q$) is given by:

\[S=\frac{1}{2}nk_BT\langle v\rangle\]

And hence, show that the average kinetic energy of particles hitting a wall is enhanced by a Bayes’ factor of $4/3$ compared to the bulk kinetic energy $\frac{3}{2}k_BT$ per particle.

Solution:

Problem: Using kinetic theory, obtain a simple expression for the Langmuir adsorption isotherm for the adsorbed fraction $\theta(p)$ as a function of the pressure $p$. Then, obtain the same result using statistical mechanics.

Solution: The Hertz-Knudsen equation can be rewritten as $J=p/\sqrt{2\pi mk_BT}$ which shows that, for isothermal fixed $T$, $J\propto p$. On the basis of this, the adsorption rate per unit area is postulated to be of the form $k_ap(1-\theta)$ whereas the desorption rate is $p$-independent and simply given by $k_d\theta$. Equating the two yields the Langmuir adsorption isotherm (which depends only on the equilibrium constant $K:=k_a/k_d$):

\[\theta(p)=\frac{Kp}{1+Kp}\]

The statistical mechanical version of this argument is to consider a vapor in equilibrium with the surface, and in particular to equate the chemical potential $k_BT\ln n\lambda_T^3$ of the vapor with that of the surface…(flesh this argument out later).

Posted in Blog | Leave a comment

The Hall Effect

Posted on June 10, 2025 by wdengquantum.me

Problem: Describe how the classical Hall coefficient $\rho^{-1}$ and explain why it’s “causally intuitive”.

Solution: In the classical Hall effect, the “cause” is both an applied current density $J$ together with an applied perpendicular magnetic field $B$. The “effect” is an induced transverse electric field $E$ whose magnitude and direction are such as to ensure a velocity selector steady state. So it seems reasonable to define the Hall coefficient by:

\[\rho^{-1}:=\frac{\text{effect}}{\text{cause}}=\frac{E}{JB}\]

where the notation $\rho^{-1}$ is deliberately suggestive of being the reciprocal charge density which is also what the Hall effect is. Note that here $E$ only represents the transverse component of the electric field, i.e. $E=-\textbf E\cdot(\textbf J\times\textbf B)/JB$, as there may also be a longitudinal component e.g. to compensate for scattering and other resistances.

The simplest way to derive this is to just set the Lorentz force density $\textbf f$ to zero:

\[\textbf f=\rho\textbf E+\textbf J\times\textbf B=\textbf 0\Rightarrow\rho^{-1}=-\textbf E\cdot (\textbf J\times\textbf B)/|\textbf J\times\textbf B|^2\]

Since $J,B$ are applied by the experimentalist, they are readily known, and $E$ can also be readily measured by measuring instead a suitable Hall voltage $\Delta\phi_H=-\int d\textbf x\cdot\textbf E$ in the transverse direction (voltages are always experimentally accessible as well), so the classical Hall effect provides a simple way to directly measure the charge density $\rho$ (via the Hall coefficient $\rho^{-1}$), and hence the number density of charge carriers $n=\rho/{\pm e}$ (strictly this assumes a single charge carrier species; for semiconductors it would be a bit more complicated…so in fact strictly speaking maybe one shouldn’t denote it by $\rho^{-1}$ but just by $R_H$).

Problem: Delve into the quantum Hall effect.

Solution: Laughlin ansatz…

Posted in Blog | Leave a comment

Method of Images

Posted on June 6, 2025 by wdengquantum.me

In sufficiently symmetric geometries, the method of images provides a way to solve Poisson’s equation $|\partial_{\textbf x}|^2\phi=-\rho/\varepsilon_0$ in a domain $V$ subject to either Dirichlet or Neumann boundary conditions (required for the uniqueness theorem to hold) by strategically placing charges in the “unphysical region” $\textbf R^3-V$ such as to ensure the boundary conditions are met. It works because of linearity and the fact by placing image charges outside the physical region $V$, one isn’t tampering with $\rho$ in that region so Poisson’s equation truly is solved.

In the following problems, the goal is to compute (in the suggested order):

The electrostatic potential $\phi(\textbf x)$ everywhere (i.e. both in regions of free space and inside materials).
The electrostatic field $\textbf E(\textbf x)=-\frac{\partial\phi}{\partial\textbf x}$
The induced charge density $\sigma$ on any conducting surfaces, along with the total charge $Q$ on such surfaces.
The force $\textbf F$ between any conductors.
The internal fields ($\textbf D,\textbf E,\textbf P,\phi$) and bound charge distributions $\rho_b,\sigma_b$ for any dielectrics.
The resistance/self-capacitance/self-inductance/mutual capacitance/mutual inductance of any conductors? (although that isn’t really electrostatics anymore…)

Problem: Consider placing a point charge $q$ at the point $(0,0,z)$ a distance $z$ from an infinite planar conductor at $z=0$.

Solution: Place an image point charge $-q$ at $(0,0,-z)$.

Problem: Now instead of a point charge, consider a line charge with linear charge density $\chi$.

Solution:

Problem: Instead of a line charge, place a line “cylinder” of charge of radius $a$.

Solution: Applying the cosine law:

\[\rho_1^2=R^2+(d-\sqrt{d^2-R^2})^2-2R(d-\sqrt{d^2-R^2})\cos\phi\]

\[\rho_2^2=R^2+(d+\sqrt{d^2+R^2})^2-2a(d+\sqrt{d^2-R^2})\cos\phi\]

So eliminating the $\cos\phi$, one finds that it is indeed possible to isolate solely for the ratio $\rho_1/\rho_2$ as a function of constant parameters, confirming that it is an equipotential surface as required.

Aside: this is nothing more than Apollonius’s construction of a circle as the set of all points whose distances $\rho,\rho’$ from $2$ “foci” are in a fixed ratio $\rho’/\rho$. Indeed, if the two foci are separated by a “semi-axis” $a$ (thus their full separation is $2a$), then the distance $d$ from the midpoint of the two foci to the center of the Apollonian circle and its radius $R$ satisfy (using the extreme points on the circle):

\[\frac{\rho’}{\rho}=\frac{d+(a-R)}{a+R-d}=\frac{a+R+d}{d-(a-R)}\]

So:

\[d^2=a^2+R^2\]

\[\frac{\rho’}{\rho}=\frac{d+a}{R}=\frac{a}{R}+\sqrt{1+\left(\frac{a}{R}\right)^2}\]

or inverting these relations:

\[d=\]

\[R=\]

Problem: Consider a magnetic dipole $\boldsymbol{\mu}$ suspended above a superconducting surface so that on this surface all magnetic fields are expelled.

Problem: An electrostatic dipole $\boldsymbol{\pi}$ a distance from a conducting plane?

Problem: Consider a conducting sphere in an asymptotically uniform background electrostatic field $\textbf E_0$.

Problem: Replace the conducting sphere by an insulating sphere (aka a linear dielectric sphere) of permittivity $\varepsilon$ (comment on how this relates to the Clausius-Mossotti relation).

Problem: Instead of linear dielectric sphere, consider linear diamagnetic sphere in a uniform magnetic field $\textbf B_0$.

Problem: Consider an $N$-gon of conducting sheets (quadrupole, octupole, etc.)

Problem: A point charge in a conducting spherical cavity (Green’s function for that domain).

Problem: A point charge outside the sphere.

Problem: (example with infinitely many image point charges?)

These ideas extend immediately to potential flows in fluid mechanics…describe all the analogous situations and analogous results without doing all the work again. Similarly for steady-state temperature distributions, and anywhere that Laplace’s equation with suitable boundary conditions shows up.

Posted in Blog | Leave a comment

Semiconductors

Posted on May 26, 2025 by wdengquantum.me

Problem: Distinguish between the terms “intrinsic semiconductor” and “extrinsic semiconductor“.

Solution: An intrinsic semiconductor is pretty much what it sounds like, i.e. a “pure” semiconductor material like $\text{Si}$ that is undoped with any impurity dopants. An extrinsic semiconductor is then basically the negation of an intrinsic semiconductor, i.e. one which is doped with impurity dopants, although conceptually one can think of it as being doped with charge carriers (either holes $h^+$ in a $p$-type extrinsic semiconductor or electrons $e^-$ in an $n$-type extrinsic semiconductor).

Problem: In the phrases $p$-type semiconductor and $n$-type semiconductor, what do the $p$ and $n$ represent?

Solution: In both cases, the extrinsic semiconductor (isolated from anything else) is neutral, even when doped. Rather, the $p$ and $n$ refer to the majority mobile/free charge carriers in the corresponding semiconductor; i.e. holes in the valence band and electrons in the conduction band respectively.

Problem: Show that the equilibrium number density $n_{e^-}$ of mobile conduction electrons (i.e. not including the immobile core/valence electrons) thermally excited into the conduction band at temperature $T$ is exponentially related to the gap $E_C-\mu$ between the energy $E_C$ at the base of the conduction band and the Fermi level $\mu$:

\[n_{e^-}=n_Ce^{-\beta(E_C-\mu)}\]

where the so-called effective density of states:

\[n_C:=\frac{2g_v}{\lambda^{*3}_T}\]

is $\approx$ the number density of available conduction band states at temperature $T$ (here $g_v$ is the valley degeneracy and $\lambda^{*3}_T$ is the thermal de Broglie wavelength with respect to the electron’s effective mass $m^*$).

Solution: To clarify some of the approximations used in that line with the $\approx$, the upper bound on the conduction band $E_{C,\text{max}}\to\infty$ can be safely taken to infinity because of the exponential suppression of the integrand by the Fermi-Dirac distribution for $E\gg\mu$ (in fact, using Fermi-Dirac statistics in the first place assumes the electrons interact solely through Pauli blocking). In addition, the density of states $g_C(E)$ is approximated by that of a free particle in the neighbourhood of the conduction band valley (with the usual $\sqrt{E}\mapsto \sqrt{E-E_C}$ because $g_C(E)=0$ in the $E\in[E_V,E_C]$ band gap) and with $m\mapsto m^*$ to reflect the local curvature of the conduction band which is inherited from the strength of the lattice’s periodic potential. Finally, to strengthen the earlier claim that $E\gg\mu$, indeed, $E\geq E_C$ is the range of the integral, and so a sufficient condition for $E\gg\mu$ is $E_C\gg\mu$ (in practice a few $k_BT$ is sufficient). This is assumed to be the case, and constitutes the assumption of a non-degenerate semiconductor (cf. non-degenerate Fermi gas). In this case, the Fermi-Dirac distribution boils down to just its “Boltzmann tail” $\frac{1}{e^{\beta(E-\mu)}+1}\approx e^{-\beta(E-\mu)}$:

Problem: Repeat the above problem for holes to derive an analogous result for the equilibrium number density $n_{h^+}$ of free conduction holes excited into the valence band at temperature $T$:

\[n_{h^+}=n_Ve^{-\beta(\mu-E_V)}\]

where $n_V$ is almost the same as $n_C$ except that it’s derived from the effective mass $m^*$ of the holes at the top of the valence band.

Solution: A few comments: if $f(E)$ is the Fermi-Dirac distribution for electrons, then by the very definition of a hole as a vacancy/absence of an electron, the analog of the Fermi-Dirac distribution for holes (which can be considered fermionic quasiparticles) is $1-f(E)$. In addition, a hole is considered to have more energy when it goes “downward” on a typical band diagram where the vertical axis $E$ is really referring to the electron’s energy. This explains the counterintuitive limits on the integral:

Problem: If an intrinsic semiconductor is doped with impurity dopants to create an extrinsic semiconductor, say with a number density $n_{d^+}$ of cationized donor dopants and $n_{a^-}$ of anionized acceptor dopants, what constraint does charge neutrality of the semiconductor impose among the concentrations $n_{e^-},n_{h^+},n_{d^+},n_{a^-}$?

Solution:

\[-en_{e^-}+en_{h^+}+en_{d^+}-en_{a^-}=0\]

\[n_{h^+}+n_{d^+}=n_{e^-}+n_{a^-}\]

Conceptually, for every electron excited into the conduction band, the corresponding donor atom now becomes cationized; similarly, every hole excited into a valence band is really an acceptor atom anionizing as it accepts an electron from the valence band, so in the equation, it is conceptually meaningful to pair up $(n_{e^-},n_{d^+})$ and $(n_{h^+},n_{a^-})$. Note however this is not saying that they are equal; though they approach becoming equal the more heavily one dopes.

Problem: Show that, in an intrinsic semiconductor, the Fermi level $\mu$ lies almost (but not exactly) at the midpoint $\frac{E_V+E_C}{2}$ of the band gap.

Solution: An intrinsic semiconductor is undoped so $n_{D^+}=n_{A^-}=0$. This implies from the charge neutrality argument above that $n_{e^-}=n_{h^+}$ (i.e. every electron excited into the conduction band leaves a hole in the valence band). The rest of the argument is then just plugging in the earlier equilibrium free charge carrier concentrations and algebra:

In what follows, it will be useful to call this particular value of $\mu$ the intrinsic Fermi level $\mu_i$ since it is the Fermi level of an intrinsic semiconductor, prior to any extrinsic doping.

Problem: Define the intrinsic charge carrier concentration by $n_i:=n_{e^-}=n_{h^+}$ for an intrinsic semiconductor, hence one has the so-called law of mass action $n_{e^-}n_{h^+}=n_i^2$ (i.e. $n_i^2$ is just a $T$-dependent equilibrium constant for the dissociation reaction $0\to e^-+h^+$). Show that the precise $T$-dependence of $n_i$ is given by:

\[n_i\sim T^{3/2}e^{-E_g/2k_BT}\]

where the band gap $E_g:=E_C-E_V$ (this result is sometimes also presented as $n_i=n_Se^{-\beta E_g/2}$ where $n_S:=\sqrt{n_Cn_V}$ is the geometric mean of the effective densities of states of the conduction and valence bands).

Solution:

Problem: Keeping tempertature $T$ fixed, consider $n$-type doping an intrinsic semiconductor with donor dopants, thus creating an $n$-type extrinsic semiconductor. The effect of this will be to raise the Fermi level from the intrinsic Fermi level $\mu_i$ (appx. in the middle of the band gap as shown earlier) to a new value $\mu$ much closer to the base of the conduction band $E_C$. Show that the precise amount of this raising can be quantified by:

\[\mu-\mu_i=k_BT\ln\frac{n_d}{n_i}\]

where $n_d$ is the (directly manipulable!) concentration of donor dopants doped into the intrinsic semiconductor, stating the $2$ key assumptions underlying this.

(note, an analogous line of reasoning for a $p$-type semiconductor shows that the Fermi level is lowered by an amount:

\[\mu_i-\mu=k_BT\ln\frac{n_a}{n_i}\]

where $n_a$ is the (also directly manipulable) concentration of acceptor dopants).

Solution: At equilibrium, one has for an intrinsic semiconductor:

\[n_i=n_Ce^{-\beta(E_C-\mu_i)}\]

and for an $n$-type doped extrinsic semiconductor:

\[n_{e^-}=n_Ce^{-\beta(E_C-\mu)}\]

Taking the ratio yields:

\[\mu-\mu_i=k_BT\ln\frac{n_{e^-}}{n_i}\]

At this point, the goal is to justify why $n_{e^-}\approx n_{d}$. This proceeds in $2$ stages.

First, justify that $n_{e^-}\approx n_{d}^+$, the concentration of cationized donor dopants. This follows by setting $n_{a^+}=0$ in the earlier charge neutrality constraint (since there are no acceptor dopants added $n_a=0\Rightarrow n_{a^+}=0$) and using the law of mass action to replace $n_{h^+}=n_i^2/n_{e^-}$, obtaining a quadratic equation for $n_{e^-}$ whose physical solution is:

\[n_{e^-}=\frac{n_{d^+}}{2}+\sqrt{\left(\frac{n_{d^+}}{2}\right)^2+n_i^2}\]

And at this point, one assumes that the semiconductor is fairly heavily doped, in particular $n_D\gg n_i$ (typical values in $\text{Si}$ are $n_D\sim 10^{16}\text{ cm}^{-3}$ while $n_i\sim 10^{10}\text{ cm}^{-3}$). This allows one to approximate $n_{e^-}\approx n_{d^+}$.

2. Then, to justify why $n_{d^+}\approx n_d$, one has to assume that the donor dopants are shallow in the sense that the binding energy of their extra valence electron is comparable to $k_BT$, and so it is easily excited into the conduction band. In other words, assume that almost complete cationization of donor dopants. This is just the statement that $n_{d^+}\approx n_d$, as desired.

Problem: In what regime is the non-degenerate semiconductor approximation valid?

Solution: For an $n$-type extrinsic semiconductor, a rule of thumb is that the Fermi level cannot rise to within $2k_BT$ of the base of the conduction band $E_C$:

\[E_c-\mu\geq 2k_BT\]

Inserting $n_d=n_Ce^{-\beta (E_C-\mu)}$ (where the approximation $n_{e^-}\approx n_d$ has been employed), one arrives at the rule of thumb that the donor dopant concentration cannot exceed:

\[n_d\leq e^{-2}n_C\approx 0.14 n_C\]

Similarly, for $p$-type doping, the acceptor dopant concentration cannot exceed:

\[n_a\leq 0.14n_V\]

(the more important takeaway here is not the exact numerical prefactors, but the fact that the Fermi level should stay a few $k_BT$ away from $E_C$ or $E_V$ for the semiconductor to be considered non-degenerate; indeed these estimates came from using the Boltzmann formula that was derived from this assumption, so it should be taken with a grain of salt as one is using a theory to predict its own demise).

Problem: A $p$-$n$ junction is formed by putting a $p$-type extrinsic semiconductor in contact with an $n$-type extrinsic semiconductor. Starting from a simple “top-hat” distribution of the free charge density $\rho_f(x)$ in the depletion region $-x_p<x<x_n$, sketch $\rho_f(x)$, $E(x)$ and $\phi(x)$.

Solution: Assuming an abrupt junction and sharp cutoff for the depletion region at $x_p, x_n$ respectively, one has:

Problem: Make the sketches more quantitative. In particular, calculate the width $x_n+x_p$ of the depletion region and the maximum strength $E_{\text{max}}$ of the electrostatic field at the junction $x=0$.

Solution:

Problem: Explain qualitatively how a depletion region forms at a $p$-$n$ junction.

Solution: Intuitively, the electrons and holes cannot just keep diffusing indefinitely across the $p$-$n$ junction because at some point too much like charge will clump on either side during recombination, preventing any further diffusion. Put another way, as the charge separation gets bigger and bigger, the induced $\textbf E$-field pointing from $n\to p$ exerts an electric force on the electrons and holes that prevents them from crossing the junction; at equilibrium, this forms a depletion region where there are no mobile charge carriers.

Problem: Just as a harmonic oscillator can be free or driven, so a $p$-$n$ junction can also be “free” (just sitting there with its built-in potential $V_{\text{bi}}$) or it can be “driven” as well in a sense, more precisely by applying an external voltage $V$ across it. However, unlike say with resistors/capacitors/inductors where the polarity of this voltage doesn’t really matter, here the asymmetry of the $p$-type vs. $n$-type semiconductors on either side, and thus the corresponding asymmetry of $V_{\text{bi}}$, means that the polarity of $V$ matters. Sketch qualitative band diagrams to show how the $p$-$n$ junction’s bands change in the case of both forward bias $V>0$ or reverse bias $V<0$. This underlies the principle of operation for some (though not all) kinds of diodes, sometimes called a $p$-$n$ semiconductor diode.

Solution: Some words of explanation: forward biasing a $p$-$n$ junction lowers the effective built-in potential from $V_{\text{bi}}\mapsto V_{\text{bi}}-V$. This clearly increases conductivity of both electrons and holes across the junction now that the energy barrier is reduced. By contrast, reverse biasing a $p$-$n$ junction only raises the effective built-in potential $V_{\text{bi}}\mapsto _{\text{bi}}+V$, reducing conductivity of electrons and holes as the depletion region gets bigger.

When the $p$-$n$ junction is initially unbiased so that $V=0$:

After forward biasing $V>0$:

The reverse-biased case is just opposite to the forward-biased case, and not shown. Note also that this is not an instance of quantum tunnelling, because it’s not just a simple top-hat potential barrier, there is no probability current across the depletion region, and indeed also no electric current, only diffusion current as elaborated later.

Another way to put it is that forward-biasing the $p$-$n$ junction encourages the majority charge carriers on each side to diffuse across the depletion region (and discourages the minority carriers, but that doesn’t matter anyways because they are minority), while reverse-bias is the opposite.

Problem: Recall that in an intrinsic semiconductor, at finite $T>0$, the very few charge carriers in the conduction band are purely thermal electrons excited from their corresponding thermal holes in the valence band. Then, respectively $p$-type or $n$-type doping the intrinsic semiconductor, the creation of hydrogenic acceptor states just above the valence band or donor states just below the conduction band causes respectively holes to become the majority charge carrier (in the valence band) and electrons to become the minority charge carrier (in the conduction band) in the $p$-type extrinsic semiconductor, and vice versa for $n$-type (that isn’t to say that the thermal electrons and thermal holes $n_{e^-,i}=n_{h^+,i}=n_i\sim 1.5\times 10^{10}\text{ cm}^{-3}$ aren’t still there, just they become negligible).

Solution:

Problem: Calculate the reverse saturation current $I_{\text{sat}}=\lim_{V\to-\infty}I_{\text{sat}}(e^{V/V_T}-1)$ for a $p$-$n$ junction semiconductor diode with the following parameters:

\[n_a=\]

Problem: Explain how a solar cell (aka photovoltaic cell) works.

Solution: Make a $p$-$n$ junction out of either a direct bandgap (e.g. GaAs) or indirect bandgap (e.g. Si) semiconductor. Then, irradiate the depletion region of the $p$-$n$ junction with photons of energy $\hbar\omega\geq E_g$. This leads to $\gamma\to e^-+h^+$ electron-hole pairs which get swept by the strong built-in $\mathbf E$-field according to $\mathbf F=q\mathbf E$ with $q=\pm e$. This can be used as a photocurrent to do work on a load. In practice, solar cells are made of multiple $p$-$n$ junctions in order to surpass the Shockley-Queisser limit $\eta\sim 1/3$ for a single $p$-$n$ junction.

To sort out later:

At T=0 K, mu is not really well-defined (b/c g(E)=0 in the band gap, so mu could be put anywhere in there..) but for T>0 K it is well-defined…

For n-type doping, by putting extra atoms near the bottom of the conduction band, will increase the chemical potential…(all this comes from interpreting mu as a silly fit parameter that needs to be tuned to get integral g(E)f(E) = number density of conduction electrons in the system = number density of donor dopants).

For p-type doping, add strongly electronegative atoms, they rip off electrons from the valence band. leaving additional holes in the valence band.

w.r.t. intrinsic concentrations of electrons and holes at T=300 K,

something about asymmetry of the densities of states…the presence of the donor and acceptor states in the band gap influences g(E) there…

Posted in Blog | Leave a comment

Sturm-Liouville Theory & Green’s Functions

Posted on May 24, 2025 by wdengquantum.me

Problem: A vibrating string with displacement profile $y(x,t)$ has non-uniform mass per unit length $\mu(x)$ and non-uniform tension $T(x)$ experiences both an internal restoring force due to $T(x)$ but also a linear “Hooke’s law” restoring force $-k(x)y(x)$ everywhere so that its equation of motion is governed by:

\[\mu\frac{\partial^2 y}{\partial t^2}=\frac{\partial}{\partial x}\left(T(x)\frac{\partial y}{\partial x}\right)-k(x)y\]

Solve this using separation of variables.

Solution:

Problem: Make the “Noetherian interpretation” above more concrete by showing that the eigenvalue $\omega^2$ can be expressed as a Rayleigh-Ritz quotient:

\[\omega^2=\frac{\int_{x_1}^{x_2}dx(T|\psi’|^2+k|\psi|^2)}{\int_{x_1}^{x_2}dx\mu|\psi|^2}\]

Solution:

In a more conceptually/notationally compact manner, one can also write the derivation above as:

\[H|\psi\rangle=\mu\omega^2|\psi\rangle\]

\[\langle\psi |H|\psi\rangle=\omega^2\langle \psi|\psi\rangle_{\mu}\]

\[\omega^2=\frac{\langle\psi |H|\psi\rangle}{\langle \psi|\psi\rangle_{\mu}}\]

making it look like the usual Rayleigh-Ritz quotient employed in the quantum mechanical variational principle (although this glosses over the subtlety about boundary terms in the integration by parts).

Problem: Conversely, show that if one considers $\omega^2=\omega^2[\psi]$ as a functional of $\psi(x)$, then the functional is stationary on eigenstates $\psi$ of the Sturm-Liouville operator $H$ and with eigenvalue $\omega^2[\psi]$.

Solution:

Problem: Show that eigenfunctions of the Sturm-Liouville operator with distinct eigenvalues are $\mu$-orthogonal.

Solution: (this proof assumes the eigenvalues have already been shown real, and the proof of that essentially mirrors this proof except setting $1=2$):