Problem: Consider placing a fictitious open surface in an equilibrium ideal gas at temperature \(T\); although the net particle current density through such a surface would be \(\textbf J=\textbf 0\), if one only counts the particles that go through the surface from one side to the other, then show that the resulting unidirectional particle currentdensity \(J\) is non-zero, and given by:
\[J=\frac{1}{4}n\langle v\rangle\]
where \(n=p/k_BT\) is the number density and \(\langle v\rangle=\sqrt{8k_BT/\pi m}\) the average speed.
Solution:
Problem: By an analogous calculation, show that the unidirectional kinetic energy current density \(S\) for an ideal gas (which one might also think of as a heat flux \(S=q\)) is given by:
\[S=\frac{1}{2}nk_BT\langle v\rangle\]
And hence, show that the average kinetic energy of particles hitting a wall is enhanced by a Bayes’ factor of \(4/3\) compared to the bulk kinetic energy \(\frac{3}{2}k_BT\) per particle.
Problem #\(1\): Describe how the classical Hall coefficient \(\rho^{-1}\) and explain why it’s “causally intuitive”.
Solution #\(1\): In the classical Hall effect, the “cause” is both an applied current density \(J\) together with an applied perpendicular magnetic field \(B\). The “effect” is an induced transverse electric field \(E\) whose magnitude and direction are such as to ensure a velocity selector steady state. So it seems reasonable to define the Hall coefficient by:
where the notation \(\rho^{-1}\) is deliberately suggestive of being the reciprocal charge density which is also what the Hall effect is. Note that here \(E\) only represents the transverse component of the electric field, i.e. \(E=-\textbf E\cdot(\textbf J\times\textbf B)/JB\), as there may also be a longitudinal component e.g. to compensate for scattering and other resistances.
The simplest way to derive this is to just set the Lorentz force density \(\textbf f\) to zero:
Since \(J,B\) are applied by the experimentalist, they are readily known, and \(E\) can also be readily measured by measuring instead a suitable Hall voltage \(\Delta\phi_H=-\int d\textbf x\cdot\textbf E\) in the transverse direction (voltages are always experimentally accessible as well), so the classical Hall effect provides a simple way to directly measure the charge density \(\rho\) (via the Hall coefficient \(\rho^{-1}\)), and hence the number density of charge carriers \(n=\rho/{\pm e}\) (strictly this assumes a single charge carrier species; for semiconductors it would be a bit more complicated…so in fact strictly speaking maybe one shouldn’t denote it by \(\rho^{-1}\) but just by \(R_H\)).
Problem #\(2\): Delve into the quantum Hall effect.
In sufficiently symmetric geometries, the method of images provides a way to solve Poisson’s equation \(|\partial_{\textbf x}|^2\phi=-\rho/\varepsilon_0\) in a domain \(V\) subject to either Dirichlet or Neumann boundary conditions (required for the uniqueness theorem to hold) by strategically placing charges in the “unphysical region” \(\textbf R^3-V\) such as to ensure the boundary conditions are met. It works because of linearity and the fact by placing image charges outside the physical region \(V\), one isn’t tampering with \(\rho\) in that region so Poisson’s equation truly is solved.
In the following problems, the goal is to compute (in the suggested order):
The electrostatic potential \(\phi(\textbf x)\) everywhere (i.e. both in regions of free space and inside materials).
The electrostatic field \(\textbf E(\textbf x)=-\frac{\partial\phi}{\partial\textbf x}\)
The induced charge density \(\sigma\) on any conducting surfaces, along with the total charge \(Q\) on such surfaces.
The force \(\textbf F\) between any conductors.
The internal fields (\(\textbf D,\textbf E,\textbf P,\phi\)) and bound charge distributions \(\rho_b,\sigma_b\) for any dielectrics.
The resistance/self-capacitance/self-inductance/mutual capacitance/mutual inductance of any conductors? (although that isn’t really electrostatics anymore…)
Problem: Consider placing a point charge \(q\) at the point \((0,0,z)\) a distance \(z\) from an infinite planar conductor at \(z=0\).
Solution: Place an image point charge \(-q\) at \((0,0,-z)\).
Problem: Now instead of a point charge, consider a line charge with linear charge density \(\chi\).
Solution:
Problem: Instead of a line charge, place a line “cylinder” of charge of radius \(a\).
So eliminating the \(\cos\phi\), one finds that it is indeed possible to isolate solely for the ratio \(\rho_1/\rho_2\) as a function of constant parameters, confirming that it is an equipotential surface as required.
Aside: this is nothing more than Apollonius’s construction of a circle as the set of all points whose distances \(\rho,\rho’\) from \(2\) “foci” are in a fixed ratio \(\rho’/\rho\). Indeed, if the two foci are separated by a “semi-axis” \(a\) (thus their full separation is \(2a\)), then the distance \(d\) from the midpoint of the two foci to the center of the Apollonian circle and its radius \(R\) satisfy (using the extreme points on the circle):
Problem: Consider a magnetic dipole \(\boldsymbol{\mu}\) suspended above a superconducting surface so that on this surface all magnetic fields are expelled.
Problem: An electrostatic dipole \(\boldsymbol{\pi}\) a distance from a conducting plane?
Problem: Consider a conducting sphere in an asymptotically uniform background electrostatic field \(\textbf E_0\).
Problem: Replace the conducting sphere by an insulating sphere (aka a linear dielectric sphere) of permittivity \(\varepsilon\) (comment on how this relates to the Clausius-Mossotti relation).
Problem: Instead of linear dielectric sphere, consider linear diamagnetic sphere in a uniform magnetic field \(\textbf B_0\).
Problem: Consider an \(N\)-gon of conducting sheets (quadrupole, octupole, etc.)
Problem: A point charge in a conducting spherical cavity (Green’s function for that domain).
Problem: A point charge outside the sphere.
Problem: (example with infinitely many image point charges?)
These ideas extend immediately to potential flows in fluid mechanics…describe all the analogous situations and analogous results without doing all the work again. Similarly for steady-state temperature distributions, and anywhere that Laplace’s equation with suitable boundary conditions shows up.
Problem: Distinguish between the terms “intrinsic semiconductor” and “extrinsic semiconductor“.
Solution: An intrinsic semiconductor is pretty much what it sounds like, i.e. a “pure” semiconductor material like \(\text{Si}\) that is undoped with any impurity dopants. An extrinsic semiconductor is then basically the negation of an intrinsic semiconductor, i.e. one which is doped with impurity dopants, although conceptually one can think of it as being doped with charge carriers (either holes \(h^+\) in a \(p\)-type extrinsic semiconductor or electrons \(e^-\) in an \(n\)-type extrinsic semiconductor).
Problem: In the phrases \(p\)-type semiconductor and \(n\)-type semiconductor, what do the \(p\) and \(n\) represent?
Solution: In both cases, the extrinsic semiconductor (isolated from anything else) is neutral, even when doped. Rather, the \(p\) and \(n\) refer to the majoritymobile/free charge carriers in the corresponding semiconductor; i.e. holes in the valence band and electrons in the conduction band respectively.
Problem: Show that the equilibrium number density \(n_{e^-}\) of mobile conduction electrons (i.e. not including the immobile core/valence electrons) thermally excited into the conduction band at temperature \(T\) is exponentially related to the gap \(E_C-\mu\) between the energy \(E_C\) at the base of the conduction band and the Fermi level \(\mu\):
\[n_{e^-}=n_Ce^{-\beta(E_C-\mu)}\]
where the so-called effective density of states:
\[n_C:=\frac{2g_v}{\lambda^{*3}_T}\]
is \(\approx\) the number density of available conduction band states at temperature \(T\) (here \(g_v\) is the valley degeneracy and \(\lambda^{*3}_T\) is the thermal de Broglie wavelength with respect to the electron’s effective mass \(m^*\)).
Solution: To clarify some of the approximations used in that line with the \(\approx\), the upper bound on the conduction band \(E_{C,\text{max}}\to\infty\) can be safely taken to infinity because of the exponential suppression of the integrand by the Fermi-Dirac distribution for \(E\gg\mu\) (in fact, using Fermi-Dirac statistics in the first place assumes the electrons interact solely through Pauli blocking). In addition, the density of states \(g_C(E)\) is approximated by that of a free particle in the neighbourhood of the conduction band valley (with the usual \(\sqrt{E}\mapsto \sqrt{E-E_C}\) because \(g_C(E)=0\) in the \(E\in[E_V,E_C]\) band gap) and with \(m\mapsto m^*\) to reflect the local curvature of the conduction band which is inherited from the strength of the lattice’s periodic potential. Finally, to strengthen the earlier claim that \(E\gg\mu\), indeed, \(E\geq E_C\) is the range of the integral, and so a sufficient condition for \(E\gg\mu\) is \(E_C\gg\mu\) (in practice a few \(k_BT\) is sufficient). This is assumed to be the case, and constitutes the assumption of a non-degenerate semiconductor (cf. non-degenerate Fermi gas). In this case, the Fermi-Dirac distribution boils down to just its “Boltzmann tail” \(\frac{1}{e^{\beta(E-\mu)}+1}\approx e^{-\beta(E-\mu)}\):
Problem: Repeat the above problem for holes to derive an analogous result for the equilibrium number density \(n_{h^+}\) of free conduction holes excited into the valence band at temperature \(T\):
\[n_{h^+}=n_Ve^{-\beta(\mu-E_V)}\]
where \(n_V\) is almost the same as \(n_C\) except that it’s derived from the effective mass \(m^*\) of the holes at the top of the valence band.
Solution: A few comments: if \(f(E)\) is the Fermi-Dirac distribution for electrons, then by the very definition of a hole as a vacancy/absence of an electron, the analog of the Fermi-Dirac distribution for holes (which can be considered fermionic quasiparticles) is \(1-f(E)\). In addition, a hole is considered to have more energy when it goes “downward” on a typical band diagram where the vertical axis \(E\) is really referring to the electron’s energy. This explains the counterintuitive limits on the integral:
Problem: If an intrinsic semiconductor is doped with impurity dopants to create an extrinsic semiconductor, say with a number density \(n_{d^+}\) of cationized donor dopants and \(n_{a^-}\) of anionized acceptor dopants, what constraint does charge neutrality of the semiconductor impose among the concentrations \(n_{e^-},n_{h^+},n_{d^+},n_{a^-}\)?
Solution:
\[-en_{e^-}+en_{h^+}+en_{d^+}-en_{a^-}=0\]
\[n_{h^+}+n_{d^+}=n_{e^-}+n_{a^-}\]
Conceptually, for every electron excited into the conduction band, the corresponding donor atom now becomes cationized; similarly, every hole excited into a valence band is really an acceptor atom anionizing as it accepts an electron from the valence band, so in the equation, it is conceptually meaningful to pair up \((n_{e^-},n_{d^+})\) and \((n_{h^+},n_{a^-})\). Note however this is not saying that they are equal; though they approach becoming equal the more heavily one dopes.
Problem: Show that, in an intrinsic semiconductor, the Fermi level \(\mu\) lies almost (but not exactly) at the midpoint \(\frac{E_V+E_C}{2}\) of the band gap.
Solution: An intrinsic semiconductor is undoped so \(n_{D^+}=n_{A^-}=0\). This implies from the charge neutrality argument above that \(n_{e^-}=n_{h^+}\) (i.e. every electron excited into the conduction band leaves a hole in the valence band). The rest of the argument is then just plugging in the earlier equilibrium free charge carrier concentrations and algebra:
In what follows, it will be useful to call this particular value of \(\mu\) the intrinsic Fermi level \(\mu_i\) since it is the Fermi level of an intrinsic semiconductor, prior to any extrinsic doping.
Problem:Define the intrinsic charge carrier concentration by \(n_i:=n_{e^-}=n_{h^+}\) for an intrinsic semiconductor, hence one has the so-called law of mass action \(n_{e^-}n_{h^+}=n_i^2\) (i.e. \(n_i^2\) is just a \(T\)-dependent equilibrium constant for the dissociation reaction \(0\to e^-+h^+\)). Show that the precise \(T\)-dependence of \(n_i\) is given by:
\[n_i\sim T^{3/2}e^{-E_g/2k_BT}\]
where the band gap \(E_g:=E_C-E_V\) (this result is sometimes also presented as \(n_i=n_Se^{-\beta E_g/2}\) where \(n_S:=\sqrt{n_Cn_V}\) is the geometric mean of the effective densities of states of the conduction and valence bands).
Solution:
Problem: Keeping tempertature \(T\) fixed, consider \(n\)-type doping an intrinsic semiconductor with donor dopants, thus creating an \(n\)-type extrinsic semiconductor. The effect of this will be to raise the Fermi level from the intrinsic Fermi level \(\mu_i\) (appx. in the middle of the band gap as shown earlier) to a new value \(\mu\) much closer to the base of the conduction band \(E_C\). Show that the precise amount of this raising can be quantified by:
\[\mu-\mu_i=k_BT\ln\frac{n_d}{n_i}\]
where \(n_d\) is the (directly manipulable!) concentration of donor dopants doped into the intrinsic semiconductor, stating the \(2\) key assumptions underlying this.
(note, an analogous line of reasoning for a \(p\)-type semiconductor shows that the Fermi level is lowered by an amount:
\[\mu_i-\mu=k_BT\ln\frac{n_a}{n_i}\]
where \(n_a\) is the (also directly manipulable) concentration of acceptor dopants).
Solution: At equilibrium, one has for an intrinsic semiconductor:
\[n_i=n_Ce^{-\beta(E_C-\mu_i)}\]
and for an \(n\)-type doped extrinsic semiconductor:
\[n_{e^-}=n_Ce^{-\beta(E_C-\mu)}\]
Taking the ratio yields:
\[\mu-\mu_i=k_BT\ln\frac{n_{e^-}}{n_i}\]
At this point, the goal is to justify why \(n_{e^-}\approx n_{d}\). This proceeds in \(2\) stages.
First, justify that \(n_{e^-}\approx n_{d}^+\), the concentration of cationized donor dopants. This follows by setting \(n_{a^+}=0\) in the earlier charge neutrality constraint (since there are no acceptor dopants added \(n_a=0\Rightarrow n_{a^+}=0\)) and using the law of mass action to replace \(n_{h^+}=n_i^2/n_{e^-}\), obtaining a quadratic equation for \(n_{e^-}\) whose physical solution is:
And at this point, one assumes that the semiconductor is fairly heavily doped, in particular \(n_D\gg n_i\) (typical values in \(\text{Si}\) are \(n_D\sim 10^{16}\text{ cm}^{-3}\) while \(n_i\sim 10^{10}\text{ cm}^{-3}\)). This allows one to approximate \(n_{e^-}\approx n_{d^+}\).
2. Then, to justify why \(n_{d^+}\approx n_d\), one has to assume that the donor dopants are shallow in the sense that the binding energy of their extra valence electron is comparable to \(k_BT\), and so it is easily excited into the conduction band. In other words, assume that almost complete cationization of donor dopants. This is just the statement that \(n_{d^+}\approx n_d\), as desired.
Problem: In what regime is the non-degenerate semiconductor approximation valid?
Solution: For an \(n\)-type extrinsic semiconductor, a rule of thumb is that the Fermi level cannot rise to within \(2k_BT\) of the base of the conduction band \(E_C\):
\[E_c-\mu\geq 2k_BT\]
Inserting \(n_d=n_Ce^{-\beta (E_C-\mu)}\) (where the approximation \(n_{e^-}\approx n_d\) has been employed), one arrives at the rule of thumb that the donor dopant concentration cannot exceed:
\[n_d\leq e^{-2}n_C\approx 0.14 n_C\]
Similarly, for \(p\)-type doping, the acceptor dopant concentration cannot exceed:
\[n_a\leq 0.14n_V\]
(the more important takeaway here is not the exact numerical prefactors, but the fact that the Fermi level should stay a few \(k_BT\) away from \(E_C\) or \(E_V\) for the semiconductor to be considered non-degenerate; indeed these estimates came from using the Boltzmann formula that was derived from this assumption, so it should be taken with a grain of salt as one is using a theory to predict its own demise).
Problem: A \(p\)-\(n\) junction is formed by putting a \(p\)-type extrinsic semiconductor in contact with an \(n\)-type extrinsic semiconductor. Starting from a simple “top-hat” distribution of the free charge density \(\rho_f(x)\) in the depletion region \(-x_p<x<x_n\), sketch \(\rho_f(x)\), \(E(x)\) and \(\phi(x)\).
Solution: Assuming an abrupt junction and sharp cutoff for the depletion region at \(x_p, x_n\) respectively, one has:
Problem: Make the sketches more quantitative. In particular, calculate the width \(x_n+x_p\) of the depletion region and the maximum strength \(E_{\text{max}}\) of the electrostatic field at the junction \(x=0\).
Solution:
Problem: Explain qualitatively how a depletion region forms at a \(p\)-\(n\) junction.
Solution: Intuitively, the electrons and holes cannot just keep diffusing indefinitely across the \(p\)-\(n\) junction because at some point too much like charge will clump on either side during recombination, preventing any further diffusion. Put another way, as the charge separation gets bigger and bigger, the induced \(\textbf E\)-field pointing from \(n\to p\) exerts an electric force on the electrons and holes that prevents them from crossing the junction; at equilibrium, this forms a depletion region where there are no mobile charge carriers.
Problem: Just as a harmonic oscillator can be free or driven, so a \(p\)-\(n\) junction can also be “free” (just sitting there with its built-in potential \(V_{\text{bi}}\)) or it can be “driven” as well in a sense, more precisely by applying an external voltage \(V\) across it. However, unlike say with resistors/capacitors/inductors where the polarity of this voltage doesn’t really matter, here the asymmetry of the \(p\)-type vs. \(n\)-type semiconductors on either side, and thus the corresponding asymmetry of \(V_{\text{bi}}\), means that the polarity of \(V\) matters. Sketch qualitative band diagrams to show how the \(p\)-\(n\) junction’s bands change in the case of both forward bias \(V>0\) or reverse bias \(V<0\). This underlies the principle of operation for some (though not all) kinds of diodes, sometimes called a \(p\)-\(n\) semiconductor diode.
Solution: Some words of explanation: forward biasing a \(p\)-\(n\) junction lowers the effective built-in potential from \(V_{\text{bi}}\mapsto V_{\text{bi}}-V\). This clearly increases conductivity of both electrons and holes across the junction now that the energy barrier is reduced. By contrast, reverse biasing a \(p\)-\(n\) junction only raises the effective built-in potential \(V_{\text{bi}}\mapsto _{\text{bi}}+V\), reducing conductivity of electrons and holes as the depletion region gets bigger.
When the \(p\)-\(n\) junction is initially unbiased so that \(V=0\):
After forward biasing \(V>0\):
The reverse-biased case is just opposite to the forward-biased case, and not shown. Note also that this is not an instance of quantum tunnelling, because it’s not just a simple top-hat potential barrier, there is no probability current across the depletion region, and indeed also no electric current, only diffusion current as elaborated later.
Another way to put it is that forward-biasing the \(p\)-\(n\) junction encourages the majority charge carriers on each side to diffuse across the depletion region (and discourages the minority carriers, but that doesn’t matter anyways because they are minority), while reverse-bias is the opposite.
Problem: Recall that in an intrinsic semiconductor, at finite \(T>0\), the very few charge carriers in the conduction band are purely thermal electrons excited from their corresponding thermal holes in the valence band. Then, respectively \(p\)-type or \(n\)-type doping the intrinsic semiconductor, the creation of hydrogenic acceptor states just above the valence band or donor states just below the conduction band causes respectively holes to become the majority charge carrier (in the valence band) and electrons to become the minority charge carrier (in the conduction band) in the \(p\)-type extrinsic semiconductor, and vice versa for \(n\)-type (that isn’t to say that the thermal electrons and thermal holes \(n_{e^-,i}=n_{h^+,i}=n_i\sim 1.5\times 10^{10}\text{ cm}^{-3}\) aren’t still there, just they become negligible).
Solution:
Problem: Calculate the reverse saturation current \(I_{\text{sat}}=\lim_{V\to-\infty}I_{\text{sat}}(e^{V/V_T}-1)\) for a \(p\)-\(n\) junction semiconductor diode with the following parameters:
\[n_a=\]
To sort out later:
At T=0 K, mu is not really well-defined (b/c g(E)=0 in the band gap, so mu could be put anywhere in there..) but for T>0 K it is well-defined…
For n-type doping, by putting extra atoms near the bottom of the conduction band, will increase the chemical potential…(all this comes from interpreting mu as a silly fit parameter that needs to be tuned to get integral g(E)f(E) = number density of conduction electrons in the system = number density of donor dopants).
For p-type doping, add strongly electronegative atoms, they rip off electrons from the valence band. leaving additional holes in the valence band.
w.r.t. intrinsic concentrations of electrons and holes at T=300 K,
something about asymmetry of the densities of states…the presence of the donor and acceptor states in the band gap influences g(E) there…
Problem: A vibrating string with displacement profile \(y(x,t)\) has non-uniform mass per unit length \(\mu(x)\) and non-uniform tension \(T(x)\) experiences both an internal restoring force due to \(T(x)\) but also a linear “Hooke’s law” restoring force \(-k(x)y(x)\) everywhere so that its equation of motion is governed by:
Problem: Make the “Noetherian interpretation” above more concrete by showing that the eigenvalue \(\omega^2\) can be expressed as a Rayleigh-Ritz quotient:
making it look like the usual Rayleigh-Ritz quotient employed in the quantum mechanical variational principle (although this glosses over the subtlety about boundary terms in the integration by parts).
Problem: Conversely, show that if one considers \(\omega^2=\omega^2[\psi]\) as a functional of \(\psi(x)\), then the functional is stationary on eigenstates \(\psi\) of the Sturm-Liouville operator \(H\) and with eigenvalue \(\omega^2[\psi]\).
Solution:
Problem: Show that eigenfunctions of the Sturm-Liouville operator with distinct eigenvalues are \(\mu\)-orthogonal.
Solution: (this proof assumes the eigenvalues have already been shown real, and the proof of that essentially mirrors this proof except setting \(1=2\)):
Problem: Solve the following inhomogeneous \(2^{\text{nd}}\)-order ODE:
Problem: Why is it important to learn how to learn?
Solution: In a post-AGI world where each day feels like swallowing a fire hose of new (mis!)information (especially as a researcher), the most important arrow one can have in the quiver is the ability to learn deeply, rapidly and with long shelf life. Equivalently, each piece of knowledge is a vector \(\mathbf k:=(\mu,\nu,\rho)\in (0,\infty)^3\) along \(3\) orthogonal axes (understanding, velocity, and retention), and the goal is for all \(\mathbf k\)-vectors to flow along the vector field of enlightenment:
Equivalently, in the time domain, it is essential to recognize that learning is just fighting an uphill battle against the entropy of the Ebbinghaus \(\mu(t)\) forgetting curve:
Problem: Describe a general (i.e. field-independent) method for learning any topic in any field deeply.
Solution: The ACADEMIAlearning method involves a sequence of \(8\) nonlinear steps (i.e. one will often bounce back and forth among them). For each step, one should only invoke LLM assistance after a sincere personal endeavour has been made.
Ask: questions to anchor one’s razor focus on as one seeks the answers through learning.
Consolidate: compress/condense/compact raw material as rigorous, concise bullets, exploiting field-specific tricks.
Analogies: connect/bridge/one-to-one dictionary mappings onto one’s prior knowledge graph as much as possible.
Diagrams: graphs (both \(\{(x,f(x))\}\) and \((V,E)\)), matrices/tables, geometric constructions, heuristic representations, etc.
Examples: concrete numbers to clarify abstract scaffolding.
Mentor: Feynman technique of explaining without jargon as if to a \(12\)-year old.
Interrogate: play devil’s advocate, adversarially seek to find flaws or inefficiencies in assumptions and conventional wisdom.
Acronym: summarize the new understanding gained with a memorable mnemonic.
Problem: Demonstrate a walk-through of the ACADEMIA learning method applied to the financial topic of stocks and bonds.
Solution:
A:
Problem: While the above demonstrates how to learn deeply (i.e. along the \(\mu\)-axis), it does not fully address the issues of how to learn quickly \(\nu\) or how to ensure retention \(\rho\).
Solution:
Forcing function: being in the right external environment with pressure and impetus to keep one motivated to learn (e.g. a library/office/coffee shop with the right incentive structures/deadlines).
Preparation: get into the right frame of mind by taking deep breaths until one feels sufficiently calm and steady to begin a deeply focused session of learning. Earmuffs on.
Reinforce: solve or answer (possibly LLM-generated) problems/exercises/quizzes/hands-on interactives/coding programs with (possibly LLM-generated) solutions.
Problem: How about high velocity? And high retention?
Solution: Listening to podcasts, popular talks, YouTube videos, etc. while instructive in their own right are not a substitute for doing it oneself.
Problem: What is the attitude with which one should approach learning?
Solution: One should only ever compare oneself to who they were yesterday, and ask whether or not one has improved from then. With consistent routines, growth compounds exponentially; \(1.01^{365}\approx 38\). Finally, practice what you preach! Also, learn for the joy of learning is best, though not always possible. The systematic method described above can seem a bit intimidating, but the little bit of extra time spent is hopefully well worth it in the long run, and overtime we will get better/faster at it (autopilot) and also improve upon it through feedback loops.
(also, the problem of how to structure one’s breaks? Pomodoro seems a bit too rigid, can interrupt flow state/groove. )
(also, comments on how to get into the groove state in the first place, I think getting over that initial activation energy barrier is key)
(another comment: perhaps meditation before working? or some form of deep breaths/emotional control to get primed for the right state of mind?)
(keeping feet stationary on the floor, no shaking, perhaps can build some kind of mechanical lock-in for our feet? And also about standing desks…)
(also, about making tasks as modular as possible, ideally can be completed in <30 minutes? So before starting a session of work, write up the to-do list to sharpen scope and prevent distractions)
(allowing downtime for the mind to wander/think about something from earlier but deliberately try to find a crack/new perspective? Point is research is by definition trying to discover out-of-distribution stuff, so need to do this exercise. Actually, how about we just play the connection game, us adversarial vs. Gemini, and also Gemini adversarial vs. us, flesh out as many connections as you can think of in that time, and write up any interesting ones somewhere for later thought? You should just vibe code this. Also comment on the Poincare theory of 4 hours/day?)
Problem: Describe some physics-specific learning tricks.
Solution: (to do: acronym for all of this as one should practice what one preaches!)
personal cognitively consonant notation
Technique #\(1\): resist simplification temptation, group variables into dimensionally meaningful groups each with a clear interpretation, e.g. density of states in \(k\)-space for a free \(3\)D electron gas is \(g(k)=2\times V/(2\pi)^3\times 4\pi k^2=Vk^2/\pi^2\), impedance of a non-dispersive string is \(Z=T/\sqrt{T/\mu}=\sqrt{T\mu}\), energy levels of a particle in a box \(E_n=\hbar^2(n\pi/L)^2/2m=n^2\pi^2\hbar^2/2mL^2\), etc.
Technique #\(2\): dimensional analysis, find how quantities scale without regards to numerical pre-factors, e.g. from Navier-Stokes equations, inertial forces (e.g. air resistance) scale like \(F_{\rho}/a^3\sim\rho v^2/a\Rightarrow F_{\rho}\sim \rho a^2v^2\) whereas viscous forces (e.g. Stokes drag) scale like \( F_{\eta}/a^3\sim\eta v/a^2\Rightarrow F_{\eta}\sim\eta av\), so the Reynolds numbers is \(\text{Re}:=F_{\rho}/F_{\eta}\sim\rho av/\eta\).
Technique #\(3\): black box thinking, view complex systems solely in terms of inputs and outputs, e.g. Fabry-Perot interferometers as tunable diffraction gratings, transformers as sequence-to-sequence models, operational amplifiers, etc.
Technique #\(4\): exploit isomorphisms, understand a less intuitive concept \(X\) using the existence of an isomorphism \(X\cong Y\) with a more intuitive concept \(Y\), e.g. driven \(RLC\) circuits as damped, driven harmonic oscillators.
Technique #\(5\): toy models, approximate complex systems by their simplest non-trivial irreducibles, e.g. a ferromagnet as an Ising/Heisenberg lattice, etc.
Technique #\(6\): remember numerical values, one can sanity-check and remember formulas simply from the order of magnitudes of the quantities (this can be aided by software such as Anki), e.g. since \(N_A\sim 10^{23}\) and \(k_B\sim 10^{-23}\) (both in SI units), one expects \(N_Ak_B\sim O(1)\). But \(R\approx 8.3\sim O(1)\), hence \(N_Ak_B=R\).
Technique #\(7\): systematize, find general recipes/frameworks for approaching problems in a field, e.g. in quantum mechanics, once one has a clear idea of \((\mathcal H,H)\), the rest is just diagonalization, in ML one has the PRAC-DTDT-ID algorithm, etc.
Technique #\(8\): draw diagrams, annotate and label the diagram as a concept graph, e.g. the unit circle complex phase plane diagram for the \(1\)D harmonic oscillator creation/annihilation operators (this one also nicely combines Technique #\(2\)).
Technique #\(9\): acronyms/checklists, e.g. PRAC-DTDT-ID
Technique #\(10\): teach it, blog or present to someone else.
Technique #\(11\): latch onto key ideas, anchor everything else to it as a kind of knowledge graph extension/branch-off point, e.g. linear transformations.
Problem: What are the key points to keep in mind for machine learning?
Solution: (to do: acronym that’s better/simpler than PRAC-DTDT-ID)
To check if one’s model \(\hat y_{\boldsymbol{\theta}}\) has any hope of training/learning successfully on some labelled dataset \(\mathcal D\), first choose a very small subset \(\mathcal D’\subset\mathcal D\) with \(|\mathcal D’|\ll |\mathcal D|\) and deliberately overfit \(\hat y_{\boldsymbol{\theta}}\) on \(\mathcal D’\), and run inference to check that \(\hat y_{\boldsymbol{\theta}}(\mathbf x_i)\approx y_i\) for all \((\mathbf x_i,y_i)\in\mathcal D’\).
It is better to spend \(2\) days to carefully think through and implement an experiment with 95% probability of success than to spend \(2\) hours fumbling together an experiment with 5% probability of success.
ACTUALLY INSPECT/LOOK AT YOUR DATA!
Be meticulous, plot losses on all outputs.
Run short (e.g. 10000-step) hyperparameter ablations (e.g. learning rate, weight decay, max_grad_norm, etc.) before a long training run, although tbh usually hyperparameters just need to be within 3 orders of magnitude or so of some value and should work, don’t fuss too much over them, quite robust if built well).
Remember to save checkpoints, but also get rid of or move old checkpoints off disk and onto some filesystem (if this happens, some IDEs will prevent one from SSH onto those VMs, can get around this by SSH via terminal instead of IDE). I think it’s a good idea to include inside one’s CLAUDE.md some instruction about first checking the current disk space before committing new checkpoints, etc.
Also, when running a new experiment/training, consider whether it’s possible to start from the checkpoint of a previous experiment.
Nebius, RunPod, Andromeda, AWS, Azure are basically all VM/server providers, pay to rent GPUs (but also other purposes)
Useful commands: nvidia-smi, watch -t nvidia-smi, claude –dangerously-skip-permissions, df -h, du -sh?
WandB (Weights and Biases) is for logging data.
Hugging Face is for storing/viewing large datasets in repos and storing model weights. Save important checkpoints to HF, otherwise just rm -rf ones you don’t need.
Have a hypothesis about the outcome of each experiment you do, and understand why you are doing it.
Use Claude to aid understanding, but not to replace it.
Attention to detail is often what makes or breaks one’s model.
Write pseudocode implementations.
Take time to carefully inspect and understand one’s dataset exactly.
Problem: What is an estimator \(\hat{\theta}\) for some parameter \(\theta\) of a random variable? Distinguish between point estimators and interval estimators.
Solution: In the broadest sense, an estimator is any function of \(N\) i.i.d. draws of the random variable \(\hat{\theta}(X_1,…,X_N)\). First, notice that the presence of the integer \(N\) is a smoking gun that all of the estimation theory that follows is strictly a frequentist approach rather than a Bayesian one.
Because \(X_1,…,X_N\) are random variables, it follows that \(\hat{\theta}\) is also a random variable. This definition doesn’t say anything about how good or bad the estimator \(\hat{\theta}\) needs to be with respect to actually trying to estimate the property \(\theta\) of the random variables.
Although the parameter \(\theta\) is fixed (in the frequentist interpretation), \(\hat{\theta}\) can either attempt to estimate \(\theta\) by directly giving a value (in which case \(\hat{\theta}\) is called a point estimator for \(\theta\)) or giving a range of values such that one has e.g. “\(95\%\) confidence” that \(\theta\) lies in that range (in which case \(\hat{\theta}\) is called an interval estimator for \(\theta\)).
Problem: Show that, if \(X_1,X_2,…,X_N\) are i.i.d. random variables, hence all having the same mean \(\mu=\langle X_1\rangle=…=\langle X_N\rangle\), then the random variable (called the sample mean):
\[\hat{\mu}:=\frac{X_1+X_2+…+X_N}{N}\]
is an unbiased point estimator for \(\mu\).
Solution: The purpose of this question is to emphasize what it means to be an unbiased point estimator, specifically one should compute the expectation:
so it is indeed unbiased or equivalently, there’s no systematic error \(\langle\hat{\mu}\rangle=\mu\).
Problem: Same setup as above, but now suppose one estimates the mean \(\mu\) using the estimator:
\[\hat{\mu}:=X_1\]
Explain why this is a bad choice of estimator.
Solution: First, notice that despite being “obviously bad”, this estimator is in fact unbiased \(\langle\hat{\mu}\rangle=\langle X_1\rangle=\mu\). However, as more and more instances of the random variable are instantiated \(N\to\infty\), this estimator \(\hat{\mu}\) remains the same forever (namely whatever the first draw happened to be). Thus, here the problem isn’t about bias, it’s about the inconsistency of the estimator (see: https://en.wikipedia.org/wiki/Consistent_estimator).
Problem: Show that, if \(X_1,X_2,…,X_N\) are i.i.d. random variables, hence all having the same mean \(\mu=\langle X_1\rangle=…=\langle X_N\rangle\) and same variance \(\sigma^2=\langle (X_1-\mu)^2\rangle=…=\langle (X_N-\mu)^2\rangle\), then the random variable (called the sample variance):
Solution: This is a very fun and instructive exercise. The key is to explicitly write out \(\hat{\mu}=\sum_{i=1}^NX_i/N\), use the parallel axis theorem \(\langle X^2_1\rangle=…=\langle X^2_N\rangle=\mu^2+\sigma^2\), and the independence assumption in i.i.d. so that in particular \(\langle X_iX_j\rangle=\mu^2+\delta_{ij}\sigma^2\).
Heuristically, this arises because \(\hat{\mu}\) has been estimated from the data \(X_1,…,X_N\) itself (as the proof above makes transparent) rather than an external source, thus representing a reduction in the number of degrees of freedom by \(1\), hence the so-called Bessel’s correction \(N\mapsto N-1\).
Problem: Even though \(\hat{\sigma}^2\) is an unbiased point estimator for the variance \(\sigma^2\), show that \(\sqrt{\hat{\sigma}^2}\) is a biased point estimator for the standard deviation \(\sigma\) (in fact, \(\sqrt{\hat{\sigma}^2}\) will systematically underestimate the actual standard deviation \(\sigma\)).
Solution: Because \(\langle\hat{\sigma}^2\rangle=\sigma^2\) is an unbiased point estimator of the variance, it follows that \(\sigma=\sqrt{\langle\hat{\sigma}^2\rangle}\). The result is thus obtained by an application of Jensen’s inequality to the concave square root function.
Problem: Show that the standard deviation of the the sample mean point estimator \(\hat{\mu}\) is given by (called the standard error on \(\hat{\mu}\)):
In practice of course, the standard deviation \(\sigma\) in that numerator of the standard error \(\sigma_{\hat{\mu}}\) isn’t known, and even worse the previous exercise just established that \(\langle\sqrt{\hat{\sigma}^2}\rangle\leq\sigma\) is a biased underestimate of the true standard deviation \(\sigma\). Nevertheless, if \(N\gg 1\), one expects \(\langle\sqrt{\hat{\sigma}^2}\rangle\to\sigma\) to be asymptotically unbiased, and hence the standard error estimator \(\hat{\sigma}_{\hat{\mu}}=\sqrt{\hat{\sigma}^2/N}\) is commonly used to estimate the true standard error \(\sigma_{\hat{\mu}}=\sigma/\sqrt{N}\) even if strictly a bit biased.
Problem: A poll is conducted to estimate the proportion of people in some country who support candidate \(A\) vs. candidate \(B\). A total of \(N=1000\) people are surveyed, and it is found that \(N_A=520\) support candidate \(A\). Report a \(95\%\) confidence interval for the actual proportion of people in the country who support candidate \(A\).
Solution: Each person can be taken as an i.i.d. \(p\)-Bernoulli random variable with value \(0\) if they support candidate \(B\) and \(1\) if they support candidate \(A\). Since \(\mu=p\) for a Bernoulli random variable, one has the unbiased point estimator:
\[\hat p=\frac{N_A}{N}=0.52\]
In order to report a \(95\%\) confidence interval, there are \(2\) additional assumptions that will be made. First, the central limit theorem asserts that \(p\) (a sum of \(N=1000\gg 1\) i.i.d. Bernoulli trials) will be approximately normally distributed with mean \(\hat{p}\) and standard error \(\sigma_{\hat p}=\sigma/\sqrt{N}=\sqrt{p(1-p)/N}\). Because \(p\) is unknown, the standard error can be estimated by \(\hat{\sigma}_{\hat p}=\sqrt{\hat p(1-\hat p)/N}\approx 0.016\), and from this one would report the confidence interval as:
with the \(z\)-score \(z\approx 1.96\) for the normal distribution (recalling the familiar \(68-95-99.7\) rule for normal distributions, it makes sense that \(z\approx 2\)).
Finally, a subtlety: this standard error estimator \(\hat{\sigma}_{\hat p}=\sqrt{\hat p(1-\hat p)/N}\) is actually not the same as the one mentioned earlier \(\hat{\sigma}_{\hat{\mu}}=\sqrt{\hat{\sigma}^2/N}\). One can explicitly check this and find that ultimately Bessel’s correction leads instead to the standard error estimator \(\hat{\sigma}_{\hat p}=\sqrt{\hat p(1-\hat p)/(N-1)}\) (in fact, in this case there’s also another subtlety: instead of using a \(z\)-score, one has to use a \(t\)-score associated to \(N-1=999\) degrees of freedom). For \(N\gg 1\), these \(2\) methods are essentially indistinguishable, so no need to lose sleep.
Problem: Above, the estimators were simply presented out of thin air, and properties such as their bias/variance/consistency were analyzed as an afterthought; however, show that they all arise within the single unifying framework of the maximum likelihood principle.
Solution: Having observed \(N\) i.i.d. draws \(\mathbf x_1,…,\mathbf x_N\) from some underlying random variable parameterized by \(\boldsymbol{\theta}\) (e.g. for a univariate normal random variable one could take \(\boldsymbol{\theta}=(\mu,\sigma^2)\)), the maximum likelihood estimator \(\hat{\boldsymbol{\theta}}_{\text{ML}}=\hat{\boldsymbol{\theta}}_{\text{ML}}(\mathbf x_1,…,\mathbf x_N)\) is given by maximizing the likelihood function of \(\boldsymbol{\theta}\):
Aside: at first glance, this seems a bit weird; any sane person’s first reaction would’ve been to instead maximize the reverse conditional probability!
where “MAP” stands for maximum a posteriori. This highlights very clearly the core difference between the frequentist school of thought (in which \(\boldsymbol{\theta}\) is simply treated as a fixed, background constant) and the Bayesian school of thought (in which there is a notion of \(\boldsymbol{\theta}\) being a random variable and hence it being well-defined to talk about its prior \(p(\boldsymbol{\theta})\)).
Back to the frequentist MLE approach, because \(\mathbf x_1,…,\mathbf x_N\) are i.i.d., one can factorize the joint probability as a product of marginals:
To avoid possible numerical underflow instabilities, one would like to convert this product into a sum. The usual way to do this is with a \(\log\) (arbitrary base). Since \(\log\) is moreover monotonically increasing, the MLE estimator is thus equivalent to maximizing the log-likelihood function:
But now notice that, instead of summing over all \(i=1,…,N\), one can instead sum over all distinct instantiations of the random variable, weighted by its empirical probability \(p_0(\mathbf x)\) based on the \(N\) training examples \(\mathbf x_1,…,\mathbf x_N\):
which, by instead considering the negative log-likelihood, can be reformulated as a problem of minimizing the cross-entropy \(S_{p(\boldsymbol{\theta})|p_0}\) of the model distribution \(p(\mathbf x|\boldsymbol{\theta})\) with respect to the empirical distribution \(p_0(\mathbf x)\):
For instance, success would mean finding a \(\boldsymbol{\theta}\) such that \(p(\boldsymbol{\theta})=p_0\) since in this case the KL-divergence \(D_{\text{KL}}(p(\boldsymbol{\theta})|p_0)=0\) attains its minimum possible value by virtue of the Gibbs inequality. Ideally, MLE would seek to choose \(\boldsymbol{\theta}\) so as to make \(p(\mathbf x|\boldsymbol{\theta})\) match the actual data-generating distribution \(p(\mathbf x)\), but since one doesn’t have access to this, the best one can hope for is to have \(p(\mathbf x|\boldsymbol{\theta})\) match the empirical distribution \(p_0(\mathbf x)\) as a proxy for the true \(p(\mathbf x)\).
Problem: In what sense can minimizing an MSE cost function of the parameters be justified from the perspective of maximum likelihood estimation?
Solution: The key assumption is that the possible target labels \(y\) associated to a given feature vector \(\mathbf x\) are normally distributed around some \(\mathbf x\)-dependent mean \(\hat y(\mathbf x|\boldsymbol{\theta})\) with some \(\mathbf x\)-independent variance \(\sigma^2\), i.e. \(p(y|\mathbf x,\boldsymbol{\theta})=(2\pi\sigma^2)^{-1/2}\exp -\frac{(y-\hat y(\mathbf x|\boldsymbol{\theta}))^2}{2\sigma^2} \). In practice, it’s hard to know this in advance; only after fitting \(\hat y(\mathbf x|\boldsymbol{\theta}_{\text{ML}})\) and plotting a histogram of the residuals \(\hat y(\mathbf x_i|\boldsymbol{\theta}_{\text{ML}})-y_i\) can one check a posteriori if the histogram looks Gaussian or not.
With this, the conditional log-likelihood \(-N\ln\sigma-\frac{N}{2}\ln(2\pi)-\frac{1}{2\sigma^2}\sum_{i=1}^N(\hat y(\mathbf x_i|\boldsymbol{\theta})-y_i)^2\) is clearly maximized for the same \(\boldsymbol{\theta}\) as the mean square error:
Problem: Define the score random vector \(\mathbf S(\boldsymbol{\theta})=\mathbf S(\boldsymbol{\theta}|\mathbf x_1,…,\mathbf x_N)\). Show that the covariance matrix of the score \(\mathbf S(\boldsymbol{\theta})\), called the Fisher informationmatrix \(I(\boldsymbol{\theta}):=\langle(\mathbf S(\boldsymbol{\theta})-\langle\mathbf S(\boldsymbol{\theta})\rangle)^{\otimes 2}\rangle\), may be written:
(all expectations \(\langle\space\rangle\) are of course over the joint distribution of the observed data \(\mathbf x_1,…,\mathbf x_N\) defined by the likelihood function \(p(\mathbf x_1,…,\mathbf x_N|\boldsymbol{\theta})\))
Solution:
Problem: Calculate explicitly the score \(S(\theta|x)\) and Fisher information \(I(\theta)\) for a \(\theta\)-Bernoulli random variable with outcome \(x\in\{0,1\}\). Hence, what can be said about the score \(S(\theta|x_1,…,x_N)\) and Fisher information \(I_N(\theta)\) for the corresponding \(N\)-trial binomial random variable where each i.i.d. trial is \(\theta\)-Bernoulli?
In the more general case where \(\langle\hat{\boldsymbol{\theta}}\rangle=\boldsymbol{\theta}\) is an unbiased point estimator for more than \(1\) parameter, a natural generalization holds:
where \(I^{-1}(\boldsymbol{\theta})\) is the inverse of the Fisher information matrix, and the meaning of this matrix inequality is that \(\text{cov}(\boldsymbol{\theta},\boldsymbol{\theta})-I^{-1}(\boldsymbol{\theta})\) is positive semi-definite.
Problem: Combining the results of the previous \(2\) problems, what can be deduced?
Problem: How can random error be reduced in a scientific experiment? What about systematic error?
Solution: The standard error \(\sigma_{\hat{\mu}}=\sigma/\sqrt{N}\) can be taken as a quantitative proxy for the effective random error from \(N\) i.i.d. measurements. This reveals that there are \(2\) ways to reduce the effective random error:
Reduce \(\sigma\) (e.g. cooling down electronics to mitigate Johnson noise).
Increase \(N\) (make more measurements).
Reducing systematic error is a completely different beast, and requires techniques such as calibration against a “gold standard”, differential measurements, etc.
Problem: Explain how the \(\chi^2\) goodness-of-fit test arises from maximizing a log-likelihood function subject to normally distributed errors between the data and the model.
Solution: The \(\chi^2\) statistic is basically just a rearrangement of the usual formula for variance to isolate for \(N=\chi^2\). Indeed, comparing \(\chi^2\ll N,\chi^2\approx N,\chi^2\gg N\) basically is the whole point of the test. The subtraction of the number of constraints from \(N\) is also reminiscent of the Bessel correction, and in fact the \(2\) are there for conceptually the same reason.
Just as the fundamental theorem of single-variable calculus \(\int_{x_1}^{x_2}f'(x)dx=f(x_2)-f(x_1)\) is the key insight on which the entire subject of single-variable calculus rests, there is an analogous sense in which one can consider a fundamental theorem of classical computing to be the key insight on which the entire field of classical computing rests. This is:
Fundamental Theorem of Classical Computing: The vector space \(\textbf N\) over the binary Galois field \(\textbf Z/2\textbf Z\) admits the countably infinite basis \(\textbf N=\text{span}_{\textbf Z/2\textbf Z}\{2^n:n\textbf N\}\) so that \(\text{dim}_{_{\textbf Z/2\textbf Z}}(\textbf N)=\aleph_0\). Put differently, the binary representation map \(n\in\textbf N\mapsto [n]^{(1,2,4,8,…)}\in \) is
From a physics perspective, one can loosely think of \(\textbf N\cong(\textbf Z/2\textbf Z)^{\oplus\infty}\) as the direct sum of infinitely many copies of the binary “vector space” \(\textbf Z/2\textbf Z\). By contrast, in the field of quantum computing, one instead has a vector space of the form \((\textbf C^2)^{\otimes N}\) for \(N\) qubits…and the multiplicative structure of the tensor product is much richer than the additive structure of the direct sum, hence the interest in quantum computing.
Knowing all this, it follows that any data \(X\) (e.g. an image file, an audio file, etc.) which can simply be “reduced to numbers” by some injection \(f:X\to\textbf N\) is also in principle just reduced to some bit string\((b_k)_{k=0}^\infty\) simply by taking each \(f(x)\in\textbf N\) for \(x\in X\) and writing out the binary representation of \(f(x)\). For instance, when \(X=\{a,b,c,…,x,y,z\}\) is the alphabet, then one possible injection (or encoding) is called the American Standard Code for Information Interchange\(\text{ASCII}:\{a,b,c,…,x,y,z\}\to\textbf N\) which maps for instance \(\text{ASCII}(a):=97\), \(\text{ASCII}(b):=98\), and so forth (strictly speaking, ASCII uses \(7\) bits to represent \(2^7=128\) characters of which \(26\) are the usual English alphabet letters in lower case while another \(26\) are uppercase English letters. There was even an extended ASCII which used \(8\) bits to represent \(2^8=256\) characters, but ultimately that was clearly insufficient for things like the Chinese language and so nowadays Unicode (with UTF-8 encoding, which stands for Unicode Transformation Format – 8 bit, or UTF-16, UTF-32, etc.) tends to be used in lieu of ASCII).
When storing numbers in memory (e.g. RAM, HDD, SSD) a computer can experience overflowerror (most computers have \(64\)-bit architecture, so can safely store up to \(2^{64}-1\)), roundofferror (this is especially relevant to floating point arithmetic), and precision errors.
An analog-to-digital converter (ADC) is just an abstraction for any function that maps an analog signal \(V(t)\), \(I(x,y)\), etc. to a digital signal \(\bar V_i,\bar I_{ij}\), etc. More precisely, one can consider an ADC to be a pair \(\text{ADC}=(f_s,\rho)\) where \(f_s\) is the sampling rate of the ADC in samples/second and \(\rho\) is the bit depth/resolution/precision at which the ADC quantizes data in bits/sample, thus the bit rate \(\dot b\) of the ADC is \(\dot b=f_s\rho\) and this is in general distinct from the baud rate \(\dot{Bd}\) of serial communication with the ADC by a factor \(\lambda:=\frac{\dot b}{\dot{Bd}}\geq 1\) which describes the number of bits per baud (see this Stack Overflow Q&A for an idea of the distinction).
For \(V(t)\) an analog signal in the time domain which is bandlimited by some bandwidth \(\Delta f\), the Nyquist-Shannon sampling theorem asserts that in order to avoid aliasing distortions when sampling \(V(t)\), one has to use \(f_s>\Delta f\). Equivalently, if \(f^*=\Delta f/2\) is the largest frequency present in \(V(t)\), then the sampling frequency needs to obey \(f_s>2f^*\). For instance, humans can hear audio up to \(f^*=20\text{ kHz}\), so audio ADCs (a fancy way of saying microphones) sample at \(f_s=48\text{ kHz}\). Cameras are just image ADCs, where now “samples” is replaced by “pixels” and so \(f_s\) might be better called “pixel frequency” (with units of pixels/meter rather than samples/second?) and the use of the RGB color space is fundamentally based on the biology of the human eye and its \(3\) types of cone cells, and conventionally each R, G, B channel has \(256\) levels (or \(1\) byte) of intensity quantization simply because that was empirically found to be sufficient (so the total bit depth of an RGB image is \(\rho=3\) bytes/pixel or \(24\) bits/px.
In practice, such data would likely be further compressed (either via lossless or lossy data compression algorithms). For instance, JPEG (lossy), PNG (lossless), run-length encoding (lossless), etc. for digital/bitmap/raster images, Lempel-Ziv-Welch (LZW) (lossless) compression, Huffman encoding (lossless), byte pair encoding, for text files, and perceptual audio encoding (lossy) which exploits the psychoacoustic quirks of the human auditory system such as auditory maskingand high frequency limits.
Computers& Logic Circuits
One abstract paradigm for understanding how a computer works is: input\(\to\)storage + processing\(\to\)output. Input is typically taken from sensors (e.g. keyboards, mouses, touchscreens, microphones, cameras), memory is handled by RAM, storage is handled by HDD/SSD, processing is done by the central processing unit(CPU) (an integrated circuit (IC)) where CPU = control unit + arithmetic logic unit (ALU) (both storage and processing use logic circuits made of many logic gates combined together), and output is a monitor, a speaker, an electric motor, an LED etc. Processing + memory are heavily interdependent, connected by a memory bus.
This excellent YouTube demonstration of implementing standard logic gates (e.g. buffers, NOT gates, AND gates, OR gates, XOR gates, NAND gates, NOR gates, etc.) using standard hardware on a solderless breadboard, notably transistors. So really, when it comes to processing data, a “computer” is an abstraction over “logic circuits” which itself is an abstraction over “logic gates” which itself is an abstraction over “bits” which itself is an abstraction over transistors and physical hardware that one actually touch and feel in the real world.
Our current computer (Microsoft Surface Studio \(2\)) has \(\sigma_{\text{RAM}}=32\text{ GB}\) and \(\sigma_{\text{SSD or C-Drive}}=1\text{ TB}\) (with Microsoft OneDrive providing an additional \(\sigma_{\text{OneDrive}}=1\text{ TB}\) of storage space).
The Internet
A computer network is topologically any strongly connected undirected graph where nodes represent computing devices (e.g. computers, phones, etc.) and edges are communicationchannels between a pair of computers. Common network topologies include the ring, star, mesh, bus, and tree topologies. Examples of computer networks include local area networks (LAN), wide area networks (WAN), data center networks (DCN), etc. with the Internet being a distributed packet-switched WAN. When designing the architecture of a computer network, one is interested in minimizing the distance (with respect to a suitable metric) that any piece of data \(D\) must travel to get from one computer to another.
At the level of physical hardware, data can be communicated between computers via copper category 5 (CAT5) twisted pair cables adhering to Ethernet standards. Fiber optic cables can also be used with Ethernet standards. Wi-Fi or Bluetooth communicates data via radio waves which suffer attenuation. Regardless, for all \(3\) of these line coding schemes (maps from abstract bits to a digital signal in the real world) we need to thank James Clerk Maxwell.
The informal notion of the “speed” of an internet connection (i.e. one of the communication channels mentioned earlier) between two computing devices \(X, Y\) is made precise by the bit rate \(\dot b_{(X,Y)}\) between the computing devices \(X\) and \(Y\). The bandwidth of that communication channel is then just the maximum bit rate \(\dot b^*_{(X,Y)}\) between \(X\) and \(Y\) (not to be confused with the signal processing notion of the bandwidth \(\Delta f\) of an analog signal). Another important factor is the latency \(\Delta t_{(X,Y)}\) of a given communication channel (i.e. just the delay). Running an Internet speed test for my computer with the measurement lab (M-lab) yields \(\dot b^{\text{downloads}}_{\text{(computer,M-lab)}}=650.7\frac{\text{Mb}}{\text s}\) and \(\dot b^{\text{uploads}}_{\text{(computer,M-lab)}}=732.0\frac{\text{Mb}}{\text s}\) and a latency of \(\Delta t_{\text{(computer,M-lab)}}=4\text{ ms}\)
Just like every house \(H\) has a physical address \(A_H\), in the WAN that we call the Internet, every computing device \(X\) has an Internet Protocol (IP) address \(\text{IP}_X\). When a computing device \(X\) transmits data packets \(D\) across a communication channel to another computing device \(Y\), \(X\) must specify the IP address \(\text{IP}_Y\) of \(Y\) in addition to providing its own IP address \(\text{IP}_X\) so that \(Y\) can reply to it. There are actually \(2\) common IP address protocols, IPv4 (a string of \(4\) bytes, leading to \(2^{32}\) possible IPv4 addresses) and IPv6 (a string of \(8\) hexadecimal numbers each up to \(0xFFFF\) for a total of \(2^{128}\) possible IPv6 addresses). IP addresses of computing devices may also be dynamic meaning that one’s Internet service provider changes it over time \(t\). Or, if one connects to a disjoint Wi-Fi network, then this will usually also mean a different IP address as each Wi-Fi provider (internet service provider) has a range of IP addresses it is allowed to allocate. By contrast, computing devices acting as servers(e.g. Google’s computers) often have static IP addresses (e.g. \(\text{IPv4}_{\text{Google computers}}=74.125.20.113\)) to make it easier for client computing devices to communicate quickly with.
In terms of actually interpreting what the numbers in an IP address mean, it turns out that it doesn’t have to be the case that (say in an IPv4 address) each byte corresponds to some piece of information. Rather, one can decide how one wishes to impose a hierarchy of subnetworks (subnets) on the IP address, that is, how many bits to represent a given piece of information. This is sometimes known as “octet splitting” where “octet” = “byte” in an IPv4 address.
The Domain Name System (DNS) is essentially a map \(\text{DNS}:\{\text{URLs}\}\to\{\text{IP addresses}\}\) and indeed anytime one uses a browser application (e.g. Chrome) to search for a website URL (e.g. www.youtube.com), DNS servers need to first find the IP address \(\text{DNS}\)(www.youtube.com) associated with that URL.
Problem: Given a linear operator \(H\) on some vector space, define the resolvent operator \(G_H(E)\) associated to \(H\).
Solution: The resolvent \(G_H(E)\) of \(H\) is the operator-valued Mobius transformation of a complex variable \(E\in\textbf C\) defined by the inverse:
\[G_H(E):=\frac{1}{E1-H}\]
(this notation \(A/B\) is only unambiguous when \([A,B^{-1}]=0\) which it is in this case).
Problem: What is the domain for \(E\in\textbf C\) of the resolvent \(G_H(E)\)?
Solution: Any value of \(E\in\textbf C\) for which the matrix \(E1-H\) is invertible leads to a well-defined resolvent. But invertibility is equivalent to a non-vanishing determinant \(\det(E1-H)\neq 0\). However, when \(\det(E1-H)=0\), then \(E\) is an eigenvalue of \(H\). So the domain of \(G_H(E)\) is \(E\in\textbf C-\text{spec}(H)\).
Problem: To see the conclusion of Solution #\(2\) another way, assume \(H\) is Hermitian so that it admits a real orthonormal eigenbasis \(H|n\rangle=E_n|n\rangle\). Show that the resolvent \(G_H(E)\) of \(H\) may be expressed as a linear combination of projectors onto its eigenspaces:
where the matrix element is \(\langle n|\frac{1}{E1-H}|m\rangle=\frac{\delta_{nm}}{E-E_n}\). Thus, the resolvent has a simple pole whenever \(E=E_n\) for some \(H\)-eigenstate \(|n\rangle\), and its residue at that simple pole is given by the corresponding projector \(|n\rangle\langle n|\).
Problem: Define the retarded resolvent \(G_H^+(E)\) and the advanced resolvent \(G_H^-(E)\).
Solution: Killing \(2\) birds with \(1\) stone:
\[G_H^{\pm}(E):=G_H(E\pm i0^+)\]
so if \(H\) were Hermitian, then normally the poles of \(G_H(E)\) would all lie on the real axis; the retarded and advanced resolvents therefore exist to shift these poles slightly above or below the real axis.
Problem: Define the Fourier transform of the time evolution operator \(U_H(t)=e^{-iHt/\hbar}\) to a new operator-valued function of \(E\) given by the convention:
Show that \(U_H(E)=-\frac{1}{\pi}\Im G_H^+(E)=\frac{1}{\pi}\Im G_H^-(E)\), where the imaginary part of an operator is defined in the obvious manner.
Solution: One way is to proceed by direct evaluation, which gives \(U_H(E)=\delta(E1-H)\). On the other hand, any delta function can be written via a partial fraction expansion of a nascent delta given by eliminating the principal value of the Sokhotski-Plemelj theorem:
Problem:Starting from the decomposition \(H=H_0+V\), conclude that the resolvents \(G_H,G_{H_0}\) of \(H\) and \(H_0\) are related by:
\[G_H=\Omega_HG_{H_0}\]
where the Moller scattering operator \(\Omega_H\) of \(H\) depends on the choice of decomposition \(H=H_0+V\):
\[\Omega_H:=\frac{1}{1-G_{H_0}V}\]
Solution:
Problem: Starting from the Schrodinger equation \(H|\psi\rangle=E|\psi\rangle\) in the unusual form:
\[G^{-1}_H(E)|\psi\rangle=0\]
Use the result above to deduce the Lippman-Schwinger equation:
\[|\psi\rangle=\Omega_H|\psi_0\rangle\]
where \(H=H_0+V\) can be decomposed in any way one pleases, so long as this is reflected in the Moller scattering operator \(\Omega_H\) and the “free state” \(|\psi_0\rangle\) with commensurate energy \(H_0|\psi_0\rangle=E|\psi_0\rangle\) (thus, \(\Omega_H\) is “energy-conserving”).
Solution:
Problem: How is this form of the Lippman-Schwinger equation typically applied to quantum mechanical scattering?
Solution: The decomposition is chosen such that \(H_0=|\textbf P|^2/2\mu\) is a purely kinetic Hamiltonian and \(V\) is a corresponding \(2\)-body scattering potential for two particles of reduced mass \(\mu\) (working in the ZMF). In this case, one typically takes \(|\psi_0\rangle:=|\textbf k\rangle\) to be a plane wave \(H_0\)-eigenstate with energy \(E_{\textbf k}=\hbar^2|\textbf k|^2/2\mu\). Furthermore, one typically chooses the retarded Moller scattering operator \(\Omega_H^+(E_{\textbf k})=\frac{1}{1-G_{H_0}^+(E_{\textbf k})V}\) as an ad hoc way of cherrypicking the physical, outward-propagating solution \(|\psi\rangle:=|\psi^+_{\textbf k}\rangle\). Finally, it is convenient to work in a specific basis, typically the \(\textbf X\)-eigenbasis. Altogether then, the useful form of the Lippman-Schwinger equation for scattering theory is:
To actually evaluate this, one ultimately has to unwrap the earlier geometric Neumann series for \(\Omega^+_H(E_{\textbf k})=1+G^+_{H_0}(E_{\textbf k})V+…\) and perhaps truncate at some partial sum to yield a Born approximation to \(\psi_{\textbf k}^+(\textbf x)\):
Gauge-fixing the \(\textbf X\)-eigenstates to be an orthonormal basis in the sense that \(\langle\textbf x|\textbf x’\rangle=\delta^3(\textbf x-\textbf x’)\) and \(\int d^3\textbf x|\textbf x\rangle\langle\textbf x|=1\), and having implicitly assumed the normalization \(\langle\textbf x|\textbf k\rangle = e^{i\textbf k\cdot\textbf x}\), then one has \(\langle\textbf k|\textbf k’\rangle=(2\pi)^3\delta^3(\textbf k-\textbf k’)\) and \(\int\frac{d^3\textbf k}{(2\pi)^3}|\textbf k\rangle\langle\textbf k|=1\). Hence:
Problem: Is this perturbation theory?
Solution: It’s not really perturbation theory in the sense that the energies are just taken to be \(E=\hbar^2|\textbf k|^2/2m\), rather one is much more interested in the eigenstates (asymptotically especially!) from which other kinds of data like scattering amplitudes and cross sections, etc. are more interesting. So the goals of the \(2\) programs are different.
Problem: Consider some \(H_0\)-eigenstate \(|0\rangle\) with energy \(H_0|0\rangle=E_0|0\rangle\). Evaluate the expectation \(\langle 0|G_{H_0}(E)|0\rangle\). Hence, for \(H=H_0+V\), show that \(\langle 0|G_H(E)|0\rangle\) looks the same as \(\langle 0|G_{H_0}(E)|0\rangle\) except with \(E_0\mapsto E_0+\Sigma_H(E;|0\rangle)\) where the self-energy \(\Sigma_{|0\rangle,V}(E)\in\textbf C\) of the \(H_0\)-eigenstate \(|0\rangle\) due to the perturbation \(V\) is given by the usual:
where \(V_{nm}:=\langle n|V|m\rangle\) are the matrix elements of the perturbation \(V\) in the unperturbed \(H_0\)-eigenbasis.
Problem: What are the interpretations of \(\Re\Sigma_{|0\rangle,V}(E)\) and \(\Im\Sigma_{|0\rangle,V}(E)\)?
Solution:
By tracking the movement of a simple pole \(E_n\) of \(G_H(E)\) in the complex \(E\)-plane, one can recover the eigenvalue and eigenstate corrections of perturbation theory.
Solution: Shine the spotlight on some eigenstate \(|n\rangle\) and its associated energy \(E_n\) by separating the unperturbed resolvent as:
The first term in the sum turns \(E=E_n\) from a simple pole into a double pole. It turns out this is what’s responsible for shifting the location of the pole away from \(E=E_n\), in other words, perturbing the eigenvalue. Meanwhile the series in the middle contributes to the residue at \(E=E_n\) because \(m\neq n\). Clearly they must be responsible for perturbing the eigenstate. Finally, because \(m,\ell\neq n\) in the last sum, it will be analytic in a neighbourhood of \(E=E_n\), in other words: crap.
Strictly speaking, after including the \(G_{H_0}VG_{H_0}\) term from the geometric Neumann-Born-Laurent series, the pole still sits at \(E=E_n\), just that its order has increased from \(1\to 2\). But consider as an example the geometric series \(1+1/x+1/x^2+…\). For any finite partial sum truncation, the pole sits at \(x=0\). But for \(|x|>1\) this converges absolutely to \(x/(x-1)\) where now the pole has been displaced to \(x=1\). Or just take any function like \(\tan(x)\) and Taylor expand it around \(x=0\) say; although all the terms in that Taylor series are analytic, the limiting behavior must be non-analytic at \(x=\pm\pi/2\). It’s basically a more extreme version of a phase transition, since singularities are more extreme than discontinuities.
Anyways, the fact that its the expectation \(\langle n|V|n\rangle\) which is sitting in the numerator of the double pole term means that this is the \(1^{\text{st}}\)-order correction to the energy. Similarly, decomposing into partial fractions:
only the first term \(\sim(E-E_n)^{-1}\) contributes to the residue at \(E=E_n\), and moreover this eigenstate contribution is just \(\sum_{m\neq n}\frac{\langle m|V|n\rangle}{E_n-E_m}|m\rangle\). Continuing to higher-order terms in the expansion reproduces the next formulas.
Problem #\(6\): Let \(H=H(\lambda)\) be a non-degenerate Hamiltonian depending on a parameter \(\lambda\) (not necessarily infinitesimal), and let \(|n\rangle=|n(\lambda)\rangle\) be a normalized \(H\)-eigenstate with energy \(E_n=E_n(\lambda)\). By differentiating the spectral equation:
\[H|n\rangle=E_n|n\rangle\]
with respect to \(\lambda\), prove the Hellman-Feynman theorems for the rate of change of the eigenvalue \(E_n\) and the \(H\)-eigenstate \(|n\rangle\):
This is like having \(2\) vectors \((a,b,c)=(d,e,f)\); naturally one’s instinct would be to equate components \(a=d,b=e,c=f\). In this context, what that looks like is projecting both sides onto an arbitrary \(H\)-eigenstate \(|m\rangle\) to equate the scalar components of all vectors:
the Hellman-Feynman theorems then arise by considering the \(2\) cases \(m=n\) and \(m\neq n\). There is a priori also a component of the rate of change \(\partial|n\rangle/\partial\lambda\) of the \(H\)-eigenstate \(|n\rangle\) along itself with amplitude \(\langle n|\frac{\partial|n\rangle}{\partial\lambda}\), but due to normalization \(\langle n|n\rangle=1\Rightarrow\frac{\partial\langle n|}{\partial\lambda}|n\rangle+\langle n|\frac{\partial|n\rangle}{\partial\lambda}=0\Rightarrow\Re\langle n|\frac{\partial|n\rangle}{\partial\lambda}=0\Rightarrow\langle n|\frac{\partial|n\rangle}{\partial\lambda}=0\); this is like saying that an ant crawling on a sphere \(|\textbf x|^2=\text{const}\) must have orthogonal position and velocity \(\textbf x\cdot\dot{\textbf x}=0\).
The Hellman-Feynman theorems are reminiscent of formulas such as:
in which one takes \(A=H\) and \(\lambda=t\), as well as the special case \(\phi=\psi\) of Ehrenfest’s theorem.
Problem #\(7\): Hence, by applying the Hellman-Feynman theorems to a linearly perturbed Hamiltonian \(H=H_0+\lambda V\), deduce the \(O(\lambda^2)\) corrections to both the eigenvalues \(E_n\) and eigenstates \(|n\rangle\) of the unperturbed Hamiltonian \(H_0\) in the presence of a perturbation \(V\).
Solution #\(7\): Specialized to the case of this particular linearly perturbed Hamiltonian, one has trivially \(\frac{\partial H}{\partial\lambda}=V\). With this in mind, one simply takes the Hellman-Feynman formulas and differentiate them again with respect to \(\lambda\):
For fun, here is a (failed!) attempt to compute the \(O(\lambda^3)\) eigenvalue correction (what’s the mistake?):
And for the \(O(\lambda^2)\) eigenstate correction, refer to this document.
TO DO: extend/generalize all the above discussion to degenerate and time dependent perturbation theory! Also find where to put the following snippets:
Problem: In the rotating frame of a Rabi drive \((\omega,\Omega)\) detuned by a small amount \(\delta:=\omega-\omega_0\) from a bare resonance, within the rotating wave approximation one has \(H=H_0+V\) with \(H_0=\frac{\hbar\delta}{2}\sigma_z\) and \(V=\frac{\hbar\Omega}{2}\sigma_x\). Evaluate the self-energy \(\Sigma^V_{|0\rangle}(E)\) in the one-dimensional subspace spanned by the ground state \(|0\rangle\), and show that it reproduces the \(2^{nd}\)-order AC Stark shift when evaluated on-shell (i.e. for \(E=-\hbar\delta/2\)).
Although Feynman diagrams are often first encountered in statistical/quantum field theory contexts where they are employed in perturbative calculations of partition/correlation functions based on Wick’s theorem, there is a lot of “fluff” in these cases that obscures their underlying simplicity. The purpose of this post is therefore to build up to a simpler, intuitive view of what Feynman diagrams are really about that hopefully demystifies them.
of a univariate normally distributed random variable \(x\) with zero mean \(\langle x\rangle=0\) (the choice of zero mean is motivated by the fact that in practice one only cares about central moments of the distribution, so to avoid writing \(x-\langle x\rangle\) everywhere it is convenient to just set \(\langle x\rangle:=0\)).
Solution #\(1\): It is clear that for odd \(n=1,3,5,…\), the integrand is an odd function so not only is \(\langle x\rangle=0\) by construction, but all higher odd moments also vanish \(\langle x^3\rangle=\langle x^5\rangle=…=0\). As for even \(n=0,2,4,…\), there are several ways:
Way #\(1\): Start with the \(n=0\) normalization (obtained in the usual Poissonian manner):
and differentiate the equation by \(\frac{\partial}{\partial(-1/2\sigma^2)}\) to pull down arbitrarily many factors of \(x^2\). One finds for instance:
\[\langle x^2\rangle=\sigma^2\]
\[\langle x^4\rangle=3\sigma^4\]
\[\langle x^6\rangle=15\sigma^6\]
\[\langle x^8\rangle=105\sigma^8\]
and so forth, in general following the rule \(\langle x^{2m}\rangle=(2m-1)!!\sigma^{2m}\) for even \(n=2m\), where the double factorial can also be written in terms of single factorials as:
\[(2m-1)!!=\frac{(2m)!}{2^mm!}\]
Way #\(2\): Substitute for \(x^2/2\sigma^2\) to recast the integral in terms of a gamma function:
Problem #\(2\): From Solution #\(1\), the presence of factorials suggests a combinatorial interpretation of the result; what is this interpretation?
Solution #\(2\): Suppose one has \(6\) people that need to be paired up for a dance; how many pairings can be formed? There are \(2\) ways to think about this.
Way #\(1\): The first person can be paired with \(5\) other people. Then, after they’ve been paired, the next person can only pair up with \(3\) more people. And after they’ve paired, the next person can only pair with the \(1\) other person that’s left. So the answer is \(5!!=5\times 3\times 1=15\) pairs.
Way #\(2\): There are \(6!\) permutations of the \(6\) people. However, they are going to form \(3\) pairs which can be permuted in \(3!\) ways. And within each of the \(3\) pairs, there are a further \(2!=2\) permutations. So in total there will be \(\frac{6!}{2^33!}=15\) pairs.
The fact that Way #\(1\) and Way #\(2\) give the same result is just a restatement of the earlier identity \((2m-1)!!=(2m)!/2^mm!\).
In this case however, the “people” are the powers of \(x\) in \(x^{2m}\)! Because all factors of \(x\) are indistinguishable, thus all \((2m-1)!!\) pairings of the \(2m\) factors of \(x\) in \(x^{2m}\) into \(m\) pairs of \(x^2\) are equivalent. The factor of \(\sigma^{2m}\) then follows on dimensional analysis grounds (it’s the only length scale for the normal distribution) and the numerical coefficient takes on this combinatorial pairing interpretation.
Problem #\(3\): Estimate the expectation \(\langle\cos(x/\sigma)\rangle\) in a univariate normal random variable \(x\) with variance \(\sigma^2\) and zero mean \(\langle x\rangle=0\).
will mostly receive contribution from small \(|x|\leq\sigma\), so one can hope to get a rough estimate of it by Maclaurin-expanding \(\cos\theta=1-\theta^2/2+\theta^4/24-\theta^6/720+…\):
(or one could have just recognized the earlier Maclaurin series for \(e^{-1/2}\)). So just taking the \(4\)th partial sum \(1-\frac{1}{2}+\frac{1}{8}-\frac{1}{48}=\frac{29}{48}\approx 0.60417\) already gets within \(0.4\%\) of the true answer. Thus, monomials/powers of \(x\) form a basis of analytic functions, and expectation is linear, so by computing all the moments of a distribution, one in principle has access to the expectation of any analytic function with respect to that distribution.
Problem #\(4\): Evaluate the cumulant generating function \(\ln\langle e^{\kappa x}\rangle\) of a univariate normal random variable \(x\) with variance \(\sigma^2\) and zero mean \(\langle x\rangle=0\).
So it is a quadratic parabola in \(\kappa\) with curvature \(\sigma^2\) at its vertex. The point therefore is that besides the \(2\)nd cumulant \(\sigma^2\), all other cumulants of the normal distribution are vanishing! For instance, the \(3\)rd cumulant (“skewness”) is \(\langle x^3\rangle=0\), the fourth cumulant (“excess kurtosis”) is \(\langle x^4\rangle-3\langle x^2\rangle^2=0\), etc.
Problem #\(4.5\): Another fun application of these ideas: define the \(n\)-th (probabilist’s) Hermite polynomials \(\text{He}_n(x)\) to be the unique degree-\(n\) monic polynomial which is orthogonal to all lower degree Hermite polynomials with respect to the Gaussian weight function \(e^{-x^2/2}\) over the real line \(\textbf R\). Hence, calculate the first \(5\) Hermite polynomials \(\text{He}_0(x),\text{He}_1(x),\text{He}_2(x),\text{He}_3(x),\text{He}_4(x)\).
Solution #\(4.5\): From the definition given above, \(\text{He}_0(x)\) must just be a constant, and the monic requirement fixes this constant to be \(1\); thus \(\text{He}_0(x)=1\). The next Hermite polynomial must have the form \(\text{He}_1(x)=x+c_0\). To fix \(c_0\), one thus requires that (using the fact that inner products with respect to a weight function are identical to expectations of products with respect to the weight function viewed as a probability distribution):
So \(\text{He}_2(x)=x^2-1\). A similar procedure gives \(\text{He}_3(x)=x^3-3x\). At this point to speed oneself up a bit, one could recognize that the Hermite polynomials alternate in parity \(\text{He}_n(-x)=(-1)^n\text{He}_n(x)\), so powers of \(x\) hop by \(2\). This motivates the more intelligent ansatz \(\text{He}_4(x)=x^4+c_2x^2+c_0\), automatically ensuring orthogonality with \(\text{He}_1(x)\) and \(\text{He}_3(x)\). Enforcing orthogonality with \(\text{He}_0(x)\) and \(\text{He}_2(x)\) gives the system of linear equations \(3+c_2+c_0=0\) and \(15+3c_2+c_0=0\) so \(c_0=3\) and \(c_2=-6\) which gives \(\text{He}_4(x)=x^4-6x^2+3\).
(mention the exponential generating function of the Hermite polynomials, and the operator representation, any connections?)
Problem #\(5\): Consider generalizing the prior discussion of a univariate normal random variable \(x\) with variance \(\sigma^2\) and zero mean \(\langle x\rangle=0\) to a \(d\)-dimensional multivariate normal random vector \(\textbf x\in\textbf R^d\) with covariance matrix \(\sigma^2\) and zero mean \(\langle\textbf x\rangle=\textbf 0\). Write down the appropriate normalized probability density function \(\rho(\textbf x)\) for \(\textbf x\).
Solution #\(5\): In analogy with the \(d=1\) univariate normal distribution, one has:
(prove by diagonalizing the covariance matrix \(\sigma^2\) of \(\textbf x\)).
Problem #\(6\): What are the moment and cumulant generating functions of a \(d\)-dimensional multivariate normal random vector\(\textbf x\in\textbf R^d\) with covariance matrix \(\sigma^2\) and zero mean \(\langle\textbf x\rangle=\textbf 0\)?
The phrase “moment generating function” is only really appropriate in \(d=1\); this is because in \(d\geq 2\), the generator \(\langle e^{\boldsymbol{\kappa}\cdot\textbf x}\rangle\) for the random vector \(\textbf x=(x_1,x_2,…,x_d)\) generates more than just moments along a given axis like \(\langle x_1^2\rangle, \langle x_2^4\rangle\) but also correlators such as \(\langle x_1x_2^3\rangle,\langle x_1x_2x_3\rangle\), etc. which obviously didn’t exist in \(d=1\). Similar to the univariate case, the even \(\textbf Z_2\) symmetry of the multivariate generator means that only correlators with even powers of \(x_i\) survive, so for instance \(\langle x_1^2x_2x_3\rangle=0\). To compute such strictly even correlators, the quickest way is typically to just compute the relevant term in the Maclaurin expansion of the generator:
Problem #\(7\): Explain why, for an arbitrary analytic random function \(f(\textbf x)\) of an arbitrary (i.e. not necessarily normal) random vector \(\textbf x\), the expectation is: