Problem: How does the paradigm of reinforcement learning (RL) fit within the broader context of machine learning?
Solution: It is instructive to compare/contrast reinforcement learning with supervised learning. In this way, it will be seen that RL can in fact be viewed as a generalization of SL:
- In SL, the training feature vectors \(\mathbf x_i\) manifest as the state vector \(\mathbf x_t\) of an agent (similar to how it’s okay to conflate the position vector \(\mathbf x(t)\) of a particle in classical mechanics with the particle itself, there’s no harm in conflating the state vector \(\mathbf x_t\) with the agent itself, see map-territory relation). The difference is purely a perspective shift reflected in the choice of subscript variable \(i=1,2,3,…\) vs. \(t=0,1,2,…\).
- In SL, the feature vectors \(\mathbf x_i\) are i.i.d. draws from the same underlying data-generating distribution. In RL, the initial state vector \(\mathbf x_{t=0}\) is drawn from an initial distribution, and subsequent state vectors \(\mathbf x_t\) are dependent on the previous state \(\mathbf x_{t-1}\) (discrete-time Markov chain).
- In SL, the model output \(\hat y(\mathbf x|\boldsymbol{\theta})\) is analogous to an action \(\mathbf f\) outputted by an agent’s policy function \(\pi(\mathbf f|\mathbf x,t)\).
- In SL, each training feature vector \(\mathbf x_i\) is labelled by its correct output \(y_i\). This concept has no analog in reinforcement learning as there is no notion of a “correct” action for the agent to take.
- In SL, the feedback mechanism is provided by a loss function \(L(\hat y,y)\). By contrast, in RL the feedback mechanism is provided by a reward signal \(r_t\).
- In SL, the objective is to find optimal parameters \(\boldsymbol{\theta}^*=\text{argmin}_{\boldsymbol{\theta}}\sum_{i=1}^{N_{\text{train}}}L(\hat y(\mathbf x_i|\boldsymbol{\theta}),y_i)\) minimizing the training cost over the \(N_{\text{train}}\) training examples. In RL, the objective is to find an optimal agent policy \(\pi^*=\text{argmax}_{\pi}\langle\sum_{t=0}^T\gamma^tr_t|\pi\rangle\) maximizing the expected return over an episode of horizon \(T\leq\infty\) (which depends on if/when the agent reaches a terminal state \(\mathbf x_T\)).
- In SL, there is often a distinction between training, cross-validation, and testing datasets, with a golden rule often being that the model should obviously not be able to see any of the testing data. In RL, it is societally acceptable to train on your test set 🙂
Problem: Imagine the set of all states \(\mathbf x\), i.e. the agent’s state space. On this state space, one can impose a scalar field \(v_{\pi}(\mathbf x,t)\); what is the meaning of this field? Give a simple example thereof.
Solution: The scalar field \(v_{\pi}(\mathbf x,t)\) can roughly be thought of as describing how “valuable” the state \(\mathbf x\) is at time \(t\). More precisely, if one were to initialize an agent \(\pi\) at state \(\mathbf x_t:=\mathbf x\), then \(v_{\pi}(\mathbf x,t)\) is the expected return:
\[v_{\pi}(\mathbf x,t):=\biggr\langle\sum_{t’=t+1}^T\gamma^{t’-t-1}r_{t’}\biggr|\mathbf x,\pi\biggr\rangle\]
Consider a \(\pi\)-creature (of \(3\)Blue\(1\)Brown fame) symbolizing an agent with policy \(\pi\). This \(\pi\)-creature has to complete a maze. Within the RL framework, this can be modelled as an MDP in which each square is one possible state \(\mathbf x\) of the agent, the maze structure defines the allowed actions \(\mathbf f\) from each state \(\mathbf x\), and one can work with an undiscounted \(\gamma=1\) return in which the reward is \(r_t=-1\) for each action taken. Thus, the optimal policy \(\pi^*\) is a time-independent, deterministic policy \(\mathbf f=\pi^*(\mathbf x)\) that gets the \(\pi\)-creature from its initial state to the terminal state in the fewest number of moves. With that in mind, one can label on top of each square \(\mathbf x\) its corresponding optimal value \(v_{\pi^*}(\mathbf x)\) (note, this image was generated using Nano Banana Pro, some of the calculated values are just wrong but the idea is clear):

Problem: The value function \(v_{\pi}(\mathbf x,t)\) is passive; it simply assess how good/bad it is to be at \(\mathbf x_t:=\mathbf x\). In order to remedy this, one can look at the quality function \(q_{\pi}(\mathbf x,\mathbf f,t)\); explain how this takes on a more active role compared to the passive nature of the value function \(v_{\pi}(\mathbf x,t)\).
Solution: Because \(q_{\pi}(\mathbf x,\mathbf f,t)\) assesses the quality of the agent choosing the action \(\mathbf f_t:=\mathbf f\) while in state \(\mathbf x_t:=\mathbf x\). That is, it is a more refined conditional expected return:
\[q_{\pi}(\mathbf x,\mathbf f,t)=\biggr\langle\sum_{t’=t+1}^T\gamma^{t’-t-1}r_{t’}\biggr|\mathbf x,\mathbf f,\pi\biggr\rangle\]
Problem: How come the agent’s policy, value function and quality function are sometimes respectively written as \(\pi(\mathbf f|\mathbf x),v_{\pi}(\mathbf x)\) and \(q_{\pi}(\mathbf x,\mathbf f)\) without the \(t\) argument?
Solution: There could be \(3\) reasons:
- Similar to the above maze example, sometimes it just doesn’t depend on \(t\).
- If one is seeking to maximize the agent’s expected return over an infinite horizon \(T=\infty\); in such a case, it is both intuitively and mathematically clear that \(\pi(\mathbf f|\mathbf x),v_{\pi}(\mathbf x)\) and \(q_{\pi}(\mathbf x,\mathbf f)\) should all be invariant with respect to time translations.
- If the horizon is finite \(T<\infty\), then it is standard to package \(\mathbf x\) and \(t\) together, thereby working with an “augmented” state vector \(\mathbf x’:=(\mathbf x,t)\). Then everything can be made to look identical to the infinite horizon \(T=\infty\) case by replacing \(\mathbf x\mapsto\mathbf x’\).
(cf. distinction between state variables and path variables in thermodynamics).
Problem: State and prove the law of iterated expectation.
Solution: If \(X,Y\) are any random variables, then:
\[\langle X\rangle=\langle\langle X|Y\rangle\rangle\]
The proof is easy as long as one is careful to interpret all the expectations correctly. For instance, \(\langle X|Y\rangle\) is not a scalar but a random variable with respect to \(Y\):
\[\langle X|Y\rangle=\int dx p(x|Y) x\]
Thus, it is clear that the outer expectation must also be with respect to \(Y\) alone:
\[\langle\langle X|Y\rangle\rangle=\int dy p(y)\langle X|y\rangle=\int dxdy p(y)p(x|y)x\]
Rewriting in terms of the joint distribution \(p(x,y)=p(y)p(x|y)\) and integrating out \(\int dy p(x,y)=p(x)\), one finally obtains:
\[=\int dx p(x) x=\langle X\rangle\]
Of course, this also generalizes easily to identities such as the following equality of random variables (w.r.t. \(Z\)):
\[\langle X|Z\rangle=\langle\langle X|Y,Z\rangle|Z\rangle\]
Problem: Show that the value function \(v_{\pi}(\mathbf x)\) satisfied the following identity (to be used as a lemma later):
\[v_{\pi}(\mathbf x_t)=\langle r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1})|\mathbf x_t,\pi\rangle\]
Solution: Denoting the return random variable by:
\[R_t:=\sum_{t’=t+1}^T\gamma^{t’-t-1}r_{t’}\]
The fundamental observation is that \(R_t\) obeys a simple recurrence relation:
\[R_t=r_{t+1}+\gamma R_{t+1}\]
Taking suitable conditional expectations thereof:
\[\langle R_t|\mathbf x_t,\pi\rangle=\langle r_{t+1}|\mathbf x_t,\pi\rangle+\gamma\langle R_{t+1}|\mathbf x_t,\pi\rangle\]
The LHS is the definition of \(v_{\pi}(\mathbf x_t)\). Applying the law of iterated expectation to the \(2^{\text{nd}}\) term on the RHS:
\[\langle R_{t+1}|\mathbf x_t,\pi\rangle=\langle\langle R_{t+1}|\mathbf x_{t+1},\mathbf x_t,\pi\rangle|\mathbf x_t,\pi\rangle\]
The Markov property ensures that \(\langle R_{t+1}|\mathbf x_{t+1},\mathbf x_t,\pi\rangle=\langle R_{t+1}|\mathbf x_{t+1},\pi\rangle=v_{\pi}(\mathbf x_{t+1})\). One thus obtains the desired result:
\[v_{\pi}(\mathbf x_t)=\langle r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1})|\mathbf x_t,\pi\rangle\]
Problem: How are \(v_{\pi}(\mathbf x_t)\) and \(q_{\pi}(\mathbf x_t,\mathbf f_t)\) related to each other?
Solution: Apply the law of iterated expectations:
\[v_{\pi}(\mathbf x_t):=\langle R_t|\mathbf x_t,\pi\rangle=\langle\langle R_t|\mathbf x_t,\mathbf f_t,\pi\rangle|\mathbf x_t,\pi\rangle=\langle q_{\pi}(\mathbf x_t,\mathbf f_t)|\mathbf x_t,\pi\rangle\]
Since \(\mathbf x_t\) is being conditioned upon, the expectation must be over \(\mathbf f_t\) so one can explicitly write it as:
\[=\sum_{\mathbf f_t}p(\mathbf f_t|\mathbf x_t,\pi)q_{\pi}(\mathbf x_t,\mathbf f_t)\]
But trivially \(p(\mathbf f_t|\mathbf x_t,\pi)=\pi(\mathbf f_t|\mathbf x_t)\). Thus, one has an expression for \(v_{\pi}(\mathbf x_t)\) in terms of \(q_{\pi}(\mathbf x_t,\mathbf f_t)\). Now one would like to proceed the other way, relating \(q_{\pi}(\mathbf x_t,\mathbf f_t)\) to \(v_{\pi}(\mathbf x_t,\mathbf f_t)\). This can be achieved by fleshing out the expectation from the earlier lemma:
\[v_{\pi}(\mathbf x_t)=\langle r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1})|\mathbf x_t,\pi\rangle\]
\[=\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\pi)(r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1}))\]
Further expand \(p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\pi)=\sum_{\mathbf f_t}p(\mathbf f_t|\mathbf x_t,\pi)p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t,\pi)\) and recognize again \(p(\mathbf f_t|\mathbf x_t,\pi)=\pi(\mathbf f_t|\mathbf x_t)\) and \(p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t,\pi)=p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)\) because the action \(\mathbf f_t\) is already fixed. Moving the summation \(\sum_{\mathbf f_t}\) to the outside so as to compare with the earlier identity expressing \(v_{\pi}(\mathbf x_t)\) in terms of \(q_{\pi}(\mathbf x_t,\mathbf f_t)\), one can pattern-match:
\[q_{\pi}(\mathbf x_t,\mathbf f_t)=\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)(r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1}))\]
(which in hindsight is manifest…)
Problem: Hence, deduce the Bellman equations for the value and quality functions.
Solution: To obtain the Bellman equation for \(v_{\pi}(\mathbf x_t)\), substitute the above expression for \(q_{\pi}(\mathbf x_t,\mathbf f_t)\) into the expression for \(v_{\pi}(\mathbf x_t)\):
\[v_{\pi}(\mathbf x_t)=\sum_{\mathbf f_t}\pi(\mathbf f_t|\mathbf x_t)\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)(r_{t+1}+\gamma v_{\pi}(\mathbf x_{t+1}))\]
Similarly, to the get the Bellman equation for \(q_{\pi}(\mathbf x_t,\mathbf f_t)\) substitute vice versa:
\[q_{\pi}(\mathbf x_t,\mathbf f_t)=\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)\left(r_{t+1}+\gamma\sum_{\mathbf f_{t+1}}p(\mathbf f_{t+1}|\mathbf x_{t+1},\pi)q_{\pi}(\mathbf x_{t+1},\mathbf f_{t+1})\right)\]
Intuitively, the Bellman equations relate the value of every state with every other state, thereby providing a massive system of simultaneous equations that in principle can be solved to deduce the state values provided one has complete knowledge of the environment dynamics \(p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)\) (similar remarks apply to the qualities of different state-action pairs).
Problem: The above Bellman equations apply to a generic agent policy \(\pi\); how do they specialize to the case of an optimal policy \(\pi^*\)?
Solution: Reverting back to the form of the “pre-Bellman” equations and using the key insight that the optimal policy \(\pi^*(\mathbf f_t|\mathbf x_t)=\delta_{\mathbf f_t,\text{argmax}_{\mathbf f_t}q_{\pi^*}(\mathbf x_t,\mathbf f_t)}\) should only assign non-zero probability to actions of the best quality, one has:
\[v_{\pi^*}(\mathbf x_t)=\text{max}_{\mathbf f_t}q_{\pi^*}(\mathbf x_t,\mathbf f_t)\]
\[q_{\pi^*}(\mathbf x_t,\mathbf f_t)=\sum_{r_{t+1},\mathbf x_{t+1}}p(r_{t+1},\mathbf x_{t+1}|\mathbf x_t,\mathbf f_t)(r_{t+1}+\gamma v_{\pi^*}(\mathbf x_{t+1}))\]
which lead to the Bellman optimality equations.
Problem: In light of Bellman optimality, discuss how policy evaluation and policy improvement work. Hence, explain the notion of generalized policy iteration (GPI).
Solution: Imagine the space of all \((\pi,v)\) pairs, where \(\pi(\mathbf f_t|\mathbf x_t)\) is any policy distribution and \(v(\mathbf x_t)\) is the value function of some policy. Within this space, there are \(2\) canonical subspaces. One is the set \((\pi,v_{\pi})\) of all pairs where \(v_{\pi}\) is specifically the value function associated to policy \(\pi\), and the other subspace \((\pi_v,v)\) where \(\pi_v(\mathbf f_t|\mathbf x_t):=\delta_{\mathbf f_t,\text{argmax}_{\mathbf f_t}q(\mathbf x_t,\mathbf f_t)}\) is \(\varepsilon=0\)-greedy with respect to the value function \(v\), where \(q(\mathbf x_t,\mathbf f_t)=\langle r_{t+1}+\gamma v(\mathbf x_{t+1})|\mathbf x_t,\mathbf f_t\rangle\) (in particular, notice \(q\neq q_{\pi}\); the policy \(\pi\) does not appear in the definition of \(q\)).
- Policy evaluation is an algorithm for projecting \((v,\pi)\mapsto (v_{\pi},\pi)\). The idea is to view \(v\) as a random initial guess for the underlying true policy value function \(v_{\pi}\) (though any terminal states \(\mathbf x_T\) should be initialized \(v(\mathbf x_T):=0\)). Then, sweeping through each state \(\mathbf x_t\), one updates the value of \(v(\mathbf x_t)\) using the Bellman equation involving the (known) randomly initialized values of the other states:
\[v(\mathbf x_t)\mapsto\sum_{\mathbf f_t}\pi(\mathbf f_t|\mathbf x_t)\sum_{\mathbf x_{t+1},r_{t+1}}p(\mathbf x_{t+1},r_{t+1}|\mathbf x_t,\mathbf f_t)(r+\gamma v(\mathbf x_{t+1}))\]
With sufficiently many sweeps over the state space, general theorems guarantee convergence \(v\to v_{\pi}\).
- Policy improvement is an algorithm for projecting onto the other canonical subspace \((v,\pi)\mapsto (v,\pi_v)\). Essentially just act \(\varepsilon=0\)-greedy:
\[\pi(\mathbf f_t|\mathbf x_t)\mapsto\delta_{\mathbf f_t,\text{argmax}_{\mathbf f_t}q_{\pi}(\mathbf x_t,\mathbf f_t)}\]
and is rigorously justified by the policy improvement theorem.
Policy iteration roughly amounts to alternating between the \(2\) steps of policy evaluation and policy improvement; however there is some leeway in how one does this, for instance one need not necessarily run the policy evaluation step to completion but rather just perform a single sweep over the state space each time (this more “generalized” version of policy iteration is called value iteration, and can be shown to eventually still funnel/converge onto the optimal policy \(\pi^*\) as one can prove that \(\pi\) is \(\varepsilon=0\)-greedy with respect to its own value function \(v_{\pi}\) iff \(\pi=\pi^*\)).
Problem: Distinguish between model-free vs. model-based methods/algorithms in reinforcement learning.
Solution: In both cases, the fundamental limitation (which previously was taken for granted) is incomplete knowledge of the environment dynamics \(p(\mathbf x_{t+1},r_{t+1}|\mathbf x_t,\mathbf f_t)\); instead, one can only sample trajectories \(\mathbf x_{t=0},\mathbf f_{t=0},r_{t=0},\mathbf x_{t=1},\mathbf f_{t=1},r_{t=1},…\) when running a policy \(\pi\) through the MDP. Recalling the fundamental objective of RL is to compute \(\pi^*=\text{argmax}_{\pi}\langle R_t|\pi\rangle\), a model-free method for doing so is one which does not attempt to estimate \(p(\mathbf x_{t+1},r_{t+1}|\mathbf x_t,\mathbf f_t)\) (here \(p\) is what the word “model” refers to, i.e. a model of the environment dynamics) whereas model-based methods do attempt to estimate \(p(\mathbf x_{t+1},r_{t+1}|\mathbf x_t,\mathbf f_t)\) and thereby obtain \(\pi^*\) through GPI as outlined earlier (see the OpenAI Spinning Up documentation for a nice diagram of this taxonomy and further comments).
Problem:





