<aside> 💡
This is a series of notes associated with Steve Brunton’s Excellent Book “Data-Driven Science and Engineering”. I highly recommend his series of lectures available on YouTube as well as the full book itself. These notes only cover the RL sections (Chapter 11) of the book. Relatedly, some people might find my breakdown of Monte Carlo Tree Search helpful.
</aside>
The policy is the probability of taking action $a$, given state $s$, to maximize the total future rewards.
$$ \pi(s, a) = Pr(a = a|s =s) $$
The state is some partial measurement of a higher-dimensional environmental state (which is a stochastic, non-linear, dynamic system).
For simplicity, we assume that state evolution is a Markov Decision Process— the current state is determined only by the previous state.
$$ P(s', s, a) = Pr(s_{k+1} = s' | s_k = s, a_k = a) $$
Even with this simplifying assumption, it can be hard to model the state evolution — if one cannot, model-free RL strategies can be used. If there is sufficient data to learn the MDP, then model-based RL strategies can be used.
The reward is another partial measurement assumed to be Markovian in nature — one only needs the current and the previous state to determine the current reward.
$$ R(s', s, a) = Pr(r_{k+1}|s_{k+1} = s', s_k = s, a_k = a) $$
The value function is the desirability of being in a given state, given policy $\pi$. It is modeled as the expected reward over some time steps $k$, subject to some discount factor $\gamma$.
$$ V_\pi(s) = E(\sum_{k} \gamma^kr_k | s_0 = s) $$
Note that the value function can be written recursively; ie
$$ V(s) = \max_\pi E(r_0 + \gamma V(s')) $$
So V can be thought of as the value of the node, assuming you use the best possible policy for $\pi$.
Given a well-calibrated value function, we can extract the optimal policy as:
$$ \pi = \argmax_\pi E(r_0 + \gamma V(s')) $$