Notes: Steve Brunton’s Introduction to RL

<aside> 💡

This is a series of notes associated with Steve Brunton’s Excellent Book “Data-Driven Science and Engineering”. I highly recommend his series of lectures available on YouTube as well as the full book itself. These notes only cover the RL sections (Chapter 11) of the book. Relatedly, some people might find my breakdown of Monte Carlo Tree Search helpful.

</aside>

Introduction to Reinforcement Learning

Screenshot 2024-10-20 at 10.03.34 AM.png

The core of RL is the system of interactions between the Agent and the Environment. Here it is, stated as simply as possible
- From the environment, the agent senses some state $s$.
- Given $s$, it takes actions according to policy $\pi$.
  - The policy $\pi$ is optimized through learning to maximize future rewards $r$, gotten from the environment.
- The action from the policy $a$ affects the environment, which creates some new state $s’$ for the agent, and the loop continues.
The RL agent is able to learn delayed rewards, which is critical for systems where the optimal solution involves a multi-step procedure.
- Rewards = sporadic and time-delayed labels
- This makes RL semi-supervised learning.
RL policies are typically learning in two phases:
- Exploration - where trial-and-error is used to learn the rules
- Exploitation - where a strategy is chosen and optimized within the learned rules

Mathematical Formalism

The policy is the probability of taking action $a$, given state $s$, to maximize the total future rewards.

$$ \pi(s, a) = Pr(a = a|s =s) $$
- It is typically parameterized by some lower-dimensional vector $\theta$ (represented $\pi_\theta$) which approximates it (this functional approximation is indeed the core of deep RL!)
The state is some partial measurement of a higher-dimensional environmental state (which is a stochastic, non-linear, dynamic system).
- For simplicity, we assume that state evolution is a Markov Decision Process— the current state is determined only by the previous state.
  
  $$ P(s', s, a) = Pr(s_{k+1} = s' | s_k = s, a_k = a) $$
- Even with this simplifying assumption, it can be hard to model the state evolution — if one cannot, model-free RL strategies can be used. If there is sufficient data to learn the MDP, then model-based RL strategies can be used.
The reward is another partial measurement assumed to be Markovian in nature — one only needs the current and the previous state to determine the current reward.

$$ R(s', s, a) = Pr(r_{k+1}|s_{k+1} = s', s_k = s, a_k = a) $$
The value function is the desirability of being in a given state, given policy $\pi$. It is modeled as the expected reward over some time steps $k$, subject to some discount factor $\gamma$.

$$ V_\pi(s) = E(\sum_{k} \gamma^kr_k | s_0 = s) $$
- Note that the value function can be written recursively; ie
  
  $$ V(s) = \max_\pi E(r_0 + \gamma V(s')) $$
  - Assuming $s’$ is the best next state, and $V$ is the optimal value for the best next state.
  - This is the Bellman’s equation, based on Bellman’s principle of optimality. A simple way of understanding it is that, assuming V(s’) is optimal, the optimal V(s) can be found.
- So V can be thought of as the value of the node, assuming you use the best possible policy for $\pi$.
- Given a well-calibrated value function, we can extract the optimal policy as:
  
  $$ \pi = \argmax_\pi E(r_0 + \gamma V(s')) $$

Goals and Challenges in RL

RL is the process of learning the policy, learning the value function, or jointly learning both.
RL might often require a large number of trials to be evaluated to determine an optimal policy. Thus, it is expensive to train, and not suitable for domains where testing a policy is expensive or potentially unsafe.
- Use RL when 1/ evaluating a policy is inexpensive, 2/ There are sufficient resources to perform a near brute-force optimization, 3/ No other control strategy works.
Many of the theoretical convergence results — and fundamental RL algorithms — only apply to finite MDPs. Many continuous dynamical systems can be approximated as a finite MDP through quantization or discretization.
- These might succumb to the curse of dimensionality. As the number of features or dimensions grows (needed in complex domains), the amount of data we need to generalize accuracy grows exponentially and thus becomes increasingly challenging.
Supervision can be quite weak in RL. A core challenge is that rewards are often rare and significantly delayed.
- This is the credit assignment problem — what action sequence was actually responsible for the reward ultimately received?

Taxonomy of RL techniques

Screenshot 2024-10-20 at 11.34.41 AM.png