There is a wide range of different patterns of reinforcement learning concepts, each slightly different from the previous — value iteration, policy iteration, PPO, TRPO, REINFORCE, A2C, A3C… If these are also confusing to you and muddle together into one blob of ideas, this is completely expected! Four decades of the proud tradition of categorization and naming have overloaded terms and letters, and made it impossible for new-comers to understand what’s going on.
The goal of this article is not to explain each of these terms— we would be stuck here forever! — but to provide to you a model for the constraints that create the categories that create a bundle of all of these policies. My goal is that, by the end of this, you could look at an algorithm, and be able to place it in a set of categories, understanding trade-offs and what works better for what domains.
Below is a flowchart of how to think of RL algorithms. I’ll warn you because you peruse it — it is a simplification. In truth, categories can be fuzzy (we’ll see an example later), and distinctions don’t necessarily follow a clean hierarchical structure. But for our first venture into this, this reduction will do:
flowchart TD
Start([Reinforcement Learning Algorithm]) --> ModelQ{"Does the algorithm learn or use
a model of the environment?
(i.e., can it predict next states
and rewards?)"}
ModelQ -->|Yes| ModelBased["Model-Based RL
(Uses environment predictions)"]
ModelBased --> PI["Policy Iteration"]
ModelBased --> VI["Value Iteration"]
ModelQ -->|No| ModelFree["Model-Free RL
(Learns directly from experience)"]
ModelFree --> PolicyQ{"How does the algorithm
learn what actions to take?
(Policy Learning Method)"}
PolicyQ -->|"Learns state/action values
(How good is each action?)"| ValueBased["Value-Based Methods"]
ValueBased --> QL["Q-Learning"]
ValueBased --> DQN["DQN"]
ValueBased --> DDQN["Double DQN"]
PolicyQ -->|"Learns policy directly"| PolicyBased["Policy-Based Methods"]
PolicyBased --> OnOffQ{"Does it require actions to be
from current policy only?"}
OnOffQ -->|"Yes (On-Policy)"| OnPolicy["On-Policy Methods"]
OnPolicy --> REINFORCE["REINFORCE"]
OnPolicy --> TRPO["TRPO"]
OnPolicy --> PPO["PPO"]
OnOffQ -->|"No (Off-Policy)"| OffPolicy["Off-Policy Methods"]
OffPolicy --> DDPG["DDPG"]
OffPolicy --> TD3["TD3"]
OffPolicy --> SAC["SAC"]
PolicyQ -->|"Jointly optimizes action selection
and value estimation"| ActorCritic["Actor-Critic Methods"]
ActorCritic --> ACTypes{"How are updates performed?
(Sequential vs Parallel)"}
ACTypes -->|"Sequential"| Sync["Synchronous Methods"]
Sync --> A2C["A2C"]
ACTypes -->|"Parallel"| Async["Asynchronous Methods"]
Async --> A3C["A3C"]
Async --> AsyncDDPG["Async DDPG"]
%% Styling
style Start fill:#2D3748,stroke:#1A202C,color:#FFFFFF
style ModelQ fill:#3182CE,stroke:#2C5282,color:#FFFFFF
style PolicyQ fill:#3182CE,stroke:#2C5282,color:#FFFFFF
style OnOffQ fill:#3182CE,stroke:#2C5282,color:#FFFFFF
style ACTypes fill:#3182CE,stroke:#2C5282,color:#FFFFFF
%% Category Nodes
style ModelBased fill:#4A5568,stroke:#2D3748,color:#FFFFFF
style ModelFree fill:#4A5568,stroke:#2D3748,color:#FFFFFF
style ValueBased fill:#4A5568,stroke:#2D3748,color:#FFFFFF
style PolicyBased fill:#4A5568,stroke:#2D3748,color:#FFFFFF
style ActorCritic fill:#4A5568,stroke:#2D3748,color:#FFFFFF
style OnPolicy fill:#4A5568,stroke:#2D3748,color:#FFFFFF
style OffPolicy fill:#4A5568,stroke:#2D3748,color:#FFFFFF
style Sync fill:#4A5568,stroke:#2D3748,color:#FFFFFF
style Async fill:#4A5568,stroke:#2D3748,color:#FFFFFF
%% Algorithm Nodes
style PI fill:#2D3748,stroke:#1A202C,color:#FFFFFF
style VI fill:#2D3748,stroke:#1A202C,color:#FFFFFF
style QL fill:#2D3748,stroke:#1A202C,color:#FFFFFF
style DQN fill:#2D3748,stroke:#1A202C,color:#FFFFFF
style DDQN fill:#2D3748,stroke:#1A202C,color:#FFFFFF
style REINFORCE fill:#2D3748,stroke:#1A202C,color:#FFFFFF
style TRPO fill:#2D3748,stroke:#1A202C,color:#FFFFFF
style PPO fill:#2D3748,stroke:#1A202C,color:#FFFFFF
style DDPG fill:#2D3748,stroke:#1A202C,color:#FFFFFF
style TD3 fill:#2D3748,stroke:#1A202C,color:#FFFFFF
style SAC fill:#2D3748,stroke:#1A202C,color:#FFFFFF
style A2C fill:#2D3748,stroke:#1A202C,color:#FFFFFF
style A3C fill:#2D3748,stroke:#1A202C,color:#FFFFFF
style AsyncDDPG fill:#2D3748,stroke:#1A202C,color:#FFFFFF
Since the leaves of this flowchart describe quite a few algorithms (some familiar faces might include PPO, Q-Learning, and value iteration), I would like to instead focus a lot more on the branching locations, and how the constraints of our system have informed the kinds of algorithms we have developed.
<aside> 💡
Do not get confused between the policy models we train and the “model” we are using here. The term is over-loaded. When we talk of models here, we talk of learning the environment and the rewards from it.
</aside>
The core distinction between model-based and model-free is whether you have a learned model of the environment, or whether you are interacting with the environment directly and gleaning rewards from those experiences.
Imagine you are training a robot to play basketball. You can take as many shots at the basket as you’d like. You are not in a data-constrained regime, but simulating all the interactions between a basketball and the basket is not a trivial task. This is an excellent domain for model-free RL— training data is easy to collect directly.
But instead imagine you couldn’t actually collect that many training data points directly. You need to somehow be more sample efficient — get more bang for your buck every time your robot shoots. Instead, you’d teach the robot about gravity, and acceleration, and air resistance. Then, the robot will build a mental model of what will happen when it throws the ball. This will hopefully help it get more good shots. This is model-based RL.
Note that a lot of model-free methods still interact with simulations of environments — in self-driving, for example, models are trained in simulations. This is still model-free learning, since the model is not leveraging an environment model to get predictions about future rewards.
Here is the split: If the only way to learn the rewards from an action is to take that action, the algorithm is model-free. If instead you can call out to another environment model to get the value of an action, then it is model-based.
In model-free training, you can really learn two things:
And you can also learn both simultaneously.