Introduction

There is a wide range of different patterns of reinforcement learning concepts, each slightly different from the previous — value iteration, policy iteration, PPO, TRPO, REINFORCE, A2C, A3C… If these are also confusing to you and muddle together into one blob of ideas, this is completely expected! Four decades of the proud tradition of categorization and naming have overloaded terms and letters, and made it impossible for new-comers to understand what’s going on.

The goal of this article is not to explain each of these terms— we would be stuck here forever! — but to provide to you a model for the constraints that create the categories that create a bundle of all of these policies. My goal is that, by the end of this, you could look at an algorithm, and be able to place it in a set of categories, understanding trade-offs and what works better for what domains.

Categorizations between RL Algorithms

Below is a flowchart of how to think of RL algorithms. I’ll warn you because you peruse it — it is a simplification. In truth, categories can be fuzzy (we’ll see an example later), and distinctions don’t necessarily follow a clean hierarchical structure. But for our first venture into this, this reduction will do:

flowchart TD
    Start([Reinforcement Learning Algorithm]) --> ModelQ{"Does the algorithm learn or use
    a model of the environment?
    (i.e., can it predict next states 
    and rewards?)"}
    
    ModelQ -->|Yes| ModelBased["Model-Based RL
    (Uses environment predictions)"]
    ModelBased --> PI["Policy Iteration"]
    ModelBased --> VI["Value Iteration"]
    
    ModelQ -->|No| ModelFree["Model-Free RL
    (Learns directly from experience)"]
    ModelFree --> PolicyQ{"How does the algorithm
    learn what actions to take?
    (Policy Learning Method)"}
    
    PolicyQ -->|"Learns state/action values
    (How good is each action?)"| ValueBased["Value-Based Methods"]
    ValueBased --> QL["Q-Learning"]
    ValueBased --> DQN["DQN"]
    ValueBased --> DDQN["Double DQN"]
    
    PolicyQ -->|"Learns policy directly"| PolicyBased["Policy-Based Methods"]
    PolicyBased --> OnOffQ{"Does it require actions to be
    from current policy only?"}
    
    OnOffQ -->|"Yes (On-Policy)"| OnPolicy["On-Policy Methods"]
    OnPolicy --> REINFORCE["REINFORCE"]
    OnPolicy --> TRPO["TRPO"]
    OnPolicy --> PPO["PPO"]
    
    OnOffQ -->|"No (Off-Policy)"| OffPolicy["Off-Policy Methods"]
    OffPolicy --> DDPG["DDPG"]
    OffPolicy --> TD3["TD3"]
    OffPolicy --> SAC["SAC"]
    
    PolicyQ -->|"Jointly optimizes action selection
    and value estimation"| ActorCritic["Actor-Critic Methods"]
    ActorCritic --> ACTypes{"How are updates performed?
    (Sequential vs Parallel)"}
    
    ACTypes -->|"Sequential"| Sync["Synchronous Methods"]
    Sync --> A2C["A2C"]
    
    ACTypes -->|"Parallel"| Async["Asynchronous Methods"]
    Async --> A3C["A3C"]
    Async --> AsyncDDPG["Async DDPG"]
    
    %% Styling
    style Start fill:#2D3748,stroke:#1A202C,color:#FFFFFF
    style ModelQ fill:#3182CE,stroke:#2C5282,color:#FFFFFF
    style PolicyQ fill:#3182CE,stroke:#2C5282,color:#FFFFFF
    style OnOffQ fill:#3182CE,stroke:#2C5282,color:#FFFFFF
    style ACTypes fill:#3182CE,stroke:#2C5282,color:#FFFFFF
    
    %% Category Nodes
    style ModelBased fill:#4A5568,stroke:#2D3748,color:#FFFFFF
    style ModelFree fill:#4A5568,stroke:#2D3748,color:#FFFFFF
    style ValueBased fill:#4A5568,stroke:#2D3748,color:#FFFFFF
    style PolicyBased fill:#4A5568,stroke:#2D3748,color:#FFFFFF
    style ActorCritic fill:#4A5568,stroke:#2D3748,color:#FFFFFF
    style OnPolicy fill:#4A5568,stroke:#2D3748,color:#FFFFFF
    style OffPolicy fill:#4A5568,stroke:#2D3748,color:#FFFFFF
    style Sync fill:#4A5568,stroke:#2D3748,color:#FFFFFF
    style Async fill:#4A5568,stroke:#2D3748,color:#FFFFFF
    
    %% Algorithm Nodes
    style PI fill:#2D3748,stroke:#1A202C,color:#FFFFFF
    style VI fill:#2D3748,stroke:#1A202C,color:#FFFFFF
    style QL fill:#2D3748,stroke:#1A202C,color:#FFFFFF
    style DQN fill:#2D3748,stroke:#1A202C,color:#FFFFFF
    style DDQN fill:#2D3748,stroke:#1A202C,color:#FFFFFF
    style REINFORCE fill:#2D3748,stroke:#1A202C,color:#FFFFFF
    style TRPO fill:#2D3748,stroke:#1A202C,color:#FFFFFF
    style PPO fill:#2D3748,stroke:#1A202C,color:#FFFFFF
    style DDPG fill:#2D3748,stroke:#1A202C,color:#FFFFFF
    style TD3 fill:#2D3748,stroke:#1A202C,color:#FFFFFF
    style SAC fill:#2D3748,stroke:#1A202C,color:#FFFFFF
    style A2C fill:#2D3748,stroke:#1A202C,color:#FFFFFF
    style A3C fill:#2D3748,stroke:#1A202C,color:#FFFFFF
    style AsyncDDPG fill:#2D3748,stroke:#1A202C,color:#FFFFFF

Since the leaves of this flowchart describe quite a few algorithms (some familiar faces might include PPO, Q-Learning, and value iteration), I would like to instead focus a lot more on the branching locations, and how the constraints of our system have informed the kinds of algorithms we have developed.

Model-Based vs Model-Free

<aside> 💡

Do not get confused between the policy models we train and the “model” we are using here. The term is over-loaded. When we talk of models here, we talk of learning the environment and the rewards from it.

</aside>

The core distinction between model-based and model-free is whether you have a learned model of the environment, or whether you are interacting with the environment directly and gleaning rewards from those experiences.

Imagine you are training a robot to play basketball. You can take as many shots at the basket as you’d like. You are not in a data-constrained regime, but simulating all the interactions between a basketball and the basket is not a trivial task. This is an excellent domain for model-free RL— training data is easy to collect directly.

But instead imagine you couldn’t actually collect that many training data points directly. You need to somehow be more sample efficient — get more bang for your buck every time your robot shoots. Instead, you’d teach the robot about gravity, and acceleration, and air resistance. Then, the robot will build a mental model of what will happen when it throws the ball. This will hopefully help it get more good shots. This is model-based RL.

Note that a lot of model-free methods still interact with simulations of environments — in self-driving, for example, models are trained in simulations. This is still model-free learning, since the model is not leveraging an environment model to get predictions about future rewards.

Here is the split: If the only way to learn the rewards from an action is to take that action, the algorithm is model-free. If instead you can call out to another environment model to get the value of an action, then it is model-based.

Learning Values vs Policies

In model-free training, you can really learn two things:

  1. What is the value of the current state I am in? That is, how close is this state to my goal?
  2. What is the value of the action I am considering? Will it get me closer to my goal?

And you can also learn both simultaneously.