✍️ @Arushi Somani July 7, 2024 5:35 PM (PDT)

🤖 reinforcement learning setup

let’s first start by discussing what a standard setup in reinforcement learning looks like.

First, there is an environment. The environment is the space in which our agent can act. Then, there is state. the state is the observable subset of the environment that the agent is provided as input.

Each agent is defined by a policy, often represented by $\pi$, which defines what actions it takes in what state. The goal of this policy is to maximize reward through the environment and thus learn to take “good actions”.

Each produced action is judged by some sort of reward model to produce a reward. The reward can be a network, but can also be a combination of heuristics (eg: 1 if agent reached the banana in the grid else 0).

Screenshot 2024-07-20 at 5.54.37 PM.png

To summarize, the agent learns better policies by interacting with an environment and accumulating rewards. We’ll keep using this general idea in the rest of this post.

🌲 introduction to game trees

The core idea behind a game tree is that every node is a game state and every edge is an action that transitions that game state into some related adjacent state.

Screenshot 2024-07-07 at 5.32.05 PM.png

the game tree and state forms the basis of all game algorithms we’ll talk of below.

we’ll use tic-tac-toe as our motivating example. let’s say we’re trying to make an agent that can beat a human at tic-tac-toe. how will be go about this?

🔄 enumerative search: minimax

the core idea behind minimax is to play out all possible scenarios before making a move— in doing so, you choose the branch that minimizes your opponent’s best possible score while maximizing your own; and make the move based on this heuristic. we won’t go further into the minimax algorithm here, but [1] is a good resource to learn more.

With enough resources, one can play out the entirety of all possible branches and play completely optimally.

but even for a game as simple as tic-tac-toe, this can get extremely complicated!

⚖️ why enumeration doesn’t scale

readers might already have intuition about why enumerative search doesn’t scale; let’s formalize this. Imagine it takes an average of $d$ moves to complete a game, and that at each possible step there are an average of $b$ possible moves — this means that the size of the search space would be $O(b^d)$!