📝 @Arushi Somani May 26, 2024 2:04 PM (PDT)

I’ve been extremely surprised by the inability of language models to play tic-tac-toe. GPT-4o and Claude Opus clearly have an understanding of how tic-tac-toe works, they understand the end goal, and they are able to analyze intermediate states properly.

Then why have I won (not even drawn, won) every game I’ve played against one of these models? [1][2][3]

Motivation

There are a couple of reasons this might be:

  1. Misunderstanding Goal or Player Intent— The model is often unable to, even in situations where the user makes suboptimal moves and leaves a winning move open. [4] This suggests that the model cannot keep in mind the goal state. Similarly, the model is often unable to recognize when the user is one move away from a winning state and to block it.
  2. Hallucinating or Misunderstanding State— An extension of the previous, a model is often unable to recognize when a win condition has been reached and the game is complete. Similarly, sometimes the model is unable to map a given number (0-9) to the correct state on the board [4].

Board Validation Eval to Measure Hallucinations

To be able to play tic tac toe, it is important for language models to be able to reason about the state of the board at any location of play. To do this, I created a small dataset of 100 examples of randomized tic tac toe boards. There are four possible states for the board: Winner=0, Winner=X, Draw, and N/A (game is incomplete). The model is supposed to label boards as one of the following states.

Note that a flaw of this series of experiments is that we didn’t have any draw states in the eval set. TODO create one with a better distribution!

OpenAI GPT-4o

On this task, GPT-4o gets an accuracy of 92% total, distributed as follows:

Untitled

And here is the distribution of the answers versus the labels—

Untitled

It’s quite exciting to note that the model never calls the board a draw at all—which is relevant since the board is never a draw.

Anthropic Opus

Opus does quite a bit worse on this task— scoring a 53% instead. The distribution looks as follows:

Untitled