📝 @Arushi Somani May 26, 2024 2:04 PM (PDT)
I’ve been extremely surprised by the inability of language models to play tic-tac-toe. GPT-4o and Claude Opus clearly have an understanding of how tic-tac-toe works, they understand the end goal, and they are able to analyze intermediate states properly.
Then why have I won (not even drawn, won) every game I’ve played against one of these models? [1][2][3]
There are a couple of reasons this might be:
To be able to play tic tac toe, it is important for language models to be able to reason about the state of the board at any location of play. To do this, I created a small dataset of 100 examples of randomized tic tac toe boards. There are four possible states for the board: Winner=0, Winner=X, Draw, and N/A (game is incomplete). The model is supposed to label boards as one of the following states.
Note that a flaw of this series of experiments is that we didn’t have any draw states in the eval set. TODO create one with a better distribution!
On this task, GPT-4o gets an accuracy of 92% total, distributed as follows:
And here is the distribution of the answers versus the labels—
It’s quite exciting to note that the model never calls the board a draw at all—which is relevant since the board is never a draw.
Opus does quite a bit worse on this task— scoring a 53% instead. The distribution looks as follows: