📝 @Arushi Somani May 20, 2024 10:43 PM (PDT)
<aside> 💡 This is a work in progress! You will find a lot of TODOs and open questions within the document. Dataset Files: ‣
</aside>
A commonly observed weakness of large language models is their inability to perform basic arithmetic operations. In our experiments, frontier models like GPT-4o could only get ~60% of the problems correct without calculator use, despite what seems like fine-tuning data that compels the model to perform long multiplication and teaches it to think step-by-step in this domain. However, it is demonstrably easy to overfit even small transformers to perform arithmetic operations up to 20 digits.
While arithmetic itself is not a critical ability — the option to call out to a calculator bridges most of the gap for simple math use cases. However, the inability to perform arithmetic highlights a weakness in long distance reasoning in language models where each step strongly depends on previous steps.
In this report, we create a systematic evaluation for arithmetic to measure frontier models on this domain.
We want large language models to be able to perform multi-step reasoning, and we treat the ability to perform arithmetic as a simple expression of this capability. The benefit of using arithmetic as a tool to measure multi-step reasoning is first that the answer is easily verifiable, which means we can objectively measure models against each other. Secondly, the simplicity in concept of the task controls for models being unaware of how to perform the task at all, allowing true error measurement. The bottleneck in performance is through errors such as not following directions, hallucinations, or forgetfulness. Finally, reasoning steps in arithmetic are strongly conditioned on each other in a chain of thought regime, making it a good test of multi-step reasoning.
There has been prior work in measuring the mathematical capabilities of models [1][2][3], but these works focus more on solving word problems, complex equations and calculus, writing or graduate level proofs. There has been work in improving the model’s ability to perform arithmetic [4][5][6][7], but a work has yet to comprehensively compare a set of models on these tasks against each other.
Our contributions are the following;
The dataset is created by generating ten random samples each for every combination of $(length_1, length_2)$ for the length of each of the numbers. These numbers are shared between all three of our evaluations — addition, subtraction, multiplication. This means that, for the sample set of two numbers, the problems might be much harder between the different evaluations. For example, the example problem $109 - 9$ is technically much easier than $109 * 9$ even though they share the same digit composition. We expect these variations to even out across the size of the dataset.
TODO: Should each of the tasks share a dataset? + Is 10 samples for each pair big enough?
Llama 8B’s performance on 10 instances each for each of the additions has a mean score of 36.8 — the breakdown looks something like this: