📝 @Arushi Somani June 20, 2024 12:39 PM (PDT)

<aside> đź’ˇ This is a work in progress; a reader will find TODOs and incomplete sections. WIP Repo: https://github.com/somaniarushi/evaluations

</aside>

Abstract

Evaluations for large language models have become increasingly important over the last few years as frontier models exponentially increase in abilities [1][2][3]. These canonical evaluations are often measured as “few-shot” evaluations— giving models multiple examples of an evaluation before asking the question for the evaluation. Not only do different evaluations require different shot counts, different organizations have reported different shot counts for the same evaluation as comparison [4]. This research report attempts to answer the following question:

Our findings for RQ1 suggest that large frontier models should not continue reporting evaluations with various shot-counts like 5-shot, 8-shot or 11-shot— instead, model scores should either be reported zero-shot, or 2-shot at worst if needed, to properly elicit model response. However, smaller models still benefit from multiple shots, and should continue being evaluated in the same way.

These findings for academic evaluations can be extended to production environments and increase research into optimizing retrieval and few-shot augmented generations in large language models.

Introduction

With the rise of many large language models, the need to evaluate the abilities of models properly is becoming critically important. A set of knowledge-based evaluations have gained popularity in recent times. These include knowledge-remembering evaluations like MMLU [5], ARC [6], or Hellaswag [7], mathematical reasoning evaluations like GSM8K [8], MATH [9], or GPQA [10], code-writing evaluations like HumanEval [11], and knowledge retrieval evaluations like DROP [12]. There also exist many multimodal evaluations like MMMU [16] and ChartQA [17], but we exclude these from the scope of this report.

These evaluations are measured by matching the answer in the generation like in the case of MMLU or MATH, F1 score in the case of DROP, and executing the code against a series of test cases in the case of coding evaluations like HumanEval.

Over time, we've also seen LLM-as-a-judge-based evaluations like MT-Bench [13] and AlpacaEval [14] which use large models like GPT-4 as the critic for a generated answer. This allows for more flexibility in cases of lack of instruction following and is well-suited for domains of problems with many correct answers or many possible rephrasing of a correct answer. We also exclude these from the scope of this report.

Many of these evaluations are conducted in a "few-shot" manner [15]. This process involves giving a variety of examples of a question-answer type in context. The goal is to bias the distribution of the model towards that of the question-answer evaluation space. We wish to explain to the model what the format is like — this is especially helpful for pre-trained models— explain the tone of answers expected, like being thorough or terse.

However, there is no consistency in the number of shots used in model evaluation. MMLU is evaluated at 5-shot [2], MGSM at 8-shot [2], HumanEval at 0-shot [1], DROP at 3-shot [2], GSM8K at 11-shot [3], and MATH at 4-shot [3]. Moreover, different frontier model reports [1][2][3] use different numbers of shots in training. For instance, Gemini Pro and Claude 3 family of models report MMLU as 5-shot but GPT-4 reports MMLU at 0-shot.

This report intends to evaluate the benefit of using specific shot counts on specific evaluations, as well as using shot counts at all.

We contribute the following:

  1. Evaluate a popular set of canonical evaluations across different shot counts, including the canonically reported few-shot number, and report my findings.
  2. Assess whether some shots are better than others — such that even evaluations with the same number of shots might differ from each other.