Rethinking Evaluation in the LLM era

TL;DR: Large Language Models (LLMs) enable humans to easily transfer knowledge into them by giving language instructions. Due to the nature of human language, it is challenging to automatically determine the amount of human knowledge encoded in an instruction. This introduces complexities and misconceptions when assessing LLM capabilities, leading to overestimation of their potentials and ultimately catastrophic failures when these models are deployed in high-stake applications. A new evaluation paradigm is needed to ensure the robustness of these systems.

Types of evaluation

First, I will formalize the typical evaluation setups in machine learning.

Let \(M\) be a model before training. I call this model an “infant” model. During training, this model consumes human knowledge \(K\) whose size is \(\vert K \vert\). In supervised learning, \(K\) is a labeled dataset. Denote by \(M \oplus K\) the model after training, which I call the “mature” model. Let \(E(M \oplus K)\) be the performance of \(M \oplus K\) on the test set.

Type IA (efficient-learner evaluation): we feed \(M_1\) and \(M_2\) the same knowledge \(K\) and compares \(E(M_1 \oplus K)\) with \(E(M_2 \oplus K)\). The model \(M_i\) that yields the better performance is called the more efficient learner, since it better utilizes the given amount of human knowledge.

Type IB (efficient-learner evaluation): the goal is similar to type IA, but we feed the models different amounts of knowledge \(K_1\) and \(K_2\) to reach a certain level of performance \(E(M_1 \oplus K_1) = E(M_2 \oplus K_2)\). Suppose \(\vert K_i \vert = \min (\vert K_1 \vert, \vert K_2 \vert)\), we claim that \(M_i\) is the more efficient learner, since it uses less amount of human knowledge to reach the same performance.

Type II (capable-model evaluation): the goal is different from type I evaluation. We do not care what human knowledge each model consumes. We simply compare \(E(M_1 \oplus K_1)\) with \(E(M_2 \oplus K_2)\) and whichever model performs better is the more capable (mature) model.

Can we evaluate LLMs the same way?

I regard an “infant” LLM model \(M\) as a model that has been pre-trained and then fine-tuned with SFT and RLHF. This model can be prompted with a prompt \(P\) to become a mature LLM \(M \oplus P\). Here, \(P\) represent the human knowledge \(K\) infused into the model.

In this setting, we can conduct IA and II, but not IB. Why? Because IB requires the ability to quantify the amount of human knowledge consumed by each model (\(\vert K \vert\)). In supervised learning, where \(K\) is a dataset, we can do so by counting the examples in the dataset. The same approach is not applicable to a prompt if part of it are language instructions. Due to the symbolic and pragmatic nature of human language, humans can transfer an arbitrary amount of information within a single language utterance. Moreover, this amount varies depending the common ground between the interlocutors and their reasoning capabilities. This basically means we cannot define \(\vert P \vert\) for a prompt.

Another thing to note is that IA and II draw different conclusions. In the general case, the most capable model is not necessarily the most efficient learner. A more efficient learner, if deprived of knowledge (e.g., trained with less labeled data), can be outperformed by a less efficient learner infused with more knowledge. Suppose \(M_1\) is capable than \(M_2\) but is less efficient. Which one is the more “intelligent” model? Now imagine \(M_1\) is GPT-4 and \(M_2\) is a human. I think one can find arguments to support each of them. This means when a model’s performance metric increases in a type II evaluation, it is still debatable to claim that the model has become more “intelligent”.

What is “novel” to an LLM?

In machine learning evaluation, the test set must be “novel” to evaluated models in order to evaluate their generalizability. In supervised learning, the relationship between the training and test examples can be mathematically characterized. For example, the test examples can be drawn from the same distribution as the training examples (in-distribution evaluation), or from a different distribution (out-of-distribution evaluation).

But how we determine how novel a set of examples is with respect to knowledge embedded in a prompt? Suppose I concatenate all examples in the test set into a string, reverse it, and input the result as the prompt. I then append to that prompt the following instruction “Reverse above input and use it to answer the following questions.” The test examples are clearly “out-of-distribution” with respect to the final prompt, but if the LLM understands the last instruction, the prompt reveals all information about the test examples so they are not novel at all.

If we cannot characterize how novel the test examples are to a model, we cannot determine how strong its generalizability is.

What do we seek for?

When someone says “AI system”, they could mean two things: the “infant” pre-training form of a system, or the “mature” post-training form. When comparing the infant forms of AI systems, one aims to identify the most efficient learner—the system that best utilizes a given amount of human knowledge. In this type of evaluation, one must ensure all compared systems consume the same amount of human knowledge (e.g., training them with the same dataset). Meanwhile, when comparing the mature forms, one seeks the most capable system—the system that yields the higher performance metric on the test set, regardless of the amount of human knowledge it has consumed.

In the classical supervised-learning (SL) framework, the best infant and mature systems are often the same because all systems are trained with the same labeled dataset. The system that performs best on the test set is both the most capable (mature) system and the most efficient (infant) learner. However, when systems consume varying amounts of human knowledge, the best infant system and the best mature system can diverge. A better infant system, if deprived of knowledge (e.g., trained with less labeled data), can be outperformed by a worse infant system infused with more knowledge.

In any type of evaluation, the test set must be “novel” to the evaluated systems. In the SL framework, this requirement is satisfied by either collecting examples that do not appear in the training set or sampling examples from a distribution that differs from the training data distribution. A system that scores high on the test set demonstrates generalizability—the ability to learn abstract representations and decision-making rules that correctly describe the latent data-generating mechanism.

Infant-system evaluation is impossible with LLMs

LLMs allow for the infusion of human knowledge through language instructions, i.e., prompting. Due to the symbolic and pragmatic nature of human language, humans can transfer an arbitrary amount of information within a single language utterance. Moreover, this amount varies depending the common ground between the interlocutors and their reasoning capabilities.

Because of this characteristic of human language, it is challenging to quantify the amount of human knowledge conveyed through prompting. The length of the prompt is a poor indicator of how much knowledge is embedded in it. One can dramatically increase the amount of information without substantially changing the prompt length by using more abstract words, make references to knowledge existing within the LLM, or add links to external knowledge sources.

A common, yet flawed approach to quantifying the amount of human knowledge in a prompt is to use the number of few-shot examples to represent this amount. This method apparently ignores knowledge conveyed through linguistic expressions. An example illustrating a reasoning pattern might encode more knowledge than ten examples showing only inputs and outputs.

Since we cannot measure the amount of human knowledge conveyed by a prompt, we cannot control for it. Therefore, infant-system evaluation is generally impossible, unless the compared systems share the same base LLM and are prompt with only few-shot examples that conform to the same format.

An obssession with performance?

Most existing evaluations of LLMs focus on mature-system evaluation, striving to identify the model that excels on a test set. The amount of data consumed by this model is often overlooked. This preference for mature-system evaluation likely stems from the impracticality of conducting infant-system evaluation. However, I believe the deeper reason lies in the evolving nature of the field of AI. Since the advent of LLMs, AI has transitioned into a more pragmatic discipline. There seems to be a prevailing belief that these models are on the brink of becoming practical tools that significantly enhance human productivity and generate substantial economic value. Consequently, the primary goal is to elevate their performance to a satisfactory level, irrespective of the cost. This cost is seen as a modest investment compared to the immense future benefits these models promise.

What is considered “novel”?

In the long run, solely striving for best performance while neglecting efficiency would result in the development of costly, eco-unfriendly systems. However, there is a even more imminent issue: the current way of conducting mature-system evaluation has a serious drawback. Specifically, the methods of creating “novel” examples in supervised learning do not work for LLM evaluation. One can design prompts that look remotely like the test examples (e.g., in terms of string edit-distance) yet reveal substantial information about them. As an overly exaggerated example, one can reversely input the characters in the test examples (“Example 1: x y” -> “y x: 1 elpmaxE”) and tell the model to “reverse the input string before giving your answer” at the end of the prompt.

In general, it is unclear what is considered “novel” with respect to a prompt, as we do not have a way to precisely characterize what is “seen” by an LLM in a prompt. Therefore, this evaluation method does not guarantee the identification of a system with strong generalizability.

The risk of overfitting

An even more dangerous risk is the risk of overfitting due to human developers distilling information about the test distribution through prompting. Fundamentally, a prompted LLM represents an LLM-human team. A robust test set must be novel to all members of the team, meaning that it must also be unseen by the human prompter. However, most of the time, the test set is public and the human knows the task definition and has an idea about the test distribution. Intentionally or not, that knowledge will influence their final prompt choice. To make the matter worse, the powerful language-understanding capabilities of LLMs makes it too convenient for humans to program complex strategies into them.

If the test set actually captures real-world conditions, fitting it is fine. But this is often not the case. Hence, the inevitable outcomes would be overfitting, the promotion of brittle systems, and catastrophic failures in the real world. There are already looming signs: AI companies continuously break records on academic benchmarks, yet their systems are tricked by users within hours after being released.

Solving this problem is far from simple. To tackle it, we need to be able to either restrict the knowledge contained in a prompt or make tests that are always “novel” to human developers. Both are difficult. Firstly, we lack methods to automatically decide which prompts are appropriate and which are not. Even human experts would struggle to reach a consensus. While prompts like “Let’s think step-by-step” seem reasonable, how about designing an LLM agent consisting of five modules, each with its own prompt specifying its role and communication protocol. Is this too much human knowledge? I believe so, but other may not. On the other hand, maintaining novelty of a test means to keep it permanently confidential to human or to frequently update it. The former has transparency issues, whereas the latter incurs substantial cost. This approach requires a credible agency entrusted to develop, maintain, and update the test. How do we establish such an agency? This problem extends beyond the realm of pure research.

Amplifiers

Several practices are exacerbating the aforementioned overfitting risk.

Companies managing the most advanced LLMs continue to keep their methodologies opaque. While OpenAI once offered a glimpse into their prompting strategy, this gesture, though commendable, remains a vague overview of their actual techniques. Peer pressure compels AI companies to hastily launch new products. In their rush to outshine competitors, they prioritize showcasing new capabilities, flashy demonstrations, and improvements on well-known but potentially already overfitted academic benchmarks. This strategy gradually misleads users into believing that the displayed metrics and features signify a robust system. However, the most relevant metric for users—performance in a “clinical trial” with a diverse group of real humans—is seldom highlighted.

In academia, I frequently encounter reviewers who insist on using GPT-4 as a baseline, often without considering the specific evaluation goals or the prompts used to generate results. If GPT-4 outperforms your system, would your paper still be worthy of publication? It is crucial to remember that when the objective is infant-system evaluation, comparing with GPT-4, which receives much more human knowledge through pre-training and prompting, violates the very first principle of this type of evaluation—ensuring all systems consume the same amount of human knowledge. Another troubling trend is the undue admiration for real-world-mimicking benchmarks and results that demonstrate only weak, in-distribution generalization, while undervaluing those benchmarks and results that, despite appearing artificial and simplistic, reveal strong compositional generalization.

Solution Ideas

As previously mentioned, there is no simple solution to this complex issue. However, it is imperative to address it, even partially, to develop robust AI systems that can enhance our society—or at the very least, do no harm. Below I suggest some potential remedies. While none of these can resolve the problem entirely on their own, and some may be challenging to implement immediately, I hope that they at least inspire more people to start thinking about the solution.

Let us begin with some general principles:

P1: Test a model in conditions that closely resemble its intended use.

P2: Evaluate a model in scenarios unfamiliar to the human prompter.

P3: Remain vigilant against overfitting when developing and evaluating LLM-based systems.

Here are some concrete suggestions:

For Government

Establish an agency to audit all commercial LLM-based products. Treat these models like cars. Require them to pass a “smog test” before deployment and periodically re-evaluate them. The government is uniquely positioned to create, maintain, and update these rigorous and confidential evaluations.

For AI Companies:

Invest in thorough evaluation processes. Develop your own benchmarks and share them with the research community instead of solely relying on academic benchmarks. Despite their expertise, researchers often lack the resources to create the highest quality benchmarks. Regularly update your benchmarks to ensure they remain challenging. Conduct clinical trials with diverse user groups before releasing any product, and highlight robust performance in these trials rather than in artificial benchmarks.

Be transparent about your methodologies. Revealing model prompts (at least to authenticated researchers) is crucial to accurately measure current models’ capabilities. This practice would ultimately benefit the company as it will ultimately improve the products.

For Researchers:

Assess the same LLM with various prompts and test the same prompt on multiple LLMs. The rationale is that the performance of a prompted LLM is a function of two random variables: the LLM and the prompt. To compute the expected performance of the LLM (or the prompt), we marginalize out the prompt (or the LLM).

Create benchmarks that closely mirror real-world challenges, making it difficult for humans to describe solution strategies through prompting. Rank the participating systems according to their performance on the development set first, and then by their performance on the test set. This approach reduces the tendency of developers to overfit the test set, as they need to overfit the development set first to obtain a high rank. Note that for this the approach to effective, the benchmark must be challenging.

Consider working with synthetic, abstract environments (e.g., minigrid, blockworld), which are generally out-of-distribution for current LLMs.

Acknowledge the extent of human knowledge embedded in your prompts and qualitatively compare it with baseline methods. If your prompts contain significant human knowledge, recognize this as a limitation of your work.

For Reviewers:

Avoid insisting that authors compare state-of-the-art LLMs when the objective is to identify the best infant system.

Performance metrics are not everything. Consider the amount of human knowledge in the prompts.

Do not undervalue tests that seem “too synthetic”; researchers are striving towards true AI, not commercial products. Tests that require strong generalizability are extremely valuable for this goal. Refrain from calling an environment “toy-ish” just because it is visually simple. Realistic-looking environments can also be “toy-ish” if they result in the selection of brittle systems that easily fall apart in the real world.